使用 Python 管道和 scikit-learn 自動化機器學習工作流程 · Machine Learning Mastery 博客文章翻譯

# 使用 Python 管道和 scikit-learn 自動化機器學習工作流程 > 原文： [https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/](https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/) 機器學習項目中有標準工作流程可以自動化。在 Python scikit-learn 中，Pipelines 有助于清楚地定義和自動化這些工作流程。在這篇文章中，您將發現 scikit-learn 中的 Pipelines 以及如何自動化常見的機器學習工作流程。讓我們開始吧。 * **2017 年 1 月更新**：已更新，以反映版本 0.18 中 scikit-learn API 的更改。 * **更新 March / 2018** ：添加了備用鏈接以下載數據集，因為原始圖像已被刪除。 ![Automate Machine Learning Workflows with Pipelines in Python and scikit-learn](https://img.kancloud.cn/73/eb/73ebd92dbbdc7aef3d5d1f672a890f8d_640x480.jpg) 使用 Python 管道和 scikit-learn 照片自動化機器學習工作流程 [Brian Cantoni](https://www.flickr.com/photos/cantoni/4426017757/) ，保留一些權利。 ## 用于自動化機器學習工作流程的管道應用機器學習中有標準的工作流程。標準是因為它們克服了測試工具中數據泄漏等常見問題。 Python scikit-learn 提供了一個 Pipeline 實用程序來幫助自動化機器學習工作流程。管道工作通過允許將線性序列的數據變換鏈接在一起，最終形成可以評估的建模過程。目標是確保管道中的所有步驟都受限于可用于評估的數據，例如訓練數據集或交叉驗證過程的每個折疊。您可以通過閱讀用戶指南的 [Pipeline 部分，了解有關 scikit-learn 中管道的更多信息。您還可以查看](http://scikit-learn.org/stable/modules/pipeline.html)[管道模塊](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline)中 [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) 和 [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) 類的 API 文檔。 ## 管道 1：數據準備和建模應用機器學習中的一個簡單陷阱是將訓練數據集中的數據泄漏到測試數據集中。為了避免這種陷阱，您需要一個強大的測試工具，強大的訓練和測試分離。這包括數據準備。數據準備是將整個訓練數據集的知識泄露給算法的一種簡單方法。例如，在學習之前使用標準化或標準化在整個訓練數據集上準備數據將不是有效的測試，因為訓練數據集會受到測試集中數據規模的影響。管道通過確保標準化等數據準備受限于交叉驗證過程的每個折疊，幫助您防止測試工具中的數據泄漏。以下示例演示了這一重要的數據準備和模型評估工作流程。管道定義有兩個步驟： 1. 標準化數據。 2. 學習線性判別分析模型。然后使用 10 倍交叉驗證評估管道。 ``` # Create a pipeline that standardizes the data then creates a model from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # load data url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # create pipeline estimators = [] estimators.append(('standardize', StandardScaler())) estimators.append(('lda', LinearDiscriminantAnalysis())) model = Pipeline(estimators) # evaluate pipeline seed = 7 kfold = KFold(n_splits=10, random_state=seed) results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) ``` 運行該示例提供了數據集上設置準確性的摘要。 ``` 0.773462064252 ``` ## 管道 2：特征提取和建模特征提取是另一個易受數據泄漏影響的過程。與數據準備一樣，特征提取過程必須限制在訓練數據集中的數據。該管道提供了一個名為 FeatureUnion 的便捷工具，它允許將多個特征選擇和提取過程的結果組合成一個可以訓練模型的較大數據集。重要的是，所有特征提取和特征聯合都發生在交叉驗證過程的每個折疊內。下面的示例演示了使用四個步驟定義的管道： 1. 主成分分析的特征提取（3 個特征） 2. 統計選擇特征提取（6 個特征） 3. 特色聯盟 4. 學習 Logistic 回歸模型 The pipeline is then evaluated using 10-fold cross validation. ``` # Create a pipeline that extracts features from the data then creates a model from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.linear_model import LogisticRegression from sklearn.decomposition import PCA from sklearn.feature_selection import SelectKBest # load data url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # create feature union features = [] features.append(('pca', PCA(n_components=3))) features.append(('select_best', SelectKBest(k=6))) feature_union = FeatureUnion(features) # create pipeline estimators = [] estimators.append(('feature_union', feature_union)) estimators.append(('logistic', LogisticRegression())) model = Pipeline(estimators) # evaluate pipeline seed = 7 kfold = KFold(n_splits=10, random_state=seed) results = cross_val_score(model, X, Y, cv=kfold) print(results.mean()) ``` 運行該示例提供了數據集上管道的準確性摘要。 ``` 0.776042378674 ``` ## 摘要在這篇文章中，您發現了應用機器學習中數據泄漏的困難。您在 Python scikit-learn 中發現了 Pipeline 實用程序，以及它們如何用于自動化標準應用的機器學習工作流程。您學習了如何在兩個重要的用例中使用 Pipelines： 1. 數據準備和建模受限于交叉驗證程序的每個折疊。 2. 特征提取和特征聯合約束于交叉驗證過程的每個折疊。您對數據泄漏，管道或此帖有任何疑問嗎？在評論中提出您的問題，我會盡力回答。