Python 中機器學習的特征選擇 · Machine Learning Mastery 博客文章翻譯

# Python 中機器學習的特征選擇 > 原文： [https://machinelearningmastery.com/feature-selection-machine-learning-python/](https://machinelearningmastery.com/feature-selection-machine-learning-python/) 用于訓練機器學習模型的數據功能會對您可以實現的表現產生巨大影響。不相關或部分相關的功能會對模型表現產生負面影響。在這篇文章中，您將發現[自動特征選擇技術](http://machinelearningmastery.com/an-introduction-to-feature-selection/)，您可以使用 scikit-learn 在 python 中準備機器學習數據。讓我們開始吧。 * **2016 年 12 月更新**：修正了 RFE 部分中有關所選變量的拼寫錯誤。謝謝安德森。 * **更新 Mar / 2018** ：添加了備用鏈接以下載數據集，因為原始圖像已被刪除。 ![Feature Selection For Machine Learning in Python](https://img.kancloud.cn/9e/b6/9eb625ee140161e63106dd2cae2bcd2c_640x426.jpg) Python 中機器學習的特征選擇 [Baptiste Lafontaine](https://www.flickr.com/photos/magn3tik/6022696093/) 的照片，保留一些權利。 ## 特征選擇特征選擇是一個過程，您可以自動選擇數據中對您感興趣的預測變量或輸出貢獻最大的那些特征。在數據中具有不相關的特征會降低許多模型的準確性，尤其是線性和邏輯回歸等線性算法。在建模數據之前執行特征選擇的三個好處是： * **減少過度擬合**：冗余數據越少意味著根據噪聲做出決策的機會就越少。 * **提高準確度**：誤導性較差的數據意味著建模精度提高。 * **縮短訓練時間**：數據越少意味著算法訓練越快。您可以在文章[特征選擇](http://scikit-learn.org/stable/modules/feature_selection.html)中了解有關使用 scikit-learn 進行特征選擇的更多信息。 ## 機器學習的特征選擇本節列出了 4 種用于 Python 機器學習的特征選擇秘籍這篇文章包含特征選擇方法的秘籍。每個秘籍都設計為完整且獨立，因此您可以將其直接復制并粘貼到項目中并立即使用。秘籍使用[皮馬印第安人糖尿病數據集](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes)來證明特征選擇方法（更新：[從這里下載](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)）。這是一個二元分類問題，其中所有屬性都是數字。 ### 1.單變量選擇統計測試可用于選擇與輸出變量具有最強關系的那些特征。 scikit-learn 庫提供 [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) 類，可以與一系列不同的統計測試一起使用，以選擇特定數量的功能。以下示例使用卡方檢（chi ^ 2）統計檢驗非負特征來從 Pima Indians 糖尿病數據集中選擇 4 個最佳特征。 ``` # Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) import pandas import numpy from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 # load data url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction test = SelectKBest(score_func=chi2, k=4) fit = test.fit(X, Y) # summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) features = fit.transform(X) # summarize selected features print(features[0:5,:]) ``` 您可以看到每個屬性的分數和選擇的 4 個屬性（分數最高的分數）： _plas_ ， _test_ ， _mass_ 和 _age_ 。 ``` [ 111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304] [[ 148\. 0\. 33.6 50\. ] [ 85\. 0\. 26.6 31\. ] [ 183\. 0\. 23.3 32\. ] [ 89\. 94\. 28.1 21\. ] [ 137\. 168\. 43.1 33\. ]] ``` ### 2.遞歸特征消除遞歸特征消除（或 RFE）通過遞歸地移除屬性并在剩余的屬性上構建模型來工作。它使用模型精度來識別哪些屬性（和屬性組合）對預測目標屬性的貢獻最大。您可以在 scikit-learn 文檔中了解有關 [RFE](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE) 類的更多信息。下面的示例使用 RFE 和邏輯回歸算法來選擇前 3 個特征。算法的選擇并不重要，只要它技巧性和一致性。 ``` # Feature Extraction with RFE from pandas import read_csv from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression # load data url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d") % fit.n_features_ print("Selected Features: %s") % fit.support_ print("Feature Ranking: %s") % fit.ranking_ ``` 你可以看到 RFE 選擇前[3]特征為 _preg_ ，_ 質量 _ 和 _pedi_ 。這些在 _support__ 數組中標記為 True，并在 _ranking__ 數組中標記為選項“1”。 ``` Num Features: 3 Selected Features: [ True False False False False True True False] Feature Ranking: [1 2 3 5 6 1 1 4] ``` ### 3.主成分分析主成分分析（或 PCA）使用線性代數將數據集轉換為壓縮形式。通常，這稱為數據簡化技術。 PCA 的一個屬性是您可以選擇轉換結果中的維數或主成分數。在下面的示例中，我們使用 PCA 并選擇 3 個主要組件。通過查看 [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) API，了解有關 scikit-PCA 課程的更多信息。在[主成分分析維基百科文章](https://en.wikipedia.org/wiki/Principal_component_analysis)中深入研究 PCA 背后的數學。 ``` # Feature Extraction with PCA import numpy from pandas import read_csv from sklearn.decomposition import PCA # load data url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction pca = PCA(n_components=3) fit = pca.fit(X) # summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_) ``` 您可以看到轉換的數據集（3 個主要組件）與源數據幾乎沒有相似之處。 ``` Explained Variance: [ 0.88854663 0.06159078 0.02579012] [[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02 9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [ 2.26488861e-02 9.72210040e-01 1.41909330e-01 -5.78614699e-02 -9.46266913e-02 4.69729766e-02 8.16804621e-04 1.40168181e-01] [ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]] ``` ### 4.特征重要性隨機森林和額外樹木等袋裝決策樹可用于估計特征的重要性。在下面的示例中，我們為 Pima 印第安人糖尿病數據集開始構建 ExtraTreesClassifier 分類器。您可以在 scikit-learn API 中了解有關 [ExtraTreesClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) 類的更多信息。 ``` # Feature Importance with Extra Trees Classifier from pandas import read_csv from sklearn.ensemble import ExtraTreesClassifier # load data url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction model = ExtraTreesClassifier() model.fit(X, Y) print(model.feature_importances_) ``` 您可以看到我們獲得了每個屬性的重要性分數，其中分數越大，屬性越重要。評分表明 _plas_ ，_ 年齡 _ 和 _ 質量 _ 的重要性。 ``` [ 0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537 0.12654214 0.15415431] ``` ## 摘要在這篇文章中，您發現了使用 scikit-learn 在 Python 中準備機器學習數據的功能選擇。您了解了 4 種不同的自動特征選擇技術： * 單變量選擇。 * 遞歸特征消除。 * 主成分分析。 * 功能重要性。如果您要查找有關功能選擇的更多信息，請參閱以下相關帖子： * [使用 Caret R 封裝進行特征選擇](http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/) * [特征選擇提高準確性并縮短訓練時間](http://machinelearningmastery.com/feature-selection-to-improve-accuracy-and-decrease-training-time/) * [特征選擇介紹](http://machinelearningmastery.com/an-introduction-to-feature-selection/) * [使用 Scikit-Learn 在 Python 中進行特征選擇](http://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/) 您對功能選擇或此帖有任何疑問嗎？在評論中提出您的問題，我會盡力回答。