1.13. 特征選擇 · sklearn中文文檔

# 1.13. 特征選擇校驗者: [@yuezhao9210](https://github.com/yuezhao9210) [@BWM-蜜蜂](https://github.com/apachecn/scikit-learn-doc-zh) 翻譯者: [@v](https://github.com/apachecn/scikit-learn-doc-zh) 在 [`sklearn.feature_selection`](classes.html#module-sklearn.feature_selection "sklearn.feature_selection") 模塊中的類可以用來對樣本集進行 feature selection（特征選擇）和 dimensionality reduction（降維），這將會提高估計器的準確度或者增強它們在高維數據集上的性能。 ## 1.13.1. 移除低方差特征 [`VarianceThreshold`](generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold "sklearn.feature_selection.VarianceThreshold") 是特征選擇的一個簡單基本方法，它會移除所有那些方差不滿足一些閾值的特征。默認情況下，它將會移除所有的零方差特征，即那些在所有的樣本上的取值均不變的特征。例如，假設我們有一個特征是布爾值的數據集，我們想要移除那些在整個數據集中特征值為0或者為1的比例超過80%的特征。布爾特征是伯努利（ Bernoulli ）隨機變量，變量的方差為 ![\mathrm{Var}[X] = p(1 - p)](https://box.kancloud.cn/332f06264325b755dc4bc0ee761e329c_140x19.jpg) 因此，我們可以使用閾值 [``](#id3).8 \* (1 - .8)``進行選擇: ``` >>> from sklearn.feature_selection import VarianceThreshold >>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]] >>> sel = VarianceThreshold(threshold=(.8 * (1 - .8))) >>> sel.fit_transform(X) array([[0, 1], [1, 0], [0, 0], [1, 1], [1, 0], [1, 1]]) ``` 正如預期一樣， `VarianceThreshold` 移除了第一列，它的值為 0 的概率為 ![p = 5/6 > .8](https://box.kancloud.cn/523274e2f54ede1e3a93953592b51ac5_99x18.jpg) 。 ## 1.13.2. 單變量特征選擇單變量的特征選擇是通過基于單變量的統計測試來選擇最好的特征。它可以當做是評估器的預處理步驟。Scikit-learn 將特征選擇的內容作為實現了 transform 方法的對象： > - [`SelectKBest`](generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest "sklearn.feature_selection.SelectKBest") 移除那些除了評分最高的 K 個特征之外的所有特征 > - [`SelectPercentile`](generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile "sklearn.feature_selection.SelectPercentile") 移除除了用戶指定的最高得分百分比之外的所有特征 > - 對每個特征應用常見的單變量統計測試: 假陽性率（false positive rate） [`SelectFpr`](generated/sklearn.feature_selection.SelectFpr.html#sklearn.feature_selection.SelectFpr "sklearn.feature_selection.SelectFpr"), 偽發現率（false discovery rate） [`SelectFdr`](generated/sklearn.feature_selection.SelectFdr.html#sklearn.feature_selection.SelectFdr "sklearn.feature_selection.SelectFdr") , 或者族系誤差（family wise error） [`SelectFwe`](generated/sklearn.feature_selection.SelectFwe.html#sklearn.feature_selection.SelectFwe "sklearn.feature_selection.SelectFwe") 。 > - [`GenericUnivariateSelect`](generated/sklearn.feature_selection.GenericUnivariateSelect.html#sklearn.feature_selection.GenericUnivariateSelect "sklearn.feature_selection.GenericUnivariateSelect") 允許使用可配置方法來進行單變量特征選擇。它允許超參數搜索評估器來選擇最好的單變量特征。例如下面的實例，我們可以使用 ![\chi^2](https://box.kancloud.cn/fd6801d91ad397b4c76c9dad2d26471a_17x19.jpg) 檢驗樣本集來選擇最好的兩個特征： ``` >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectKBest >>> from sklearn.feature_selection import chi2 >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X.shape (150, 4) >>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y) >>> X_new.shape (150, 2) ``` 這些對象將得分函數作為輸入，返回單變量的得分和 p 值（或者僅僅是 [`SelectKBest`](generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest "sklearn.feature_selection.SelectKBest") 和 [`SelectPercentile`](generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile "sklearn.feature_selection.SelectPercentile") 的分數）: > - 對于回歸: [`f_regression`](generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression "sklearn.feature_selection.f_regression") , [`mutual_info_regression`](generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression "sklearn.feature_selection.mutual_info_regression") > - 對于分類: [`chi2`](generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2 "sklearn.feature_selection.chi2") , [`f_classif`](generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif "sklearn.feature_selection.f_classif") , [`mutual_info_classif`](generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif "sklearn.feature_selection.mutual_info_classif") 這些基于 F-test 的方法計算兩個隨機變量之間的線性相關程度。另一方面，mutual information methods（互信息）能夠計算任何種類的統計相關性，但是作為非參數的方法，互信息需要更多的樣本來進行準確的估計。稀疏數據的特征選擇如果你使用的是稀疏的數據 (例如數據可以由稀疏矩陣來表示),[`chi2`](generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2 "sklearn.feature_selection.chi2") , [`mutual_info_regression`](generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression "sklearn.feature_selection.mutual_info_regression") , [`mutual_info_classif`](generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif "sklearn.feature_selection.mutual_info_classif")可以處理數據并保持它的稀疏性。 Warning 不要使用一個回歸評分函數來處理分類問題，你會得到無用的結果。 Examples: - [Univariate Feature Selection](../auto_examples/feature_selection/plot_feature_selection.html#sphx-glr-auto-examples-feature-selection-plot-feature-selection-py) - [Comparison of F-test and mutual information](../auto_examples/feature_selection/plot_f_test_vs_mi.html#sphx-glr-auto-examples-feature-selection-plot-f-test-vs-mi-py) ## 1.13.3. 遞歸式特征消除給定一個外部的估計器，可以對特征賦予一定的權重（比如，線性模型的相關系數），recursive feature elimination ( [`RFE`](generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE "sklearn.feature_selection.RFE") ) 通過考慮越來越小的特征集合來遞歸的選擇特征。首先，評估器在初始的特征集合上面訓練并且每一個特征的重要程度是通過一個 `coef_` 屬性或者 `feature_importances_` 屬性來獲得。然后，從當前的特征集合中移除最不重要的特征。在特征集合上不斷的重復遞歸這個步驟，直到最終達到所需要的特征數量為止。 [`RFECV`](generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV "sklearn.feature_selection.RFECV") 在一個交叉驗證的循環中執行 RFE 來找到最優的特征數量示例: - [Recursive feature elimination](../auto_examples/feature_selection/plot_rfe_digits.html#sphx-glr-auto-examples-feature-selection-plot-rfe-digits-py) : 通過遞歸式特征消除來體現數字分類任務中像素重要性的例子。 - [Recursive feature elimination with cross-validation](../auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py) : 通過遞歸式特征消除來自動調整交叉驗證中選擇的特征數。 ## 1.13.4. 使用 SelectFromModel 選取特征 [`SelectFromModel`](generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel "sklearn.feature_selection.SelectFromModel") 是一個 meta-transformer（元轉換器），它可以用來處理任何帶有 `coef_` 或者 `feature_importances_` 屬性的訓練之后的評估器。如果相關的``coef\_`` 或者 `featureimportances` 屬性值低于預先設置的閾值，這些特征將會被認為不重要并且移除掉。除了指定數值上的閾值之外，還可以通過給定字符串參數來使用內置的啟發式方法找到一個合適的閾值。可以使用的啟發式方法有 mean 、 median 以及使用浮點數乘以這些（例如，0.1\*mean ）。有關如何使用的例子，可以參閱下面的例子。 Examples - [Feature selection using SelectFromModel and LassoCV](../auto_examples/feature_selection/plot_select_from_model_boston.html#sphx-glr-auto-examples-feature-selection-plot-select-from-model-boston-py): 從 Boston 數據中自動選擇最重要兩個特征而不需要提前得知這一信息。 ### 1.13.4.1. 基于 L1 的特征選取 [Linear models](linear_model.html#linear-model) 使用 L1 正則化的線性模型會得到稀疏解：他們的許多系數為 0。當目標是降低使用另一個分類器的數據集的維度，它們可以與 [`feature_selection.SelectFromModel`](generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel "sklearn.feature_selection.SelectFromModel")一起使用來選擇非零系數。特別的，可以用于此目的的稀疏評估器有用于回歸的 [`linear_model.Lasso`](generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso "sklearn.linear_model.Lasso") , 以及用于分類的 [`linear_model.LogisticRegression`](generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression "sklearn.linear_model.LogisticRegression") 和 [`svm.LinearSVC`](generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC "sklearn.svm.LinearSVC") ``` >>> from sklearn.svm import LinearSVC >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectFromModel >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X.shape (150, 4) >>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y) >>> model = SelectFromModel(lsvc, prefit=True) >>> X_new = model.transform(X) >>> X_new.shape (150, 3) ``` 在 SVM 和邏輯回歸中，參數 C 是用來控制稀疏性的：小的 C 會導致少的特征被選擇。使用 Lasso，alpha 的值越大，越少的特征會被選擇。示例: - [Classification of text documents using sparse features](../auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py): 不同算法的比較，當使用 L1 正則化的特征選擇在文件分類任務上。 **L1-recovery 和 compressive sensing（壓縮感知）** 當選擇了正確的 alpha 值以后， [Lasso](linear_model.html#lasso) 可以僅通過少量觀察點便恢復完整的非零特征，假設特定的條件可以被滿足的話。特別的，數據量需要 “足夠大” ，不然 L1 模型的表現將缺乏保障。 “足夠大” 的定義取決于非零系數的個數、特征數量的對數值、噪音的數量、非零系數的最小絕對值、以及設計矩陣（design maxtrix） X 的結構。特征矩陣必須有特定的性質，如數據不能過度相關。關于如何選擇 alpha 值沒有固定的規則。alpha 值可以通過交叉驗證來確定（ `LassoCV` 或者 `LassoLarsCV` ），盡管這可能會導致欠懲罰的模型：包括少量的無關變量對于預測值來說并非致命的。相反的， BIC（ `LassoLarsIC` ）傾向于給定高 alpha 值。 **Reference（參考文獻）** Richard G. Baraniuk “Compressive Sensing”, IEEE Signal Processing Magazine \[120\] July 2007 <http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/baraniukCSlecture07.pdf> ### 1.13.4.2. 基于 Tree（樹）的特征選取基于樹的 estimators （查閱 [`sklearn.tree`](classes.html#module-sklearn.tree "sklearn.tree") 模塊和樹的森林在 [`sklearn.ensemble`](classes.html#module-sklearn.ensemble "sklearn.ensemble")模塊）可以用來計算特征的重要性，然后可以消除不相關的特征（當與 [`sklearn.feature_selection.SelectFromModel`](generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel "sklearn.feature_selection.SelectFromModel") 等元轉換器一同使用時）: ``` >>> from sklearn.ensemble import ExtraTreesClassifier >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectFromModel >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X.shape (150, 4) >>> clf = ExtraTreesClassifier() >>> clf = clf.fit(X, y) >>> clf.feature_importances_ array([ 0.04..., 0.05..., 0.4..., 0.4...]) >>> model = SelectFromModel(clf, prefit=True) >>> X_new = model.transform(X) >>> X_new.shape (150, 2) ``` 示例: - [Feature importances with forests of trees](../auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py): 在合成數據上恢復有用特征的示例。 - [Pixel importances with a parallel forest of trees](../auto_examples/ensemble/plot_forest_importances_faces.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-faces-py): 在人臉識別數據上的示例。 ## 1.13.5. 特征選取作為 pipeline（管道）的一部分特征選擇通常在實際的學習之前用來做預處理。在 scikit-learn 中推薦的方式是使用 :[`sklearn.pipeline.Pipeline`](generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline "sklearn.pipeline.Pipeline"): ``` clf = Pipeline([ ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))), ('classification', RandomForestClassifier()) ]) clf.fit(X, y) ``` 在這段代碼中，我們利用 [`sklearn.svm.LinearSVC`](generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC "sklearn.svm.LinearSVC") 和 [`sklearn.feature_selection.SelectFromModel`](generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel "sklearn.feature_selection.SelectFromModel") 來評估特征的重要性并且選擇出相關的特征。然后，在轉化后的輸出中使用一個 [`sklearn.ensemble.RandomForestClassifier`](generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier "sklearn.ensemble.RandomForestClassifier") 分類器，比如只使用相關的特征。你也可以使用其他特征選擇的方法和可以提供評估特征重要性的分類器來執行相似的操作。請查閱 [`sklearn.pipeline.Pipeline`](generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline "sklearn.pipeline.Pipeline") 來了解更多的實例。