在 Python 中使用 XGBoost 的特征重要性和特征選擇 · Machine Learning Mastery 博客文章翻譯

# 在 Python 中使用 XGBoost 的特征重要性和特征選擇 > 原文： [https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/](https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/) 使用諸如梯度提升之類的決策樹方法的集合的好處是它們可以從訓練的預測模型自動提供特征重要性的估計。在本文中，您將了解如何使用 Python 中的 XGBoost 庫來估計功能對預測建模問題的重要性。閱讀這篇文章后你會知道： * 如何使用梯度提升算法計算特征重要性。 * 如何在 XGBoost 模型計算的 Python 中繪制要素重要性。 * 如何使用 XGBoost 計算的要素重要性來執行要素選擇。讓我們開始吧。 * **2017 年 1 月更新**：已更新，以反映 scikit-learn API 版本 0.18.1 中的更改??。 * **更新 March / 2018** ：添加了備用鏈接以下載數據集，因為原始圖像已被刪除。 ![Feature Importance and Feature Selection With XGBoost in Python](https://img.kancloud.cn/9c/da/9cdac39bb6f895dc677682aa7a47a650_640x480.jpg) 功能重要性和功能選擇使用 Python 中的 XGBoost 照片由 [Keith Roper](https://www.flickr.com/photos/keithroper/15476027141/) ，保留一些權利。 ## 梯度提升中的特征重要性使用梯度提升的好處是，在構建增強樹之后，檢索每個屬性的重要性分數是相對簡單的。通常，重要性提供分數，該分數指示每個特征在模型內的增強決策樹的構造中的有用性或有價值。使用決策樹做出關鍵決策的屬性越多，其相對重要性就越高。對于數據集中的每個屬性，明確計算此重要性，允許對屬性進行排名并相互比較。通過每個屬性分割點改進表現度量的量來計算單個決策樹的重要性，并由節點負責的觀察數量加權。表現度量可以是用于選擇分裂點的純度（基尼指數）或另一個更具體的誤差函數。然后，在模型中的所有決策樹中對要素重要性進行平均。有關如何在提升的決策樹中計算特征重要性的更多技術信息，請參閱本書[統計學習要素：數據挖掘，推理，第 10.53.1 節“_ 預測變量的相對重要性 _”。和預測](http://www.amazon.com/dp/0387848576?tag=inspiredalgor-20)，第 367 頁。另外，請參閱 Matthew Drury 對 StackOverflow 問題的回答“ [Boosting](http://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting) 的相對變量重要性”，他提供了非常詳細和實用的答案。 ## 手動繪制功能重要性經過訓練的 XGBoost 模型可自動計算預測建模問題的特征重要性。這些重要性分數可在訓練模型的 **feature_importances_** 成員變量中找到。例如，它們可以直接打印如下： ```py print(model.feature_importances_) ``` 我們可以直接在條形圖上繪制這些分數，以直觀地顯示數據集中每個要素的相對重要性。例如： ```py # plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show() ``` 我們可以通過在 [Pima 印第安人糖尿病數據集](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes)上訓練 XGBoost 模型并根據計算的特征重要性創建條形圖來證明這一點（更新：[從這里下載](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)）。 ```py # plot feature importance manually from numpy import loadtxt from xgboost import XGBClassifier from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # feature importance print(model.feature_importances_) # plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show() ``` 運行此示例首先輸出重要性分數： ```py [ 0.089701 ???0.17109634 ?0.08139535 ?0.04651163 ?0.10465116 ?0.2026578 0.1627907 ??0.14119601] ``` 我們還得到了相對重要性的條形圖。 ![Manual Bar Chart of XGBoost Feature Importance](https://img.kancloud.cn/fe/ce/fecec6fcdbed080c86a0e0a9db263103_800x600.jpg) XGBoost 功能重要性的手動條形圖這個圖的缺點是功能按輸入索引而不是它們的重要性排序。我們可以在繪圖之前對功能進行排序。值得慶幸的是，有一個內置的繪圖功能來幫助我們。 ## 使用內置的 XGBoost 功能重要性圖 XGBoost 庫提供了一個內置函數來繪制按其重要性排序的特征。該函數稱為 **plot_importance（）**，可以按如下方式使用： ```py # plot feature importance plot_importance(model) pyplot.show() ``` 例如，下面是一個完整的代碼清單，使用內置的 **plot_importance（）**函數繪制 Pima Indians 數據集的特征重要性。 ```py # plot feature importance using built-in function from numpy import loadtxt from xgboost import XGBClassifier from xgboost import plot_importance from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # plot feature importance plot_importance(model) pyplot.show() ``` 運行該示例為我們提供了更有用的條形圖。 ![XGBoost Feature Importance Bar Chart](https://img.kancloud.cn/cb/2b/cb2b60873128ec537cf33413d5302493_800x600.jpg) XGBoost 功能重要性條形圖您可以看到功能是根據它們在 F0 到 F7 的輸入數組（X）中的索引自動命名的。在問題描述中手動將這些指數映射到[名稱，我們可以看到該圖顯示 F5（體重指數）具有最高重要性，F3（皮膚折疊厚度）具有最低重要性。](https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.names) ## 使用 XGBoost 功能重要性分數進行特征選擇特征重要性分數可用于 scikit-learn 中的特征選擇。這是使用 [SelectFromModel](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html) 類完成的，該類采用模型并可以將數據集轉換為具有所選特征的子集。該課程可以采用預訓練的模型，例如在整個訓練數據集上訓練的模型。然后，它可以使用閾值來決定選擇哪些功能。當您在 **SelectFromModel** 實例上調用 **transform（）**方法以在訓練數據集和測試數據集上始終選擇相同的特征時，將使用此閾值。在下面的示例中，我們首先分別訓練并評估整個訓練數據集和測試數據集上的 XGBoost 模型。使用從訓練數據集計算的要素重要性，然后我們將模型包裝在 SelectFromModel 實例中。我們使用它來選擇訓練數據集上的特征，從所選特征子集訓練模型，然后根據相同的特征選擇方案評估測試集上的模型。例如： ```py # select features using threshold selection = SelectFromModel(model, threshold=thresh, prefit=True) select_X_train = selection.transform(X_train) # train model selection_model = XGBClassifier() selection_model.fit(select_X_train, y_train) # eval model select_X_test = selection.transform(X_test) y_pred = selection_model.predict(select_X_test) ``` 為了興趣，我們可以測試多個閾值，以按功能重要性選擇要素。具體而言，每個輸入變量的特征重要性，基本上允許我們按重要性測試每個特征子集，從所有特征開始，以具有最重要特征的子集結束。完整的代碼清單如下。 ```py # use feature importance for feature selection from numpy import loadtxt from numpy import sort from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectFromModel # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) # fit model on all training data model = XGBClassifier() model.fit(X_train, y_train) # make predictions for test data and evaluate y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) # Fit model using each importance as a threshold thresholds = sort(model.feature_importances_) for thresh in thresholds: # select features using threshold selection = SelectFromModel(model, threshold=thresh, prefit=True) select_X_train = selection.transform(X_train) # train model selection_model = XGBClassifier() selection_model.fit(select_X_train, y_train) # eval model select_X_test = selection.transform(X_test) y_pred = selection_model.predict(select_X_test) predictions = [round(value) for value in y_pred] accuracy = accuracy_score(y_test, predictions) print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0)) ``` 運行此示例將輸出以下輸出： ```py Accuracy: 77.95% Thresh=0.071, n=8, Accuracy: 77.95% Thresh=0.073, n=7, Accuracy: 76.38% Thresh=0.084, n=6, Accuracy: 77.56% Thresh=0.090, n=5, Accuracy: 76.38% Thresh=0.128, n=4, Accuracy: 76.38% Thresh=0.160, n=3, Accuracy: 74.80% Thresh=0.186, n=2, Accuracy: 71.65% Thresh=0.208, n=1, Accuracy: 63.78% ``` 我們可以看到模型的表現通常隨著所選特征的數量而減少。在這個問題上，需要權衡測試集合精度的特征，我們可以決定采用較不復雜的模型（較少的屬性，如 n = 4），并接受估計精度的適度降低，從 77.95％降至 76.38％。這可能是對這么小的數據集的一種洗滌，但對于更大的數據集并且使用交叉驗證作為模型評估方案可能是更有用的策略。 ## 摘要在這篇文章中，您發現了如何在訓練有素的 XGBoost 梯度提升模型中訪問特征和使用重要性。具體來說，你學到了： * 重要的是什么，一般如何在 XGBoost 中計算。 * 如何從 XGBoost 模型訪問和繪制要素重要性分數。 * 如何使用 XGBoost 模型中的要素重要性來選擇要素。您對 XGBoost 或此帖中的功能重要性有任何疑問嗎？在評論中提出您的問題，我會盡力回答。