在 Python 中使用 XGBoost 和 scikit-learn 進行隨機梯度提升 · Machine Learning Mastery 博客文章翻譯

# 在 Python 中使用 XGBoost 和 scikit-learn 進行隨機梯度提升 > 原文： [https://machinelearningmastery.com/stochastic-gradient-boosting-xgboost-scikit-learn-python/](https://machinelearningmastery.com/stochastic-gradient-boosting-xgboost-scikit-learn-python/) 用于集合決策樹的簡單技術涉及在訓練數據集的子樣本上訓練樹。可以采用訓練數據中的行的子集來訓練稱為裝袋的單個樹。當在計算每個分裂點時也獲取訓練數據的行的子集時，這被稱為隨機森林。這些技術也可以在稱為隨機梯度提升的技術中用于梯度樹增強模型。在這篇文章中，您將發現隨機梯度提升以及如何使用 XGBoost 和 Python 中的 scikit-learn 來調整采樣參數。閱讀這篇文章后你會知道： * 在數據的子樣本上訓練樹的原理以及如何在梯度提升中使用它。 * 如何使用 scikit-learn 調整 XGBoost 中基于行的子采樣。 * 如何在 XGBoost 中通過樹和分割點調整基于列的子采樣。讓我們開始吧。 * **2017 年 1 月更新**：已更新，以反映 scikit-learn API 版本 0.18.1 中的更改??。 ![Stochastic Gradient Boosting with XGBoost and scikit-learn in Python](https://img.kancloud.cn/12/22/1222fcdaf3e91b7ccbf5d1df927dc097_500x375.jpg) 隨機梯度提升 XGBoost 和 scikit-Python 攝影：[HenningKlokker?sen](https://www.flickr.com/photos/photohenning/379603235/)，保留一些權利。 ## 隨機梯度提升梯度提升是一個貪婪的程序。將新決策樹添加到模型中以校正現有模型的殘差。使用貪婪搜索過程創建每個決策樹，以選擇最佳地最小化目標函數的分割點。這可能導致樹一次又一次地使用相同的屬性甚至相同的分裂點。 [Bagging](http://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/) 是一種技術，其中創建決策樹的集合，每個決策樹來自訓練數據的不同行的隨機子集。結果是，從樹集合中獲得了更好的表現，因為樣本中的隨機性允許創建略微不同的樹，從而增加了集合預測的方差。通過允許在選擇分割點時對特征（列）進行二次采樣，隨機森林更進一步，向樹集合添加進一步的方差。這些相同的技術可用于在稱為隨機梯度提升的變化中的梯度提升中的決策樹的構造中。通常使用訓練數據的積極子樣本，例如 40％至 80％。 ## 教程概述在本教程中，我們將研究不同子采樣技術在梯度提升中的效果。我們將調整 Python 中 XGBoost 庫支持的三種不同風格的隨機梯度提升，具體來說： 1. 在創建每個樹時，對數據集中的行進行子采樣。 2. 在創建每個樹時對數據集中的列進行子采??樣。 3. 在創建每個樹時，對數據集中每個拆分的列進行子采??樣。 ## 問題描述：Otto Dataset 在本教程中，我們將使用 [Otto Group 產品分類挑戰](https://www.kaggle.com/c/otto-group-product-classification-challenge)數據集。此數據集可從 Kaggle 免費獲得（您需要注冊 Kaggle 才能下載此數據集）。您可以從[數據頁面](https://www.kaggle.com/c/otto-group-product-classification-challenge/data)下載訓練數據集 **train.csv.zip** ，并將解壓縮的 **train.csv** 文件放入您的工作目錄。該數據集描述了超過 61,000 種產品的 93 個模糊細節，這些產品分為 10 個產品類別（例如時裝，電子產品等）。輸入屬性是某種不同事件的計數。目標是對新產品進行預測，因為 10 個類別中的每個類別都有一組概率，并且使用多類對數損失（也稱為交叉熵）來評估模型。這個競賽在 2015 年 5 月完成，這個數據集對 XGBoost 來說是一個很好的挑戰，因為它有很多例子，問題的難度以及需要很少數據準備的事實（除了將字符串類變量編碼為整數）。 ## 在 XGBoost 中調整行子采樣行子采樣涉及選擇訓練數據集的隨機樣本而無需替換。行子采樣可以在**子樣本**參數中的 XGBoost 類的 scikit-learn 包裝器中指定。默認值為 1.0，不進行子采樣。我們可以使用 scikit-learn 內置的網格搜索功能來評估 Otto 數據集中 0.1 到 1.0 的不同子樣本值的影響。 ```py [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] ``` 子樣本有 9 種變體，每種模型將使用 10 倍交叉驗證進行評估，這意味著需要訓練和測試 9×10 或 90 個模型。完整的代碼清單如下。 ```py # XGBoost on Otto dataset, tune subsample from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() subsample = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] param_grid = dict(subsample=subsample) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(subsample, means, yerr=stds) pyplot.title("XGBoost subsample vs Log Loss") pyplot.xlabel('subsample') pyplot.ylabel('Log Loss') pyplot.savefig('subsample.png') ``` 運行此示例將打印每個已測試配置的最佳配置以及日志丟失。我們可以看到，獲得的最佳結果是 0.3，或使用 30％的訓練數據集樣本訓練樹。 ```py Best: -0.000647 using {'subsample': 0.3} -0.001156 (0.000286) with: {'subsample': 0.1} -0.000765 (0.000430) with: {'subsample': 0.2} -0.000647 (0.000471) with: {'subsample': 0.3} -0.000659 (0.000635) with: {'subsample': 0.4} -0.000717 (0.000849) with: {'subsample': 0.5} -0.000773 (0.000998) with: {'subsample': 0.6} -0.000877 (0.001179) with: {'subsample': 0.7} -0.001007 (0.001371) with: {'subsample': 0.8} -0.001239 (0.001730) with: {'subsample': 1.0} ``` 我們可以繪制這些平均值和標準差對數損失值，以更好地理解表現如何隨子采樣值變化。 ![Plot of Tuning Row Sample Rate in XGBoost](https://img.kancloud.cn/38/ac/38acbb0ffa6e6c58937e0f55fd91b1e5_800x600.jpg) XGBoost 中調整行采樣率的圖我們可以看到確實有 30％的人具有最佳的平均表現，但我們也可以看到，隨著比率的增加，表現的差異也會顯著增加。值得注意的是，所有**子樣本**值的平均表現優于沒有子采樣的平均表現（**子樣本= 1.0** ）。 ## 按樹在 XGBoost 中調整列子采樣我們還可以在增強模型中創建每個決策樹之前創建要使用的特征（或列）的隨機樣本。在用于 scikit-learn 的 XGBoost 包裝器中，這由 **colsample_bytree** 參數控制。默認值為 1.0，表示在每個決策樹中使用所有列。我們可以評估 **colsample_bytree** 的值在 0.1 和 1.0 之間遞增 0.1。 ```py [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] ``` 完整的代碼清單如下。 ```py # XGBoost on Otto dataset, tune colsample_bytree from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() colsample_bytree = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] param_grid = dict(colsample_bytree=colsample_bytree) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(colsample_bytree, means, yerr=stds) pyplot.title("XGBoost colsample_bytree vs Log Loss") pyplot.xlabel('colsample_bytree') pyplot.ylabel('Log Loss') pyplot.savefig('colsample_bytree.png') ``` Running this example prints the best configuration as well as the log loss for each tested configuration. 我們可以看到該模型的最佳表現是 **colsample_bytree = 1.0** 。這表明對此問題的子采樣列不會增加價值。 ```py Best: -0.001239 using {'colsample_bytree': 1.0} -0.298955 (0.002177) with: {'colsample_bytree': 0.1} -0.092441 (0.000798) with: {'colsample_bytree': 0.2} -0.029993 (0.000459) with: {'colsample_bytree': 0.3} -0.010435 (0.000669) with: {'colsample_bytree': 0.4} -0.004176 (0.000916) with: {'colsample_bytree': 0.5} -0.002614 (0.001062) with: {'colsample_bytree': 0.6} -0.001694 (0.001221) with: {'colsample_bytree': 0.7} -0.001306 (0.001435) with: {'colsample_bytree': 0.8} -0.001239 (0.001730) with: {'colsample_bytree': 1.0} ``` 繪制結果，我們可以看到模型平臺的表現（至少在這個尺度上），其值在 0.5 到 1.0 之間。 ![Plot of Tuning Per-Tree Column Sampling in XGBoost](https://img.kancloud.cn/02/d1/02d15065db9e1f37952d4c367dfe3b16_800x600.jpg) 在 XGBoost 中調整每樹列采樣的圖 ## 通過拆分調整 XGBoost 中的列子采樣我們可以在決策樹中的每個拆分中對它們進行二次采樣，而不是對每個樹進行一次子采樣。原則上，這是隨機森林中使用的方法。我們可以在 XGBoost 包裝器類的 **colsample_bylevel** 參數中為 scikit-learn 設置每個拆分使用的列樣本的大小。和以前一樣，我們將比例從 10％變為默認的 100％。 The full code listing is provided below. ```py # XGBoost on Otto dataset, tune colsample_bylevel from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() colsample_bylevel = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] param_grid = dict(colsample_bylevel=colsample_bylevel) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(colsample_bylevel, means, yerr=stds) pyplot.title("XGBoost colsample_bylevel vs Log Loss") pyplot.xlabel('colsample_bylevel') pyplot.ylabel('Log Loss') pyplot.savefig('colsample_bylevel.png') ``` Running this example prints the best configuration as well as the log loss for each tested configuration. 我們可以看到通過將 **colsample_bylevel** 設置為 70％來實現最佳結果，導致（反向）日志丟失-0.001062，這比將每樹列采樣設置為-0.001239 更好。 100％。如果每個樹的結果建議使用 100％的列，那么建議不要放棄列子采樣，而是嘗試按每個拆分列子采樣。 ```py Best: -0.001062 using {'colsample_bylevel': 0.7} -0.159455 (0.007028) with: {'colsample_bylevel': 0.1} -0.034391 (0.003533) with: {'colsample_bylevel': 0.2} -0.007619 (0.000451) with: {'colsample_bylevel': 0.3} -0.002982 (0.000726) with: {'colsample_bylevel': 0.4} -0.001410 (0.000946) with: {'colsample_bylevel': 0.5} -0.001182 (0.001144) with: {'colsample_bylevel': 0.6} -0.001062 (0.001221) with: {'colsample_bylevel': 0.7} -0.001071 (0.001427) with: {'colsample_bylevel': 0.8} -0.001239 (0.001730) with: {'colsample_bylevel': 1.0} ``` 我們可以繪制每個 **colsample_bylevel** 變體的表現。結果顯示相對較低的方差，并且在此規模的值為 0.3 之后，表現似乎是表現的平臺。 ![Plot of Tuning Per-Split Column Sampling in XGBoost](https://img.kancloud.cn/57/b6/57b610d8ce4185f7ada0ca71c00f7c22_800x600.jpg) 在 XGBoost 中調整每分割列采樣的圖 ## 摘要在這篇文章中，您發現了使用 Python 中的 XGBoost 進行隨機梯度提升。具體來說，你學到了： * 關于隨機增強以及如何對訓練數據進行二次采樣以改進模型的泛化 * 如何在 Python 和 scikit-learn 中使用 XGBoost 調整行子采樣。 * 如何使用每個樹和每個拆分的 XGBoost 調整列子采樣。您對隨機梯度提升或關于這篇文章有任何疑問嗎？在評論中提出您的問題，我會盡力回答。