如何在 Python 中使用 XGBoost 調整決策樹的數量和大小 · Machine Learning Mastery 博客文章翻譯

# 如何在 Python 中使用 XGBoost 調整決策樹的數量和大小 > 原文： [https://machinelearningmastery.com/tune-number-size-decision-trees-xgboost-python/](https://machinelearningmastery.com/tune-number-size-decision-trees-xgboost-python/) 梯度提升包括順序創建和添加決策樹，每個嘗試糾正前面的學習器的錯誤。這提出了一個問題：在梯度提升模型中要配置多少樹（弱學習器或估計者）以及每棵樹應該有多大。在這篇文章中，您將了解如何設計系統實驗來選擇決策樹的數量和大小以用于您的問題。閱讀這篇文章后你會知道： * 如何評估向 XGBoost 模型添加更多決策樹的效果。 * 如何評估為 XGBoost 模型創建更大的決策樹的效果。 * 如何調查問題樹的數量和深度之間的關系。讓我們開始吧。 * **2017 年 1 月更新**：已更新，以反映 scikit-learn API 版本 0.18.1 中的更改??。 ![How to Tune the Number and Size of Decision Trees with XGBoost in Python](https://img.kancloud.cn/b7/64/b76453df69900112cd27035e93638b7a_640x427.jpg) 如何在 Python 中使用 XGBoost 調整決策樹的數量和大小照片由 [USFWSmidwest](https://www.flickr.com/photos/usfwsmidwest/15857830226/) ，保留一些權利。 ## 問題描述：Otto Dataset 在本教程中，我們將使用 [Otto Group 產品分類挑戰](https://www.kaggle.com/c/otto-group-product-classification-challenge)數據集。此數據集可從 Kaggle 免費獲得（您需要注冊 Kaggle 才能下載此數據集）。您可以從[數據頁面](https://www.kaggle.com/c/otto-group-product-classification-challenge/data)下載訓練數據集 **train.csv.zip** ，并將解壓縮的 **train.csv** 文件放入您的工作目錄。該數據集描述了超過 61,000 種產品的 93 個模糊細節，這些產品分為 10 個產品類別（例如時裝，電子產品等）。輸入屬性是某種不同事件的計數。目標是對新產品進行預測，因為 10 個類別中的每個類別都有一組概率，并且使用多類對數損失（也稱為交叉熵）來評估模型。這個競賽在 2015 年 5 月完成，這個數據集對 XGBoost 來說是一個很好的挑戰，因為它有很多例子，問題的難度以及需要很少數據準備的事實（除了將字符串類變量編碼為整數）。 ## 調整 XGBoost 中的決策樹數量大多數梯度提升的實現默認配置有相對較少數量的樹，例如數百或數千。一般原因是，在大多數問題上，添加超出限制的更多樹不會改善模型的表現。原因在于構造了增強樹模型的方式，順序地每個新樹嘗試建模并校正由先前樹序列產生的錯誤。很快，該模型達到了收益遞減的程度。我們可以在 Otto 數據集上輕松證明這一收益遞減點。 XGBoost 模型中的樹（或舍入）數量是在 n_estimators 參數中指定給 XGBClassifier 或 XGBRegressor 類的。 XGBoost 庫中的默認值為 100。使用 scikit-learn，我們可以對 **n_estimators** 模型參數進行網格搜索，評估 50 到 350 的一系列值，步長為 50（50,150,200,250,300,350）。 ```py # grid search model = XGBClassifier() n_estimators = range(50, 400, 50) param_grid = dict(n_estimators=n_estimators) kfold = StratifiedKFold(n_splits scoring="neg_log_loss", n_jobs=-1, cv=kfold) result = grid_search.fit(X, label_encoded_y) ``` 我們可以在 Otto 數據集上執行此網格搜索，使用 10 倍交叉驗證，需要訓練 60 個模型（6 個配置* 10 倍）。完整性代碼清單如下所示。 ```py # XGBoost on Otto dataset, Tune n_estimators from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() n_estimators = range(50, 400, 50) param_grid = dict(n_estimators=n_estimators) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(n_estimators, means, yerr=stds) pyplot.title("XGBoost n_estimators vs Log Loss") pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.savefig('n_estimators.png') ``` 運行此示例將打印以下結果。 ```py Best: -0.001152 using {'n_estimators': 250} -0.010970 (0.001083) with: {'n_estimators': 50} -0.001239 (0.001730) with: {'n_estimators': 100} -0.001163 (0.001715) with: {'n_estimators': 150} -0.001153 (0.001702) with: {'n_estimators': 200} -0.001152 (0.001702) with: {'n_estimators': 250} -0.001152 (0.001704) with: {'n_estimators': 300} -0.001153 (0.001706) with: {'n_estimators': 350} ``` 我們可以看到交叉驗證日志損失分數是負數。這是因為 scikit-learn 交叉驗證框架顛倒了它們。原因是在內部，框架要求所有正在優化的度量標準都要最大化，而日志丟失則是最小化度量標準。通過反轉分數可以很容易地使其最大化。最好的樹數是 **n_estimators = 250** ，導致對數損失為 0.001152，但與 **n_estimators = 200** 實際上沒有顯著差異。事實上，如果我們繪制結果，那么在 100 到 350 之間的樹木數量之間沒有很大的相對差異。下面的線圖顯示了樹木數量與平均（倒置）對數損失之間的關系，標準差顯示為誤差條。 ![Tune The Number of Trees in XGBoost](https://img.kancloud.cn/fa/cc/faccd94c383f0510cb18692f821fe46d_800x600.jpg) 調整 XGBoost 中的樹數 ## 調整 XGBoost 中決策樹的大小在梯度提升中，我們可以控制決策樹的大小，也稱為層數或深度。預計淺樹的表現不佳，因為它們捕捉的問題細節很少，通常被稱為弱學習器。更深的樹通常捕獲太多問題細節并過度擬合訓練數據集，限制了對新數據做出良好預測的能力。通常，提升算法配置有弱學習器，具有少量層的決策樹，有時像根節點一樣簡單，也稱為決策樹而不是決策樹。可以在 **max_depth** 參數中的 **XGBC 分類器**和 **XGBRegressor** XGBoost 包裝類中指定最大深度。此參數采用整數值，默認值為 3。 ```py model = XGBClassifier(max_depth=3) ``` 我們可以使用關于 Otto 數據集的 scikit-learn 中的網格搜索基礎結構來調整 XGBoost 的這個超參數。下面我們評估 **max_depth** 的奇數值在 1 到 9 之間（1,3,5,7,9）。使用 10 倍交叉驗證評估 5 種配置中的每一種，從而構建 50 個模型。完整性代碼清單如下所示。 ```py # XGBoost on Otto dataset, Tune max_depth from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() max_depth = range(1, 11, 2) print(max_depth) param_grid = dict(max_depth=max_depth) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(max_depth, means, yerr=stds) pyplot.title("XGBoost max_depth vs Log Loss") pyplot.xlabel('max_depth') pyplot.ylabel('Log Loss') pyplot.savefig('max_depth.png') ``` 運行此示例將打印每個 **max_depth** 的日志丟失。最佳配置為 **max_depth = 5** ，導致對數損失為 0.001236。 ```py Best: -0.001236 using {'max_depth': 5} -0.026235 (0.000898) with: {'max_depth': 1} -0.001239 (0.001730) with: {'max_depth': 3} -0.001236 (0.001701) with: {'max_depth': 5} -0.001237 (0.001701) with: {'max_depth': 7} -0.001237 (0.001701) with: {'max_depth': 9} ``` 回顧日志損失得分圖，我們可以看到從 **max_depth = 1** 到 **max_depth = 3** 的顯著跳躍，然后其余的表現相當均勻 **max_depth** 的值]。盡管 **max_depth = 5** 觀察到最佳評分，但值得注意的是，使用 **max_depth = 3** 或 **max_depth = 7** 之間幾乎沒有差異。這表明 **max_depth** 在你可以使用網格搜索挑出的問題上的收益遞減點。將 **max_depth** 值的圖對下面的（反向）對數損失作圖。 ![Tune Max Tree Depth in XGBoost](https://img.kancloud.cn/00/a4/00a457f970bda15cbe20e8076f80ebf6_800x600.jpg) 調整 XGBoost 中的最大樹深度 ## 調整 XGBoost 中的樹數和最大深度模型中的樹木數量與每棵樹的深度之間存在關系。我們期望更深的樹將導致模型中需要更少的樹，并且更簡單的樹（例如決策樹樁）需要更多樹以實現類似結果。我們可以通過評估 **n_estimators** 和 **max_depth** 配置值的網格來研究這種關系。為避免評估花費太長時間，我們將限制評估的配置值總數。選擇參數來梳理關系而不是優化模型。我們將創建一個包含 4 個不同 n_estimators 值（50,100,150,200）和 4 個不同 max_depth 值（2,4,6,8）的網格，并且將使用 10 倍交叉驗證來評估每個組合。將訓練和評估總共 4 * 4 * 10 或 160 個型號。完整的代碼清單如下。 ```py # XGBoost on Otto dataset, Tune n_estimators and max_depth from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot import numpy # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() n_estimators = [50, 100, 150, 200] max_depth = [2, 4, 6, 8] print(max_depth) param_grid = dict(max_depth=max_depth, n_estimators=n_estimators) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot results scores = numpy.array(means).reshape(len(max_depth), len(n_estimators)) for i, value in enumerate(max_depth): pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value)) pyplot.legend() pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.savefig('n_estimators_vs_max_depth.png') ``` 運行代碼會生成每個參數對的 logloss 列表。 ```py Best: -0.001141 using {'n_estimators': 200, 'max_depth': 4} -0.012127 (0.001130) with: {'n_estimators': 50, 'max_depth': 2} -0.001351 (0.001825) with: {'n_estimators': 100, 'max_depth': 2} -0.001278 (0.001812) with: {'n_estimators': 150, 'max_depth': 2} -0.001266 (0.001796) with: {'n_estimators': 200, 'max_depth': 2} -0.010545 (0.001083) with: {'n_estimators': 50, 'max_depth': 4} -0.001226 (0.001721) with: {'n_estimators': 100, 'max_depth': 4} -0.001150 (0.001704) with: {'n_estimators': 150, 'max_depth': 4} -0.001141 (0.001693) with: {'n_estimators': 200, 'max_depth': 4} -0.010341 (0.001059) with: {'n_estimators': 50, 'max_depth': 6} -0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 6} -0.001163 (0.001688) with: {'n_estimators': 150, 'max_depth': 6} -0.001154 (0.001679) with: {'n_estimators': 200, 'max_depth': 6} -0.010342 (0.001059) with: {'n_estimators': 50, 'max_depth': 8} -0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 8} -0.001161 (0.001688) with: {'n_estimators': 150, 'max_depth': 8} -0.001153 (0.001679) with: {'n_estimators': 200, 'max_depth': 8} ``` 我們可以看到， **n_estimators = 200** 和 **max_depth = 4** 實現了最佳結果，類似于前兩輪獨立參數調整中發現的最佳值（ **n_estimators = 250** ， **max_depth = 5** ）。我們可以繪制給定 **n_estimators** 的每個 **max_depth** 值系列之間的關系。 ![Tune The Number of Trees and Max Tree Depth in XGBoost](https://img.kancloud.cn/f0/17/f017be65ede5f40d8da6a67b85cec3a8_800x600.jpg) 調整 XGBoost 中的樹數和最大樹深度線條重疊使得很難看到這種關系，但通常我們可以看到我們期望的互動。隨著樹木深度的增加，需要更少的樹木。此外，我們期望由更深的單個樹提供的增加的復雜性導致訓練數據的更大過度擬合，這將通過具有更多樹而加劇，進而導致更低的交叉驗證分數。我們在這里看不到這一點，因為我們的樹木不是那么深，我們也沒有太多。探索這種期望是一種你可以自己探索的練習。 ## 摘要在這篇文章中，您發現了在 Python 中使用 XGBoost 進行梯度提升時如何調整決策樹的數量和深度。具體來說，你學到了： * 如何調整 XGBoost 模型中的決策樹數量。 * 如何在 XGBoost 模型中調整決策樹的深度。 * 如何在 XGBoost 模型中共同調整樹的數量和樹深度您對梯度提升模型或此帖中決策樹的數量或大小有任何疑問嗎？在評論中提出您的問題，我會盡力回答。