如何在 Python 中調優 XGBoost 的多線程支持 · Machine Learning Mastery 博客文章翻譯

# 在 Python 中如何調優 XGBoost 的多線程支持 > 原文： [https://machinelearningmastery.com/best-tune-multithreading-support-xgboost-python/](https://machinelearningmastery.com/best-tune-multithreading-support-xgboost-python/) 為梯度提升（gradient boosting）而設計的 XGBoost 庫具有高效的多核并行處理功能。它能夠在訓練時有效地使用系統中的所有 CPU 核心。在這篇文章中，您將了解在 Python 中使用 XGBoost 的并行處理能力。閱讀之后您會學習到： * 如何確認 XGBoost 多線程功能可以在您的系統上運行。 * 如何在增加 XGBoost 上的線程數之后評估其效果。 * 如何在使用交叉驗證和網格搜索(grid search)時充分利用多線程的 XGBoost。讓我們開始吧。 * **2017 年 1 月更新**：對應 scikit-learn API 版本 0.18.1 中的更改??。 ![How to Best Tune Multithreading Support for XGBoost in Python](https://img.kancloud.cn/0e/3c/0e3cda30896803da288ae6ffc2e8d4a7_640x417.jpg) 在 Python 中如何調優 XGBoost 的多線程支持照片由 [Nicholas A. Tonelli](https://www.flickr.com/photos/nicholas_t/14946860658/) 拍攝，保留部分版權。 ## 問題描述：Otto Dataset 在本教程中，我們將使用 [Otto Group 產品分類挑戰賽](https://www.kaggle.com/c/otto-group-product-classification-challenge)數據集。數據集可從 Kaggle 獲得（您需要注冊 Kaggle 以獲取下載權限）。從[數據頁面（Data page）](https://www.kaggle.com/c/otto-group-product-classification-challenge/data)下載訓練數據集 **train.zip** ，并將解壓之后的 **trian.csv** 文件放入您的工作目錄。該數據集描述了超過 61,000 件產品的 93 個模糊細節。這些產品被分為 10 個類別（例如時尚，電子等）。填入屬性(input attributes)是該種類對不同事件的計數。任務目標是對新產品進行預測，在一個數組中給出分屬 10 個類別的概率。評估模型將使用多類對數損失（multiclass logarithmic loss）（也稱為交叉熵）。這個競賽已在 2015 年 5 月結束，該數據集對 XGBoost 來說是一個很好的挑戰，因為有相當大規模的范例以及較大的問題難度，并且需要很少的數據準備（除了將字符串類型變量編碼為整數）。 ## 線程數的影響 XGBoost 是由 C++ 實現的，顯式地使用 [OpenMP API](https://en.wikipedia.org/wiki/OpenMP) 來進行并行處理。梯度提升中的并行性可以應用于單樹（individual trees）的構建，而不是像隨機森林并行創建樹。這是因為在提升(boosting)中，樹是被順序添加到模型中。 XGBoost 的速度改觀既體現在構造單樹（individual trees）時添加并行性，也體現在有效地準備輸入數據，以幫助加快樹的構建。根據您系統的平臺，您可能需要專門編譯 XGBoost 以支持多線程。詳細信息請參閱 [XGBoost 安裝說明](https://github.com/dmlc/xgboost/blob/master/doc/build.md)。 XGBoost 的 **XGBClassifier** 和 **XGBRegressor** 包裝類給 scikit-learn 的使用提供了 **nthread** 參數，用于指定 XGBoost 在訓練期間可以使用的線程數。默認情況下，此參數設置為-1 以使用系統中的所有核心。 ```py model = XGBClassifier(nthread=-1) ``` 通常，您應該從 XGBoost 安裝中直接獲得多線程支持，而無需任何額外的工作。根據您的 Python 環境（例如 Python 3），可能需要顯式啟用 XGBoost 的多線程支持。如果您需要幫助， [XGBoost 庫提供了一個示例](https://github.com/dmlc/xgboost/blob/master/demo/guide-python/sklearn_parallel.py)。您可以通過構建一定數量的不同的 XGBoost 模型來確認 XGBoost 多線程支持是否正常工作，指定線程數并計算構建每個模型所需的時間。這一過程將向您表明啟用了多線程支持，并顯示構建模型時的時長效果。例如，如果您的系統有 4 個核心，您可以訓練 8 個不同的模型，并計算創建每個模型所需的時間（以秒為單位），然后比較時長。 ```py # evaluate the effect of the number of threads results = [] num_threads = [1, 2, 3, 4] for n in num_threads: start = time.time() model = XGBClassifier(nthread=n) model.fit(X_train, y_train) elapsed = time.time() - start print(n, elapsed) results.append(elapsed) ``` 我們可以在 Otto 數據集上使用這種方法。為說明的完備性，下面給出完整示例。您可以更改 **num_threads** 數組以符合您系統的核心數。 ```py # Otto, tune number of threads from pandas import read_csv from xgboost import XGBClassifier from sklearn.preprocessing import LabelEncoder import time from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # evaluate the effect of the number of threads results = [] num_threads = [1, 2, 3, 4] for n in num_threads: start = time.time() model = XGBClassifier(nthread=n) model.fit(X, label_encoded_y) elapsed = time.time() - start print(n, elapsed) results.append(elapsed) # plot results pyplot.plot(num_threads, results) pyplot.ylabel('Speed (seconds)') pyplot.xlabel('Number of Threads') pyplot.title('XGBoost Training Speed vs Number of Threads') pyplot.show() ``` 運行這段示例代碼將記錄不同配置下的訓練執行時間（以秒為單位），例如： ```py (1, 115.51652717590332) (2, 62.7727689743042) (3, 46.042901039123535) (4, 40.55334496498108) ``` 下圖給出這些時間的直觀說明。 ![XGBoost Tune Number of Threads for Single Model](https://img.kancloud.cn/6c/38/6c38549ccc02ff11e9628fb5ad0f910f_800x600.jpg) 單個模型的 XGBoost 調節線程數隨著線程數量的增加，我們可以看到執行時間減少的優越趨勢。如果您沒有看到增加每個新線程的運行時間有所改善，可能需要檢查怎樣在安裝過程中或運行過程中啟用 XGBoost 多線程支持。我們可以在具有更多核心的機器上運行相同的代碼。例如大型的 Amazon Web Services EC2 具有 32 個核心。我們可以調整上面的代碼來計算具有 1 到 32 個核心的模型所需的訓練時間。結果如下圖。 ![XGBoost Time to Train Model on 1 to 32 Cores](https://img.kancloud.cn/3e/e6/3ee60cfcec5432309ddf330db330780e_800x600.jpg) XGBoost 在 1 到 32 個核心上訓練模型所需的時間值得注意的是，在多于 16 個線程（大約 7 秒）的情況下，我們沒有看到太多進步。我想其原因是Amazon僅在硬件中提供 16 個內核，而另外的16個核心是通過超線程提供額外。結果表明，如果您的計算機具有超線程能力，則可能需要將 **num_threads** 設置為等于計算機中物理 CPU 核心的數量。使用 OpenMP 進行 XGBoost 的低層面最優執行能壓縮像這樣大型計算機的每一次最后一個周期（last cycle）。 ## 交叉驗證 XGBoost 模型時的并行性 scikit-learn 中的 k-fold 交叉驗證也同樣支持多線程。例如，當使用 k-fold 交叉驗證評估數據集上的模型，**cross_val_score（）**函數的 **n_jobs** 參數允許您指定要運行的并行作業數。默認情況下，此值設置為 1，但可以設置為-1 以使用系統上的所有 CPU 核心。這其實也是一個很好地實踐。例如： ```py results = cross_val_score(model, X, label_encoded_y, cv=kfold, scoring='log_loss', n_jobs=-1, verbose=1) ``` 這就提出了如何配置交叉驗證的問題： * 禁用 XGBoost 中的多線程支持，并允許交叉驗證在所有核心上運行。 * 禁用交叉驗證中的多線程支持，并允許 XGBoost 在所有核心上運行。 * 同時啟用 XGBoost 和交叉驗證的多線程支持。我們可以通過簡單計算在每種情況下評估模型所需的時間來得到這個問題的答案。在下面的示例中，我們使用 10 次交叉驗證來評估 Otto 訓練數據集上的默認 XGBoost 模型。上述每種情況都得到了評估，并記錄了所花費的時間。完整的代碼示例如下所示。 ```py # Otto, parallel cross validation from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.preprocessing import LabelEncoder import time # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # prepare cross validation kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) # Single Thread XGBoost, Parallel Thread CV start = time.time() model = XGBClassifier(nthread=1) results = cross_val_score(model, X, label_encoded_y, cv=kfold, scoring='neg_log_loss', n_jobs=-1) elapsed = time.time() - start print("Single Thread XGBoost, Parallel Thread CV: %f" % (elapsed)) # Parallel Thread XGBoost, Single Thread CV start = time.time() model = XGBClassifier(nthread=-1) results = cross_val_score(model, X, label_encoded_y, cv=kfold, scoring='neg_log_loss', n_jobs=1) elapsed = time.time() - start print("Parallel Thread XGBoost, Single Thread CV: %f" % (elapsed)) # Parallel Thread XGBoost and CV start = time.time() model = XGBClassifier(nthread=-1) results = cross_val_score(model, X, label_encoded_y, cv=kfold, scoring='neg_log_loss', n_jobs=-1) elapsed = time.time() - start print("Parallel Thread XGBoost and CV: %f" % (elapsed)) ``` 運行這段示例代碼將會print以下結果： ```py Single Thread XGBoost, Parallel Thread CV: 359.854589 Parallel Thread XGBoost, Single Thread CV: 330.498101 Parallel Thread XGBoost and CV: 313.382301 ``` 我們可以看到，并行化 XGBoost 較之并行化交叉驗證會帶來提升。這是說得通的，因為 10 個單列快速任務將比（10 除以 num_cores）慢任務表現優秀。有趣的是，我們可以看到通過在 XGBoost 和交叉驗證中同時啟用多線程實現了最佳結果。這是令人驚訝的，因為它代表并行 XGBoost 模型的 num_cores 數在與創建模型中相同的 num_cores 數進行競爭。然而，這實現了最快的結果，它是進行交叉驗證的 XGBoost 優選使用方法。因為網格搜索（grid search）使用相同的基礎方法來實現并行性，所以我們期望同樣的結論可用于優化 XGBoost 的超參數。 ## 總結在這篇文章中，您了解到了 XGBoost 的多線程功能。所學到的要點是： * 如何檢查您的系統中是否啟用了 XGBoost 中的多線程支持。 * 增加線程數會如何影響訓練 XGBoost 模型的性能。 * 如何在 Python 中最優配置 XGBoost 和交叉驗證以獲取最短的運行時間。您對 XGBoost 的多線程功能或者這篇文章有任何疑問嗎？請在評論中提出您的問題，我將會盡力回答。