應用機器學習的 XGBoost 簡介 · Machine Learning Mastery 博客文章翻譯

# 應用機器學習的殺器： XGBoost 簡介 > 原文： [https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/) XGBoost 是一種算法庫，近年來在應用機器學習和 Kaggle 競賽中占據統治地位，它專長于處理結構化數據或表格數據。 XGBoost 是為速度和性能而設計的一種梯度提升決策樹方法。在這篇文章中，您將輕松了解 XGBoost的入門信息，并知曉它究竟是什么，源自何處，以及如何學習它的更多信息。閱讀之后您會學習到： * 什么是 XGBoost 以及本項目所達成的目標。 * 為什么 XGBoost 必須與您現有的機器學習工具包作區分。 * 在您的下一個機器學習項目中，您可以從哪里獲取使用 XGBoost 的更多信息。讓我們開始吧。 ![A Gentle Introduction to XGBoost for Applied Machine Learning](https://img.kancloud.cn/ae/c0/aec011aa790d7bc70d46c8b37fd829fc_640x426.jpg) 應用機器學習的殺器： XGBoost 簡介。照片由 [Sigfrid Lundberg](https://www.flickr.com/photos/sigfridlundberg/14945045482/) 拍攝，保留部分權利。 ## 什么是 XGBoost？ XGBoost 的名字源自 e **X** treme **G** radient **B** oosting （極限梯度提升）。 > 其實 xgboost 實際上是在致力于將提升樹算法對計算資源的利用推至工程極限。這也是為什么有許多人會使用 xgboost 的原因。 - Tianqi Chen（陳天奇）對Quora問題“ [R gbm（梯度提升機）和 xgboost（極限梯度提升）有什么區別？](https://www.quora.com/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting) “的回答。它是 [Tianqi Chen](http://homes.cs.washington.edu/~tqchen/) 創建的一種梯度提升機實現，現在有許多開發人員在為這個項目做貢獻。它屬于分布式機器學習社區（[DMLC](http://dmlc.ml/)）寬泛范疇中的一種工具。Chen同時也是流行的 [mxnet 深度學習庫](https://github.com/dmlc/mxnet)創建者。 Tianqi Chen 在[ XGBoost 的背后故事與經驗](http://homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html)中提供了關于 XGBoost 演進的簡短而有趣的背景故事。 XGBoost 定義了一個軟件庫，您可以在您的電腦上下載和安裝，有多種接口方式可以調用。具體來說，XGBoost 支持以下主要接口： * 命令行界面（CLI）。 * C++（編寫XGBoost庫的語言）。 * Python 界面以及作為 scikit-learn 的一個模型。 * R 接口以及作為caret包中的模型。 * Julia。 * Java 和 JVM 語言，例如Scala，以及像 Hadoop 這樣的平臺。 ## XGBoost 的特點 XGBoost庫高度專注于計算速度和模型性能，因此幾乎沒有冗余功能。不過它仍然提供了許多高級功能。 ### 模型的特點 XGBoost模型支持 scikit-learn 和 R 的實現，并且新增了正則化等功能。它支持三種主要的梯度提升形式： * **Gradient Boosting** 算法，也稱為具有學習率的梯度提升機。 * 對行、列以及分割列進行子采樣的**隨機梯度提升**。 * L1 和 L2 正則化的**正則化梯度提升**。 ### 系統的特點 XGBoost庫提供了豐富的計算環境，包括而不限于： * **在訓練期間使用所有 CPU 內核并行化的進行樹構建**。 * **分布式計算**可在一組計算機集群上訓練超大型模型。 * **核外計算（Out-Of-Core）（外擴存儲計算）**適用于無法裝載入內存的超大型數據集。 * 數據結構的**緩存優化**以及充分利用硬件的算法。 ### 算法的特點算法的實現致力于提高計算時間和內存資源的使用效率。設計目標就是充分利用可用資源來訓練模型。一些關鍵的算法實現特點包括： * 具有自動處理缺失數據值的**稀疏感知**能力。 * 具有支持樹構建并行化的**塊結構**。 * 具有**繼續訓練**能力，以便您可以進一步根據新數據提升已經訓練過的模型。 XGBoost 是免費的開源軟件，可在 Apache-2 許可范圍使用。 ## 為什么要使用 XGBoost？使用 XGBoost 的兩個原因也是本項目的兩個目標： 1. 執行速度。 2. 模型性能。 ### 1\. XGBoost 執行速度通常，XGBoost 相當快速。與梯度提升的其他實現方法相比，真的很快。 [Szilard Pafka](https://www.linkedin.com/in/szilard) 進行了一些客觀的基準測試，比較了 XGBoost 與其它梯度提升實現方法以及bagged決策樹方法。他在 2015 年 5 月的博客文章“[隨機森林方法的基準測試](http://datascience.la/benchmarking-random-forest-implementations/)”中展示了他的結果。他同時在 [GitHub](https://github.com/szilard/benchm-ml) 上提供了所有代碼以及附有更多硬核數字的拓展報告。 ![Benchmark Performance of XGBoost](https://img.kancloud.cn/69/5f/695f5662b4ec6ca8adff29889a2cae31_500x300.jpg) XGBoost 的基準性能，引自[隨機森林方法的基準測試](http://datascience.la/benchmarking-random-forest-implementations/)。他的結果表明 XGBoost 在 R、Python、Spark 和 H2O 的實現幾乎總是比其它基準測試的實現方法更快。在他的實驗中，他評論說： > 我也比較了 xgboost，它是一個可以構建隨機森林的流行提升庫。它速度快、內存使用效率高，同時具有高精度。 - Szilard Pafka，[可以構建隨機森林](http://datascience.la/benchmarking-random-forest-implementations/)。 ### 2\. XGBoost 的模型性能 XGBoost 在分類和回歸預測建模問題上對于有著結構化或表格化形式的數據集占據著統治地位。有證據表明，它是 Kaggle 數據科學平臺競賽獲勝者的首選算法。這里給出一個不完整的第一、第二和第三名競賽獲勝者名單，標題取名為： [XGBoost：機器學習挑戰賽獲勝解決方案](https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions)。為了使這一點更加具象，下面是來自 Kaggle 比賽獲勝者的一些有啟示的見解： > 作為 Kaggle 比賽的贏家，并且仍在增長獲勝數字，XGBoost 再次向我們展示了它是一個值得留在您工具箱中的全面算法。 - [Dato獲獎者訪談：第1名 Mad Professors](http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/) > 如果感到困惑，不知道作何選擇，請使用 xgboost。 - [Avito獲獎者訪談：第1名，Owen Zhang](http://blog.kaggle.com/2015/08/26/avito-winners-interview-1st-place-owen-zhang/) > 我喜歡讓單一模特表現的更好，而我最好的單一模特是 XGBoost，它可以自己獲得第 10 名。 - [Caterpillar獲獎者訪談：第1名](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/) > 我只用過 XGBoost。 - [Liberty Mutual Property Inspection，獲獎者訪談：第1名，Qingchen Wang](http://blog.kaggle.com/2015/09/28/liberty-mutual-property-inspection-winners-interview-qingchen-wang/) > 我唯一用過的有監督學習方法是梯度提升，通過優秀的 xgboost 實現。 - [Recruit Coupon Purchase獲獎者訪談：第2名，Halla Yang](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/) ## XGBoost 使用什么算法？ XGBoost 庫執行[梯度提升決策樹算法](https://en.wikipedia.org/wiki/Gradient_boosting)。該算法有許多不同的名稱，例如梯度提升（gradient boosting），多重加性回歸樹（multiple additive regression trees），隨機梯度提升（stochastic gradient boosting）或梯度提升機（gradient boosting machines）。提升是一種調和技術（ensemble technique），它可以添加新模型以糾正現有模型所產生的誤差。在這個過程中，模型會被逐步添加，直到不能再進一步改進。一個流行的例子是 [AdaBoost 算法](http://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/)，它對很難預測的數據點進行加權。梯度提升是一種方法，其中創建新模型以預測先前模型的殘差或誤差，然后將其加在一起以進行最終預測。它被稱為梯度提升，因為它使用梯度下降算法來最小化添加新模型時的損失。這種方法支持回歸和分類預測建模問題。關于提升和梯度提升的更多信息，請參閱 Trevor Hastie 關于[梯度提升機器學習](https://www.youtube.com/watch?v=wPqtzj5VZus)的演講。 <iframe allowfullscreen="" frameborder="0" height="281" src="https://www.youtube.com/embed/wPqtzj5VZus?feature=oembed" width="500"></iframe> ## 官方 XGBoost 資源關于 XGBoost 的最佳信息來源是項目的[官方 GitHub 倉庫。](https://github.com/dmlc/xgboost) 從那里，您可以訪問[議題追蹤（Issue Tracker）](https://github.com/dmlc/xgboost/issues)以及[用戶組（User Group）](https://groups.google.com/forum/#!forum/xgboost-user/)，它們可用于提問和報告bug。 [Awesome XGBoost 頁面](https://github.com/dmlc/xgboost/tree/master/demo)是一個很好的資源庫，配有示例代碼和幫助信息。此外還有一個[官方文檔頁面](https://xgboost.readthedocs.io/en/latest/)，其中包含一系列不同語言的入門指南，教程，操作向導等。關于 XGBoost 有許多更正式的論文值得閱讀，可以從中獲取更多關于這個庫的使用背景： * [Higgs Boson Discovery with Boosted Trees](http://jmlr.org/proceedings/papers/v42/chen14.pdf) ，2014。 * [XGBoost：A Scalable Tree Boosting System](http://arxiv.org/abs/1603.02754)，2016。 ## XGBoost 的演講當開始使用像 XGBoost 這樣的新工具時，在深入研究代碼之前，先回顧一下有關該主題的一些演講會很有幫助。 ### XGBoost：A Scalable Tree Boosting System XGBoost庫的創建者Tianqi Chen 2016 年 6 月在洛杉磯Data Science Group進行了一次題為“ [XGBoost：A Scalable Tree Boosting System](https://www.youtube.com/watch?v=Vly8xGnNiWs)”的演講。 <iframe allowfullscreen="" frameborder="0" height="281" src="https://www.youtube.com/embed/Vly8xGnNiWs?feature=oembed" width="500"></iframe> 您可以在此處讀到他演講中的幻燈片： <iframe allowfullscreen="true" allowtransparency="true" frameborder="0" height="345" id="talk_frame_345261" mozallowfullscreen="true" src="//speakerdeck.com/player/5c6dab45648344208185d2b1ab4fdc95" style="border:0; padding:0; margin:0; background:transparent;" webkitallowfullscreen="true" width="500"></iframe> 在 [DataScience LA blog](http://datascience.la/xgboost-workshop-and-meetup-talk-with-tianqi-chen/)可以找到更多信息。 ### XGBoost：eXtreme Gradient Boosting 2015 年 12 月一位XGBoost的R語言接口貢獻者在紐約Data Science Academy發表了題為“ [XGBoost: eXtreme Gradient Boosting](https://www.youtube.com/watch?v=ufHo8vbk6g4)”的演講。 <iframe allowfullscreen="" frameborder="0" height="281" src="https://www.youtube.com/embed/ufHo8vbk6g4?feature=oembed" width="500"></iframe> 您也可以在此處讀到他演講中的幻燈片： <iframe allowfullscreen="" frameborder="0" height="356" marginheight="0" marginwidth="0" scrolling="no" src="https://www.slideshare.net/slideshow/embed_code/key/lhcV8LfZ8RfrG" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" width="427"></iframe> **[Xgboost](https://www.slideshare.net/ShangxuanZhang/xgboost-55872323 "Xgboost")** from **[Vivian Shangxuan Zhang](http://www.slideshare.net/ShangxuanZhang)** 有關此次演講的更多信息，請訪問[NYC Data Science Academy blog](http://blog.nycdatascience.com/faculty/kaggle-winning-solution-xgboost-algorithm-let-us-learn-from-its-author-3/)。 ## 安裝 XGBoost 在[XGBoost 文檔網站](http://xgboost.readthedocs.io/en/latest/build.html)上有綜合的安裝指南。它涵蓋了 Linux，Mac OS X 和 Windows 的安裝指南。它也包括了在 R 和 Python 等平臺上的安裝向導。 ### R 中的 XGBoost 如果您是R語言用戶，最好的入門位置是 [xgboost 包的CRAN 頁面](https://cran.r-project.org/web/packages/xgboost/index.html)。在此頁面中，您可以訪問 [R vignette Package'xgboost'](https://cran.r-project.org/web/packages/xgboost/xgboost.pdf) （pdf）。此頁面還鏈接了一些優秀的 R 教程，以幫助您入門： * [Discover your data]](https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html) * [XGBoost Presentation](https://cran.r-project.org/web/packages/xgboost/vignettes/xgboostPresentation.html) * [xgboost：eXtreme Gradient Boosting](https://cran.r-project.org/web/packages/xgboost/vignettes/xgboost.pdf) (pdf) 還有官方的 [XGBoost R Tutorial](http://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html)和[Understand your dataset with XGBoost](http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html) 。 ### Python 中的 XGBoost 安裝說明可在 XGBoost 安裝指南的 [Python section of the XGBoost installation guide](https://github.com/dmlc/xgboost/blob/master/doc/build.md#python-package-installation)部分找到。官方 [Python 包簡介](http://xgboost.readthedocs.io/en/latest/python/python_intro.html)是在 Python 中使用 XGBoost 最好的起步位置。若想要快速使用，您可以輸入： ```py sudo pip install xgboost ``` 在 [XGBoost Python Feature Walkthrough](https://github.com/tqchen/xgboost/tree/master/demo/guide-python)中，有一個很好的Python范例源代碼列表。 ## 總結在本文中，您了解了應用機器學習的 XGBoost 方法。您學到了： * XGBoost 是一個用于開發快速和高性能梯度提升樹模型的庫。 * XGBoost 目前在一系列困難的機器學習任務中都達到最佳表現。 * 可以在命令行，Python 和 R 中使用這個庫，以及如何開始使用。你用過 XGBoost 嗎？請在下面的評論中分享您的經驗。您對 XGBoost 或這篇文章有任何疑問嗎？請在下面的評論中提出您的問題，我會盡力回答。