Scikit-Learn 的溫和介紹：Python 機器學習庫 · Machine Learning Mastery 博客文章翻譯

# Scikit-Learn 的溫和介紹：Python 機器學習庫 > 原文： [https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/](https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/) 如果您是一名 Python 程序員，或者您正在尋找一個強大的庫，您可以將機器學習帶入生產系統，那么您需要認真考慮的庫是 scikit-learn。在這篇文章中，您將獲得 scikit-learn 庫的概述以及可以從中了解更多信息的有用參考資料。 ## 它從哪里來的？ Scikit-learn 最初是由 David Cournapeau 在 2007 年開發的 Google 夏季代碼項目。后來 Matthieu Brucher 加入了這個項目并開始將其作為論文工作的一部分。 2010 年，INRIA 參與其中，第一次公開發布（v0.1 beta）于 2010 年 1 月下旬發布。該項目目前有超過 30 個活躍的貢獻者，并已經 [INRIA](http://www.inria.fr/en/) ，谷歌， [Tinyclues](http://www.tinyclues.com/) 和 [Python 軟件基金會](https://www.python.org/psf/)支付了贊助費。 [![Scikit-learn Homepage](https://img.kancloud.cn/86/5a/865ae296ad81babde47c19817040903d_1024x642.jpg)](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2014/04/scikit-learn.png) [Scikit-learn 主頁](http://scikit-learn.org/stable/index.html) ## 什么是 scikit-learn？ Scikit-learn 通過 Python 中的一致接口提供一系列有監督和無監督的學習算法。它根據許可的簡化 BSD 許可證授權，并在許多 Linux 發行版下分發，鼓勵學術和商業用途。該庫是基于 SciPy（Scientific Python）構建的，必須先安裝才能使用 scikit-learn。這個堆棧包括： * **NumPy** ：基本 n 維數組包 * **SciPy** ：科學計算的基礎庫 * **Matplotlib** ：全面的 2D / 3D 繪圖 * **IPython** ：增強的交互式控制臺 * **Sympy** ：符號數學 * **Pandas** ：數據結構和分析 SciPy 護理的擴展或模塊通常命名為 [SciKits](http://scikits.appspot.com/scikits) 。因此，該模塊提供學習算法，并命名為 scikit-learn。該庫的愿景是在生產系統中使用所需的穩健性和支持水平。這意味著要深入關注易用性，代碼質量，協作，文檔和表現等問題。雖然接口是 Python，但 c-libraries 可以利用表現，例如數組和矩陣運算的 numpy， [LAPACK](http://www.netlib.org/lapack/) ， [LibSVM](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) 以及謹慎使用 cython。 ## 有什么特點？該庫專注于建模數據。它不專注于加載，操作和匯總數據。有關這些功能，請參閱 NumPy 和 Pandas。 [![mean-shift clustering algorithm](https://img.kancloud.cn/99/93/9993d6f5446cca229bc863e615f8af39_800x600.jpg)](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2014/04/plot_mean_shift_1.png) 截圖取自[平均移位聚類算法](http://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html)的演示 scikit-learn 提供的一些流行的模型組包括： * **聚類**：用于對未標記數據（如 KMeans）進行分組。 * **交叉驗證**：用于估計監督模型對未見數據的表現。 * **數據集**：用于測試數據集以及用于生成具有用于調查模型行為的特定屬性的數據集。 * **維度降低**：用于減少數據中的屬性數量，以進行摘要，可視化和特征選擇，例如主成分分析。 * **集合方法**：用于組合多個監督模型的預測。 * **特征提取**：用于定義圖像和文本數據中的屬性。 * **特征選擇**：用于識別創建監督模型的有意義屬性。 * **參數調整**：用于充分利用受監督的模型。 * **流形學習**：用于總結和描繪復雜的多維數據。 * **監督模型**：一個龐大的陣列，不僅限于廣義線性模型，判別分析，樸素貝葉斯，惰性方法，神經網絡，支持向量機和決策樹。 ## 示例：分類和回歸樹我想舉個例子向您展示使用庫是多么容易。在此示例中，我們使用分類和回歸樹（CART）決策樹算法來模擬 Iris 花數據集。此數據集作為庫的示例數據集提供并加載。分類器適合數據，然后對訓練數據進行預測。最后，打印分類準確度和混淆矩陣。 ``` # Sample Decision Tree Classifier from sklearn import datasets from sklearn import metrics from sklearn.tree import DecisionTreeClassifier # load the iris datasets dataset = datasets.load_iris() # fit a CART model to the data model = DecisionTreeClassifier() model.fit(dataset.data, dataset.target) print(model) # make predictions expected = dataset.target predicted = model.predict(dataset.data) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted)) ``` 運行此示例將生成以下輸出，顯示已訓練模型的詳細信息，根據一些常見指標的模型技能和混淆矩陣。 ``` DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') precision recall f1-score support 0 1.00 1.00 1.00 50 1 1.00 1.00 1.00 50 2 1.00 1.00 1.00 50 avg / total 1.00 1.00 1.00 150 [[50 0 0] [ 0 50 0] [ 0 0 50]] ``` ## 誰在使用它？ [scikit-learn 推薦頁](http://scikit-learn.org/stable/testimonials/testimonials.html)列出了 Inria，Mendeley，wise.io，Evernote，Telecom ParisTech 和 AWeber 作為庫的用戶。如果這是對使用過的公司的一個小指示，那么很可能有數十到數百個較大的組織使用該庫。它具有良好的測試覆蓋率和管理版本，適用于原型和生產項目。 ## 資源如果您有興趣了解更多信息，請查看包含文檔和相關資源的 [Scikit-Learn 主頁](http://scikit-learn.org)。您可以從 [github 存儲庫](https://github.com/scikit-learn)獲取代碼，并且在 [Sourceforge 項目](http://sourceforge.net/projects/scikit-learn/)中可以獲得歷史版本。 ### 文檔我建議您從快速入門教程開始，然后瀏覽用戶指南和示例庫，了解您感興趣的算法。最終，scikit-learn 是一個庫，API 參考將是完成工作的最佳文檔。 * 快速入門教程 [http://scikit-learn.org/stable/tutorial/basic/tutorial.html](http://scikit-learn.org/stable/tutorial/basic/tutorial.html) * 用戶指南 [http://scikit-learn.org/stable/user_guide.html](http://scikit-learn.org/stable/user_guide.html) * API 參考 [http://scikit-learn.org/stable/modules/classes.html](http://scikit-learn.org/stable/modules/classes.html) * 示例圖庫 [http://scikit-learn.org/stable/auto_examples/index.html](http://scikit-learn.org/stable/auto_examples/index.html) ### 文件如果您對項目如何開始以及它的愿景有更多信息感興趣，那么您可能需要查看一些論文。 * [Scikit-learn：Python 中的機器學習](http://jmlr.org/papers/v12/pedregosa11a.html)（2011） * [機器學習軟件的 API 設計：來自 scikit-learn 項目的經驗](http://arxiv.org/abs/1309.0238)（2013） ### 圖書如果您正在尋找一本好書，我推薦“使用 Python 構建機器學習系統”。它編寫得很好，例子很有趣。 * [學習 scikit-learn：Python 中的機器學習](http://www.amazon.com/dp/1783281936?tag=inspiredalgor-20)（2013） * [用 Python 構建機器學習系統](http://www.amazon.com/dp/1782161406?tag=inspiredalgor-20)（2013） * [天文學中的統計，數據挖掘和機器學習：用于分析調查數據的實用 Python 指南](http://www.amazon.com/dp/0691151687?tag=inspiredalgor-20)（2014）