使用 scikit-learn 介紹機器學習 · sklearn中文文檔

# 使用 scikit-learn 介紹機器學習校驗者: [@小瑤](https://github.com/apachecn/scikit-learn-doc-zh) 翻譯者: [@李昊偉](https://github.com/apachecn/scikit-learn-doc-zh) 校驗者: [@hlxstc](https://github.com/hlxstc) [@BWM-蜜蜂](https://github.com/apachecn/scikit-learn-doc-zh) [@小瑤](https://github.com/apachecn/scikit-learn-doc-zh) 翻譯者: [@...](https://github.com/apachecn/scikit-learn-doc-zh) 內容提要在本節中，我們介紹一些在使用 scikit-learn 過程中用到的 [機器學習](https://en.wikipedia.org/wiki/Machine_learning) 詞匯，并且給出一些例子闡釋它們。 ## 機器學習：問題設置一般來說，一個學習問題通常會考慮一系列 n 個 [樣本](https://en.wikipedia.org/wiki/Sample_(statistics)) 數據，然后嘗試預測未知數據的屬性。如果每個樣本是 [多個屬性的數據](https://en.wikipedia.org/wiki/Multivariate_random_variable) （比如說是一個多維記錄），就說它有許多“屬性”，或稱 **features(特征)** 。我們可以將學習問題分為幾大類: > - [監督學習](https://en.wikipedia.org/wiki/Supervised_learning) , 其中數據帶有一個附加屬性，即我們想要預測的結果值（ [點擊此處](../../supervised_learning.html#supervised-learning) 轉到 scikit-learn 監督學習頁面）。這個問題可以是: > > > - [分類](https://en.wikipedia.org/wiki/Classification_in_machine_learning) : 樣本屬于兩個或更多個類，我們想從已經標記的數據中學習如何預測未標記數據的類別。分類問題的一個例子是手寫數字識別，其目的是將每個輸入向量分配給有限數目的離散類別之一。我們通常把分類視作監督學習的一個離散形式（區別于連續形式），從有限的類別中，給每個樣本貼上正確的標簽。 > > - [回歸](https://en.wikipedia.org/wiki/Regression_analysis) : 如果期望的輸出由一個或多個連續變量組成，則該任務稱為 *回歸* 。回歸問題的一個例子是預測鮭魚的長度是其年齡和體重的函數。 > - [無監督學習](https://en.wikipedia.org/wiki/Unsupervised_learning), 其中訓練數據由沒有任何相應目標值的一組輸入向量x組成。這種問題的目標可能是在數據中發現彼此類似的示例所聚成的組，這種問題稱為 [聚類](https://en.wikipedia.org/wiki/Cluster_analysis) , 或者，確定輸入空間內的數據分布，稱為 [密度估計](https://en.wikipedia.org/wiki/Density_estimation) ，又或從高維數據投影數據空間縮小到二維或三維以進行 *可視化* （[點擊此處](../../unsupervised_learning.html#unsupervised-learning) 轉到 scikit-learn 無監督學習頁面）。訓練集和測試集機器學習是從數據的屬性中學習，并將它們應用到新數據的過程。這就是為什么機器學習中評估算法的普遍實踐是把數據分割成 **訓練集** （我們從中學習數據的屬性）和 **測試集** （我們測試這些性質）。 ## 加載示例數據集 scikit-learn 提供了一些標準數據集，例如用于分類的 [iris](https://en.wikipedia.org/wiki/Iris_flower_data_set)和 [digits](http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits) 數據集和 [波士頓房價回歸數據集](http://archive.ics.uci.edu/ml/datasets/Housing) . 在下文中，我們從我們的 shell 啟動一個 Python 解釋器，然后加載 `iris` 和 `digits` 數據集。我們的符號約定是 `$` 表示 shell 提示符，而 `>>>` 表示 Python 解釋器提示符: ``` $ python >>> from sklearn import datasets >>> iris = datasets.load_iris() >>> digits = datasets.load_digits() ``` 數據集是一個類似字典的對象，它保存有關數據的所有數據和一些元數據。該數據存儲在 `.data` 成員中，它是 `n_samples, n_features` 數組。在監督問題的情況下，一個或多個響應變量存儲在 `.target` 成員中。有關不同數據集的更多詳細信息，請參見 [專用數據集部分](../../datasets/index.html#datasets) 。例如，在數字數據集的情況下，`digits.data` 使我們能夠得到一些用于分類的樣本特征: ``` >>> print(digits.data) [[ 0. 0. 5. ..., 0. 0. 0.] [ 0. 0. 0. ..., 10. 0. 0.] [ 0. 0. 0. ..., 16. 9. 0.] ..., [ 0. 0. 1. ..., 6. 0. 0.] [ 0. 0. 2. ..., 12. 0. 0.] [ 0. 0. 10. ..., 12. 1. 0.]] ``` 并且 `digits.target` 表示了數據集內每個數字的真實類別，也就是我們期望從每個手寫數字圖像中學得的相應的數字標記: ``` >>> digits.target array([0, 1, 2, ..., 8, 9, 8]) ``` 數據數組的形狀數據總是二維數組，形狀 `(n_samples, n_features)` ，盡管原始數據可能具有不同的形狀。在數字的情況下，每個原始樣本是形狀 `(8, 8)` 的圖像，可以使用以下方式訪問: ``` >>> digits.images[0] array([[ 0., 0., 5., 13., 9., 1., 0., 0.], [ 0., 0., 13., 15., 10., 15., 5., 0.], [ 0., 3., 15., 2., 0., 11., 8., 0.], [ 0., 4., 12., 0., 0., 8., 8., 0.], [ 0., 5., 8., 0., 0., 9., 8., 0.], [ 0., 4., 11., 0., 1., 12., 7., 0.], [ 0., 2., 14., 5., 10., 12., 0., 0.], [ 0., 0., 6., 13., 10., 0., 0., 0.]]) ``` 該 [數據集上的簡單示例](../../auto_examples/classification/plot_digits_classification.html#sphx-glr-auto-examples-classification-plot-digits-classification-py) 說明了如何從原始數據開始調整，形成可以在 scikit-learn 中使用的數據。從外部數據集加載要從外部數據集加載，請參閱 [加載外部數據集](../../datasets/index.html#external-datasets) 。 ## 學習和預測在數字數據集的情況下，任務是給出圖像來預測其表示的數字。我們給出了 10 個可能類（數字 0 到 9）中的每一個的樣本，我們在這些類上 *擬合* 一個 [估計器](https://en.wikipedia.org/wiki/Estimator) ，以便能夠 *預測* 未知的樣本所屬的類。在 scikit-learn 中，分類的估計器是一個 Python 對象，它實現了 `fit(X, y)` 和 `predict(T)` 等方法。估計器的一個例子類 `sklearn.svm.SVC` ，實現了 [支持向量分類](https://en.wikipedia.org/wiki/Support_vector_machine) 。估計器的構造函數以相應模型的參數為參數，但目前我們將把估計器視為黑箱即可: ``` >>> from sklearn import svm >>> clf = svm.SVC(gamma=0.001, C=100.) ``` 選擇模型的參數在這個例子中，我們手動設置 `gamma` 值。不過，通過使用 [網格搜索](../../modules/grid_search.html#grid-search) 及 [交叉驗證](../../modules/cross_validation.html#cross-validation) 等工具，可以自動找到參數的良好值。我們把我們的估計器實例命名為 `clf` ，因為它是一個分類器（classifier）。我們需要它適應模型，也就是說，要它從模型中 *學習* 。這是通過將我們的訓練集傳遞給 `fit` 方法來完成的。作為一個訓練集，讓我們使用數據集中除最后一張以外的所有圖像。我們用 `[:-1]` Python 語法選擇這個訓練集，它產生一個包含 `digits.data` 中除最后一個條目（entry）之外的所有條目的新數組 ``` >>> clf.fit(digits.data[:-1], digits.target[:-1]) SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) ``` 現在你可以預測新的值，特別是我們可以向分類器詢問 `digits` 數據集中最后一個圖像（沒有用來訓練的一條實例）的數字是什么: ``` >>> clf.predict(digits.data[-1:]) array([8]) ``` 相應的圖像如下: [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_digits_last_image_001.png](https://box.kancloud.cn/43592ff4c7cb588f6902be555ee8ad67_300x300.jpg)](../../auto_examples/datasets/plot_digits_last_image.html)正如你所看到的，這是一項具有挑戰性的任務：圖像分辨率差。你是否認同這個分類？這個分類問題的一個完整例子可以作為一個例子來運行和學習：識別手寫數字。 [Recognizing hand-written digits](../../auto_examples/classification/plot_digits_classification.html#sphx-glr-auto-examples-classification-plot-digits-classification-py). ## 模型持久化可以通過使用 Python 的內置持久化模塊（即 [pickle](https://docs.python.org/2/library/pickle.html) ）將模型保存: ``` >>> from sklearn import svm >>> from sklearn import datasets >>> clf = svm.SVC() >>> iris = datasets.load_iris() >>> X, y = iris.data, iris.target >>> clf.fit(X, y) SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) >>> import pickle >>> s = pickle.dumps(clf) >>> clf2 = pickle.loads(s) >>> clf2.predict(X[0:1]) array([0]) >>> y[0] 0 ``` 在scikit的具體情況下，使用 joblib 替換 pickle（ `joblib.dump` & `joblib.load` ）可能會更有趣，這對大數據更有效，但只能序列化 (pickle) 到磁盤而不是字符串變量: ``` >>> from sklearn.externals import joblib >>> joblib.dump(clf, 'filename.pkl') ``` 之后，您可以加載已保存的模型（可能在另一個 Python 進程中）: ``` >>> clf = joblib.load('filename.pkl') ``` Warning `joblib.dump` 以及 `joblib.load` 函數也接受 file-like（類文件）對象而不是文件名。有關 Joblib 的數據持久化的更多信息，請 [點擊此處](https://pythonhosted.org/joblib/persistence.html) 。請注意，pickle 有一些安全性和維護性問題。有關使用 scikit-learn 的模型持久化的更多詳細信息，請參閱 [模型持久化](../../modules/model_persistence.html#model-persistence) 部分。 ## 規定 scikit-learn 估計器遵循某些規則，使其行為更可預測。 ### 類型轉換除非特別指定，輸入將被轉換為 `float64` ``` >>> import numpy as np >>> from sklearn import random_projection >>> rng = np.random.RandomState(0) >>> X = rng.rand(10, 2000) >>> X = np.array(X, dtype='float32') >>> X.dtype dtype('float32') >>> transformer = random_projection.GaussianRandomProjection() >>> X_new = transformer.fit_transform(X) >>> X_new.dtype dtype('float64') ``` 在這個例子中，`X` 原本是 `float32` ，被 `fit_transform(X)` 轉換成 `float64` 。回歸目標被轉換為 `float64` ，但分類目標維持不變: ``` >>> from sklearn import datasets >>> from sklearn.svm import SVC >>> iris = datasets.load_iris() >>> clf = SVC() >>> clf.fit(iris.data, iris.target) SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) >>> list(clf.predict(iris.data[:3])) [0, 0, 0] >>> clf.fit(iris.data, iris.target_names[iris.target]) SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) >>> list(clf.predict(iris.data[:3])) ['setosa', 'setosa', 'setosa'] ``` 這里，第一個 `predict()` 返回一個整數數組，因為在 `fit` 中使用了 `iris.target` （一個整數數組）。第二個 `predict()` 返回一個字符串數組，因為 `iris.target_names` 是一個字符串數組。 ### 再次訓練和更新參數估計器的超參數可以通過 [`sklearn.pipeline.Pipeline.set_params`](../../modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.set_params "sklearn.pipeline.Pipeline.set_params") 方法在實例化之后進行更新。調用 `fit()` 多次將覆蓋以前的 `fit()` 所學到的參數: ``` >>> import numpy as np >>> from sklearn.svm import SVC >>> rng = np.random.RandomState(0) >>> X = rng.rand(100, 10) >>> y = rng.binomial(1, 0.5, 100) >>> X_test = rng.rand(5, 10) >>> clf = SVC() >>> clf.set_params(kernel='linear').fit(X, y) SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) >>> clf.predict(X_test) array([1, 0, 1, 1, 0]) >>> clf.set_params(kernel='rbf').fit(X, y) SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) >>> clf.predict(X_test) array([0, 0, 0, 1, 0]) ``` 在這里，估計器被 `SVC()` 構造之后，默認內核 `rbf` 首先被改變到 `linear` ，然后改回到 `rbf` 重新訓練估計器并進行第二次預測。 ### 多分類與多標簽擬合當使用 [`多類分類器`](../../modules/classes.html#module-sklearn.multiclass "sklearn.multiclass") 時，執行的學習和預測任務取決于參與訓練的目標數據的格式: ``` >>> from sklearn.svm import SVC >>> from sklearn.multiclass import OneVsRestClassifier >>> from sklearn.preprocessing import LabelBinarizer >>> X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]] >>> y = [0, 0, 1, 1, 2] >>> classif = OneVsRestClassifier(estimator=SVC(random_state=0)) >>> classif.fit(X, y).predict(X) array([0, 0, 1, 1, 2]) ``` 在上述情況下，分類器使用含有多個標簽的一維數組訓練模型，因此 `predict()` 方法可提供相應的多標簽預測。分類器也可以通過標簽二值化后的二維數組來訓練: ``` >>> y = LabelBinarizer().fit_transform(y) >>> classif.fit(X, y).predict(X) array([[1, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 0], [0, 0, 0]]) ``` 這里，使用 [`LabelBinarizer`](../../modules/generated/sklearn.preprocessing.LabelBinarizer.html#sklearn.preprocessing.LabelBinarizer "sklearn.preprocessing.LabelBinarizer") 將目標向量 y 轉化成二值化后的二維數組。在這種情況下， `predict()` 返回一個多標簽預測相應的二維數組。請注意，第四個和第五個實例返回全零向量，表明它們不能匹配用來訓練中的目標標簽中的任意一個。使用多標簽輸出，類似地可以為一個實例分配多個標簽: ``` >> from sklearn.preprocessing import MultiLabelBinarizer >> y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]] >> y = MultiLabelBinarizer().fit_transform(y) >> classif.fit(X, y).predict(X) array([[1, 1, 0, 0, 0], [1, 0, 1, 0, 0], [0, 1, 0, 1, 0], [1, 0, 1, 1, 0], [0, 0, 1, 0, 1]]) ``` 在這種情況下，用來訓練分類器的多個向量被賦予多個標記， [`MultiLabelBinarizer`](../../modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer "sklearn.preprocessing.MultiLabelBinarizer") 用來二值化多個標簽產生二維數組并用來訓練。 `predict()` 函數返回帶有多個標簽的二維數組作為每個實例的結果。