監督學習：從高維觀察預測輸出變量 · sklearn中文文檔

# 監督學習：從高維觀察預測輸出變量校驗者: [@Kyrie](https://github.com/apachecn/scikit-learn-doc-zh) [@片刻](https://github.com/apachecn/scikit-learn-doc-zh) 翻譯者: [@森系](https://github.com/apachecn/scikit-learn-doc-zh) 監督學習解決的問題 [監督學習](../../supervised_learning.html#supervised-learning) 在于學習兩個數據集的聯系：觀察數據 `X` 和我們正在嘗試預測的額外變量 `y` (通常稱“目標”或“標簽”)，而且通常是長度為 `n_samples` 的一維數組。 scikit-learn 中所有監督的估計量 <https://en.wikipedia.org/wiki/Estimator> 都有一個用來擬合模型的 `fit(X, y)` 方法，和根據給定的沒有標簽觀察值 `X` 返回預測的帶標簽的 `y` 的 `predict(X)` 方法。詞匯：分類和回歸如果預測任務是為了將觀察值分類到有限的標簽集合中，換句話說，就是給觀察對象命名，那任務就被稱為 **分類** 任務。另外，如果任務是為了預測一個連續的目標變量，那就被稱為 **回歸** 任務。當在 scikit-learn 中進行分類時，`y` 是一個整數或字符型的向量。注：可以查看 :ref: 用 scikit-learn 進行機器學習介紹 <introduction> 快速了解機器學習中的基礎詞匯。 ## 最近鄰和維度懲罰鳶尾屬植物分類： [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_iris_dataset_001.png](https://box.kancloud.cn/6fd8442ce2ba55457339d694dcdfc640_566x424.jpg)](../../auto_examples/datasets/plot_iris_dataset.html)鳶尾屬植物數據集是根據花瓣長度、花瓣度度、萼片長度和萼片寬度4個特征對3種不同類型的鳶尾屬植物進行分類: ``` >>> import numpy as np >>> from sklearn import datasets >>> iris = datasets.load_iris() >>> iris_X = iris.data >>> iris_y = iris.target >>> np.unique(iris_y) array([0, 1, 2]) ``` ### K近鄰分類器 [最近鄰](https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm): 也許是最簡單的分類器：給定一個新的觀察值 `X_test`，用最接近的特征向量在訓練集(比如，用于訓練估計器的數據)找到觀察值。(請看 Scikit-learn 在線學習文檔的 [最近鄰章節](../../modules/neighbors.html#neighbors) 獲取更多關于這種分類器的信息) 訓練集和測試集當用任意的學習算法進行實驗時，最重要的就是不要在用于擬合估計器的數據上測試一個估計器的預期值，因為這不會評估在 **新數據** 上估計器的執行情況。這也是數據集經常被分為 *訓練* 和 *測試* 數據的原因。 **KNN(k 最近鄰)分類器例子**: [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_classification_001.png](https://box.kancloud.cn/390139a85024c76bfbc8d231e4870bb1_566x424.jpg)](../../auto_examples/neighbors/plot_classification.html) ``` >>> # 將鳶尾屬植物數據集分解為訓練集和測試集 >>> # 隨機排列，用于使分解的數據隨機分布 >>> np.random.seed(0) >>> indices = np.random.permutation(len(iris_X)) >>> iris_X_train = iris_X[indices[:-10]] >>> iris_y_train = iris_y[indices[:-10]] >>> iris_X_test = iris_X[indices[-10:]] >>> iris_y_test = iris_y[indices[-10:]] >>> # 創建和擬合一個最近鄰分類器 >>> from sklearn.neighbors import KNeighborsClassifier >>> knn = KNeighborsClassifier() >>> knn.fit(iris_X_train, iris_y_train) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform') >>> knn.predict(iris_X_test) array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0]) >>> iris_y_test array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0]) ``` ### 維度懲罰為了使一個估計器有效，你需要鄰接點間的距離小于一些值：![d](https://box.kancloud.cn/c7d8c62c9dba7f2dfd95aa73d579b8ae_10x13.jpg)，這取決于具體問題。在一維中，這需要平均 n sim 1/d 點。在上文 ![k](https://box.kancloud.cn/300675e73ace6bf4c352cfbb633f0199_9x13.jpg)-NN 例子中，如果數據只是由一個0到1的特征值和 ![n](https://box.kancloud.cn/ee463e4b2bbbc723c7017b00e6d51b41_11x8.jpg) 訓練觀察值所描述，那么新數據將不會超過 ![1/n](https://box.kancloud.cn/babe0a449070dc769fd98272fc99c740_28x18.jpg)。因此，最近鄰決策規則會很有效率，因為與類間特征變量范圍相比， ![1/n](https://box.kancloud.cn/babe0a449070dc769fd98272fc99c740_28x18.jpg) 很小。如果特征數是 ![p](https://box.kancloud.cn/251fcba434769c07a22131f9d8b84b32_10x12.jpg)，你現在就需要 ![n \sim 1/d^p](https://box.kancloud.cn/0742de83b5051c2f04e0e3627a8b689a_69x18.jpg) 點。也就是說我們在一維 ![[0, 1]](https://box.kancloud.cn/06fb25c9c5ed966849477a076a62532b_32x18.jpg) 空間里需要10個點，在 ![p](https://box.kancloud.cn/251fcba434769c07a22131f9d8b84b32_10x12.jpg) 維里就需要 ![10^p](https://box.kancloud.cn/1512d80cbdc5f5536812f59f27effd4b_24x13.jpg) 個點。當 ![p](https://box.kancloud.cn/251fcba434769c07a22131f9d8b84b32_10x12.jpg) 增大時，為了得到一個好的估計器，相應的訓練點數量就需要成倍增大。比如，如果每個點只是單個數字(8個字節)，那么一個 ![k](https://box.kancloud.cn/300675e73ace6bf4c352cfbb633f0199_9x13.jpg)-NN 估計器在一個非常小的 ![p \sim 20](https://box.kancloud.cn/12d7e30a35ccab39cb2dcf9ee14e8b66_53x16.jpg) 維度下就需要比現在估計的整個互聯網的大小(±1000 艾字節或更多)還要多的訓練數據。這叫 [維度懲罰](https://en.wikipedia.org/wiki/Curse_of_dimensionality)，是機器學習領域的核心問題。 ## 線性模型：從回歸到稀疏糖尿病數據集糖尿病數據集包括442名患者的10個生理特征(年齡，性別，體重，血壓)，和一年后的疾病級別指標: ``` >>> diabetes = datasets.load_diabetes() >>> diabetes_X_train = diabetes.data[-20] >>> diabetes_X_test = diabetes.data[-20:] >>> diabetes_y_train = diabetes.target[:-20] >>> diabetes_y_test = diabetes.target[-20:] ``` 手頭上的任務是為了從生理特征預測疾病級別。 ### 線性回歸 [`LinearRegression`](../../modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression "sklearn.linear_model.LinearRegression")，最簡單的擬合線性模型形式，是通過調整數據集的一系列參數令殘差平方和盡可能小。 [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_ols_001.png](https://box.kancloud.cn/befce407d9e45b7dbcedd39546005038_566x424.jpg)](../../auto_examples/linear_model/plot_ols.html)Linear models: ![y = X\beta + \epsilon](https://box.kancloud.cn/6ac53ed498bd7955da53e99cfe921c3e_90x16.jpg) > - ![X](https://box.kancloud.cn/b422495cc9601a331610a3a428f9133b_16x12.jpg): 數據 > - ![y](https://box.kancloud.cn/0255a09d3dccb9843dcf063bbeec303f_9x12.jpg): 目標變量 > - ![\beta](https://box.kancloud.cn/0b61934b8a4d7388dbc1b4fc82b0d49f_11x16.jpg): 回歸系數 > - ![\epsilon](https://box.kancloud.cn/3ec5a738819e1e6501032891d360ef4a_7x8.jpg): 觀察噪聲 ``` >>> from sklearn import linear_model >>> regr = linear_model.LinearRegression() >>> regr.fit(diabetes_X_train, diabetes_y_train) LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) >>> print(regr.coef_) [ 0.30349955 -237.63931533 510.53060544 327.73698041 -814.13170937 492.81458798 102.84845219 184.60648906 743.51961675 76.09517222] >>> # 均方誤差 >>> np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2) 2004.56760268... >>> # 方差分數：1 是完美的預測 >>> # 0 意味著 X 和 y 之間沒有線性關系。 >>> regr.score(diabetes_X_test, diabetes_y_test) 0.5850753022690... ``` ### 收縮如果每個維度的數據點很少，觀察噪聲就會導致很大的方差： [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_ols_ridge_variance_001.png](https://box.kancloud.cn/a9efdb957ec879c6a2d923b4d11f492a_400x300.jpg)](../../auto_examples/linear_model/plot_ols_ridge_variance.html) ``` >>> X = np.c_[ .5, 1].T >>> y = [.5, 1] >>> test = np.c_[ 0, 2].T >>> regr = linear_model.LinearRegression() >>> import matplotlib.pyplot as plt >>> plt.figure() >>> np.random.seed(0) >>> for _ in range(6): ... this_X = .1*np.random.normal(size=(2, 1)) + X ... regr.fit(this_X, y) ... plt.plot(test, regr.predict(test)) ... plt.scatter(this_X, y, s=3) ``` 高緯統計學習中的一個解決方法是 *收縮* 回歸系數到0：任何兩個隨機選擇的觀察值數據集都很可能是不相關的。這稱為嶺回歸： [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_ols_ridge_variance_002.png](https://box.kancloud.cn/f9d7e9d12644e488c4392e9cce9f373f_400x300.jpg)](../../auto_examples/linear_model/plot_ols_ridge_variance.html) ``` >>> regr = linear_model.Ridge(alpha=.1) >>> plt.figure() >>> np.random.seed(0) >>> for _ in range(6): ... this_X = .1*np.random.normal(size=(2, 1)) + X ... regr.fit(this_X, y) ... plt.plot(test, regr.predict(test)) ... plt.scatter(this_X, y, s=3) ``` 這是 **bias/variance tradeoff** 中的一個例子：嶺參數 `alpha` 越大，偏差越大，方差越小。我們可以選擇 `alpha` 來最小化排除錯誤，這里使用糖尿病數據集而不是人為數據: ``` >>> alphas = np.logspace(-4, -1, 6) >>> from __future__ import print_function >>> print([regr.set_params(alpha=alpha ... ).fit(diabetes_X_train, diabetes_y_train, ... ).score(diabetes_X_test, diabetes_y_test) for alpha in alphas]) [0.5851110683883..., 0.5852073015444..., 0.5854677540698..., 0.5855512036503..., 0.5830717085554..., 0.57058999437...] ``` Note 捕獲擬合參數噪聲使得模型不能歸納新的數據稱為 [過擬合](https://en.wikipedia.org/wiki/Overfitting)。嶺回歸產生的偏差被稱為 [正則化](https://en.wikipedia.org/wiki/Regularization_%28machine_learning%29)。 ### 稀疏 **只擬合特征1和2** **[![diabetes_ols_1](https://box.kancloud.cn/a172812f0ada5911d389b183d0cc6787_400x300.jpg)](../../auto_examples/linear_model/plot_ols_3d.html) [![diabetes_ols_3](https://box.kancloud.cn/f5c0ba82332684eb428857b2cb86ebc0_400x300.jpg)](../../auto_examples/linear_model/plot_ols_3d.html) [![diabetes_ols_2](https://box.kancloud.cn/2511260dcd42d5d0007ca0556a923b4f_400x300.jpg)](../../auto_examples/linear_model/plot_ols_3d.html)** Note 整個糖尿病數據集包括11個維度(10個特征維度和1個目標變量)。很難直觀地表示出來，但是記住那是一個比較 *空* 的空間可能比較有用。我們可以看到，盡管特征2在整個模型占有一個很大的系數，但是當考慮特征1時，其對 `y` 的影響就較小了。為了提高問題的條件(比如，緩解`維度懲罰`)，只選擇信息特征和設置無信息時就會變得有趣，比如特征2到0。嶺回歸會減小他們的值，但不會減到0.另一種抑制方法，稱為 [Lasso](../../modules/linear_model.html#lasso) (最小絕對收縮和選擇算子)，可以把一些系數設為0。這些方法稱為 **稀疏法**，稀疏可以看作是奧卡姆剃刀的應用：*模型越簡單越好*。 ``` >>> regr = linear_model.Lasso() >>> scores = [regr.set_params(alpha=alpha ... ).fit(diabetes_X_train, diabetes_y_train ... ).score(diabetes_X_test, diabetes_y_test) ... for alpha in alphas] >>> best_alpha = alphas[scores.index(max(scores))] >>> regr.alpha = best_alpha >>> regr.fit(diabetes_X_train, diabetes_y_train) Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False) >>> print(regr.coef_) [ 0. -212.43764548 517.19478111 313.77959962 -160.8303982 -0. -187.19554705 69.38229038 508.66011217 71.84239008] ``` **同一個問題的不同算法** 不同的算法可以用于解決同一個數學問題。比如在 scikit-learn 里 `Lasso` 對象使用 [coordinate descent](https://en.wikipedia.org/wiki/Coordinate_descent) 方法解決 lasso 回歸問題，對于大型數據集很有效。但是，scikit-learn 也提供了使用 *LARS* 算法的:class:LassoLars 對象，對于處理帶權向量非常稀疏的數據非常有效(比如，問題的觀察值很少)。 ### 分類 [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_logistic_001.png](https://box.kancloud.cn/cdc841cec84a072dd475857d852c3f87_400x300.jpg)](../../auto_examples/linear_model/plot_logistic.html)對于分類，比如標定 [鳶尾屬植物](https://en.wikipedia.org/wiki/Iris_flower_data_set) 任務，線性回歸就不是好方法了，因為它會給數據很多遠離決策邊界的權值。一個線性方法是為了擬合 sigmoid 函數或 **logistic** 函數： ![y = \textrm{sigmoid}(X\beta - \textrm{offset}) + \epsilon = \frac{1}{1 + \textrm{exp}(- X\beta + \textrm{offset})} + \epsilon](https://box.kancloud.cn/0de312bdeab0470cc60108b7cf118094_462x41.jpg) ``` >>> logistic = linear_model.LogisticRegression(C=1e5) >>> logistic.fit(iris_X_train, iris_y_train) LogisticRegression(C=100000.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) ``` 這就是有名的： [`LogisticRegression`](../../modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression "sklearn.linear_model.LogisticRegression") [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_iris_logistic_001.png](https://box.kancloud.cn/3af3e5b59fbdb61bd966b118f5ab5b5d_400x300.jpg)](../../auto_examples/linear_model/plot_iris_logistic.html)多類分類如果你有很多類需要預測，一種常用方法就是去擬合一對多分類器，然后使用根據投票為最后做決定。使用 logistic 回歸進行收縮和稀疏 [`LogisticRegression`](../../modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression "sklearn.linear_model.LogisticRegression") 對象中的 `C` 參數控制著正則化數量：`C` 值越大，正則化數量越小。`penalty="l2"` 提供收縮`(比如，無稀疏系數)，同時 ``penalty=”l1”` 提供`稀疏化`。 **練習** 嘗試用最近鄰和線性模型分類數字數據集。留出最后 10%的數據，并測試觀察值預期效果。 ``` from sklearn import datasets, neighbors, linear_model digits = datasets.load_digits() X_digits = digits.data y_digits = digits.target ``` 方法: [`../../auto_examples/exercises/plot_digits_classification_exercise.py`](../../_downloads/plot_digits_classification_exercise.py) ## 支持向量積(SVMs) ### 線性 SVMs [支持向量機](../../modules/svm.html#svm) 屬于判別模型家族：它們嘗試通過找到樣例的一個組合來構建一個兩類之間最大化的平面。通過 `C` 參數進行正則化設置：`C` 的值小意味著邊緣是通過分割線周圍的所有觀測樣例進行計算得到的(更正則化)；`C` 的值大意味著邊緣是通過鄰近分割線的觀測樣例計算得到的(更少正則化)。例子: - [Plot different SVM classifiers in the iris dataset](../../auto_examples/svm/plot_iris.html#sphx-glr-auto-examples-svm-plot-iris-py) SVMs 可以用于回歸 –:class: SVR (支持向量回歸)–，或者分類 –:class: SVC (支持向量分類)。 ``` >>> from sklearn import svm >>> svc = svm.SVC(kernel='linear') >>> svc.fit(iris_X_train, iris_y_train) SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) ``` Warning **規格化數據** 對很多估計器來說，包括 SVMs，為每個特征值使用單位標準偏差的數據集，是獲得好的預測重要前提。 ### 使用核在特征空間類并不總是線性可分的。解決辦法就是構建一個不是線性的但能是多項式的函數做代替。這要使用 *核技巧(kernel trick)*，它可以被看作通過設置 *kernels* 在觀察樣例上創建決策力量： **線性核****多項式核**[![svm_kernel_linear](https://box.kancloud.cn/3f9ee247c82ffcb57d515f45203891bf_400x300.jpg)](../../auto_examples/svm/plot_svm_kernels.html)[![svm_kernel_poly](https://box.kancloud.cn/70d2295d520eae282b08c0a1695b28a1_400x300.jpg)](../../auto_examples/svm/plot_svm_kernels.html) ``` >>> svc = svm.SVC(kernel='linear') ``` ``` >>> svc = svm.SVC(kernel='poly', ... degree=3) >>> # degree: polynomial degree ``` **RBF 內核(徑向基函數)**[![svm_kernel_rbf](https://box.kancloud.cn/709fcb8c6eb56cc7abbc29ec4a57e2e4_400x300.jpg)](../../auto_examples/svm/plot_svm_kernels.html) ``` >>> svc = svm.SVC(kernel='rbf') >>> # gamma: inverse of size of >>> # radial kernel ``` **交互例子** 查看 [SVM GUI](../../auto_examples/applications/svm_gui.html#sphx-glr-auto-examples-applications-svm-gui-py) 通過下載 `svm_gui.py`；通過左右按鍵添加兩類數據點，擬合模型并改變參數和數據。 [![http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_iris_dataset_001.png](https://box.kancloud.cn/6fd8442ce2ba55457339d694dcdfc640_566x424.jpg)](../../auto_examples/datasets/plot_iris_dataset.html)**練習** 根據特征1和特征2，嘗試用 SVMs 把1和2類從鳶尾屬植物數據集中分出來。為每一個類留下10%，并測試這些觀察值預期效果。 **警告**: 類是有序的，不要留下最后10%，不然你只能測試一個類了。 **提示**: 為了直觀顯示，你可以在網格上使用 `decision_function` 方法。 ``` iris = datasets.load_iris() X = iris.data y = iris.target X = X[y != 0, :2] y = y[y != 0] ``` 方法: [`../../auto_examples/exercises/plot_iris_exercise.py`](../../_downloads/plot_iris_exercise.py)