3.1. 交叉驗證：評估估算器的表現 · sklearn中文文檔

# 3.1. 交叉驗證：評估估算器的表現校驗者: [@想和太陽肩并肩](https://github.com/apachecn/scikit-learn-doc-zh) [@樊雯](https://github.com/apachecn/scikit-learn-doc-zh) 翻譯者: [@\\S^R^Y/](https://github.com/apachecn/scikit-learn-doc-zh) 學習預測函數的參數，并在相同數據集上進行測試是一種錯誤的做法: 一個僅給出測試用例標簽的模型將會獲得極高的分數，但對于尚未出現過的數據它則無法預測出任何有用的信息。這種情況稱為 **overfitting（過擬合）**. 為了避免這種情況，在進行（監督）機器學習實驗時，通常取出部分可利用數據作為 **test set（測試數據集）**`X_test, y_test`。需要強調的是這里說的 “experiment(實驗)” 并不僅限于學術（academic），因為即使是在商業場景下機器學習也往往是從實驗開始的。利用 scikit-learn 包中的 [`train_test_split`](generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split "sklearn.model_selection.train_test_split") 輔助函數可以很快地將實驗數據集劃分為任何訓練集（training sets）和測試集（test sets）。下面讓我們載入 iris 數據集，并在此數據集上訓練出線性支持向量機: ``` >>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> from sklearn import datasets >>> from sklearn import svm >>> iris = datasets.load_iris() >>> iris.data.shape, iris.target.shape ((150, 4), (150,)) ``` 我們能快速采樣到原數據集的 40% 作為測試集，從而測試（評估）我們的分類器: ``` >>> X_train, X_test, y_train, y_test = train_test_split( ... iris.data, iris.target, test_size=0.4, random_state=0) >>> X_train.shape, y_train.shape ((90, 4), (90,)) >>> X_test.shape, y_test.shape ((60, 4), (60,)) >>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train) >>> clf.score(X_test, y_test) 0.96... ``` 當評價估計器的不同設置（”hyperparameters(超參數)”）時，例如手動為 SVM 設置的 `C` 參數，由于在訓練集上，通過調整參數設置使估計器的性能達到了最佳狀態；但 *在測試集上* 可能會出現過擬合的情況。此時，測試集上的信息反饋足以顛覆訓練好的模型，評估的指標不再有效反映出模型的泛化性能。為了解決此類問題，還應該準備另一部分被稱為 “validation set(驗證集)” 的數據集，模型訓練完成以后在驗證集上對模型進行評估。當驗證集上的評估實驗比較成功時，在測試集上進行最后的評估。然而，通過將原始數據分為3個數據集合，我們就大大減少了可用于模型學習的樣本數量，并且得到的結果依賴于集合對（訓練，驗證）的隨機選擇。這個問題可以通過 [交叉驗證（CV 縮寫）](https://en.wikipedia.org/wiki/Cross-validation_(statistics))來解決。交叉驗證仍需要測試集做最后的模型評估，但不再需要驗證集。最基本的方法被稱之為，*k-折交叉驗證* 。 k-折交叉驗證將訓練集劃分為 k 個較小的集合（其他方法會在下面描述，主要原則基本相同）。每一個 *k* 折都會遵循下面的過程： > - 將 ![k-1](https://box.kancloud.cn/37e3bf499a5150b760fe4ae065bbb143_39x14.jpg) 份訓練集子集作為 training data （訓練集）訓練模型， > - 將剩余的 1 份訓練集子集作為驗證集用于模型驗證（也就是利用該數據集計算模型的性能指標，例如準確率）。 *k*-折交叉驗證得出的性能指標是循環計算中每個值的平均值。該方法雖然計算代價很高，但是它不會浪費太多的數據（如固定任意測試集的情況一樣），在處理樣本數據集較少的問題（例如，逆向推理）時比較有優勢。 ## 3.1.1. 計算交叉驗證的指標使用交叉驗證最簡單的方法是在估計器和數據集上調用 [`cross_val_score`](generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score "sklearn.model_selection.cross_val_score") 輔助函數。下面的例子展示了如何通過分割數據，擬合模型和計算連續 5 次的分數（每次不同分割）來估計 linear kernel 支持向量機在 iris 數據集上的精度: ``` >>> from sklearn.model_selection import cross_val_score >>> clf = svm.SVC(kernel='linear', C=1) >>> scores = cross_val_score(clf, iris.data, iris.target, cv=5) >>> scores array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ]) ``` 評分估計的平均得分和 95% 置信區間由此給出: ``` >>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) Accuracy: 0.98 (+/- 0.03) ``` 默認情況下，每個 CV 迭代計算的分數是估計器的 `score` 方法。可以通過使用 scoring 參數來改變計算方式如下: ``` >>> from sklearn import metrics >>> scores = cross_val_score( ... clf, iris.data, iris.target, cv=5, scoring='f1_macro') >>> scores array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ]) ``` 詳情請參閱 [scoring 參數: 定義模型評估規則](model_evaluation.html#scoring-parameter) 。在 Iris 數據集的情形下，樣本在各個目標類別之間是平衡的，因此準確度和 F1-score 幾乎相等。當 `cv` 參數是一個整數時， [`cross_val_score`](generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score "sklearn.model_selection.cross_val_score") 默認使用 [`KFold`](generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold "sklearn.model_selection.KFold") 或 [`StratifiedKFold`](generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold "sklearn.model_selection.StratifiedKFold") 策略，后者會在估計器派生自 [`ClassifierMixin`](generated/sklearn.base.ClassifierMixin.html#sklearn.base.ClassifierMixin "sklearn.base.ClassifierMixin") 時使用。也可以通過傳入一個交叉驗證迭代器來使用其他交叉驗證策略，比如: ``` >>> from sklearn.model_selection import ShuffleSplit >>> n_samples = iris.data.shape[0] >>> cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0) >>> cross_val_score(clf, iris.data, iris.target, cv=cv) ... array([ 0.97..., 0.97..., 1. ]) ``` 保留數據的數據轉換正如在訓練集中保留的數據上測試一個 predictor （預測器）是很重要的一樣，預處理（如標準化，特征選擇等）和類似的 [data transformations](../data_transforms.html#data-transforms) 也應該從訓練集中學習，并應用于預測數據以進行預測: ``` >>> from sklearn import preprocessing >>> X_train, X_test, y_train, y_test = train_test_split( ... iris.data, iris.target, test_size=0.4, random_state=0) >>> scaler = preprocessing.StandardScaler().fit(X_train) >>> X_train_transformed = scaler.transform(X_train) >>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train) >>> X_test_transformed = scaler.transform(X_test) >>> clf.score(X_test_transformed, y_test) 0.9333... ``` [`Pipeline`](generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline "sklearn.pipeline.Pipeline") 可以更容易地組合估計器，在交叉驗證下使用如下: ``` >>> from sklearn.pipeline import make_pipeline >>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1)) >>> cross_val_score(clf, iris.data, iris.target, cv=cv) ... array([ 0.97..., 0.93..., 0.95...]) ``` 可以參閱 [Pipeline（管道）和 FeatureUnion（特征聯合）: 合并的評估器](pipeline.html#combining-estimators). ### 3.1.1.1. cross\_validate 函數和多度量評估 `cross_validate` 函數與 `cross_val_score` 在下面的兩個方面有些不同 - - 它允許指定多個指標進行評估. - 除了測試得分之外，它還會返回一個包含訓練得分，擬合次數， score-times （得分次數）的一個字典。 It returns a dict containing training scores, fit-times and score-times in addition to the test score. 對于單個度量評估，其中 scoring 參數是一個字符串，可以調用或 None ， keys 將是 - `['test_score', 'fit_time', 'score_time']` 而對于多度量評估，返回值是一個帶有以下的 keys 的字典 - `['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']` `return_train_score` 默認設置為 `True` 。它增加了所有 scorers(得分器) 的訓練得分 keys 。如果不需要訓練 scores ，則應將其明確設置為 `False` 。可以將多個指標指定為 predefined scorer names（預定義的得分器的名稱） list ，tuple 或者 set ``` >>> from sklearn.model_selection import cross_validate >>> from sklearn.metrics import recall_score >>> scoring = ['precision_macro', 'recall_macro'] >>> clf = svm.SVC(kernel='linear', C=1, random_state=0) >>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring, ... cv=5, return_train_score=False) >>> sorted(scores.keys()) ['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro'] >>> scores['test_recall_macro'] array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ]) ``` 或作為一個字典 mapping 得分器名稱預定義或自定義的得分函數: ``` >>> from sklearn.metrics.scorer import make_scorer >>> scoring = {'prec_macro': 'precision_macro', ... 'rec_micro': make_scorer(recall_score, average='macro')} >>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring, ... cv=5, return_train_score=True) >>> sorted(scores.keys()) ['fit_time', 'score_time', 'test_prec_macro', 'test_rec_micro', 'train_prec_macro', 'train_rec_micro'] >>> scores['train_rec_micro'] array([ 0.97..., 0.97..., 0.99..., 0.98..., 0.98...]) ``` 這里是一個使用單一指標的 `cross_validate` 的例子: ``` >>> scores = cross_validate(clf, iris.data, iris.target, ... scoring='precision_macro') >>> sorted(scores.keys()) ['fit_time', 'score_time', 'test_score', 'train_score'] ``` ### 3.1.1.2. 通過交叉驗證獲取預測除了返回結果不同，函數 [`cross_val_predict`](generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict "sklearn.model_selection.cross_val_predict") 具有和 [`cross_val_score`](generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score "sklearn.model_selection.cross_val_score") 相同的接口，對于每一個輸入的元素，如果其在測試集合中，將會得到預測結果。交叉驗證策略會將可用的元素提交到測試集合有且僅有一次（否則會拋出一個異常）。這些預測可以用于評價分類器的效果: ``` >>> from sklearn.model_selection import cross_val_predict >>> predicted = cross_val_predict(clf, iris.data, iris.target, cv=10) >>> metrics.accuracy_score(iris.target, predicted) 0.973... ``` 注意，這個計算的結果和 [`cross_val_score`](generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score "sklearn.model_selection.cross_val_score") 有輕微的差別，因為后者用另一種方式組織元素。可用的交叉驗證迭代器在下面的部分中。示例 - [Receiver Operating Characteristic (ROC) with cross validation](../auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py), - [Recursive feature elimination with cross-validation](../auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py), - [Parameter estimation using grid search with cross-validation](../auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py), - [Sample pipeline for text feature extraction and evaluation](../auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py), - [繪制交叉驗證預測圖](../auto_examples/plot_cv_predict.html#sphx-glr-auto-examples-plot-cv-predict-py), - [Nested versus non-nested cross-validation](../auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py). ## 3.1.2. 交叉驗證迭代器接下來的部分列出了一些用于生成索引標號，用于在不同的交叉驗證策略中生成數據劃分的工具。 ## 3.1.3. 交叉驗證迭代器–循環遍歷數據假設一些數據是獨立的和相同分布的 (i.i.d) 假定所有的樣本來源于相同的生成過程，并假設生成過程沒有記憶過去生成的樣本。在這種情況下可以使用下面的交叉驗證器。 **注意** 而 i.i.d 數據是機器學習理論中的一個常見假設，在實踐中很少成立。如果知道樣本是使用時間相關的過程生成的，則使用 [time-series aware cross-validation scheme](#timeseries-cv) 更安全。同樣，如果我們知道生成過程具有 group structure （群體結構）（從不同 subjects（主體）， experiments（實驗）， measurement devices （測量設備）收集的樣本），則使用 [group-wise cross-validation](#group-cv) 更安全。 ### 3.1.3.1. K 折 [`KFold`](generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold "sklearn.model_selection.KFold") 將所有的樣例劃分為 ![k](https://box.kancloud.cn/300675e73ace6bf4c352cfbb633f0199_9x13.jpg) 個組，稱為折疊 (fold) （如果 ![k = n](https://box.kancloud.cn/a3211e8b4c09fbfd7ff526df317da29f_44x13.jpg)，這等價于 *Leave One Out（留一）* 策略），都具有相同的大小（如果可能）。預測函數學習時使用 ![k - 1](https://box.kancloud.cn/37e3bf499a5150b760fe4ae065bbb143_39x14.jpg) 個折疊中的數據，最后一個剩下的折疊會用于測試。在 4 個樣例的數據集上使用 2-fold 交叉驗證的例子: ``` >>> import numpy as np >>> from sklearn.model_selection import KFold >>> X = ["a", "b", "c", "d"] >>> kf = KFold(n_splits=2) >>> for train, test in kf.split(X): ... print("%s %s" % (train, test)) [2 3] [0 1] [0 1] [2 3] ``` 每個折疊由兩個 arrays 組成，第一個作為 *training set* ，另一個作為 *test set* 。由此，可以通過使用 numpy 的索引創建訓練/測試集合: ``` >>> X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]]) >>> y = np.array([0, 1, 0, 1]) >>> X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test] ``` ### 3.1.3.2. 重復 K-折交叉驗證 [`RepeatedKFold`](generated/sklearn.model_selection.RepeatedKFold.html#sklearn.model_selection.RepeatedKFold "sklearn.model_selection.RepeatedKFold") 重復 K-Fold n 次。當需要運行時可以使用它 [`KFold`](generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold "sklearn.model_selection.KFold") n 次，在每次重復中產生不同的分割。 2折 K-Fold 重復 2 次的示例: ``` >>> import numpy as np >>> from sklearn.model_selection import RepeatedKFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> random_state = 12883823 >>> rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state) >>> for train, test in rkf.split(X): ... print("%s %s" % (train, test)) ... [2 3] [0 1] [0 1] [2 3] [0 2] [1 3] [1 3] [0 2] ``` 類似地， [`RepeatedStratifiedKFold`](generated/sklearn.model_selection.RepeatedStratifiedKFold.html#sklearn.model_selection.RepeatedStratifiedKFold "sklearn.model_selection.RepeatedStratifiedKFold") 在每個重復中以不同的隨機化重復 n 次分層的 K-Fold 。 ### 3.1.3.3. 留一交叉驗證 (LOO) [`LeaveOneOut`](generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut "sklearn.model_selection.LeaveOneOut") (或 LOO) 是一個簡單的交叉驗證。每個學習集都是通過除了一個樣本以外的所有樣本創建的，測試集是被留下的樣本。因此，對于 ![n](https://box.kancloud.cn/ee463e4b2bbbc723c7017b00e6d51b41_11x8.jpg) 個樣本，我們有 ![n](https://box.kancloud.cn/ee463e4b2bbbc723c7017b00e6d51b41_11x8.jpg) 個不同的訓練集和 ![n](https://box.kancloud.cn/ee463e4b2bbbc723c7017b00e6d51b41_11x8.jpg) 個不同的測試集。這種交叉驗證程序不會浪費太多數據，因為只有一個樣本是從訓練集中刪除掉的: ``` >>> from sklearn.model_selection import LeaveOneOut >>> X = [1, 2, 3, 4] >>> loo = LeaveOneOut() >>> for train, test in loo.split(X): ... print("%s %s" % (train, test)) [1 2 3] [0] [0 2 3] [1] [0 1 3] [2] [0 1 2] [3] ``` LOO 潛在的用戶選擇模型應該權衡一些已知的警告。當與 ![k](https://box.kancloud.cn/300675e73ace6bf4c352cfbb633f0199_9x13.jpg) 折交叉驗證進行比較時，可以從 ![n](https://box.kancloud.cn/ee463e4b2bbbc723c7017b00e6d51b41_11x8.jpg) 樣本中構建 ![n](https://box.kancloud.cn/ee463e4b2bbbc723c7017b00e6d51b41_11x8.jpg) 模型，而不是 ![k](https://box.kancloud.cn/300675e73ace6bf4c352cfbb633f0199_9x13.jpg) 模型，其中 ![n > k](https://box.kancloud.cn/6fbb34c9e282421e29ad1a692e4874a5_45x13.jpg) 。此外，每個在 ![n - 1](https://box.kancloud.cn/7234aa09f4b1f378bfefb87d6087a461_41x13.jpg) 個樣本而不是在 ![(k-1) n / k](https://box.kancloud.cn/eedda7ce3b7093433b2e1016d0d6a062_84x19.jpg) 上進行訓練。在兩種方式中，假設 ![k](https://box.kancloud.cn/300675e73ace6bf4c352cfbb633f0199_9x13.jpg) 不是太大，并且 ![k < n](https://box.kancloud.cn/b3cf9a5cf170ffcde658a3b9f9a66d85_44x13.jpg) ， LOO 比 ![k](https://box.kancloud.cn/300675e73ace6bf4c352cfbb633f0199_9x13.jpg) 折交叉驗證計算開銷更加昂貴。就精度而言， LOO 經常導致較高的方差作為測試誤差的估計器。直觀地說，因為 ![n](https://box.kancloud.cn/ee463e4b2bbbc723c7017b00e6d51b41_11x8.jpg) 個樣本中的 ![n - 1](https://box.kancloud.cn/7234aa09f4b1f378bfefb87d6087a461_41x13.jpg) 被用來構建每個模型，折疊構建的模型實際上是相同的，并且是從整個訓練集建立的模型。但是，如果學習曲線對于所討論的訓練大小是陡峭的，那么 5- 或 10- 折交叉驗證可以泛化誤差增高。作為一般規則，大多數作者和經驗證據表明， 5- 或者 10- 交叉驗證應該優于 LOO 。參考文獻: - <http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html>; - T. Hastie, R. Tibshirani, J. Friedman, [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn), Springer 2009 - L. Breiman, P. Spector [Submodel selection and evaluation in regression: The X-random case](http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/197.pdf), International Statistical Review 1992; - R. Kohavi, [A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection](http://web.cs.iastate.edu/~jtian/cs573/Papers/Kohavi-IJCAI-95.pdf), Intl. Jnt. Conf. AI - R. Bharat Rao, G. Fung, R. Rosales, [On the Dangers of Cross-Validation. An Experimental Evaluation](http://people.csail.mit.edu/romer/papers/CrossVal_SDM08.pdf), SIAM 2008; - G. James, D. Witten, T. Hastie, R Tibshirani, [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL), Springer 2013. ### 3.1.3.4. 留 P 交叉驗證 (LPO) [`LeavePOut`](generated/sklearn.model_selection.LeavePOut.html#sklearn.model_selection.LeavePOut "sklearn.model_selection.LeavePOut") 與 [`LeaveOneOut`](generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut "sklearn.model_selection.LeaveOneOut") 非常相似，因為它通過從整個集合中刪除 ![p](https://box.kancloud.cn/251fcba434769c07a22131f9d8b84b32_10x12.jpg) 個樣本來創建所有可能的訓練/測試集。對于 ![n](https://box.kancloud.cn/ee463e4b2bbbc723c7017b00e6d51b41_11x8.jpg) 個樣本，這產生了 ![{n \choose p}](https://box.kancloud.cn/f883bf698a647723f0563737d40a4798_20x25.jpg) 個訓練-測試對。與 [`LeaveOneOut`](generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut "sklearn.model_selection.LeaveOneOut") 和 [`KFold`](generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold "sklearn.model_selection.KFold") 不同，當 ![p > 1](https://box.kancloud.cn/4e4ee6cfa80ce04e0d435656a8bd1017_42x16.jpg) 時，測試集會重疊。在有 4 個樣例的數據集上使用 Leave-2-Out 的例子: ``` >>> from sklearn.model_selection import LeavePOut >>> X = np.ones(4) >>> lpo = LeavePOut(p=2) >>> for train, test in lpo.split(X): ... print("%s %s" % (train, test)) [2 3] [0 1] [1 3] [0 2] [1 2] [0 3] [0 3] [1 2] [0 2] [1 3] [0 1] [2 3] ``` ### 3.1.3.5. 隨機排列交叉驗證 a.k.a. Shuffle & Split [`ShuffleSplit`](generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit "sklearn.model_selection.ShuffleSplit") [`ShuffleSplit`](generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit "sklearn.model_selection.ShuffleSplit") 迭代器將會生成一個用戶給定數量的獨立的訓練/測試數據劃分。樣例首先被打散然后劃分為一對訓練測試集合。可以通過設定明確的 `random_state` ，使得偽隨機生成器的結果可以重復。這是一個使用的小例子: ``` >>> from sklearn.model_selection import ShuffleSplit >>> X = np.arange(5) >>> ss = ShuffleSplit(n_splits=3, test_size=0.25, ... random_state=0) >>> for train_index, test_index in ss.split(X): ... print("%s %s" % (train_index, test_index)) ... [1 3 4] [2 0] [1 4 3] [0 2] [4 0 2] [1 3] ``` [`ShuffleSplit`](generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit "sklearn.model_selection.ShuffleSplit") 可以替代 [`KFold`](generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold "sklearn.model_selection.KFold") 交叉驗證，因為其提供了細致的訓練 / 測試劃分的數量和樣例所占的比例等的控制。 ## 3.1.4. 基于類標簽、具有分層的交叉驗證迭代器一些分類問題在目標類別的分布上可能表現出很大的不平衡性：例如，可能會出現比正樣本多數倍的負樣本。在這種情況下，建議采用如 [`StratifiedKFold`](generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold "sklearn.model_selection.StratifiedKFold") 和 [`StratifiedShuffleSplit`](generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit "sklearn.model_selection.StratifiedShuffleSplit") 中實現的分層抽樣方法，確保相對的類別頻率在每個訓練和驗證折疊中大致保留。 ### 3.1.4.1. 分層 k 折 [`StratifiedKFold`](generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold "sklearn.model_selection.StratifiedKFold") 是 *k-fold* 的變種，會返回 *stratified（分層）* 的折疊：每個小集合中，各個類別的樣例比例大致和完整數據集中相同。在有 10 個樣例的，有兩個略不均衡類別的數據集上進行分層 3-fold 交叉驗證的例子: ``` >>> from sklearn.model_selection import StratifiedKFold >>> X = np.ones(10) >>> y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] >>> skf = StratifiedKFold(n_splits=3) >>> for train, test in skf.split(X, y): ... print("%s %s" % (train, test)) [2 3 6 7 8 9] [0 1 4 5] [0 1 3 4 5 8 9] [2 6 7] [0 1 2 4 5 6 7] [3 8 9] ``` [`RepeatedStratifiedKFold`](generated/sklearn.model_selection.RepeatedStratifiedKFold.html#sklearn.model_selection.RepeatedStratifiedKFold "sklearn.model_selection.RepeatedStratifiedKFold") 可用于在每次重復中用不同的隨機化重復分層 K-Fold n 次。 ### 3.1.4.2. 分層隨機 Split [`StratifiedShuffleSplit`](generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit "sklearn.model_selection.StratifiedShuffleSplit") 是 *ShuffleSplit* 的一個變種，會返回直接的劃分，比如：創建一個劃分，但是劃分中每個類的比例和完整數據集中的相同。 ## 3.1.5. 用于分組數據的交叉驗證迭代器如果潛在的生成過程產生依賴樣本的 groups ，那么 i.i.d. 假設將會被打破。這樣的數據分組是特定于域的。一個例子是從多個患者收集醫學數據，從每個患者身上采集多個樣本。而這樣的數據很可能取決于個人群體。在我們的例子中，每個樣本的患者 ID 將是其 group identifier （組標識符）。在這種情況下，我們想知道在一組特定的 groups 上訓練的模型是否能很好地適用于看不見的 group 。為了衡量這一點，我們需要確保驗證對象中的所有樣本來自配對訓練折疊中完全沒有表示的組。下面的交叉驗證分離器可以用來做到這一點。樣本的 grouping identifier （分組標識符）通過 `groups` 參數指定。 ### 3.1.5.1. 組 k-fold [`GroupKFold`](generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold "sklearn.model_selection.GroupKFold") 是 k-fold 的變體，它確保同一個 group 在測試和訓練集中都不被表示。例如，如果數據是從不同的 subjects 獲得的，每個 subject 有多個樣本，并且如果模型足夠靈活以高度人物指定的特征中學習，則可能無法推廣到新的 subject 。 [`GroupKFold`](generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold "sklearn.model_selection.GroupKFold") 可以檢測到這種過擬合的情況。 Imagine you have three subjects, each with an associated number from 1 to 3: ``` >>> from sklearn.model_selection import GroupKFold >>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10] >>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"] >>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3] >>> gkf = GroupKFold(n_splits=3) >>> for train, test in gkf.split(X, y, groups=groups): ... print("%s %s" % (train, test)) [0 1 2 3 4 5] [6 7 8 9] [0 1 2 6 7 8 9] [3 4 5] [3 4 5 6 7 8 9] [0 1 2] ``` 每個 subject 都處于不同的測試階段，同一個科目從來沒有在測試和訓練過程中。請注意，由于數據不平衡，折疊的大小并不完全相同。 ### 3.1.5.2. 留一組交叉驗證 [`LeaveOneGroupOut`](generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn.model_selection.LeaveOneGroupOut "sklearn.model_selection.LeaveOneGroupOut") 是一個交叉驗證方案，它根據第三方提供的 array of integer groups （整數組的數組）來提供樣本。這個組信息可以用來編碼任意域特定的預定義交叉驗證折疊。每個訓練集都是由除特定組別以外的所有樣本構成的。例如，在多個實驗的情況下， [`LeaveOneGroupOut`](generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn.model_selection.LeaveOneGroupOut "sklearn.model_selection.LeaveOneGroupOut") 可以用來根據不同的實驗創建一個交叉驗證：我們使用除去一個實驗的所有實驗的樣本創建一個訓練集: ``` >>> from sklearn.model_selection import LeaveOneGroupOut >>> X = [1, 5, 10, 50, 60, 70, 80] >>> y = [0, 1, 1, 2, 2, 2, 2] >>> groups = [1, 1, 2, 2, 3, 3, 3] >>> logo = LeaveOneGroupOut() >>> for train, test in logo.split(X, y, groups=groups): ... print("%s %s" % (train, test)) [2 3 4 5 6] [0 1] [0 1 4 5 6] [2 3] [0 1 2 3] [4 5 6] ``` 另一個常見的應用是使用時間信息：例如，組可以是收集樣本的年份，從而允許與基于時間的分割進行交叉驗證。 ### 3.1.5.3. 留 P 組交叉驗證 [`LeavePGroupsOut`](generated/sklearn.model_selection.LeavePGroupsOut.html#sklearn.model_selection.LeavePGroupsOut "sklearn.model_selection.LeavePGroupsOut") 類似于 [`LeaveOneGroupOut`](generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn.model_selection.LeaveOneGroupOut "sklearn.model_selection.LeaveOneGroupOut") ，但為每個訓練/測試集刪除與 ![P](https://box.kancloud.cn/08277e04611b27b30b29f99ba0830d27_14x12.jpg) 組有關的樣本。 Leave-2-Group Out 的示例: ``` >>> from sklearn.model_selection import LeavePGroupsOut >>> X = np.arange(6) >>> y = [1, 1, 1, 2, 2, 2] >>> groups = [1, 1, 2, 2, 3, 3] >>> lpgo = LeavePGroupsOut(n_groups=2) >>> for train, test in lpgo.split(X, y, groups=groups): ... print("%s %s" % (train, test)) [4 5] [0 1 2 3] [2 3] [0 1 4 5] [0 1] [2 3 4 5] ``` ### 3.1.5.4. Group Shuffle Split [`GroupShuffleSplit`](generated/sklearn.model_selection.GroupShuffleSplit.html#sklearn.model_selection.GroupShuffleSplit "sklearn.model_selection.GroupShuffleSplit") 迭代器是 [`ShuffleSplit`](generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit "sklearn.model_selection.ShuffleSplit") 和 [`LeavePGroupsOut`](generated/sklearn.model_selection.LeavePGroupsOut.html#sklearn.model_selection.LeavePGroupsOut "sklearn.model_selection.LeavePGroupsOut") 的組合，它生成一個隨機劃分分區的序列，其中為每個分組提供了一個組子集。這是使用的示例: ``` >>> from sklearn.model_selection import GroupShuffleSplit >>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001] >>> y = ["a", "b", "b", "b", "c", "c", "c", "a"] >>> groups = [1, 1, 2, 2, 3, 3, 4, 4] >>> gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0) >>> for train, test in gss.split(X, y, groups=groups): ... print("%s %s" % (train, test)) ... [0 1 2 3] [4 5 6 7] [2 3 6 7] [0 1 4 5] [2 3 4 5] [0 1 6 7] [4 5 6 7] [0 1 2 3] ``` 當需要 [`LeavePGroupsOut`](generated/sklearn.model_selection.LeavePGroupsOut.html#sklearn.model_selection.LeavePGroupsOut "sklearn.model_selection.LeavePGroupsOut") 的操作時，這個類的信息是很有必要的，但是組的數目足夠大，以至于用 ![P](https://box.kancloud.cn/08277e04611b27b30b29f99ba0830d27_14x12.jpg) 組生成所有可能的分區將會花費很大的代價。在這種情況下， [`GroupShuffleSplit`](generated/sklearn.model_selection.GroupShuffleSplit.html#sklearn.model_selection.GroupShuffleSplit "sklearn.model_selection.GroupShuffleSplit") 通過 [`LeavePGroupsOut`](generated/sklearn.model_selection.LeavePGroupsOut.html#sklearn.model_selection.LeavePGroupsOut "sklearn.model_selection.LeavePGroupsOut") 提供了一個隨機（可重復）的訓練 / 測試劃分采樣。 ## 3.1.6. 預定義的折疊 / 驗證集對一些數據集，一個預定義的，將數據劃分為訓練和驗證集合或者劃分為幾個交叉驗證集合的劃分已經存在。可以使用 [`PredefinedSplit`](generated/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit "sklearn.model_selection.PredefinedSplit") 來使用這些集合來搜索超參數。比如，當使用驗證集合時，設置所有驗證集合中的樣例的 `test_fold` 為 0，而將其他樣例設置為 -1 。 ## 3.1.7. 交叉驗證在時間序列數據中應用時間序列數據的特點是時間 (*autocorrelation(自相關性)*) 附近的觀測之間的相關性。然而，傳統的交叉驗證技術，例如 [`KFold`](generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold "sklearn.model_selection.KFold") 和 [`ShuffleSplit`](generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit "sklearn.model_selection.ShuffleSplit") 假設樣本是獨立的且分布相同的，并且在時間序列數據上會導致訓練和測試實例之間不合理的相關性（產生廣義誤差的估計較差）。因此，對 “future(未來)” 觀測的時間序列數據模型的評估至少與用于訓練模型的觀測模型非常重要。為了達到這個目的，一個解決方案是由 [`TimeSeriesSplit`](generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit "sklearn.model_selection.TimeSeriesSplit") 提供的。 ### 3.1.7.1. 時間序列分割 [`TimeSeriesSplit`](generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit "sklearn.model_selection.TimeSeriesSplit") 是 *k-fold* 的一個變體，它首先返回 ![k](https://box.kancloud.cn/300675e73ace6bf4c352cfbb633f0199_9x13.jpg) 折作為訓練數據集，并且 ![(k+1)](https://box.kancloud.cn/35956817bbe4d670ca858a05102049ac_53x18.jpg) 折作為測試數據集。請注意，與標準的交叉驗證方法不同，連續的訓練集是超越前者的超集。另外，它將所有的剩余數據添加到第一個訓練分區，它總是用來訓練模型。這個類可以用來交叉驗證以固定時間間隔觀察到的時間序列數據樣本。對具有 6 個樣本的數據集進行 3-split 時間序列交叉驗證的示例: ``` >>> from sklearn.model_selection import TimeSeriesSplit >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([1, 2, 3, 4, 5, 6]) >>> tscv = TimeSeriesSplit(n_splits=3) >>> print(tscv) TimeSeriesSplit(max_train_size=None, n_splits=3) >>> for train, test in tscv.split(X): ... print("%s %s" % (train, test)) [0 1 2] [3] [0 1 2 3] [4] [0 1 2 3 4] [5] ``` ## 3.1.8. A note on shuffling (如果數據的順序不是任意的（比如說，相同標簽的樣例連續出現），為了獲得有意義的交叉驗證結果，首先對其進行打散是很有必要的。然而，當樣例不是獨立同分布時打散則是不可行的。例如：樣例是相關的文章，以他們發表的時間進行排序，這時候如果對數據進行打散，將會導致模型過擬合，得到一個過高的驗證分數：因為驗證樣例更加相似（在時間上更接近）于訓練數據。一些交叉驗證迭代器，比如 [`KFold`](generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold "sklearn.model_selection.KFold") ，有一個內建的在劃分數據前進行數據索引打散的選項。注意: - 這種方式僅需要很少的內存就可以打散數據。 - 默認不會進行打散，包括設置 `cv=some_integer` （直接）k 折疊交叉驗證的 [`cross_val_score`](generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score "sklearn.model_selection.cross_val_score") ，表格搜索等。注意 [`train_test_split`](generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split "sklearn.model_selection.train_test_split") 會返回一個隨機的劃分。 - 參數 `random_state` 默認設置為 `None` ，這意為著每次進行 `KFold(..., shuffle=True)` 時，打散都是不同的。然而， `GridSearchCV` 通過調用 `fit` 方法驗證時，將會使用相同的打散來訓練每一組參數。 - 為了保證結果的可重復性（在相同的平臺上），應該給 `random_state` 設定一個固定的值。 ## 3.1.9. 交叉驗證和模型選擇交叉驗證迭代器可以通過網格搜索得到最優的模型超參數，從而直接用于模型的選擇。這是另一部分 [調整估計器的超參數](grid_search.html#grid-search) 的主要內容。