14.4“預測”的真正含義是什么？ · 斯坦福 Stats60 21 世紀的統計思維

## 14.4“預測”的真正含義是什么？當我們談論日常生活中的“預測”時，我們通常指的是在看到數據之前估計某個變量值的能力。然而，該術語通常在線性回歸的背景下用于指模型與數據的擬合；估計值（![](https://img.kancloud.cn/26/52/265211f84b3ffdfacab0a3a31c3b065c_9x17.jpg)）有時被稱為“預測”，獨立變量被稱為“預測”。這有一個不幸的含義，因為它意味著我們的模型還應該能夠預測未來新數據點的值。實際上，模型與用于獲取參數的數據集的匹配幾乎總是優于模型與新數據集的匹配（copas 1983）。作為一個例子，讓我們從 NHANES 中選取 48 名兒童為樣本，并擬合一個體重回歸模型，該模型包括幾個回歸因子（年齡、身高、看電視和使用電腦的時間以及家庭收入）及其相互作用。 ```r # create dataframe with children with complete data on all variables NHANES_child <- NHANES %>% drop_na(Height, Weight, TVHrsDayChild, HHIncomeMid, CompHrsDayChild, Age) %>% dplyr::filter(Age < 18) ``` ```r # create function to sample data and compute regression on in-sample and out-of-sample data get_sample_predictions <- function(sample_size, shuffle = FALSE) { # generate a sample from NHANES orig_sample <- NHANES_child %>% sample_n(sample_size) # if shuffle is turned on, then randomly shuffle the weight variable if (shuffle) { orig_sample$Weight <- sample(orig_sample$Weight) } # compute the regression line for Weight, as a function of several # other variables (with all possible interactions between variables) heightRegressOrig <- lm( Weight ~ Height * TVHrsDayChild * CompHrsDayChild * HHIncomeMid * Age, data = orig_sample ) # compute the predictions pred_orig <- predict(heightRegressOrig) # create a new sample from the same population new_sample <- NHANES_child %>% sample_n(sample_size) # use the model fom the original sample to predict the # Weight values for the new sample pred_new <- predict(heightRegressOrig, new_sample) # return r-squared and rmse for original and new data return(c( cor(pred_orig, orig_sample$Weight)**2, cor(pred_new, new_sample$Weight)**2, sqrt(mean((pred_orig - orig_sample$Weight)**2)), sqrt(mean((pred_new - new_sample$Weight)**2)) )) } ``` ```r # implement the function sim_results <- replicate(100, get_sample_predictions(sample_size = 48, shuffle = FALSE)) sim_results <- t(sim_results) %>% data.frame() mean_rsquared <- sim_results %>% summarize( rmse_original_data = mean(X3), rmse_new_data = mean(X4) ) pander(mean_rsquared) ``` <colgroup><col style="width: 29%"> <col style="width: 20%"></colgroup> | RMSE_ 原始數據 | RMSE_ 新數據 | | --- | --- | | 2.97 條 | 25.72 美元 | 在這里，我們看到，雖然模型與原始數據相匹配顯示出非常好的擬合（每個人只減去幾磅），但同一個模型在預測從同一人群中抽樣的新兒童的體重值（每個人減去 25 磅以上）方面做得更差。這是因為我們指定的模型非常復雜，因為它不僅包括每個單獨的變量，而且還包括它們的所有可能組合（即它們的 _ 交互 _），從而產生一個具有 32 個參數的模型。由于這幾乎和數據點一樣多的系數（即 48 個孩子的身高），模型 _ 對數據進行了過度擬合 _，就像我們在[5.4 節](#overfitting)中最初的過度擬合示例中的復雜多項式曲線一樣。另一種觀察過度擬合效果的方法是觀察如果我們隨機地改變權重變量的值會發生什么。隨機改變該值應該使得無法從其他變量預測權重，因為它們不應該有系統的關系。 ```r print("using shuffled y variable to simulate null effect") ``` ```r ## [1] "using shuffled y variable to simulate null effect" ``` ```r sim_results <- replicate(100, get_sample_predictions(sample_size = 48, shuffle = TRUE)) sim_results <- t(sim_results) %>% data.frame() mean_rsquared <- sim_results %>% summarize( rmse_original_data = mean(X3), rmse_new_data = mean(X4) ) pander(mean_rsquared) ``` <colgroup><col style="width: 29%"> <col style="width: 20%"></colgroup> | rmse_original_data | rmse_new_data | | --- | --- | | 7.56 條 | 第 60.1 條 | 這向我們表明，即使沒有真正的關系需要建模（因為疏解應該消除了關系），復雜的模型在預測中仍然顯示非常低的錯誤，因為它適合特定數據集中的噪聲。然而，當該模型應用于一個新的數據集時，我們會發現錯誤要大得多，這是應該的。 ### 14.4.1 交叉驗證為了幫助解決過擬合問題而開發的一種方法是 _ 交叉驗證 _。這種技術通常用于機器學習領域，該領域的重點是構建能夠很好地概括為新數據的模型，即使我們沒有新的數據集來測試模型。交叉驗證背后的想法是，我們反復地適應我們的模型，每次都會遺漏數據的一個子集，然后測試模型預測每個被保留的子集中值的能力。 ![A schematic of the cross-validation procedure.](https://img.kancloud.cn/7e/67/7e6793ba6a5e7dc79eb97f24e6101892_3293x1690.png) 圖 14.9 交叉驗證程序示意圖。讓我們看看這對于我們的重量預測示例是如何工作的。在這種情況下，我們將執行 12 倍交叉驗證，這意味著我們將把數據分成 12 個子集，然后將模型擬合 12 次，在每種情況下，去掉其中一個子集，然后測試模型準確預測所持有的因變量值的能力。-找出數據點。R 中的`caret`包使我們能夠輕松地跨數據集運行交叉驗證： ```r # create a function to run cross-validation # returns the r-squared for the out-of-sample prediction compute_cv <- function(d, nfolds = 12) { # based on https://quantdev.ssri.psu.edu/tutorials/cross-validation-tutorial train_ctrl <- trainControl(method = "cv", number = nfolds) model_caret <- train( Weight ~ Height * TVHrsDayChild * CompHrsDayChild * HHIncomeMid * Age, data = d, trControl = train_ctrl, # folds method = "lm" ) # specifying regression model r2_cv <- mean(model_caret$resample$Rsquared) rmse_cv <- mean(model_caret$resample$RMSE) return(c(r2_cv, rmse_cv)) } ``` 使用此函數，我們可以對來自 nhanes 數據集的 100 個樣本運行交叉驗證，并計算交叉驗證的 RMSE，以及原始數據和新數據集的 RMSE，正如我們上面計算的那樣。 ```r #implement the function sim_results <- replicate(100, get_sample_predictions_cv(sample_size = 48, shuffle = FALSE)) sim_results <- t(sim_results) %>% data.frame() mean_rsquared <- sim_results %>% summarize( mse_original_data = mean(X4), mse_new_data = mean(X5), mse_crossvalidation = mean(X6) ) pander(mean_rsquared) ``` <colgroup><col style="width: 27%"> <col style="width: 20%"> <col style="width: 29%"></colgroup> | MSE 原始數據 | MSE 新數據 | MSE 交叉驗證 | | --- | --- | --- | | 2.98 年 | 21.64 條 | 29.29 條 | 在這里，我們看到交叉驗證給了我們一個預測準確性的估計，它比我們用原始數據集看到的膨脹的準確性更接近我們用一個全新數據集看到的結果——事實上，它甚至比新數據集的平均值更悲觀。可能是因為只有部分數據被用來訓練每個模型。我們還可以確認，當因變量隨機變動時，交叉驗證能準確估計預測精度： <colgroup><col style="width: 29%"> <col style="width: 22%"> <col style="width: 30%"></colgroup> | rmse_original_data | rmse_new_data | RMSE 交叉驗證 | | --- | --- | --- | | 第 7.9 條 | 第 73.7 條 | 75.31 條 | 在這里，我們再次看到交叉驗證給了我們一個預測準確性的評估，這與我們對新數據的預期更為接近，而且更為悲觀。正確使用交叉驗證是很困難的，建議在實際使用之前咨詢專家。然而，本節希望向您展示三件事： * “預言”并不總是意味著你認為它意味著什么。 * 復雜的模型會嚴重地過度擬合數據，這樣即使沒有真正的預測信號，人們也能看到似乎很好的預測。 * 除非使用適當的方法，否則您應該非常懷疑地查看有關預測準確性的聲明。