14.2 安裝更復雜的模型 · 斯坦福 Stats60 21 世紀的統計思維

## 14.2 安裝更復雜的模型我們通常希望了解多個變量對某些特定結果的影響，以及它們如何相互關聯。在我們學習時間的例子中，假設我們發現一些學生以前參加過關于這個主題的課程。如果我們繪制他們的成績（見圖[14.4](#fig:StudytimeGradesPrior)），我們可以看到，在相同的學習時間內，那些上過一門課的學生比沒有上過課的學生表現要好得多。 ![The relationship between study time and grades, with color identifying whether each student had taken a previous course on the topic](https://img.kancloud.cn/d2/bd/d2bda9a47126b3bdfc4dde976a3e7457_576x384.png) 圖 14.4 學習時間和成績之間的關系，顏色標識每個學生是否上過該主題的課程我們希望建立一個考慮到這一點的統計模型，我們可以通過擴展我們在上面建立的模型來實現這一點： ![](https://img.kancloud.cn/cd/56/cd5690af9fe7459761bc3372b7dbb896_336x21.jpg) 為了模擬每個人是否有以前的類，我們使用我們稱之為 _ 的偽編碼 _ 來創建一個新變量，該變量的值為 1 表示以前有過一個類，否則為零。這意味著，對于以前上過課的人，我們只需將![](https://img.kancloud.cn/ae/a2/aea2e4cd5281920eaec9f6e53c40de1e_16x21.jpg)的值添加到他們的預測值中——也就是說，使用虛擬編碼![](https://img.kancloud.cn/ae/a2/aea2e4cd5281920eaec9f6e53c40de1e_16x21.jpg)只是反映了兩組人之間的平均值差異。我們對![](https://img.kancloud.cn/0a/ae/0aae65eceb0e309cde94947b99a24033_16x21.jpg)的估計反映了所有數據點的回歸斜率——我們假設回歸斜率是相同的，不管以前是否有過類（見圖[14.5](#fig:LinearRegressionByPriorClass)）。 ```r # perform linear regression for study time and prior class # must change priorClass to a factor variable df$priorClass <- as.factor(df$priorClass) lmResultTwoVars <- lm(grade ~ studyTime + priorClass, data = df) summary(lmResultTwoVars) ``` ```r ## ## Call: ## lm(formula = grade ~ studyTime + priorClass, data = df) ## ## Residuals: ## 1 2 3 4 5 6 7 8 ## 3.5833 0.7500 -3.5833 -0.0833 0.7500 -6.4167 2.0833 2.9167 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 70.08 3.77 18.60 8.3e-06 *** ## studyTime 5.00 1.37 3.66 0.015 * ## priorClass1 9.17 2.88 3.18 0.024 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4 on 5 degrees of freedom ## Multiple R-squared: 0.803, Adjusted R-squared: 0.724 ## F-statistic: 10.2 on 2 and 5 DF, p-value: 0.0173 ``` ![The relation between study time and grade including prior experience as an additional component in the model. The blue line shows the slope relating grades to study time, and the black dotted line corresponds to the difference in means between the two groups.](https://img.kancloud.cn/78/72/787299ed24fbe6c360ca92e43dbd58ea_576x384.png) 圖 14.5 研究時間和年級之間的關系，包括作為模型中額外組成部分的先前經驗。藍線表示與學習時間相關的坡度，黑色虛線表示兩組之間平均值的差異。