7.2 采樣誤差 · 斯坦福 Stats60 21 世紀的統計思維

## 7.2 采樣誤差不管我們的樣本有多具有代表性，我們根據樣本計算的統計數據很可能至少與總體參數略有不同。我們稱之為 _ 采樣誤差 _。我們的統計估計值也會因樣本而異；我們將我們的統計數據在樣本間的分布稱為 _ 抽樣分布 _。抽樣誤差直接關系到人口測量的質量。顯然，我們希望從樣本中獲得的估計值盡可能接近總體參數的真實值。然而，即使我們的統計數據是無偏的（也就是說，從長遠來看，我們希望它與總體參數具有相同的值），任何特定估計的值都將不同于總體估計，并且當抽樣誤差較大時，這些差異將更大。因此，減小采樣誤差是實現更好測量的重要步驟。我們將使用 nhanes 數據集作為示例；我們將假設 nhanes 是整個總體，然后我們將從總體中隨機抽取樣本。在下一章中，我們將有更多的話要說，關于“隨機”樣本的生成是如何在計算機中工作的。 ```r # load the NHANES data library library(NHANES) # create a NHANES dataset without duplicated IDs NHANES <- NHANES %>% distinct(ID, .keep_all = TRUE) #create a dataset of only adults NHANES_adult <- NHANES %>% filter( !is.na(Height), Age >= 18 ) #print the NHANES population mean and standard deviation of adult height sprintf( "Population height: mean = %.2f", mean(NHANES_adult$Height) ) ``` ```r ## [1] "Population height: mean = 168.35" ``` ```r sprintf( "Population height: std deviation = %.2f", sd(NHANES_adult$Height) ) ``` ```r ## [1] "Population height: std deviation = 10.16" ``` 在這個例子中，我們知道成年人口的平均值和身高的標準偏差，因為我們假設 nhanes 數據集包含整個成年人口。現在，讓我們從 NHANES 人群中抽取 50 個個體的單個樣本，并將結果統計數據與人口參數進行比較。 ```r # sample 50 individuals from NHANES dataset exampleSample <- NHANES_adult %>% sample_n(50) #print the sample mean and standard deviation of adult height sprintf( 'Sample height: mean = %.2f', mean(exampleSample$Height) ) ``` ```r ## [1] "Sample height: mean = 169.46" ``` ```r sprintf( 'Sample height: std deviation = %.2f', sd(exampleSample$Height) ) ``` ```r ## [1] "Sample height: std deviation = 10.07" ``` 樣本平均值和標準差相似，但不完全等于總體值。現在，讓我們取 50 個個體的大量樣本，計算每個樣本的平均值，并查看得出的平均值抽樣分布。為了更好地估計抽樣分布，我們必須決定要采集多少樣本——在這種情況下，讓我們采集 5000 個樣本，以便我們對答案真正有信心。請注意，像這樣的模擬有時需要幾分鐘才能運行，并且可能會使您的計算機變得氣喘吁吁。圖[7.1](#fig:samplePlot)中的柱狀圖顯示，對 50 個個體的每個樣本估計的平均值有所不同，但總體而言，它們集中在人口平均值周圍。 ```r # compute sample means across 5000 samples from NHANES data sampSize <- 50 # size of sample nsamps <- 5000 # number of samples we will take # set up variable to store all of the results sampMeans <- array(NA, nsamps) # Loop through and repeatedly sample and compute the mean for (i in 1:nsamps) { NHANES_sample <- sample_n(NHANES_adult, sampSize) sampMeans[i] <- mean(NHANES_sample$Height) } sprintf( "Average sample mean = %.2f", mean(sampMeans) ) ``` ```r ## [1] "Average sample mean = 168.33" ``` ```r sampMeans_df <- tibble(sampMeans = sampMeans) ``` ![The blue histogram shows the sampling distribution of the mean over 5000 random samples from the NHANES dataset. The histogram for the full dataset is shown in gray for reference.](https://img.kancloud.cn/59/ec/59ec8a12c53f30d0237f9e949e084785_768x384.png) 圖 7.1 藍色柱狀圖顯示了來自 nhanes 數據集的 5000 多個隨機樣本的平均值的抽樣分布。完整數據集的柱狀圖以灰色顯示以供參考。