估計隨機機器學習算法的實驗重復次數 · Machine Learning Mastery 博客文章翻譯

# 估計隨機機器學習算法的實驗重復次數 > 原文： [https://machinelearningmastery.com/estimate-number-experiment-repeats-stochastic-machine-learning-algorithms/](https://machinelearningmastery.com/estimate-number-experiment-repeats-stochastic-machine-learning-algorithms/) 許多隨機機器學習算法的問題在于，對相同數據的相同算法的不同運行返回不同的結果。這意味著在執行實驗以配置隨機算法或比較算法時，您必須收集多個結果并使用平均表現來總結模型的技能。這提出了一個問題，即實驗的重復次數足以充分表征給定問題的隨機機器學習算法的技能。通常建議使用30次或更多次重復，甚至100次。一些從業者使用數千次重復，似乎超越了收益遞減的想法。在本教程中，您將探索可用于估計正確重復次數的統計方法，以有效地表征隨機機器學習算法的表現。讓我們開始吧。 ![Estimate the Number of Experiment Repeats for Stochastic Machine Learning Algorithms](img/471927a6fd8dce30b642fff079e8c02a.jpg) 估計隨機機器學習算法的實驗重復次數照片來自 [oatsy40](https://www.flickr.com/photos/oatsy40/9920211843/) ，保留一些權利。 ## 教程概述本教程分為4個部分。他們是： 1. 生成數據。 2. 基本分析。 3. 重復次數的影響。 4. 計算標準誤差。本教程假設您使用NumPy，Pandas和Matplotlib安裝了Python 2或3 SciPy環境。 ## 1.生成數據第一步是生成一些數據。我們假設我們已經將神經網絡或其他一些隨機算法擬合到訓練數據集1000次，并在數據集上收集最終的RMSE分數。我們將進一步假設數據是正態分布的，這是我們將在本教程中使用的分析類型的要求。始終檢查結果的分布;結果往往是高斯。我們將生成一組結果進行分析。這是有用的，因為我們將知道真實的人口平均值和標準偏差，我們在實際情景中不會知道。我們將使用60的平均分，標準差為10。下面的代碼生成1000個隨機結果的樣本，并將它們保存到名為 _results.csv_ 的CSV文件中。我們使用 [seed（）](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.seed.html)函數為隨機數生成器播種，以確保每次運行此代碼時我們始終獲得相同的結果（因此您得到的數字與我相同）。然后我們使用 [normal（）](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html)函數生成高斯隨機數和 [savetxt（）](https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html)函數以ASCII格式保存數字數組。 ```py from numpy.random import seed from numpy.random import normal from numpy import savetxt # define underlying distribution of results mean = 60 stev = 10 # generate samples from ideal distribution seed(1) results = normal(mean, stev, 1000) # save to ASCII file savetxt('results.csv', results) ``` 你現在應該有一個名為 _results.csv_ 的文件，我們假裝的隨機算法測試工具有1000個最終結果。以下是文件中最后10行的上下文。 ```py ... 6.160564991742511864e+01 5.879850024371251038e+01 6.385602292344325548e+01 6.718290735754342791e+01 7.291188902850875309e+01 5.883555851728335995e+01 3.722702003339634302e+01 5.930375460544870947e+01 6.353870426882840405e+01 5.813044983467250404e+01 ``` 我們會忘記我們知道這些假結果是如何暫時產生的。 ## 2.基本分析我們有大量結果的第一步是做一些基本的統計分析，看看我們有什么。基本分析的三個有用工具包括： 1. 計算匯總統計信息，例如平均值，標準差和百分位數。 2. 使用盒子和須狀圖檢查數據的傳播。 3. 使用直方圖查看數據的分布。下面的代碼執行此基本分析。首先加載 _results.csv_ ，計算匯總統計數據，并顯示圖表。 ```py from pandas import DataFrame from pandas import read_csv from numpy import mean from numpy import std from matplotlib import pyplot # load results file results = read_csv('results.csv', header=None) # descriptive stats print(results.describe()) # box and whisker plot results.boxplot() pyplot.show() # histogram results.hist() pyplot.show() ``` 首先運行該示例將打印摘要統計信息。我們可以看到該算法的平均表現約為60.3單位，標準差約為9.8。如果我們假設得分是像RMSE這樣的最小分數，我們可以看到最差的表現是大約99.5，最好的表現是大約29.4。 ```py count 1000.000000 mean 60.388125 std 9.814950 min 29.462356 25% 53.998396 50% 60.412926 75% 67.039989 max 99.586027 ``` 創建框和胡須圖以總結數據的傳播，顯示中間50％（框），異常值（點）和中值（綠線）。我們可以看到，即使在中位數附近，結果的傳播似乎也是合理的。 ![Box and Whisker Plot of Model Skill](img/6fdb700d723d4493b24cea722235372c.jpg) 模型技巧的盒子和晶須圖最后，創建結果的直方圖。我們可以看到高斯分布的告警鐘形曲線形狀，這是一個好兆頭，因為它意味著我們可以使用標準統計工具。我們沒有看到任何明顯的分布偏差;它似乎集中在大約60左右。 ![Histogram of Model Skill Distribution](img/d6c3bd2d9265c9b926c9e9aad9b2a5fa.jpg) 模型技能分布的直方圖 ## 3.重復次數的影響我們有很多結果，確切地說是1000。這可能比我們需要的結果要多得多，或者還不夠。我們怎么知道？我們可以通過繪制實驗的重復次數與這些重復的平均分數來得到第一個想法。我們預計隨著實驗重復次數的增加，平均分數會很快穩定下來。它應該產生一個最初嘈雜，長尾穩定的情節。下面的代碼創建了這個圖表。 ```py from pandas import DataFrame from pandas import read_csv from numpy import mean from matplotlib import pyplot import numpy # load results file results = read_csv('results.csv', header=None) values = results.values # collect cumulative stats means = list() for i in range(1,len(values)+1): data = values[0:i, 0] mean_rmse = mean(data) means.append(mean_rmse) # line plot of cumulative values pyplot.plot(means) pyplot.show() ``` 該圖確實顯示了一段時間的噪聲平均結果，可能是前200次重復，直到它變得穩定。在重復600次后，它似乎變得更加穩定。 ![Line Plot of the Number of Experiment Repeats vs Mean Model Skill](img/7181705d6f97fb147593d1144945ea84.jpg) 實驗重復次數與平均模型技能的線圖我們可以將此圖放大到前500個重復，看看我們是否能更好地了解正在發生的事情。我們還可以覆蓋最終平均分數（來自所有1000次運行的平均值）并嘗試找到收益遞減點。 ```py from pandas import DataFrame from pandas import read_csv from numpy import mean from matplotlib import pyplot import numpy # load results file results = read_csv('results.csv', header=None) values = results.values final_mean = mean(values) # collect cumulative stats means = list() for i in range(1,501): data = values[0:i, 0] mean_rmse = mean(data) means.append(mean_rmse) # line plot of cumulative values pyplot.plot(means) pyplot.plot([final_mean for x in range(len(means))]) pyplot.show() ``` 橙色線顯示所有1000次運行的平均值。我們可以看到，100次運行可能是一個很好的停止點，否則可能會有400次更精確的結果，但只是略有下降。 ![Line Plot of the Number of Experiment Repeats vs Mean Model Skill Truncated to 500 Repeants and Showing the Final Mean](img/ae34004f9fc1f33fbf906fd492127762.jpg) 實驗重復次數與平均模型技能的線圖截斷為500次重復并顯示最終均值這是一個良好的開端，但我們能做得更好嗎？ ## 4.計算標準誤差標準誤差是計算“樣本均值”與“總體均值”的差異程度。這與描述樣本中觀察值的平均變化量的標準偏差不同。標準誤差可以提供給定樣本大小的指示誤差量或可能從樣本均值預期到基礎和未知總體平均值的誤差擴散。標準誤差可按如下方式計算： ```py standard_error = sample_standard_deviation / sqrt(number of repeats) ``` 在這種情況下，模型分數樣本的標準偏差除以重復總數的平方根。我們期望標準誤差隨著實驗的重復次數而減少。給定結果，我們可以從每個重復次數的總體平均值計算樣本均值的標準誤差。完整的代碼清單如下。 ```py from pandas import read_csv from numpy import std from numpy import mean from matplotlib import pyplot from math import sqrt # load results file results = read_csv('results.csv', header=None) values = results.values # collect cumulative stats std_errors = list() for i in range(1,len(values)+1): data = values[0:i, 0] stderr = std(data) / sqrt(len(data)) std_errors.append(stderr) # line plot of cumulative values pyplot.plot(std_errors) pyplot.show() ``` 創建標準誤差與重復次數的線圖。我們可以看到，正如預期的那樣，隨著重復次數的增加，標準誤差會減小。我們還可以看到一個可接受的錯誤點，例如一個或兩個單位。標準誤差的單位與模型技能的單位相同。 ![Line Plot of the Standard Error of the Sample Mean from the Population Mean](img/76e04560c7bed62bb574891795f60fa1.jpg) 樣本均值標準誤差的線圖來自總體均值我們可以重新創建上面的圖形并繪制0.5和1單位作為指南，可用于找到可接受的錯誤級別。 ```py from pandas import read_csv from numpy import std from numpy import mean from matplotlib import pyplot from math import sqrt # load results file results = read_csv('results.csv', header=None) values = results.values # collect cumulative stats std_errors = list() for i in range(1,len(values)+1): data = values[0:i, 0] stderr = std(data) / sqrt(len(data)) std_errors.append(stderr) # line plot of cumulative values pyplot.plot(std_errors) pyplot.plot([0.5 for x in range(len(std_errors))], color='red') pyplot.plot([1 for x in range(len(std_errors))], color='red') pyplot.show() ``` 同樣，我們在標準誤差1和0.5處看到標準誤差與紅色指南的相同線圖。我們可以看到，如果標準誤差為1是可以接受的，則可能大約100次重復就足夠了。如果0.5的標準誤差是可接受的，那么重復300-350次就足夠了。我們可以看到重復的次數很快就會達到標準誤差收益遞減的程度。同樣，請記住，標準誤差是衡量模型技能分數樣本的平均值與給定隨機初始條件下給定模型配置的可能分數的真實基礎群體相比有多少的度量。 ![Line Plot of the Standard Error of the Sample Mean from the Population Mean With Markers](img/cd8ed82f64f58581e6d17f6bee5a907f.jpg) 樣本均值標準誤差的線圖來自具有標記的總體均值我們還可以使用標準誤差作為平均模型技能的置信區間。例如，未知總體群體意味著模型的表現具有95％在上限和下限之間的可能性。請注意，此方法僅適用于適度和大量重復，例如20或更多。置信區間可以定義為： ```py sample mean +/- (standard error * 1.96) ``` 我們可以計算這個置信區間，并將其作為誤差條添加到每個重復次數的樣本均值。完整的代碼清單如下。 ```py from pandas import read_csv from numpy import std from numpy import mean from matplotlib import pyplot from math import sqrt # load results file results = read_csv('results.csv', header=None) values = results.values # collect cumulative stats means, confidence = list(), list() n = len(values) + 1 for i in range(20,n): data = values[0:i, 0] mean_rmse = mean(data) stderr = std(data) / sqrt(len(data)) conf = stderr * 1.96 means.append(mean_rmse) confidence.append(conf) # line plot of cumulative values pyplot.errorbar(range(20, n), means, yerr=confidence) pyplot.plot(range(20, n), [60 for x in range(len(means))], color='red') pyplot.show() ``` 創建線圖，顯示每個重復次數的平均樣本值，誤差條顯示捕獲未知基礎種群平均值的每個平均值的置信區間。繪制一條讀數線，顯示實際人口平均數（僅因為我們在本教程開始時設計了模型技能分數）。作為總體均值的替代，您可以在1000次或更多次重復之后添加最終樣本均值的一行。誤差線模糊了平均分數的線條。我們可以看到平均值高估了人口平均值，但95％置信區間捕獲了人口平均值。請注意，95％置信區間意味著具有間隔的100個樣本均值中的95個將捕獲總體均值，并且5個這樣的樣本均值和置信區間將不會。我們可以看到，隨著標準誤差的減少，95％置信區間似乎隨著重復的增加而收緊，但可能會有超過500次重復的收益遞減。 ![Line Plot of Mean Result with Standard Error Bars and Population Mean](img/142850706a0da76cdf421b69b6f1dad3.jpg) 具有標準誤差棒和總體均值的均值結果線圖我們可以通過縮放此圖表來更清楚地了解正在發生的事情，突出顯示從20到200的重復。 ```py from pandas import read_csv from numpy import std from numpy import mean from matplotlib import pyplot from math import sqrt # load results file results = read_csv('results.csv', header=None) values = results.values # collect cumulative stats means, confidence = list(), list() n = 200 + 1 for i in range(20,n): data = values[0:i, 0] mean_rmse = mean(data) stderr = std(data) / sqrt(len(data)) conf = stderr * 1.96 means.append(mean_rmse) confidence.append(conf) # line plot of cumulative values pyplot.errorbar(range(20, n), means, yerr=confidence) pyplot.plot(range(20, n), [60 for x in range(len(means))], color='red') pyplot.show() ``` 在創建的線圖中，我們可以清楚地看到樣本均值和圍繞它的對稱誤差條。該圖確實更好地顯示了樣本均值中的偏差。 ![Zoomed Line Plot of Mean Result with Standard Error Bars and Population Mean](img/1483a0367221f7a13d0d2e05168e021a.jpg) 具有標準誤差棒和總體均值的平均結果的縮放線圖 ## 進一步閱讀沒有太多資源可以將所需的統計數據與隨機算法使用的計算實驗方法聯系起來。我發現的關于這個主題的最好的書是： * [人工智能的經驗方法](http://amzn.to/2mq9THq)，Cohen，1995，如果這篇文章對你感興趣，我強烈推薦這本書。 [![Amazon Image](img/65e0043914eb14513ee5774aa90daa3d.jpg)](http://www.amazon.com/dp/0262032252?tag=inspiredalgor-20) 以下是您可能會發現有用的其他一些文章： * [標準錯誤](https://en.wikipedia.org/wiki/Standard_error) * [置信區間](https://en.wikipedia.org/wiki/Confidence_interval) * [68-95-99.7規則](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule) 你知道其他任何好的相關材料嗎？請在下面的評論中告訴我。 ## 摘要在本教程中，您發現了可用于幫助選擇適用于評估隨機機器學習算法的重復次數的技術。您發現了許多可以立即使用的方法： * 粗略估計30,100或1000次重復。 * 樣本均值與重復次數的關系圖，并根據拐點選擇。 * 標準誤差與重復次數的關系圖，并根據誤差閾值進行選擇。 * 樣本置信區間與重復次數的關系圖，并根據誤差的擴散進行選擇。您是否在自己的實驗中使用過這些方法？在評論中分享您的結果;我很想聽聽他們的消息。