如何利用 Python 模擬殘差錯誤來糾正時間序列預測 · Machine Learning Mastery 博客文章翻譯

# 如何利用 Python 模擬殘差錯誤來糾正時間序列預測 > 原文： [https://machinelearningmastery.com/model-residual-errors-correct-time-series-forecasts-python/](https://machinelearningmastery.com/model-residual-errors-correct-time-series-forecasts-python/) 時間序列預測中的殘差提供了我們可以建模的另一個信息來源。殘差本身形成了一個可以具有時間結構的時間序列。此結構的簡單自回歸模型可用于預測預測誤差，而預測誤差又可用于校正預測。這種類型的模型稱為移動平均模型，名稱相同，但與移動平均平滑非常不同。在本教程中，您將了解如何建模殘差錯誤時間序列并使用它來糾正 Python 的預測。完成本教程后，您將了解： * 關于如何使用自回歸模型建模殘差錯誤時間序列。 * 如何開發和評估剩余誤差時間序列模型。 * 如何使用殘差誤差模型來糾正預測并提高預測技巧。讓我們開始吧。 * **2017 年 1 月更新**：改進了一些代碼示例以使其更加完整。 ## 殘差的模型預期和預測之間的差異稱為殘差。計算方法如下： ```py residual error = expected - predicted ``` 就像輸入觀察本身一樣，時間序列中的殘差可以具有趨勢，偏差和季節性等時間結構。殘差預測誤差的時間序列中的任何時間結構都可用作診斷，因為它建議可以合并到預測模型中的信息。理想的模型不會在殘差中留下任何結構，只是無法建模的隨機波動。剩余誤差中的結構也可以直接建模。殘余誤差中可能存在難以直接并入模型中的復雜信號。相反，您可以創建剩余錯誤時間序列的模型并預測模型的預期誤差。然后可以從模型預測中減去預測誤差，進而提供額外的表現提升。一個簡單有效的殘差誤差模型是自回歸。這是在下一個時間步使用一些滯后誤差值來預測誤差的地方。這些滯后誤差在線性回歸模型中組合，非常類似于直接時間序列觀測的自回歸模型。殘差錯誤時間序列的自回歸稱為移動平均（MA）模型。這很令人困惑，因為它與移動平均平滑過程無關。將其視為自回歸（AR）過程的兄弟，除了滯后的殘差而不是滯后的原始觀測值。在本教程中，我們將開發剩余錯誤時間序列的自回歸模型。在我們深入研究之前，讓我們看一下我們將開發模型的單變量數據集。 ## 每日女性出生數據集該數據集描述了 1959 年加利福尼亞州每日女性出生人數。單位是計數，有 365 個觀測值。數據集的來源歸功于 Newton（1988）。 [在此處下載并了解有關數據集的更多信息](https://datamarket.com/data/set/235k/daily-total-female-births-in-california-1959)。下載數據集并將其放在當前工作目錄中，文件名為“ _daily-total-female-births.csv_ ”。以下是從 CSV 加載每日女性出生數據集的示例。 ```py from pandas import Series from matplotlib import pyplot series = Series.from_csv('daily-total-female-births.csv', header=0) print(series.head()) series.plot() pyplot.show() ``` 運行該示例將打印加載文件的前 5 行。 ```py Date 1959-01-01 35 1959-01-02 32 1959-01-03 30 1959-01-04 31 1959-01-05 44 Name: Births, dtype: int64 ``` 數據集也以隨時間變化的觀察線圖顯示。 ![Daily Total Female Births Plot](https://img.kancloud.cn/4f/7d/4f7d7462b504c7b52081b73827b67cb5_800x600.jpg) 每日總女性出生情節我們可以看到沒有明顯的趨勢或季節性。數據集看起來是靜止的，這是使用自回歸模型的期望。 ## 持久性預測模型我們可以做的最簡單的預測是預測上一個時間步驟中發生的事情將與下一個時間步驟中發生的情況相同。這稱為“樸素預測”或持久性預測模型。該模型將提供我們可以計算剩余誤差時間序列的預測。或者，我們可以開發時間序列的自回歸模型并將其用作我們的模型。在這種情況下，我們不會為了簡潔而開發自回歸模型，而是關注殘差的模型。我們可以在 Python 中實現持久性模型。加載數據集后，它被定性為監督學習問題。創建數據集的滯后版本，其中先前時間步長（t-1）用作輸入變量，下一時間步驟（t + 1）用作輸出變量。 ```py # create lagged dataset values = DataFrame(series.values) dataframe = concat([values.shift(1), values], axis=1) dataframe.columns = ['t-1', 't+1'] ``` 接下來，數據集分為訓練集和測試集。共有 66％的數據用于訓練，其余 34％用于測試集。持久性模型不需要訓練;這只是一種標準的測試工具方法。拆分后，訓練和測試裝置將分為輸入和輸出組件。 ```py # split into train and test sets X = dataframe.values train_size = int(len(X) * 0.66) train, test = X[1:train_size], X[train_size:] train_X, train_y = train[:,0], train[:,1] test_X, test_y = test[:,0], test[:,1] ``` 通過預測輸出值（ _y_ ）作為輸入值（ _x_ ）的副本來應用持久性模型。 ```py # persistence model predictions = [x for x in test_X] ``` 然后將殘余誤差計算為預期結果（ _test_y_ ）和預測（_ 預測 _）之間的差異。 ```py # calculate residuals residuals = [test_y[i]-predictions[i] for i in range(len(predictions))] ``` 該示例將所有這些放在一起，并為我們提供了一組殘余預測錯誤，我們可以探索本教程。 ```py from pandas import Series from pandas import DataFrame from pandas import concat series = Series.from_csv('daily-total-female-births.csv', header=0) # create lagged dataset values = DataFrame(series.values) dataframe = concat([values.shift(1), values], axis=1) dataframe.columns = ['t-1', 't+1'] # split into train and test sets X = dataframe.values train_size = int(len(X) * 0.66) train, test = X[1:train_size], X[train_size:] train_X, train_y = train[:,0], train[:,1] test_X, test_y = test[:,0], test[:,1] # persistence model predictions = [x for x in test_X] # calculate residuals residuals = [test_y[i]-predictions[i] for i in range(len(predictions))] residuals = DataFrame(residuals) print(residuals.head()) ``` 然后該示例打印預測殘差的前 5 行。 ```py 0 9.0 1 -10.0 2 3.0 3 -6.0 4 30.0 ``` 我們現在有一個可以建模的剩余錯誤時間序列。 ## 剩余誤差的自回歸我們可以使用自回歸模型對剩余誤差時間序列進行建模。這是一個線性回歸模型，可以創建滯后殘差項的加權線性和。例如： ```py error(t+1) = b0 + b1*error(t-1) + b2*error(t-2) ...+ bn*error(t-n) ``` 我們可以使用 [statsmodels 庫](http://statsmodels.sourceforge.net/)提供的自回歸模型（AR）。基于上一節中的持久性模型，我們可以首先在訓練數據集上計算的殘差上訓練模型。這要求我們對訓練數據集中的每個觀察進行持久性預測，然后創建 AR 模型，如下所示。 ```py from pandas import Series from pandas import DataFrame from pandas import concat from statsmodels.tsa.ar_model import AR series = Series.from_csv('daily-total-female-births.csv', header=0) # create lagged dataset values = DataFrame(series.values) dataframe = concat([values.shift(1), values], axis=1) dataframe.columns = ['t-1', 't+1'] # split into train and test sets X = dataframe.values train_size = int(len(X) * 0.66) train, test = X[1:train_size], X[train_size:] train_X, train_y = train[:,0], train[:,1] test_X, test_y = test[:,0], test[:,1] # persistence model on training set train_pred = [x for x in train_X] # calculate residuals train_resid = [train_y[i]-train_pred[i] for i in range(len(train_pred))] # model the training set residuals model = AR(train_resid) model_fit = model.fit() window = model_fit.k_ar coef = model_fit.params print('Lag=%d, Coef=%s' % (window, coef)) ``` 運行此片段可以打印訓練后的線性回歸模型所選擇的滯后 15 和 16 個系數（截距和每個滯后一個）。 ```py Lag=15, Coef=[ 0.10120699 -0.84940615 -0.77783609 -0.73345006 -0.68902061 -0.59270551 -0.5376728 -0.42553356 -0.24861246 -0.19972102 -0.15954013 -0.11045476 -0.14045572 -0.13299964 -0.12515801 -0.03615774] ``` 接下來，我們可以逐步完成測試數據集，并且每個時間步驟都必須： 1. 計算持久性預測（t + 1 = t-1）。 2. 使用自回歸模型預測殘差。自回歸模型需要前 15 個時間步的殘差。因此，我們必須保持這些價值觀。當我們逐步通過測試數據集的時間步長進行預測和估算誤差時，我們可以計算實際殘差并更新剩余誤差時間序列滯后值（歷史），以便我們可以在下一個時間步計算誤差。這是一個前瞻性預測或滾動預測模型。我們最終得到了來自訓練數據集的殘差預測誤差的時間序列以及測試數據集上的預測殘差。我們可以繪制這些圖并快速了解模型在預測殘差方面的巧妙程度。下面列出了完整的示例。 ```py from pandas import Series from pandas import DataFrame from pandas import concat from statsmodels.tsa.ar_model import AR from matplotlib import pyplot series = Series.from_csv('daily-total-female-births.csv', header=0) # create lagged dataset values = DataFrame(series.values) dataframe = concat([values.shift(1), values], axis=1) dataframe.columns = ['t-1', 't+1'] # split into train and test sets X = dataframe.values train_size = int(len(X) * 0.66) train, test = X[1:train_size], X[train_size:] train_X, train_y = train[:,0], train[:,1] test_X, test_y = test[:,0], test[:,1] # persistence model on training set train_pred = [x for x in train_X] # calculate residuals train_resid = [train_y[i]-train_pred[i] for i in range(len(train_pred))] # model the training set residuals model = AR(train_resid) model_fit = model.fit() window = model_fit.k_ar coef = model_fit.params # walk forward over time steps in test history = train_resid[len(train_resid)-window:] history = [history[i] for i in range(len(history))] predictions = list() expected_error = list() for t in range(len(test_y)): # persistence yhat = test_X[t] error = test_y[t] - yhat expected_error.append(error) # predict error length = len(history) lag = [history[i] for i in range(length-window,length)] pred_error = coef[0] for d in range(window): pred_error += coef[d+1] * lag[window-d-1] predictions.append(pred_error) history.append(error) print('predicted error=%f, expected error=%f' % (pred_error, error)) # plot predicted error pyplot.plot(expected_error) pyplot.plot(predictions, color='red') pyplot.show() ``` 首先運行示例打印測試數據集中每個時間步的預測和預期殘差。 ```py ... predicted error=-1.951332, expected error=-10.000000 predicted error=6.675538, expected error=3.000000 predicted error=3.419129, expected error=15.000000 predicted error=-7.160046, expected error=-4.000000 predicted error=-4.179003, expected error=7.000000 predicted error=-10.425124, expected error=-5.000000 ``` 接下來，與預測的殘差（紅色）相比，繪制時間序列的實際殘差（藍色）。 ![Prediction of Residual Error Time Series](https://img.kancloud.cn/0d/36/0d364ede4a7f5626f5e97c605a854c13_800x600.jpg) 殘差時間序列的預測現在我們知道如何建模殘差，接下來我們將看看如何糾正預測和提高模型技能。 ## 用殘差誤差模型進行正確預測預測殘差的模型很有意思，但它也可以用來做出更好的預測。通過對時間步長的預測誤差進行良好估計，我們可以做出更好的預測。例如，我們可以將預期的預測誤差添加到預測中以進行糾正，從而提高模型的技能。 ```py improved forecast = forecast + estimated error ``` 讓我們以一個例子來具體化。假設時間步長的期望值為 10.模型預測 8 并估計誤差為 3.改進的預測將是： ```py improved forecast = forecast + estimated error improved forecast = 8 + 3 improved forecast = 11 ``` 這將實際預測誤差從 2 個單位減少到 1 個單位。我們可以更新上一節中的示例，將估計的預測誤差添加到持久性預測中，如下所示： ```py # correct the prediction yhat = yhat + pred_error ``` 下面列出了完整的示例。 ```py from pandas import Series from pandas import DataFrame from pandas import concat from statsmodels.tsa.ar_model import AR from matplotlib import pyplot from sklearn.metrics import mean_squared_error series = Series.from_csv('daily-total-female-births.csv', header=0) # create lagged dataset values = DataFrame(series.values) dataframe = concat([values.shift(1), values], axis=1) dataframe.columns = ['t-1', 't+1'] # split into train and test sets X = dataframe.values train_size = int(len(X) * 0.66) train, test = X[1:train_size], X[train_size:] train_X, train_y = train[:,0], train[:,1] test_X, test_y = test[:,0], test[:,1] # persistence model on training set train_pred = [x for x in train_X] # calculate residuals train_resid = [train_y[i]-train_pred[i] for i in range(len(train_pred))] # model the training set residuals model = AR(train_resid) model_fit = model.fit() window = model_fit.k_ar coef = model_fit.params # walk forward over time steps in test history = train_resid[len(train_resid)-window:] history = [history[i] for i in range(len(history))] predictions = list() for t in range(len(test_y)): # persistence yhat = test_X[t] error = test_y[t] - yhat # predict error length = len(history) lag = [history[i] for i in range(length-window,length)] pred_error = coef[0] for d in range(window): pred_error += coef[d+1] * lag[window-d-1] # correct the prediction yhat = yhat + pred_error predictions.append(yhat) history.append(error) print('predicted=%f, expected=%f' % (yhat, test_y[t])) # error mse = mean_squared_error(test_y, predictions) print('Test MSE: %.3f' % mse) # plot predicted error pyplot.plot(test_y) pyplot.plot(predictions, color='red') pyplot.show() ``` 運行該示例將打印測試數據集中每個時間步的預測和預期結果。校正預測的均方誤差計算為 56.234，遠遠優于單獨持續模型的 83.744 分。 ```py ... predicted=40.675538, expected=37.000000 predicted=40.419129, expected=52.000000 predicted=44.839954, expected=48.000000 predicted=43.820997, expected=55.000000 predicted=44.574876, expected=50.000000 Test MSE: 56.234 ``` 最后，繪制測試數據集的預期值（藍色）與校正預測值（紅色）。我們可以看到持久性模型已經被積極地修正為一個看起來像移動平均線的時間序列。 ![Corrected Persistence Forecast for Daily Female Births](https://img.kancloud.cn/04/7d/047d3c95900595132e8cae016217ee52_800x600.jpg) 糾正每日女性出生的持續性預測 ## 摘要在本教程中，您了解了如何建模殘差錯誤時間序列并使用它來糾正 Python 的預測。具體來說，你學到了： * 關于將自回歸模型發展為殘差的移動平均（MA）方法。 * 如何開發和評估殘差誤差模型來預測預測誤差。 * 如何使用預測誤差的預測來糾正預測并提高模型技能。您對移動平均模型或本教程有任何疑問嗎？在下面的評論中提出您的問題，我會盡力回答。