如何構建家庭用電自回歸預測模型 · Machine Learning Mastery 博客文章翻譯

# 如何構建家庭用電自回歸預測模型 > 原文： [https://machinelearningmastery.com/how-to-develop-an-autoregression-forecast-model-for-household-electricity-consumption/](https://machinelearningmastery.com/how-to-develop-an-autoregression-forecast-model-for-household-electricity-consumption/) 鑒于智能電表的興起以及太陽能電池板等發電技術的廣泛采用，可提供大量的用電數據。該數據代表了多變量時間序列的功率相關變量，而這些變量又可用于建模甚至預測未來的電力消耗。自相關模型非常簡單，可以提供快速有效的方法，對電力消耗進行熟練的一步和多步預測。在本教程中，您將了解如何開發和評估用于多步預測家庭功耗的自回歸模型。完成本教程后，您將了解： * 如何創建和分析單變量時間序列數據的自相關和部分自相關圖。 * 如何使用自相關圖中的結果來配置自動回歸模型。 * 如何開發和評估用于進行一周預測的自相關模型。讓我們開始吧。 ![How to Develop an Autoregression Forecast Model for Household Electricity Consumption](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2018/10/How-to-Develop-an-Autoregression-Forecast-Model-for-Household-Electricity-Consumption.jpg) 如何制定家庭用電的自回歸預測模型 [wongaboo](https://www.flickr.com/photos/27146806@N00/22122826108/) 的照片，保留一些權利。 ## 教程概述本教程分為五個部分;他們是： 1. 問題描述 2. 加載并準備數據集 3. 模型評估 4. 自相關分析 5. 開發自回歸模型 ## 問題描述 '[家庭用電量](https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption)'數據集是一個多變量時間序列數據集，描述了四年內單個家庭的用電量。該數據是在 2006 年 12 月至 2010 年 11 月之間收集的，并且每分鐘收集家庭內的能耗觀察結果。它是一個多變量系列，由七個變量組成（除日期和時間外）;他們是： * **global_active_power** ：家庭消耗的總有功功率（千瓦）。 * **global_reactive_power** ：家庭消耗的總無功功率（千瓦）。 * **電壓**：平均電壓（伏特）。 * **global_intensity** ：平均電流強度（安培）。 * **sub_metering_1** ：廚房的有功電能（瓦特小時的有功電能）。 * **sub_metering_2** ：用于洗衣的有功能量（瓦特小時的有功電能）。 * **sub_metering_3** ：氣候控制系統的有功電能（瓦特小時的有功電能）。有功和無功電能參考[交流電](https://en.wikipedia.org/wiki/AC_power)的技術細節。可以通過從總活動能量中減去三個定義的子計量變量的總和來創建第四個子計量變量，如下所示： ```py sub_metering_remainder = (global_active_power * 1000 / 60) - (sub_metering_1 + sub_metering_2 + sub_metering_3) ``` ## 加載并準備數據集數據集可以從 UCI 機器學習庫下載為單個 20 兆字節的.zip 文件： * [household_power_consumption.zip](https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip) 下載數據集并將其解壓縮到當前工作目錄中。您現在將擁有大約 127 兆字節的文件“ _household_power_consumption.txt_ ”并包含所有觀察結果。我們可以使用 _read_csv（）_ 函數來加載數據，并將前兩列合并到一個日期時間列中，我們可以將其用作索引。 ```py # load all data dataset = read_csv('household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime']) ``` 接下來，我們可以用'_ 標記所有[缺失值](https://machinelearningmastery.com/handle-missing-timesteps-sequence-prediction-problems-python/)？_ '具有 _NaN_ 值的字符，這是一個浮點數。這將允許我們將數據作為一個浮點值數組而不是混合類型（效率較低）。 ```py # mark all missing values dataset.replace('?', nan, inplace=True) # make dataset numeric dataset = dataset.astype('float32') ``` 我們還需要填寫缺失值，因為它們已被標記。一種非常簡單的方法是從前一天的同一時間復制觀察。我們可以在一個名為 _fill_missing（）_ 的函數中實現它，該函數將從 24 小時前獲取數據的 NumPy 數組并復制值。 ```py # fill missing values with a value at the same time one day ago def fill_missing(values): one_day = 60 * 24 for row in range(values.shape[0]): for col in range(values.shape[1]): if isnan(values[row, col]): values[row, col] = values[row - one_day, col] ``` 我們可以將此函數直接應用于 DataFrame 中的數據。 ```py # fill missing fill_missing(dataset.values) ``` 現在，我們可以使用上一節中的計算創建一個包含剩余子計量的新列。 ```py # add a column for for the remainder of sub metering values = dataset.values dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6]) ``` 我們現在可以將清理后的數據集版本保存到新文件中;在這種情況下，我們只需將文件擴展名更改為.csv，并將數據集保存為“ _household_power_consumption.csv_ ”。 ```py # save updated dataset dataset.to_csv('household_power_consumption.csv') ``` 將所有這些結合在一起，下面列出了加載，清理和保存數據集的完整示例。 ```py # load and clean-up data from numpy import nan from numpy import isnan from pandas import read_csv from pandas import to_numeric # fill missing values with a value at the same time one day ago def fill_missing(values): one_day = 60 * 24 for row in range(values.shape[0]): for col in range(values.shape[1]): if isnan(values[row, col]): values[row, col] = values[row - one_day, col] # load all data dataset = read_csv('household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime']) # mark all missing values dataset.replace('?', nan, inplace=True) # make dataset numeric dataset = dataset.astype('float32') # fill missing fill_missing(dataset.values) # add a column for for the remainder of sub metering values = dataset.values dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6]) # save updated dataset dataset.to_csv('household_power_consumption.csv') ``` 運行該示例將創建新文件' _household_power_consumption.csv_ '，我們可以將其用作建模項目的起點。 ## 模型評估在本節中，我們將考慮如何開發和評估家庭電力數據集的預測模型。本節分為四個部分;他們是： 1. 問題框架 2. 評估指標 3. 訓練和測試集 4. 前瞻性驗證 ### 問題框架有許多方法可以利用和探索家庭用電量數據集。在本教程中，我們將使用這些數據來探索一個非常具體的問題;那是： > 鑒于最近的耗電量，未來一周的預期耗電量是多少？這要求預測模型預測未來七天每天的總有功功率。從技術上講，考慮到多個預測步驟，這個問題的框架被稱為多步驟時間序列預測問題。利用多個輸入變量的模型可以稱為多變量多步時間序列預測模型。這種類型的模型在規劃支出方面可能有助于家庭。在供應方面，它也可能有助于規劃特定家庭的電力需求。數據集的這種框架還表明，將每分鐘功耗的觀察結果下采樣到每日總數是有用的。這不是必需的，但考慮到我們對每天的總功率感興趣，這是有道理的。我們可以使用 pandas DataFrame 上的 [resample（）函數](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html)輕松實現這一點。使用參數' _D_ '調用此函數允許按日期時間索引的加載數據按天分組（[查看所有偏移別名](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)）。然后，我們可以計算每天所有觀測值的總和，并為八個變量中的每一個創建每日耗電量數據的新數據集。下面列出了完整的示例。 ```py # resample minute data to total for each day from pandas import read_csv # load the new file dataset = read_csv('household_power_consumption.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # resample data to daily daily_groups = dataset.resample('D') daily_data = daily_groups.sum() # summarize print(daily_data.shape) print(daily_data.head()) # save daily_data.to_csv('household_power_consumption_days.csv') ``` 運行該示例將創建一個新的每日總功耗數據集，并將結果保存到名為“ _household_power_consumption_days.csv_ ”的單獨文件中。我們可以將其用作數據集，用于擬合和評估所選問題框架的預測模型。 ### 評估指標預測將包含七個值，一個用于一周中的每一天。多步預測問題通常分別評估每個預測時間步長。這有助于以下幾個原因： * 在特定提前期評論技能（例如+1 天 vs +3 天）。 * 在不同的交付時間基于他們的技能對比模型（例如，在+1 天的模型和在日期+5 的模型良好的模型）。總功率的單位是千瓦，并且具有也在相同單位的誤差度量將是有用的。均方根誤差（RMSE）和平均絕對誤差（MAE）都符合這個要求，盡管 RMSE 更常用，將在本教程中采用。與 MAE 不同，RMSE 更能預測預測誤差。此問題的表現指標是從第 1 天到第 7 天的每個提前期的 RMSE。作為捷徑，使用單個分數總結模型的表現以幫助模型選擇可能是有用的。可以使用的一個可能的分數是所有預測天數的 RMSE。下面的函數 _evaluate_forecasts（）_ 將實現此行為并基于多個七天預測返回模型的表現。 ```py # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores ``` 運行該函數將首先返回整個 RMSE，無論白天，然后每天返回一系列 RMSE 分數。 ### 訓練和測試集我們將使用前三年的數據來訓練預測模型和評估模型的最后一年。給定數據集中的數據將分為標準周。這些是從周日開始到周六結束的周。這是使用所選模型框架的現實且有用的方法，其中可以預測未來一周的功耗。它也有助于建模，其中模型可用于預測特定日期（例如星期三）或整個序列。我們將數據拆分為標準周，從測試數據集向后工作。數據的最后一年是 2010 年，2010 年的第一個星期日是 1 月 3 日。數據于 2010 年 11 月中旬結束，數據中最接近的最后一個星期六是 11 月 20 日。這給出了 46 周的測試數據。下面提供了測試數據集的每日數據的第一行和最后一行以供確認。 ```py 2010-01-03,2083.4539999999984,191.61000000000055,350992.12000000034,8703.600000000033,3842.0,4920.0,10074.0,15888.233355799992 ... 2010-11-20,2197.006000000004,153.76800000000028,346475.9999999998,9320.20000000002,4367.0,2947.0,11433.0,17869.76663959999 ``` 每日數據從 2006 年底開始。數據集中的第一個星期日是 12 月 17 日，這是第二行數據。將數據組織到標準周內為訓練預測模型提供了 159 個完整的標準周。 ```py 2006-12-17,3390.46,226.0059999999994,345725.32000000024,14398.59999999998,2033.0,4187.0,13341.0,36946.66673200004 ... 2010-01-02,1309.2679999999998,199.54600000000016,352332.8399999997,5489.7999999999865,801.0,298.0,6425.0,14297.133406600002 ``` 下面的函數 _split_dataset（）_ 將每日數據拆分為訓練集和測試集，并將每個數據組織成標準周。使用特定行偏移來使用數據集的知識來分割數據。然后使用 NumPy [split（）函數](https://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html)將分割數據集組織成每周數據。 ```py # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test ``` 我們可以通過加載每日數據集并打印訓練和測試集的第一行和最后一行數據來測試此功能，以確認它們符合上述預期。完整的代碼示例如下所示。 ```py # split into standard weeks from numpy import split from numpy import array from pandas import read_csv # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) train, test = split_dataset(dataset.values) # validate train data print(train.shape) print(train[0, 0, 0], train[-1, -1, 0]) # validate test print(test.shape) print(test[0, 0, 0], test[-1, -1, 0]) ``` 運行該示例表明，訓練數據集確實有 159 周的數據，而測試數據集有 46 周。我們可以看到，第一行和最后一行的訓練和測試數據集的總有效功率與我們定義為每組標準周界限的特定日期的數據相匹配。 ```py (159, 7, 8) 3390.46 1309.2679999999998 (46, 7, 8) 2083.4539999999984 2197.006000000004 ``` ### 前瞻性驗證將使用稱為[前進驗證](https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/)的方案評估模型。這是需要模型進行一周預測的地方，然后該模型的實際數據可用于模型，以便它可以用作在隨后一周進行預測的基礎。這對于如何在實踐中使用模型以及對模型有益而使其能夠利用最佳可用數據都是現實的。我們可以通過分離輸入數據和輸出/預測數據來證明這一點。 ```py Input, Predict [Week1] Week2 [Week1 + Week2] Week3 [Week1 + Week2 + Week3] Week4 ... ``` 評估此數據集上的預測模型的前瞻性驗證方法在下面實現，命名為 _evaluate_model（）_。為模型提供函數的名稱作為參數“ _model_func_ ”。該功能負責定義模型，使模型適合訓練數據，并進行一周的預測。然后使用先前定義的 _evaluate_forecasts（）_ 函數，針對測試數據集評估模型所做的預測。 ```py # evaluate a single model def evaluate_model(model_func, train, test): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = model_func(history) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores ``` 一旦我們對模型進行評估，我們就可以總結表現。以下名為 _summarize_scores（）_ 的函數將模型的表現顯示為單行，以便與其他模型進行比較。 ```py # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) ``` 我們現在已經開始評估數據集上的預測模型的所有元素。 ## 自相關分析統計相關性總結了兩個變量之間關系的強度。我們可以假設每個變量的分布符合[高斯](https://machinelearningmastery.com/statistical-data-distributions/)（鐘形曲線）分布。如果是這種情況，我們可以使用 Pearson 相關系數來總結變量之間的相關性。 Pearson 相關系數是介于-1 和 1 之間的數字，分別描述了負相關或正相關。值為零表示沒有相關性。我們可以計算時間序列觀測值與之前時間步長的觀測值之間的相關性，稱為滯后。因為時間序列觀測值的相關性是使用先前時間的相同序列的值計算的，所以這稱為序列相關或自相關。滯后時間序列自相關的圖稱為[自相關函數](https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/)，或首字母縮略詞 ACF。該圖有時稱為[相關圖](https://en.wikipedia.org/wiki/Correlogram)，或自相關圖。部分自相關函數或 PACF 是時間序列中的觀察與先前時間步驟的觀察與中間觀察的關系被移除之間的關系的總結。觀察的自相關和先前時間步的觀察包括直接相關和間接相關。這些間接相關性是觀察相關性的線性函數，以及在中間時間步驟的觀察。部分自相關函數試圖消除這些間接相關性。沒有進入數學，這是部分自相關的直覺。我們可以分別使用 [plot_acf（）](http://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_acf.html)和 [plot_pacf（）](http://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_pacf.html) statsmodels 函數計算自相關和部分自相關圖。為了計算和繪制自相關，我們必須將數據轉換為單變量時間序列。具體而言，觀察到每日消耗的總功率。下面的 _to_series（）_ 功能將多元數據劃分為每周窗口，并返回單個單變量時間序列。 ```py # convert windows of weekly multivariate data into a series of total power def to_series(data): # extract just the total power from each week series = [week[:, 0] for week in data] # flatten into a single series series = array(series).flatten() return series ``` 我們可以為準備好的訓練數據集調用此函數。首先，必須加載每日功耗數據集。 ```py # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) ``` 然后必須使用標準周窗口結構將數據集拆分為訓練集和測試集。 ```py # split into train and test train, test = split_dataset(dataset.values) ``` 然后可以從訓練數據集中提取每日功耗的單變量時間序列。 ```py # convert training data into a series series = to_series(train) ``` 然后我們可以創建一個包含 ACF 和 PACF 圖的單個圖。可以指定延遲時間步數。我們將此修復為每日觀察一年或 365 天。 ```py # plots pyplot.figure() lags = 365 # acf axis = pyplot.subplot(2, 1, 1) plot_acf(series, ax=axis, lags=lags) # pacf axis = pyplot.subplot(2, 1, 2) plot_pacf(series, ax=axis, lags=lags) # show plot pyplot.show() ``` 下面列出了完整的示例。我們預計明天和未來一周消耗的電量將取決于前幾天消耗的電量。因此，我們期望在 ACF 和 PACF 圖中看到強烈的自相關信號。 ```py # acf and pacf plots of total power from numpy import split from numpy import array from pandas import read_csv from matplotlib import pyplot from statsmodels.graphics.tsaplots import plot_acf from statsmodels.graphics.tsaplots import plot_pacf # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # convert windows of weekly multivariate data into a series of total power def to_series(data): # extract just the total power from each week series = [week[:, 0] for week in data] # flatten into a single series series = array(series).flatten() return series # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # convert training data into a series series = to_series(train) # plots pyplot.figure() lags = 365 # acf axis = pyplot.subplot(2, 1, 1) plot_acf(series, ax=axis, lags=lags) # pacf axis = pyplot.subplot(2, 1, 2) plot_pacf(series, ax=axis, lags=lags) # show plot pyplot.show() ``` 運行該示例將創建一個包含 ACF 和 PACF 圖的單個圖。這些地塊非常密集，難以閱讀。然而，我們或許可以看到熟悉的自回歸模式。我們也可能會在一年內看到一些重要的滯后觀察結果。進一步調查可能暗示季節性自相關成分，這不是一個令人驚訝的發現。 ![ACF and PACF plots for the univariate series of power consumption](https://img.kancloud.cn/a9/74/a9740e50f46a592efa3fa523761db50e_1280x960.jpg) ACF 和 PACF 繪制了單變量系列功耗我們可以放大繪圖并將滯后觀測的數量從 365 更改為 50。 ```py lags = 50 ``` 使用此更改結果重新運行代碼示例是繪圖的放大版本，雜亂程度更低。我們可以清楚地看到兩個圖中熟悉的自回歸模式。該模式由兩個元素組成： * **ACF** ：隨著滯后增加而緩慢降低的大量顯著滯后觀察。 * **PACF** ：隨著滯后的增加，一些顯著的滯后觀察突然下降。 ACF 圖表明存在強自相關分量，而 PACF 圖表明該分量對于前七個滯后觀察是不同的。這表明一個好的起始模型將是 AR（7）;這是一個自回歸模型，有 7 個滯后觀察值作為輸入。 ![Zoomed in ACF and PACF plots for the univariate series of power consumption](https://img.kancloud.cn/0f/17/0f171a0d862237876ad4268d20fbec80_1280x960.jpg) 在 ACF 和 PACF 圖中放大了單變量系列的功耗 ## 開發自回歸模型我們可以為單變量的日常功耗系列開發自回歸模型。 Statsmodels 庫提供了多種開發 AR 模型的方法，例如使用 AR，ARMA，ARIMA 和 SARIMAX 類。我們將使用 [ARIMA 實現](http://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_model.ARIMA.html)，因為它允許輕松擴展到差分和移動平均值。首先，必須將包含數周先前觀察的歷史數據轉換為每日功耗的單變量時間序列。我們可以使用上一節中開發的 _to_series（）_ 函數。 ```py # convert history into a univariate series series = to_series(history) ``` 接下來，可以通過將參數傳遞給 ARIMA 類的構造函數來定義 ARIMA 模型。我們將指定 AR（7）模型，其在 ARIMA 表示法中是 ARIMA（7,0,0）。 ```py # define the model model = ARIMA(series, order=(7,0,0)) ``` 接下來，該模型可以適合訓練數據。我們將使用默認值并在擬合期間通過設置 _disp = False_ 禁用所有調試信息。 ```py # fit the model model_fit = model.fit(disp=False) ``` 現在模型已經適合，我們可以做出預測。可以通過調用 _predict（）_ 函數并將其傳遞給相對于訓練數據的日期或索引的間隔來進行預測。我們將使用從訓練數據之外的第一個時間步開始的指數，并將其延長六天，總共提供超過訓練數據集的七天預測期。 ```py # make forecast yhat = model_fit.predict(len(series), len(series)+6) ``` 我們可以將所有這些包含在名為 _arima_forecast（）_ 的函數中，該函數獲取歷史記錄并返回一周的預測。 ```py # arima forecast def arima_forecast(history): # convert history into a univariate series series = to_series(history) # define the model model = ARIMA(series, order=(7,0,0)) # fit the model model_fit = model.fit(disp=False) # make forecast yhat = model_fit.predict(len(series), len(series)+6) return yhat ``` 此功能可直接用于前面描述的測試工具中。下面列出了完整的示例。 ```py # arima forecast from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from statsmodels.tsa.arima_model import ARIMA # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # evaluate a single model def evaluate_model(model_func, train, test): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = model_func(history) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # convert windows of weekly multivariate data into a series of total power def to_series(data): # extract just the total power from each week series = [week[:, 0] for week in data] # flatten into a single series series = array(series).flatten() return series # arima forecast def arima_forecast(history): # convert history into a univariate series series = to_series(history) # define the model model = ARIMA(series, order=(7,0,0)) # fit the model model_fit = model.fit(disp=False) # make forecast yhat = model_fit.predict(len(series), len(series)+6) return yhat # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # define the names and functions for the models we wish to evaluate models = dict() models['arima'] = arima_forecast # evaluate each model days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] for name, func in models.items(): # evaluate and get scores score, scores = evaluate_model(func, train, test) # summarize scores summarize_scores(name, score, scores) # plot scores pyplot.plot(days, scores, marker='o', label=name) # show plot pyplot.legend() pyplot.show() ``` 運行該示例首先在測試數據集上打印 AR（7）模型的表現。我們可以看到該模型實現了大約 381 千瓦的總體 RMSE。與樸素的預測模型相比，該模型具有技巧，例如使用一年前同一時間的觀測預測前一周的模型，其總體 RMSE 約為 465 千瓦。 ```py arima: [381.636] 393.8, 398.9, 357.0, 377.2, 393.9, 306.1, 432.2 ``` 還創建了預測的線圖，顯示了預測的七個交付時間中每個時段的 RMSE（千瓦）。我們可以看到一個有趣的模式。我們可能會認為早期的提前期比以后的提前期更容易預測，因為每個連續提前期的誤差都會增加。相反，我們看到星期五（提前期+6）是最容易預測的，星期六（提前期+7）是預測中最具挑戰性的。我們還可以看到剩余的交付周期在中高到 300 千瓦的范圍內都有類似的誤差。 ![Line plot of ARIMA forecast error for each forecasted lead times](https://img.kancloud.cn/3f/f2/3ff2c894718fc4f21175935b622cc657_1280x960.jpg) 每個預測提前期的 ARIMA 預測誤差線圖 ## 擴展本節列出了一些擴展您可能希望探索的教程的想法。 * **調整 ARIMA** 。沒有調整 ARIMA 模型的參數。探索或搜索一套 ARIMA 參數（q，d，p），看看表現是否可以進一步提高。 * **探索季節性 AR** 。探索是否可以通過包含季節性自回歸元素來改善 AR 模型的表現。這可能需要使用 SARIMA 模型。 * **探索數據準備**。該模型直接適用于原始數據。探索標準化或標準化甚至功率變換是否可以進一步提高 AR 模型的技能。如果你探索任何這些擴展，我很想知道。 ## 進一步閱讀如果您希望深入了解，本節將提供有關該主題的更多資源。 ### API * [pandas.read_csv API](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) * [pandas.DataFrame.resample API](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) * [重采樣偏移別名](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) * [sklearn.metrics.mean_squared_error API](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) * [numpy.split API](https://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html) * [statsmodels.graphics.tsaplots.plot_acf API](http://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_acf.html) * [statsmodels.graphics.tsaplots.plot_pacf API](http://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_pacf.html) * [statsmodels.tsa.arima_model.ARIMA API](http://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_model.ARIMA.html) ### 用品 * [個人家庭用電數據集，UCI 機器學習庫。](https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption) * [交流電源，維基百科。](https://en.wikipedia.org/wiki/AC_power) * [Correlogram，維基百科。](https://en.wikipedia.org/wiki/Correlogram) ## 摘要在本教程中，您了解了如何開發和評估用于多步預測家庭功耗的自回歸模型。具體來說，你學到了： * 如何創建和分析單變量時間序列數據的自相關和部分自相關圖。 * 如何使用自相關圖中的結果來配置自動回歸模型。 * 如何開發和評估用于進行一周預測的自相關模型。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。