如何在 Python 中搜索 SARIMA 模型超參數用于時間序列預測 · Machine Learning Mastery 博客文章翻譯

# 如何在 Python 中搜索 SARIMA 模型超參數用于時間序列預測 > 原文： [https://machinelearningmastery.com/how-to-grid-search-sarima-model-hyperparameters-for-time-series-forecasting-in-python/](https://machinelearningmastery.com/how-to-grid-search-sarima-model-hyperparameters-for-time-series-forecasting-in-python/) 季節性自回歸整合移動平均線（SARIMA）模型是一種對可能包含趨勢和季節性成分的單變量時間序列數據進行建模的方法。它是時間序列預測的有效方法，但需要仔細分析和領域專業知識才能配置七個或更多模型超參數。配置模型的另一種方法是利用快速和并行的現代硬件來網格搜索一套超參數配置，以便發現最有效的方法。通常，此過程可以揭示非直觀模型配置，這些配置導致預測誤差低于通過仔細分析指定的配置。在本教程中，您將了解如何開發網格搜索所有 SARIMA 模型超參數的框架，以進行單變量時間序列預測。完成本教程后，您將了解： * 如何使用前向驗證從頭開始開發網格搜索 SARIMA 模型的框架。 * 如何為出生日常時間序列數據網格搜索 SARIMA 模型超參數。 * 如何對洗發水銷售，汽車銷售和溫度的月度時間序列數據進行網格搜索 SARIMA 模型超參數。讓我們開始吧。 ![How to Grid Search SARIMA Model Hyperparameters for Time Series Forecasting in Python](https://img.kancloud.cn/70/3a/703a8d39b3a2b67789f32457a5e6dba1_640x360.jpg) 如何在 Python 中搜索用于時間序列預測的 SARIMA 模型超參數 [Thomas](https://www.flickr.com/photos/photommo/17832992898/) 的照片，保留一些權利。 ## 教程概述本教程分為六個部分;他們是： 1. SARIMA 用于時間序列預測 2. 開發網格搜索框架 3. 案例研究 1：沒有趨勢或季節性 4. 案例研究 2：趨勢 5. 案例研究 3：季節性 6. 案例研究 4：趨勢和季節性 ## SARIMA 用于時間序列預測季節性自回歸整合移動平均線，SARIMA 或季節性 ARIMA，是 ARIMA 的擴展，明確支持具有季節性成分的單變量時間序列數據。它增加了三個新的超參數來指定系列季節性成分的自回歸（AR），差分（I）和移動平均（MA），以及季節性周期的附加參數。 > 通過在 ARIMA 中包含額外的季節性術語來形成季節性 ARIMA 模型[...]模型的季節性部分由與模型的非季節性組成非常相似的術語組成，但它們涉及季節性時段的后移。 - 第 242 頁，[預測：原則和實踐](https://amzn.to/2xlJsfV)，2013。配置 SARIMA 需要為系列的趨勢和季節性元素選擇超參數。有三個趨勢元素需要配置。它們與 ARIMA 模型相同;特別： * **p** ：趨勢自動回歸順序。 * **d** ：趨勢差異順序。 * **q** ：趨勢均線。有四個不屬于 ARIMA 的季節性元素必須配置;他們是： * **P** ：季節性自回歸順序。 * **D** ：季節性差異順序。 * **Q** ：季節性移動平均線。 * **m** ：單個季節性時段的時間步數。同時，SARIMA 模型的表示法指定為： ```py SARIMA(p,d,q)(P,D,Q)m ``` SARIMA 模型可以通過模型配置參數包含 ARIMA，ARMA，AR 和 MA 模型。可以通過分析自相關和部分自相關圖來配置模型的趨勢和季節性超參數，這可能需要一些專業知識。另一種方法是對一組模型配置進行網格搜索，并發現哪些配置最適合特定的單變量時間序列。 > 季節性 ARIMA 模型可能具有大量參數和術語組合。因此，在擬合數據時嘗試各種模型并使用適當的標準選擇最佳擬合模型是合適的... - 第 143-144 頁，[介紹時間序列與 R](https://amzn.to/2smB9LR) ，2009 年。這種方法在現代計算機上比分析過程更快，并且可以揭示可能不明顯的令人驚訝的結果并導致較低的預測誤差。 ## 開發網格搜索框架在本節中，我們將針對給定的單變量時間序列預測問題開發網格搜索 SARIMA 模型超參數的框架。我們將使用 statsmodels 庫提供的 [SARIMA](http://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html) 的實現。該模型具有超參數，可控制為系列，趨勢和季節性執行的模型的性質，具體為： * **order** ：用于趨勢建模的元組 p，d 和 q 參數。 * **sesonal_order** ：用于建模季節性的 P，D，Q 和 m 參數元組 * **趨勢**：用于將確定性趨勢模型控制為“n”，“c”，“t”，“ct”之一的參數，無趨勢，常數，線性和常數，線性趨勢，分別。如果您對問題了解得足以指定其中一個或多個參數，則應指定它們。如果沒有，您可以嘗試網格搜索這些參數。我們可以通過定義一個適合具有給定配置的模型的函數來開始，并進行一步預測。下面的 _sarima_forecast（）_ 實現了這種行為。該函數采用連續先前觀察的數組或列表以及用于配置模型的配置參數列表，特別是兩個元組和趨勢順序，季節性順序趨勢和參數的字符串。我們還嘗試通過放寬約束來使模型健壯，例如數據必須是靜止的并且 MA 變換是可逆的。 ```py # one-step sarima forecast def sarima_forecast(history, config): order, sorder, trend = config # define model model = SARIMAX(history, order=order, seasonal_order=sorder, trend=trend, enforce_stationarity=False, enforce_invertibility=False) # fit model model_fit = model.fit(disp=False) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] ``` 接下來，我們需要建立一些函數，通過前向驗證重復擬合和評估模型，包括將數據集拆分為訓練集和測試集以及評估一步預測。我們可以使用給定指定大小的分割的切片來分割列表或 NumPy 數據數組，例如，從測試集中的數據中使用的時間步數。下面的 _train_test_split（）_ 函數為提供的數據集和要在測試集中使用的指定數量的時間步驟實現此功能。 ```py # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] ``` 在對測試數據集中的每個步驟進行預測之后，需要將它們與測試集進行比較以計算錯誤分數。時間序列預測有許多流行的錯誤分數。在這種情況下，我們將使用均方根誤差（RMSE），但您可以將其更改為您的首選度量，例如 MAPE，MAE 等下面的 _measure_rmse（）_ 函數將根據實際（測試集）和預測值列表計算 RMSE。 ```py # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) ``` 我們現在可以實現前向驗證方案。這是評估尊重觀測時間順序的時間序列預測模型的標準方法。首先，使用 _train_test_split（）_ 函數將提供的單變量時間序列數據集分成訓練集和測試集。然后枚舉測試集中的觀察數。對于每一個我們都適合所有歷史的模型，并進行一步預測。然后將對時間步驟的真實觀察添加到歷史中并重復該過程。調用 _sarima_forecast（）_ 函數以適合模型并進行預測。最后，通過調用 _measure_rmse（）_ 函數，將所有一步預測與實際測試集進行比較，計算錯誤分數。下面的 _walk_forward_validation（）_ 函數實現了這一點，采用單變量時間序列，在測試集中使用的一些時間步驟，以及模型配置數組。 ```py # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = sarima_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error ``` 如果您對進行多步預測感興趣，可以在 _sarima_forecast（）_ 函數中更改 _predict（）_ 的調用，并更改 _ 中的錯誤計算 measure_rmse（）_ 功能。我們可以使用不同的模型配置列表重復調用 _walk_forward_validation（）_。一個可能的問題是，可能不會為模型調用模型配置的某些組合，并且會拋出異常，例如，指定數據中季節性結構的一些但不是所有方面。此外，某些型號還可能會對某些數據發出警告，例如：來自 statsmodels 庫調用的線性代數庫。我們可以在網格搜索期間捕獲異常并忽略警告，方法是將所有調用包含在 _walk_forward_validation（）_ 中，并使用 try-except 和 block 來忽略警告。我們還可以添加調試支持來禁用這些保護，以防我們想要查看實際情況。最后，如果確實發生了錯誤，我們可以返回 None 結果，否則我們可以打印一些有關每個模型評估技能的信息。當評估大量模型時，這很有用。下面的 _score_model（）_ 函數實現了這個并返回（鍵和結果）的元組，其中鍵是測試模型配置的字符串版本。 ```py # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) ``` 接下來，我們需要一個循環來測試不同模型配置的列表。這是驅動網格搜索過程的主要功能，并將為每個模型配置調用 _score_model（）_ 函數。通過并行評估模型配置，我們可以大大加快網格搜索過程。一種方法是使用 [Joblib 庫](https://pythonhosted.org/joblib/)。我們可以定義一個 Parallel 對象，其中包含要使用的核心數，并將其設置為硬件中檢測到的分數。 ```py executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') ``` 然后我們可以創建一個并行執行的任務列表，這將是對我們擁有的每個模型配置的 _score_model（）_ 函數的一次調用。 ```py tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) ``` 最后，我們可以使用 Parallel 對象并行執行任務列表。 ```py scores = executor(tasks) ``` 而已。我們還可以提供評估所有模型配置的非并行版本，以防我們想要調試某些內容。 ```py scores = [score_model(data, n_test, cfg) for cfg in cfg_list] ``` 評估配置列表的結果將是元組列表，每個元組都有一個名稱，該名稱總結了特定的模型配置，并且使用該配置評估的模型的錯誤為 RMSE，如果出現錯誤則為 None。我們可以使用“無”過濾掉所有分數。 ```py scores = [r for r in scores if r[1] != None] ``` 然后我們可以按照升序排列列表中的所有元組（最好是第一個），然后返回此分數列表以供審閱。給定單變量時間序列數據集，模型配置列表（列表列表）以及在測試集中使用的時間步數，下面的 _grid_search（）_ 函數實現此行為。可選的 _ 并行 _ 參數允許對所有內核的模型進行開啟或關閉調整，默認情況下處于打開狀態。 ```py # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores ``` 我們差不多完成了。剩下要做的唯一事情是定義模型配置列表以嘗試數據集。我們可以一般地定義它。我們可能想要指定的唯一參數是系列中季節性組件的周期性（如果存在）。默認情況下，我們假設沒有季節性組件。下面的 _sarima_configs（）_ 函數將創建要評估的模型配置列表。這些配置假設趨勢和季節性的每個 AR，MA 和 I 分量都是低階的，例如，關（0）或[1,2]。如果您認為訂單可能更高，則可能需要擴展這些范圍。可以指定季節性時段的可選列表，您甚至可以更改該功能以指定您可能了解的有關時間序列的其他元素。理論上，有 1,296 種可能的模型配置需要評估，但在實踐中，許多模型配置無效并會導致我們將陷入和忽略的錯誤。 ```py # create a set of sarima configs to try def sarima_configs(seasonal=[0]): models = list() # define config lists p_params = [0, 1, 2] d_params = [0, 1] q_params = [0, 1, 2] t_params = ['n','c','t','ct'] P_params = [0, 1, 2] D_params = [0, 1] Q_params = [0, 1, 2] m_params = seasonal # create config instances for p in p_params: for d in d_params: for q in q_params: for t in t_params: for P in P_params: for D in D_params: for Q in Q_params: for m in m_params: cfg = [(p,d,q), (P,D,Q,m), t] models.append(cfg) return models ``` 我們現在有一個網格搜索 SARIMA 模型超參數的框架，通過一步前進驗證。它是通用的，適用于作為列表或 NumPy 數組提供的任何內存中單變量時間序列。我們可以通過在人為設計的 10 步數據集上進行測試來確保所有部分協同工作。下面列出了完整的示例。 ```py # grid search sarima hyperparameters from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.statespace.sarimax import SARIMAX from sklearn.metrics import mean_squared_error # one-step sarima forecast def sarima_forecast(history, config): order, sorder, trend = config # define model model = SARIMAX(history, order=order, seasonal_order=sorder, trend=trend, enforce_stationarity=False, enforce_invertibility=False) # fit model model_fit = model.fit(disp=False) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = sarima_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of sarima configs to try def sarima_configs(seasonal=[0]): models = list() # define config lists p_params = [0, 1, 2] d_params = [0, 1] q_params = [0, 1, 2] t_params = ['n','c','t','ct'] P_params = [0, 1, 2] D_params = [0, 1] Q_params = [0, 1, 2] m_params = seasonal # create config instances for p in p_params: for d in d_params: for q in q_params: for t in t_params: for P in P_params: for D in D_params: for Q in Q_params: for m in m_params: cfg = [(p,d,q), (P,D,Q,m), t] models.append(cfg) return models if __name__ == '__main__': # define dataset data = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0] print(data) # data split n_test = 4 # model configs cfg_list = sarima_configs() # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error) ``` 首先運行該示例打印設計的時間序列數據集。接下來，在評估模型配置及其錯誤時報告模型配置及其錯誤，為簡潔起見，將其截斷。最后，報告前三種配置的配置和錯誤。我們可以看到，許多模型在這個簡單的線性增長的時間序列問題上實現了完美的表現。 ```py [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0] ... > Model[[(2, 0, 0), (2, 0, 0, 0), 'ct']] 0.001 > Model[[(2, 0, 0), (2, 0, 1, 0), 'ct']] 0.000 > Model[[(2, 0, 1), (0, 0, 0, 0), 'n']] 0.000 > Model[[(2, 0, 1), (0, 0, 1, 0), 'n']] 0.000 done [(2, 1, 0), (1, 0, 0, 0), 'n'] 0.0 [(2, 1, 0), (2, 0, 0, 0), 'n'] 0.0 [(2, 1, 1), (1, 0, 1, 0), 'n'] 0.0 ``` 現在我們有一個強大的網格搜索 SARIMA 模型超參數框架，讓我們在一套標準的單變量時間序列數據集上進行測試。選擇數據集用于演示目的;我并不是說 SARIMA 模型是每個數據集的最佳方法;在某些情況下，或許 ETS 或其他更合適的東西。 ## 案例研究 1：沒有趨勢或季節性 “每日女性分娩”數據集總結了 1959 年美國加利福尼亞州每日女性總分娩數。數據集沒有明顯的趨勢或季節性成分。 ![Line Plot of the Daily Female Births Dataset](https://img.kancloud.cn/82/c2/82c2332333012a46b0561998c9b6224b_1440x780.jpg) 每日女性出生數據集的線圖您可以從 [DataMarket](https://datamarket.com/data/set/235k/daily-total-female-births-in-california-1959#!ds=235k&display=line) 了解有關數據集的更多信息。直接從這里下載數據集： * [每日總數 - 女性分娩.sv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-total-female-births.csv) 在當前工作目錄中使用文件名“ _daily-total-female-births.csv_ ”保存文件。我們可以使用函數 _read_csv（）_ 將此數據集作為 Pandas 系列加載。 ```py series = read_csv('daily-total-female-births.csv', header=0, index_col=0) ``` 數據集有一年或 365 個觀測值。我們將使用前 200 個進行訓練，將剩余的 165 個作為測試集。下面列出了搜索每日女性單變量時間序列預測問題的完整示例網格。 ```py # grid search sarima hyperparameters for daily female dataset from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.statespace.sarimax import SARIMAX from sklearn.metrics import mean_squared_error from pandas import read_csv # one-step sarima forecast def sarima_forecast(history, config): order, sorder, trend = config # define model model = SARIMAX(history, order=order, seasonal_order=sorder, trend=trend, enforce_stationarity=False, enforce_invertibility=False) # fit model model_fit = model.fit(disp=False) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = sarima_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of sarima configs to try def sarima_configs(seasonal=[0]): models = list() # define config lists p_params = [0, 1, 2] d_params = [0, 1] q_params = [0, 1, 2] t_params = ['n','c','t','ct'] P_params = [0, 1, 2] D_params = [0, 1] Q_params = [0, 1, 2] m_params = seasonal # create config instances for p in p_params: for d in d_params: for q in q_params: for t in t_params: for P in P_params: for D in D_params: for Q in Q_params: for m in m_params: cfg = [(p,d,q), (P,D,Q,m), t] models.append(cfg) return models if __name__ == '__main__': # load dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) data = series.values print(data.shape) # data split n_test = 165 # model configs cfg_list = sarima_configs() # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error) ``` 在現代硬件上運行該示例可能需要幾分鐘。在評估模型時打印模型配置和 RMSE 在運行結束時報告前三個模型配置及其錯誤。我們可以看到最好的結果是大約 6.77 個出生的 RMSE，具有以下配置： * **訂單** :( 1,0,2） * **季節性命令** :( 1,0,1,0） * **趨勢參數**：'t'表示線性趨勢令人驚訝的是，具有一些季節性元素的配置導致最低的錯誤。我不會猜到這種配置，可能會堅持使用 ARIMA 模型。 ```py ... > Model[[(2, 1, 2), (1, 0, 1, 0), 'ct']] 6.905 > Model[[(2, 1, 2), (2, 0, 0, 0), 'ct']] 7.031 > Model[[(2, 1, 2), (2, 0, 1, 0), 'ct']] 6.985 > Model[[(2, 1, 2), (1, 0, 2, 0), 'ct']] 6.941 > Model[[(2, 1, 2), (2, 0, 2, 0), 'ct']] 7.056 done [(1, 0, 2), (1, 0, 1, 0), 't'] 6.770349800255089 [(0, 1, 2), (1, 0, 2, 0), 'ct'] 6.773217122759515 [(2, 1, 1), (2, 0, 2, 0), 'ct'] 6.886633191752254 ``` ## 案例研究 2：趨勢 “洗發水”數據集總結了三年內洗發水的月銷售額。數據集包含明顯的趨勢，但沒有明顯的季節性成分。 ![Line Plot of the Monthly Shampoo Sales Dataset](https://img.kancloud.cn/ae/a5/aea5992c9bbc15a4ef6046500013d962_1438x776.jpg) 月度洗發水銷售數據集的線圖您可以從 [DataMarket](https://datamarket.com/data/set/22r0/sales-of-shampoo-over-a-three-year-period#!ds=22r0&display=line) 了解有關數據集的更多信息。直接從這里下載數據集： * [shampoo.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/shampoo.csv) 在當前工作目錄中使用文件名“shampoo.csv”保存文件。我們可以使用函數 _read_csv（）_ 將此數據集作為 Pandas 系列加載。 ```py # parse dates def custom_parser(x): return datetime.strptime('195'+x, '%Y-%m') # load dataset series = read_csv('shampoo.csv', header=0, index_col=0, date_parser=custom_parser) ``` 數據集有三年，或 36 個觀測值。我們將使用前 24 個用于訓練，其余 12 個用作測試集。下面列出了搜索洗發水銷售單變量時間序列預測問題的完整示例網格。 ```py # grid search sarima hyperparameters for monthly shampoo sales dataset from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.statespace.sarimax import SARIMAX from sklearn.metrics import mean_squared_error from pandas import read_csv from pandas import datetime # one-step sarima forecast def sarima_forecast(history, config): order, sorder, trend = config # define model model = SARIMAX(history, order=order, seasonal_order=sorder, trend=trend, enforce_stationarity=False, enforce_invertibility=False) # fit model model_fit = model.fit(disp=False) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = sarima_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of sarima configs to try def sarima_configs(seasonal=[0]): models = list() # define config lists p_params = [0, 1, 2] d_params = [0, 1] q_params = [0, 1, 2] t_params = ['n','c','t','ct'] P_params = [0, 1, 2] D_params = [0, 1] Q_params = [0, 1, 2] m_params = seasonal # create config instances for p in p_params: for d in d_params: for q in q_params: for t in t_params: for P in P_params: for D in D_params: for Q in Q_params: for m in m_params: cfg = [(p,d,q), (P,D,Q,m), t] models.append(cfg) return models # parse dates def custom_parser(x): return datetime.strptime('195'+x, '%Y-%m') if __name__ == '__main__': # load dataset series = read_csv('shampoo.csv', header=0, index_col=0, date_parser=custom_parser) data = series.values print(data.shape) # data split n_test = 12 # model configs cfg_list = sarima_configs() # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error) ``` 在現代硬件上運行該示例可能需要幾分鐘。在評估模型時打印模型配置和 RMSE 在運行結束時報告前三個模型配置及其錯誤。我們可以看到最好的結果是 RMSE 約為 54.76，具有以下配置： * **趨勢訂單** :( 0,1,2） * **季節性命令** :( 2,0,2,0） * **趨勢參數**：'t'（線性趨勢） ```py ... > Model[[(2, 1, 2), (1, 0, 1, 0), 'ct']] 68.891 > Model[[(2, 1, 2), (2, 0, 0, 0), 'ct']] 75.406 > Model[[(2, 1, 2), (1, 0, 2, 0), 'ct']] 80.908 > Model[[(2, 1, 2), (2, 0, 1, 0), 'ct']] 78.734 > Model[[(2, 1, 2), (2, 0, 2, 0), 'ct']] 82.958 done [(0, 1, 2), (2, 0, 2, 0), 't'] 54.767582003072874 [(0, 1, 1), (2, 0, 2, 0), 'ct'] 58.69987083057107 [(1, 1, 2), (0, 0, 1, 0), 't'] 58.709089340600094 ``` ## 案例研究 3：季節性 “月平均溫度”數據集總結了 1920 至 1939 年華氏諾丁漢城堡的月平均氣溫，以華氏度為單位。數據集具有明顯的季節性成分，沒有明顯的趨勢。 ![Line Plot of the Monthly Mean Temperatures Dataset](https://img.kancloud.cn/24/3c/243cfe0fd0e8ab5923b76dcc30ca7a95_1454x766.jpg) 月平均氣溫數據集的線圖您可以從 [DataMarket](https://datamarket.com/data/set/22li/mean-monthly-air-temperature-deg-f-nottingham-castle-1920-1939#!ds=22li&display=line) 了解有關數據集的更多信息。直接從這里下載數據集： * [monthly-mean-temp.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-mean-temp.csv) 在當前工作目錄中使用文件名“ _monthly-mean-temp.csv_ ”保存文件。我們可以使用函數 _read_csv（）_ 將此數據集作為 Pandas 系列加載。 ```py series = read_csv('monthly-mean-temp.csv', header=0, index_col=0) ``` 數據集有 20 年，或 240 個觀測值。我們將數據集修剪為過去五年的數據（60 個觀測值），以加快模型評估過程，并使用去年或 12 個觀測值來測試集。 ```py # trim dataset to 5 years data = data[-(5*12):] ``` 季節性成分的周期約為一年，或 12 個觀測值。在準備模型配置時，我們將此作為調用 _sarima_configs（）_ 函數的季節性時段。 ```py # model configs cfg_list = sarima_configs(seasonal=[0, 12]) ``` 下面列出了搜索月平均溫度時間序列預測問題的完整示例網格。 ```py # grid search sarima hyperparameters for monthly mean temp dataset from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.statespace.sarimax import SARIMAX from sklearn.metrics import mean_squared_error from pandas import read_csv # one-step sarima forecast def sarima_forecast(history, config): order, sorder, trend = config # define model model = SARIMAX(history, order=order, seasonal_order=sorder, trend=trend, enforce_stationarity=False, enforce_invertibility=False) # fit model model_fit = model.fit(disp=False) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = sarima_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of sarima configs to try def sarima_configs(seasonal=[0]): models = list() # define config lists p_params = [0, 1, 2] d_params = [0, 1] q_params = [0, 1, 2] t_params = ['n','c','t','ct'] P_params = [0, 1, 2] D_params = [0, 1] Q_params = [0, 1, 2] m_params = seasonal # create config instances for p in p_params: for d in d_params: for q in q_params: for t in t_params: for P in P_params: for D in D_params: for Q in Q_params: for m in m_params: cfg = [(p,d,q), (P,D,Q,m), t] models.append(cfg) return models if __name__ == '__main__': # load dataset series = read_csv('monthly-mean-temp.csv', header=0, index_col=0) data = series.values # trim dataset to 5 years data = data[-(5*12):] # data split n_test = 12 # model configs cfg_list = sarima_configs(seasonal=[0, 12]) # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error) ``` 在現代硬件上運行該示例可能需要幾分鐘。在評估模型時打印模型配置和 RMSE 在運行結束時報告前三個模型配置及其錯誤。我們可以看到最好的結果是大約 1.5 度的 RMSE，具有以下配置： * **趨勢訂單** :( 0,0,0） * **季節性命令** :( 1,0,1,12） * **趨勢參數**：'n'（無趨勢）正如我們所料，該模型沒有趨勢組件和 12 個月的季節性 ARMA 組件。 ```py ... > Model[[(2, 1, 2), (2, 1, 0, 12), 't']] 4.599 > Model[[(2, 1, 2), (1, 1, 0, 12), 'ct']] 2.477 > Model[[(2, 1, 2), (2, 0, 0, 12), 'ct']] 2.548 > Model[[(2, 1, 2), (2, 0, 1, 12), 'ct']] 2.893 > Model[[(2, 1, 2), (2, 1, 0, 12), 'ct']] 5.404 done [(0, 0, 0), (1, 0, 1, 12), 'n'] 1.5577613610905712 [(0, 0, 0), (1, 1, 0, 12), 'n'] 1.6469530713847962 [(0, 0, 0), (2, 0, 0, 12), 'n'] 1.7314448163607488 ``` ## 案例研究 4：趨勢和季節性 “月度汽車銷售”數據集總結了 1960 年至 1968 年間加拿大魁北克省的月度汽車銷量。數據集具有明顯的趨勢和季節性成分。 ![Line Plot of the Monthly Car Sales Dataset](https://img.kancloud.cn/04/5f/045f949f08b91dfff5ec9152a3aaca14_1462x768.jpg) 月度汽車銷售數據集的線圖您可以從 [DataMarket](https://datamarket.com/data/set/22n4/monthly-car-sales-in-quebec-1960-1968#!ds=22n4&display=line) 了解有關數據集的更多信息。直接從這里下載數據集： * [month-car-sales.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-car-sales.csv) 在當前工作目錄中使用文件名“ _monthly-car-sales.csv_ ”保存文件。我們可以使用函數 _read_csv（）_ 將此數據集作為 Pandas 系列加載。 ```py series = read_csv('monthly-car-sales.csv', header=0, index_col=0) ``` 數據集有 9 年或 108 個觀測值。我們將使用去年或 12 個觀測值作為測試集。季節性成分的期限可能是六個月或 12 個月。在準備模型配置時，我們將嘗試將兩者作為調用 _sarima_configs（）_ 函數的季節性時段。 ```py # model configs cfg_list = sarima_configs(seasonal=[0,6,12]) ``` 下面列出了搜索月度汽車銷售時間序列預測問題的完整示例網格。 ```py # grid search sarima hyperparameters for monthly car sales dataset from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.statespace.sarimax import SARIMAX from sklearn.metrics import mean_squared_error from pandas import read_csv # one-step sarima forecast def sarima_forecast(history, config): order, sorder, trend = config # define model model = SARIMAX(history, order=order, seasonal_order=sorder, trend=trend, enforce_stationarity=False, enforce_invertibility=False) # fit model model_fit = model.fit(disp=False) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = sarima_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of sarima configs to try def sarima_configs(seasonal=[0]): models = list() # define config lists p_params = [0, 1, 2] d_params = [0, 1] q_params = [0, 1, 2] t_params = ['n','c','t','ct'] P_params = [0, 1, 2] D_params = [0, 1] Q_params = [0, 1, 2] m_params = seasonal # create config instances for p in p_params: for d in d_params: for q in q_params: for t in t_params: for P in P_params: for D in D_params: for Q in Q_params: for m in m_params: cfg = [(p,d,q), (P,D,Q,m), t] models.append(cfg) return models if __name__ == '__main__': # load dataset series = read_csv('monthly-car-sales.csv', header=0, index_col=0) data = series.values print(data.shape) # data split n_test = 12 # model configs cfg_list = sarima_configs(seasonal=[0,6,12]) # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error) ``` 在現代硬件上運行該示例可能需要幾分鐘。在評估模型時打印模型配置和 RMSE 在運行結束時報告前三個模型配置及其錯誤。我們可以看到最好的結果是 RMSE 大約 1,551 銷售，具有以下配置： * **趨勢訂單** :( 0,0,0） * **季節性命令** :( 1,1,0,12） * **趨勢參數**：'t'（線性趨勢） ```py > Model[[(2, 1, 2), (2, 1, 1, 6), 'ct']] 2246.248 > Model[[(2, 1, 2), (2, 0, 2, 12), 'ct']] 10710.462 > Model[[(2, 1, 2), (2, 1, 2, 6), 'ct']] 2183.568 > Model[[(2, 1, 2), (2, 1, 0, 12), 'ct']] 2105.800 > Model[[(2, 1, 2), (2, 1, 1, 12), 'ct']] 2330.361 > Model[[(2, 1, 2), (2, 1, 2, 12), 'ct']] 31580326686.803 done [(0, 0, 0), (1, 1, 0, 12), 't'] 1551.8423920342414 [(0, 0, 0), (2, 1, 1, 12), 'c'] 1557.334614575545 [(0, 0, 0), (1, 1, 0, 12), 'c'] 1559.3276311282675 ``` ## 擴展本節列出了一些擴展您可能希望探索的教程的想法。 * **數據轉換**。更新框架以支持可配置的數據轉換，例如規范化和標準化。 * **地塊預測**。更新框架以重新擬合具有最佳配置的模型并預測整個測試數據集，然后將預測與測試集中的實際觀察值進行比較。 * **調整歷史數量**。更新框架以調整用于擬合模型的歷史數據量（例如，在 10 年最高溫度數據的情況下）。如果你探索任何這些擴展，我很想知道。 ## 進一步閱讀如果您希望深入了解，本節將提供有關該主題的更多資源。 ### 帖子 * [如何使用 Python 創建用于時間序列預測的 ARIMA 模型](https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/) * [如何使用 Python 網格搜索 ARIMA 模型超參數](https://machinelearningmastery.com/grid-search-arima-hyperparameters-with-python/) * [自相關和部分自相關的溫和介紹](https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/) ### 圖書 * 第 8 章 ARIMA 模型，[預測：原則和實踐](https://amzn.to/2xlJsfV)，2013。 * 第 7 章，非平穩模型， [R](https://amzn.to/2smB9LR) 的入門時間序列，2009。 ### API * [Statsmodels 狀態空間方法的時間序列分析](http://www.statsmodels.org/dev/statespace.html) * [statsmodels.tsa.statespace.sarimax.SARIMAX API](http://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html) * [statsmodels.tsa.statespace.sarimax.SARIMAXResults API](http://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAXResults.html) * [Statsmodels SARIMAX 筆記本](http://www.statsmodels.org/dev/examples/notebooks/generated/statespace_sarimax_stata.html) * [Joblib：運行 Python 函數作為管道作業](https://pythonhosted.org/joblib/) ### 用品 * [維基百科上的自回歸綜合移動平均線](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) ## 摘要在本教程中，您了解了如何開發網格搜索所有 SARIMA 模型超參數的框架，以進行單變量時間序列預測。具體來說，你學到了： * 如何使用前向驗證從頭開始開發網格搜索 SARIMA 模型的框架。 * 如何為出生日常時間序列數據網格搜索 SARIMA 模型超參數。 * 如何針對洗發水銷售，汽車銷售和溫度的月度時間序列數據網格搜索 SARIMA 模型超參數。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。