如何開發一種編碼器 - 解碼器模型，注重Keras中的序列到序列預測 · Machine Learning Mastery 博客文章翻譯

# 如何開發一種編碼器 - 解碼器模型，注重Keras中的序列到序列預測 > 原文： [https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/](https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/) 用于循環神經網絡的編碼器 - 解碼器架構被證明在諸如機器翻譯和字幕生成之類的自然語言處理領域中的大量序列到序列預測問題上是強大的。注意力是一種解決編碼器 - 解碼器架構在長序列上的限制的機制，并且通常加速學習并且提升模型的技能而沒有序列來排序預測問題。在本教程中，您將了解如何使用Keras在Python中開發一個編碼器 - 解碼器循環神經網絡。完成本教程后，您將了解： * 如何設計一個小的可配置問題來評估編碼器 - 解碼器循環神經網絡有無注意。 * 如何設計和評估編碼器 - 解碼器網絡，有和沒有注意序列預測問題。 * 如何有力地比較編碼器 - 解碼器網絡的表現有沒有注意。讓我們開始吧。 ![How to Develop an Encoder-Decoder Model with Attention for Sequence-to-Sequence Prediction in Keras](img/a508d456c2630712158b024c6041c69e.jpg) 如何開發一個編碼器 - 解碼器模型，注意Keras中的序列到序列預測照片由 [Angela和Andrew](https://www.flickr.com/photos/150568953@N07/34585914155/) ，保留一些權利。 ## 教程概述本教程分為6個部分;他們是： 1. 注意編碼器解碼器 2. 注意力的測試問題 3. 編碼器 - 解碼器沒有注意 4. 自定義Keras注意層 5. 注意編碼器解碼器 6. 模型比較 ### Python環境本教程假定您已安裝Python 3 SciPy環境。您必須安裝帶有TensorFlow或Theano后端的Keras（2.0或更高版本）。本教程還假設您安裝了scikit-learn，Pandas，NumPy和Matplotlib。如果您需要有關環境的幫助，請參閱此帖子： * [如何使用Anaconda設置用于機器學習和深度學習的Python環境](https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/) ## 注意編碼器解碼器用于循環神經網絡的編碼器 - 解碼器模型是用于序列到序列預測問題的架構。它由兩個子模型組成，顧名思義： * **編碼器**：編碼器負責逐步執行輸入時間步長并將整個序列編碼為稱為上下文向量的固定長度向量。 * **解碼器**：解碼器負責在從上下文向量讀取時逐步執行輸出時間步長。該體系結構的問題在于長輸入或輸出序列的表現差。原因被認為是由于編碼器使用的固定大小的內部表示。注意是解決此限制的體系結構的擴展。它的工作原理是首先提供從編碼器到解碼器的更豐富的上下文和學習機制，其中解碼器可以在預測輸出序列中的每個時間步長時學習在更豐富的編碼中注意的位置。有關編碼器 - 解碼器架構的更多關注，請參閱帖子： * [長期短期記憶循環神經網絡](https://machinelearningmastery.com/attention-long-short-term-memory-recurrent-neural-networks/)的注意事項 * [編碼器 - 解碼器循環神經網絡中的注意事項如何工作](https://machinelearningmastery.com/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks/) ## 注意力的測試問題在我們開發注意力模型之前，我們將首先定義一個可設計的可擴展測試問題，我們可以用它來確定注意力是否提供任何好處。在這個問題中，我們將生成隨機整數序列作為輸入和匹配輸出序列，該輸出序列由輸入序列中的整數子集組成。例如，輸入序列可能是[1,6,2,7,3]，預期的輸出序列可能是序列[1,6]中的前兩個隨機整數。我們將定義問題，使輸入和輸出序列長度相同，并根據需要用“0”值填充輸出序列。首先，我們需要一個函數來生成隨機整數序列。我們將使用Python [randint（）](https://docs.python.org/3/library/random.html)函數生成0和最大值之間的隨機整數，并使用此范圍作為問題的基數（例如，要素數或難度軸）。下面的函數 _generate_sequence（）_將生成一個固定長度和指定基數的隨機整數序列。 ```py from random import randint # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # generate random sequence sequence = generate_sequence(5, 50) print(sequence) ``` 運行此示例將生成一個包含5個時間步長的序列，其中序列中的每個值都是0到49之間的隨機整數。 ```py [43, 3, 28, 34, 33] ``` 接下來，我們需要一個函數[將一個熱編碼](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)的離散整數值轉換成二進制向量。如果使用50的基數，則每個整數將由0個值的50個元素向量和指定整數值的索引中的1表示。下面的 _one_hot_encode（）_函數將對給定的整數序列進行熱編碼。 ```py # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) ``` 我們還需要能夠解碼編碼序列。需要將預測從模型或編碼的預期序列轉換回我們可以讀取和評估的整數序列。下面的 _one_hot_decode（）_函數將一個熱編碼序列解碼回整數序列。 ```py # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] ``` 我們可以在下面的例子中測試這些操作。 ```py from random import randint from numpy import array from numpy import argmax # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # generate random sequence sequence = generate_sequence(5, 50) print(sequence) # one hot encode encoded = one_hot_encode(sequence, 50) print(encoded) # decode decoded = one_hot_decode(encoded) print(decoded) ``` 首先運行該示例打印隨機生成的序列，然后打印一個熱編碼版本，最后再打印解碼序列。 ```py [3, 18, 32, 11, 36] [[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]] [3, 18, 32, 11, 36] ``` 最后，我們需要一個能夠創建輸入和輸出序列對的函數來訓練和評估模型。下面命名為 _get_pair（）_的函數將返回一個輸入和輸出序列對，給定指定的輸入長度，輸出長度和基數。輸入和輸出序列的長度和輸入序列的長度相同，但輸出序列將作為輸入序列的第一個 _n_ 字符，并用零值填充到所需長度。然后對整數序列進行編碼，然后重新成形為循環神經網絡所需的3D格式，其尺寸為：_樣本_，_時間步長_和_特征_。在這種情況下，樣本總是1，因為我們只生成一個輸入 - 輸出對，時間步長是輸入序列長度，特征是每個時間步長的基數。 ```py # prepare data for the LSTM def get_pair(n_in, n_out, n_unique): # generate random sequence sequence_in = generate_sequence(n_in, n_unique) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, n_unique) y = one_hot_encode(sequence_out, n_unique) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y ``` 我們可以將它們放在一起并演示數據準備代碼。 ```py from random import randint from numpy import array from numpy import argmax # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # prepare data for the LSTM def get_pair(n_in, n_out, n_unique): # generate random sequence sequence_in = generate_sequence(n_in, n_unique) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, n_unique) y = one_hot_encode(sequence_out, n_unique) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y # generate random sequence X, y = get_pair(5, 2, 50) print(X.shape, y.shape) print('X=%s, y=%s' % (one_hot_decode(X[0]), one_hot_decode(y[0]))) ``` 運行該示例生成單個輸入 - 輸出對并打印兩個陣列的形狀。然后以解碼的形式打印生成的對，其中我們可以看到序列的前兩個整數在輸出序列中被再現，隨后是零值的填充。 ```py (1, 5, 50) (1, 5, 50) X=[12, 20, 36, 40, 12], y=[12, 20, 0, 0, 0] ``` ## 編碼器 - 解碼器沒有注意在本節中，我們將在沒有注意的情況下使用編碼器 - 解碼器模型開發關于問題表現的基線。我們將在5個時間步的輸入和輸出序列中修復問題定義，輸出序列中輸入序列的前2個元素和基數為50。 ```py # configure problem n_features = 50 n_timesteps_in = 5 n_timesteps_out = 2 ``` 我們可以通過從編碼器LSTM模型獲得輸出，在Keras中開發一個簡單的編碼器 - 解碼器模型，對輸出序列中的時間步長重復n次，然后使用解碼器來預測輸出序列。有關如何在Keras中定義編碼器 - 解碼器架構的更多詳細信息，請參閱帖子： * [編碼器 - 解碼器長短期存儲器網絡](https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/) 我們將使用相同數量的單位配置編碼器和解碼器，在本例中為150.我們將使用梯度下降的有效Adam實現并優化分類交叉熵損失函數，因為該問題在技術上是一個多類別分類問題。模型的配置是在經過一些試驗和錯誤之后找到的，并未進行優化。下面列出了Keras中編碼器 - 解碼器架構的代碼。 ```py # define model model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features))) model.add(RepeatVector(n_timesteps_in)) model.add(LSTM(150, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) ``` 我們將在5,000個隨機輸入 - 輸出對的整數序列上訓練模型。 ```py # train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=2) ``` 一旦經過訓練，我們將在100個新的隨機生成的整數序列上評估模型，并且只在整個輸出序列與預期值匹配時才標記預測正確。 ```py # evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0)) ``` 最后，我們將打印10個預期輸出序列和模型預測序列的例子。將所有這些放在一起，下面列出了完整的示例。 ```py from random import randint from numpy import array from numpy import argmax from numpy import array_equal from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import TimeDistributed from keras.layers import RepeatVector # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # prepare data for the LSTM def get_pair(n_in, n_out, cardinality): # generate random sequence sequence_in = generate_sequence(n_in, cardinality) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, cardinality) y = one_hot_encode(sequence_out, cardinality) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y # configure problem n_features = 50 n_timesteps_in = 5 n_timesteps_out = 2 # define model model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features))) model.add(RepeatVector(n_timesteps_in)) model.add(LSTM(150, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) # train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=2) # evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0)) # spot check some examples for _ in range(10): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0])) ``` 運行此示例不會花費很長時間，可能需要幾分鐘的CPU，不需要GPU。據報道該模型的準確率略低于20％。鑒于[神經網絡的隨機性](https://machinelearningmastery.com/randomness-in-machine-learning/)，您的結果會有所不同;考慮運行幾次示例并取平均值。 ```py Accuracy: 19.00% ``` 我們可以從樣本輸出中看到，模型確實在輸出序列中得到一個數字，對于大多數或所有情況都是正確的，并且只與第二個數字斗爭。正確預測所有零填充值。 ```py Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0] Expected: [43, 31, 0, 0, 0] Predicted [43, 31, 0, 0, 0] Expected: [14, 22, 0, 0, 0] Predicted [14, 14, 0, 0, 0] Expected: [39, 31, 0, 0, 0] Predicted [39, 39, 0, 0, 0] Expected: [6, 4, 0, 0, 0] Predicted [6, 4, 0, 0, 0] Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0] Expected: [39, 33, 0, 0, 0] Predicted [39, 39, 0, 0, 0] Expected: [23, 2, 0, 0, 0] Predicted [23, 23, 0, 0, 0] Expected: [19, 28, 0, 0, 0] Predicted [19, 3, 0, 0, 0] Expected: [32, 33, 0, 0, 0] Predicted [32, 32, 0, 0, 0] ``` ## 自定義Keras注意層現在我們需要關注編碼器 - 解碼器模型。在撰寫本文時，Keras沒有內置于庫中的注意力，但很快就會 [](https://github.com/fchollet/keras/pull/7980)。在Keras正式提供關注之前，我們可以開發自己的實現或使用現有的第三方實現。為了加快速度，讓我們使用現有的第三方實現。 [Zafarali Ahmed](http://www.zafarali.me/) [Datalogue](https://www.datalogue.io/) 的實習生為Keras開發了一個[自定義層](https://keras.io/layers/writing-your-own-keras-layers/)，提供了關注支持，在一篇名為“[如何可視化您的復發的帖子中提出2017年Keras](https://medium.com/datalogue/attention-in-keras-1892773a4f22) 中關注的神經網絡和GitHub項目稱為“ [keras-attention](https://github.com/datalogue/keras-attention) ”。自定義注意層稱為 _AttentionDecoder_ ，可在GitHub項目的 [custom_recurrents.py](https://github.com/datalogue/keras-attention/blob/master/models/custom_recurrents.py) 文件中找到。我們可以在項目的 [GNU Affero通用公共許可證v3.0許可證](https://github.com/datalogue/keras-attention/blob/master/LICENSE)下重用此代碼。下面列出了自定義層的副本以確保完整性。將其復制并粘貼到當前工作目錄中名為“ _attention_decoder.py_ ”的新單獨文件中。 ```py import tensorflow as tf from keras import backend as K from keras import regularizers, constraints, initializers, activations from keras.layers.recurrent import Recurrent, _time_distributed_dense from keras.engine import InputSpec tfPrint = lambda d, T: tf.Print(input_=T, data=[T, tf.shape(T)], message=d) class AttentionDecoder(Recurrent): def __init__(self, units, output_dim, activation='tanh', return_probabilities=False, name='AttentionDecoder', kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None, **kwargs): """ Implements an AttentionDecoder that takes in a sequence encoded by an encoder and outputs the decoded states :param units: dimension of the hidden state and the attention matrices :param output_dim: the number of labels in the output space references: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). """ self.units = units self.output_dim = output_dim self.return_probabilities = return_probabilities self.activation = activations.get(activation) self.kernel_initializer = initializers.get(kernel_initializer) self.recurrent_initializer = initializers.get(recurrent_initializer) self.bias_initializer = initializers.get(bias_initializer) self.kernel_regularizer = regularizers.get(kernel_regularizer) self.recurrent_regularizer = regularizers.get(kernel_regularizer) self.bias_regularizer = regularizers.get(bias_regularizer) self.activity_regularizer = regularizers.get(activity_regularizer) self.kernel_constraint = constraints.get(kernel_constraint) self.recurrent_constraint = constraints.get(kernel_constraint) self.bias_constraint = constraints.get(bias_constraint) super(AttentionDecoder, self).__init__(**kwargs) self.name = name self.return_sequences = True # must return sequences def build(self, input_shape): """ See Appendix 2 of Bahdanau 2014, arXiv:1409.0473 for model details that correspond to the matrices here. """ self.batch_size, self.timesteps, self.input_dim = input_shape if self.stateful: super(AttentionDecoder, self).reset_states() self.states = [None, None] # y, s """ Matrices for creating the context vector """ self.V_a = self.add_weight(shape=(self.units,), name='V_a', initializer=self.kernel_initializer, regularizer=self.kernel_regularizer, constraint=self.kernel_constraint) self.W_a = self.add_weight(shape=(self.units, self.units), name='W_a', initializer=self.kernel_initializer, regularizer=self.kernel_regularizer, constraint=self.kernel_constraint) self.U_a = self.add_weight(shape=(self.input_dim, self.units), name='U_a', initializer=self.kernel_initializer, regularizer=self.kernel_regularizer, constraint=self.kernel_constraint) self.b_a = self.add_weight(shape=(self.units,), name='b_a', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) """ Matrices for the r (reset) gate """ self.C_r = self.add_weight(shape=(self.input_dim, self.units), name='C_r', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.U_r = self.add_weight(shape=(self.units, self.units), name='U_r', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.W_r = self.add_weight(shape=(self.output_dim, self.units), name='W_r', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.b_r = self.add_weight(shape=(self.units, ), name='b_r', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) """ Matrices for the z (update) gate """ self.C_z = self.add_weight(shape=(self.input_dim, self.units), name='C_z', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.U_z = self.add_weight(shape=(self.units, self.units), name='U_z', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.W_z = self.add_weight(shape=(self.output_dim, self.units), name='W_z', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.b_z = self.add_weight(shape=(self.units, ), name='b_z', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) """ Matrices for the proposal """ self.C_p = self.add_weight(shape=(self.input_dim, self.units), name='C_p', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.U_p = self.add_weight(shape=(self.units, self.units), name='U_p', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.W_p = self.add_weight(shape=(self.output_dim, self.units), name='W_p', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.b_p = self.add_weight(shape=(self.units, ), name='b_p', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) """ Matrices for making the final prediction vector """ self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim), name='C_o', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.U_o = self.add_weight(shape=(self.units, self.output_dim), name='U_o', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim), name='W_o', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.b_o = self.add_weight(shape=(self.output_dim, ), name='b_o', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) # For creating the initial state: self.W_s = self.add_weight(shape=(self.input_dim, self.units), name='W_s', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.input_spec = [ InputSpec(shape=(self.batch_size, self.timesteps, self.input_dim))] self.built = True def call(self, x): # store the whole sequence so we can "attend" to it at each timestep self.x_seq = x # apply the a dense layer over the time dimension of the sequence # do it here because it doesn't depend on any previous steps # thefore we can save computation time: self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a, input_dim=self.input_dim, timesteps=self.timesteps, output_dim=self.units) return super(AttentionDecoder, self).call(x) def get_initial_state(self, inputs): # apply the matrix on the first time step to get the initial s0. s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s)) # from keras.layers.recurrent to initialize a vector of (batchsize, # output_dim) y0 = K.zeros_like(inputs) # (samples, timesteps, input_dims) y0 = K.sum(y0, axis=(1, 2)) # (samples, ) y0 = K.expand_dims(y0) # (samples, 1) y0 = K.tile(y0, [1, self.output_dim]) return [y0, s0] def step(self, x, states): ytm, stm = states # repeat the hidden state to the length of the sequence _stm = K.repeat(stm, self.timesteps) # now multiplty the weight matrix with the repeated hidden state _Wxstm = K.dot(_stm, self.W_a) # calculate the attention probabilities # this relates how much other timesteps contributed to this one. et = K.dot(activations.tanh(_Wxstm + self._uxpb), K.expand_dims(self.V_a)) at = K.exp(et) at_sum = K.sum(at, axis=1) at_sum_repeated = K.repeat(at_sum, self.timesteps) at /= at_sum_repeated # vector of size (batchsize, timesteps, 1) # calculate the context vector context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1) # ~~~> calculate new hidden state # first calculate the "r" gate: rt = activations.sigmoid( K.dot(ytm, self.W_r) + K.dot(stm, self.U_r) + K.dot(context, self.C_r) + self.b_r) # now calculate the "z" gate zt = activations.sigmoid( K.dot(ytm, self.W_z) + K.dot(stm, self.U_z) + K.dot(context, self.C_z) + self.b_z) # calculate the proposal hidden state: s_tp = activations.tanh( K.dot(ytm, self.W_p) + K.dot((rt * stm), self.U_p) + K.dot(context, self.C_p) + self.b_p) # new hidden state: st = (1-zt)*stm + zt * s_tp yt = activations.softmax( K.dot(ytm, self.W_o) + K.dot(stm, self.U_o) + K.dot(context, self.C_o) + self.b_o) if self.return_probabilities: return at, [yt, st] else: return yt, [yt, st] def compute_output_shape(self, input_shape): """ For Keras internal compatability checking """ if self.return_probabilities: return (None, self.timesteps, self.timesteps) else: return (None, self.timesteps, self.output_dim) def get_config(self): """ For rebuilding models on load time. """ config = { 'output_dim': self.output_dim, 'units': self.units, 'return_probabilities': self.return_probabilities } base_config = super(AttentionDecoder, self).get_config() return dict(list(base_config.items()) + list(config.items())) ``` 我們可以通過以下方式導入項目中來使用此自定義層： ```py from attention_decoder import AttentionDecoder ``` 如Bahdanau等人所述，該層實現了關注。在他們的論文“[神經機器翻譯中通過聯合學習來對齊和翻譯](https://arxiv.org/abs/1409.0473)。” 該代碼在原始帖子中得到了很好的解釋，并與LSTM和注意力方程相關聯。該實現的限制是它必須輸出與輸入序列長度相同的序列，編碼器 - 解碼器架構被設計為克服的特定限制。重要的是，新層管理由第二LSTM執行的重復解碼，以及由編碼器 - 解碼器模型中的密集輸出層執行的模型的softmax輸出，而沒有注意。這極大地簡化了模型的代碼。值得注意的是，自定義層建立在Keras的 [Recurrent](https://github.com/fchollet/keras/blob/master/keras/legacy/layers.py#L762) 層上，在編寫本文時，它被標記為遺留代碼，并且可能會在某個時候從項目中刪除。 ## 編碼器解碼器注意現在我們已經可以使用我們的注意力實現，我們可以開發一個編碼器 - 解碼器模型，注意我們設計的序列預測問題。具有關注層的模型定義如下。我們可以看到該層處理編碼器 - 解碼器模型本身的一些機制，使得定義模型更簡單。 ```py # define model model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True)) model.add(AttentionDecoder(150, n_features)) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) ``` 而已。示例的其余部分是相同的。下面列出了完整的示例。 ```py from random import randint from numpy import array from numpy import argmax from numpy import array_equal from keras.models import Sequential from keras.layers import LSTM from attention_decoder import AttentionDecoder # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # prepare data for the LSTM def get_pair(n_in, n_out, cardinality): # generate random sequence sequence_in = generate_sequence(n_in, cardinality) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, cardinality) y = one_hot_encode(sequence_out, cardinality) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y # configure problem n_features = 50 n_timesteps_in = 5 n_timesteps_out = 2 # define model model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True)) model.add(AttentionDecoder(150, n_features)) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) # train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=2) # evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0)) # spot check some examples for _ in range(10): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0])) ``` 運行該示例在100個隨機生成的輸入 - 輸出對上打印模型的技能。在相同的資源和相同數量的訓練下，關注的模型表現得更好。鑒于神經網絡的隨機性，您的結果可能會有所不同。嘗試運行幾次示例。 ```py Accuracy: 95.00% ``` 通過抽查一些樣本輸出和預測序列，我們可以看到很少的錯誤，即使在前兩個元素中存在零值的情況下也是如此。 ```py Expected: [48, 47, 0, 0, 0] Predicted [48, 47, 0, 0, 0] Expected: [7, 46, 0, 0, 0] Predicted [7, 46, 0, 0, 0] Expected: [32, 30, 0, 0, 0] Predicted [32, 2, 0, 0, 0] Expected: [3, 25, 0, 0, 0] Predicted [3, 25, 0, 0, 0] Expected: [45, 4, 0, 0, 0] Predicted [45, 4, 0, 0, 0] Expected: [49, 9, 0, 0, 0] Predicted [49, 9, 0, 0, 0] Expected: [22, 23, 0, 0, 0] Predicted [22, 23, 0, 0, 0] Expected: [29, 36, 0, 0, 0] Predicted [29, 36, 0, 0, 0] Expected: [0, 29, 0, 0, 0] Predicted [0, 29, 0, 0, 0] Expected: [11, 26, 0, 0, 0] Predicted [11, 26, 0, 0, 0] ``` ## 模型比較盡管我們從模型中獲得了更好的結果，但是每個模型的單次運行都會報告結果。在這種情況下，我們通過多次重復評估每個模型并報告這些運行的平均表現來尋求更穩健的發現。有關評估神經網絡模型的這種強大方法的更多信息，請參閱帖子： * [如何評估深度學習模型的技巧](https://machinelearningmastery.com/evaluate-skill-deep-learning-models/) 我們可以定義一個函數來創建每種類型的模型，如下所示。 ```py # define the encoder-decoder model def baseline_model(n_timesteps_in, n_features): model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features))) model.add(RepeatVector(n_timesteps_in)) model.add(LSTM(150, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) return model # define the encoder-decoder with attention model def attention_model(n_timesteps_in, n_features): model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True)) model.add(AttentionDecoder(150, n_features)) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) return model ``` 然后，我們可以定義一個函數來擬合和評估擬合模型的準確性并返回準確度分數。 ```py # train and evaluate a model, return accuracy def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features): # train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=0) # evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 return float(correct)/float(total)*100.0 ``` 綜上所述，我們可以多次重復創建，訓練和評估每種類型的模型，并報告重復的平均準確度。為了減少運行時間，我們將重復每次模型評估10次，但如果您有資源，則可以將此值增加到30或100次。 The complete example is listed below. ```py from random import randint from numpy import array from numpy import argmax from numpy import array_equal from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import TimeDistributed from keras.layers import RepeatVector from attention_decoder import AttentionDecoder # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # prepare data for the LSTM def get_pair(n_in, n_out, cardinality): # generate random sequence sequence_in = generate_sequence(n_in, cardinality) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, cardinality) y = one_hot_encode(sequence_out, cardinality) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y # define the encoder-decoder model def baseline_model(n_timesteps_in, n_features): model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features))) model.add(RepeatVector(n_timesteps_in)) model.add(LSTM(150, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) return model # define the encoder-decoder with attention model def attention_model(n_timesteps_in, n_features): model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True)) model.add(AttentionDecoder(150, n_features)) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) return model # train and evaluate a model, return accuracy def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features): # train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=0) # evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 return float(correct)/float(total)*100.0 # configure problem n_features = 50 n_timesteps_in = 5 n_timesteps_out = 2 n_repeats = 10 # evaluate encoder-decoder model print('Encoder-Decoder Model') results = list() for _ in range(n_repeats): model = baseline_model(n_timesteps_in, n_features) accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features) results.append(accuracy) print(accuracy) print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats))) # evaluate encoder-decoder with attention model print('Encoder-Decoder With Attention Model') results = list() for _ in range(n_repeats): model = attention_model(n_timesteps_in, n_features) accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features) results.append(accuracy) print(accuracy) print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats))) ``` 運行此示例將打印每個模型重復的準確性，以便您了解運行的進度。 ```py Encoder-Decoder Model 20.0 23.0 23.0 18.0 28.000000000000004 28.999999999999996 23.0 26.0 21.0 20.0 Mean Accuracy: 23.10% Encoder-Decoder With Attention Model 98.0 91.0 94.0 93.0 96.0 99.0 97.0 94.0 99.0 96.0 Mean Accuracy: 95.70% ``` 我們可以看到，即使平均超過10次運行，注意模型仍然表現出比沒有注意的編碼器 - 解碼器模型更好的表現，23.10％對95.70％。此評估的一個很好的擴展是捕獲每個模型的每個時期的模型損失，取平均值，并比較損失如何隨著時間的推移而變化，無論是否受到關注。我希望這種追蹤能夠比非注意力模型更快，更快地顯示出更好的技能，進一步突出了這種方法的好處。 ## 進一步閱讀如果您希望深入了解，本節將提供有關該主題的更多資源。 * [長期短期記憶循環神經網絡](https://machinelearningmastery.com/attention-long-short-term-memory-recurrent-neural-networks/)的注意事項 * [編碼器 - 解碼器循環神經網絡中的注意事項如何工作](https://machinelearningmastery.com/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks/) * [編碼器 - 解碼器長短期存儲器網絡](https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/) * [如何評估深度學習模型的技巧](https://machinelearningmastery.com/evaluate-skill-deep-learning-models/) * [如何在Keras中注意循環神經網絡](https://medium.com/datalogue/attention-in-keras-1892773a4f22)，2017。 * [keras-attention GitHub Project](https://github.com/datalogue/keras-attention) * [通過共同學習對齊和翻譯的神經機器翻譯](https://github.com/datalogue/keras-attention)，2015。 ## 摘要在本教程中，您了解了如何使用Keras在Python中開發編碼器 - 解碼器循環神經網絡。具體來說，你學到了： * 如何設計一個小的可配置問題來評估編碼器 - 解碼器循環神經網絡有無注意。 * 如何設計和評估編碼器 - 解碼器網絡，有和沒有注意序列預測問題。 * 如何有力地比較編碼器 - 解碼器網絡的表現有沒有注意。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。