實現 LSTM 模型 · TensorFlow 機器學習秘籍中文第二版

# 實現 LSTM 模型我們將擴展我們的 RNN 模型，以便通過在此秘籍中引入 LSTM 單元來使用更長的序列。 ## 做好準備長短期記憶（LSTM）是傳統 RNN 的變體。 LSTM 是一種解決可變長度 RNN 所具有的消失/爆炸梯度問題的方法。為了解決這個問題，LSTM 單元引入了一個內部遺忘門，它可以修改從一個單元到下一個單元的信息流。為了概念化它的工作原理，我們將逐步介紹一個無偏置的 LSTM 方程式。第一步與常規 RNN 相同： ![](https://img.kancloud.cn/c0/b6/c0b6487193d132da75126c5baa925f4b_1780x220.png) 為了確定我們想要忘記或通過的值，我們將如下評估候選值。這些值通常稱為存儲單元： ![](https://img.kancloud.cn/af/6c/af6c0c8c5e0abbadb68d6bf93d9c6a4f_2150x220.png) 現在我們用一個遺忘矩陣修改候選存儲單元，其計算方法如下： ![](https://img.kancloud.cn/77/a0/77a0f876c8aac693a50319b71ea86ff6_1860x230.png) 我們現在將遺忘存儲器與先前的存儲器步驟相結合，并將其添加到候選存儲器單元以獲得新的存儲器值： ![](https://img.kancloud.cn/02/cc/02cc376ecfad5f33c1a2846af74553a4_1640x200.png) 現在我們將所有內容組合起來以獲取單元格的輸出： ![](https://img.kancloud.cn/2b/93/2b93629782163bca7a0d40d2112a9b21_2570x220.png) 然后，對于下一次迭代，我們更新 h 如下： ![](https://img.kancloud.cn/f2/5e/f25e1357fd5b4e27bac9faaf9d66d57f_1510x220.png) LSTM 的想法是通過基于輸入到細胞的信息可以忘記或修改的細胞具有自我調節的信息流。 > 在這里使用 TensorFlow 的一個好處是我們不必跟蹤這些操作及其相應的反向傳播屬性。 TensorFlow 將跟蹤這些并根據我們的損失函數，優化器和學習率指定的梯度自動更新模型變量。對于這個秘籍，我們將使用具有 LSTM 細胞的序列 RNN 來嘗試預測接下來的單詞，對莎士比亞的作品進行訓練。為了測試我們的工作方式，我們將提供模型候選短語，例如`thou art more`，并查看模型是否可以找出短語后面應該包含的單詞。 ## 操作步驟 1. 首先，我們為腳本加載必要的庫： ```py import os import re import string import requests import numpy as np import collections import random import pickle import matplotlib.pyplot as plt import tensorflow as tf ``` 1. 接下來，我們啟動圖會話并設置 RNN 參數： ```py sess = tf.Session() # Set RNN Parameters min_word_freq = 5 rnn_size = 128 epochs = 10 batch_size = 100 learning_rate = 0.001 training_seq_len = 50 embedding_size = rnn_size save_every = 500 eval_every = 50 prime_texts = ['thou art more', 'to be or not to', 'wherefore art thou'] ``` 1. 我們設置數據和模型文件夾和文件名，同時聲明要刪除的標點符號。我們希望保留連字符和撇號，因為莎士比亞經常使用它們來組合單詞和音節： ```py data_dir = 'temp' data_file = 'shakespeare.txt' model_path = 'shakespeare_model' full_model_dir = os.path.join(data_dir, model_path) # Declare punctuation to remove, everything except hyphens and apostrophe's punctuation = string.punctuation punctuation = ''.join([x for x in punctuation if x not in ['-', "'"]]) ``` 1. 接下來，我們獲取數據。如果數據文件不存在，我們下載并保存莎士比亞文本。如果確實存在，我們加載數據： ```py if not os.path.exists(full_model_dir): os.makedirs(full_model_dir) # Make data directory if not os.path.exists(data_dir): os.makedirs(data_dir) print('Loading Shakespeare Data') # Check if file is downloaded. if not os.path.isfile(os.path.join(data_dir, data_file)): print('Not found, downloading Shakespeare texts from www.gutenberg.org') shakespeare_url = 'http://www.gutenberg.org/cache/epub/100/pg100.txt' # Get Shakespeare text response = requests.get(shakespeare_url) shakespeare_file = response.content # Decode binary into string s_text = shakespeare_file.decode('utf-8') # Drop first few descriptive paragraphs. s_text = s_text[7675:] # Remove newlines s_text = s_text.replace('\r\n', '') s_text = s_text.replace('\n', '') # Write to file with open(os.path.join(data_dir, data_file), 'w') as out_conn: out_conn.write(s_text) else: # If file has been saved, load from that file with open(os.path.join(data_dir, data_file), 'r') as file_conn: s_text = file_conn.read().replace('\n', '') ``` 1. 我們通過刪除標點符號和額外的空格來清理莎士比亞的文本： ```py s_text = re.sub(r'[{}]'.format(punctuation), ' ', s_text) s_text = re.sub('s+', ' ', s_text ).strip().lower() ``` 1. 我們現在處理創建要使用的莎士比亞詞匯。我們創建一個函數，它將返回兩個字典（單詞到索引和索引到單詞），其中的單詞出現的頻率超過指定的頻率： ```py def build_vocab(text, min_word_freq): word_counts = collections.Counter(text.split(' ')) # limit word counts to those more frequent than cutoff word_counts = {key:val for key, val in word_counts.items() if val>min_word_freq} # Create vocab --> index mapping words = word_counts.keys() vocab_to_ix_dict = {key:(ix+1) for ix, key in enumerate(words)} # Add unknown key --> 0 index vocab_to_ix_dict['unknown']=0 # Create index --> vocab mapping ix_to_vocab_dict = {val:key for key,val in vocab_to_ix_dict.items()} return ix_to_vocab_dict, vocab_to_ix_dict ix2vocab, vocab2ix = build_vocab(s_text, min_word_freq) vocab_size = len(ix2vocab) + 1 ``` > 請注意，在處理文本時，我們必須小心索引值為零的單詞。我們應該保存填充的零值，也可能保存未知單詞。 1. 現在我們有了詞匯量，我們將莎士比亞的文本變成了一系列索引： ```py s_text_words = s_text.split(' ') s_text_ix = [] for ix, x in enumerate(s_text_words): try: s_text_ix.append(vocab2ix[x]) except: s_text_ix.append(0) s_text_ix = np.array(s_text_ix) ``` 1. 在本文中，我們將展示如何在類對象中創建模型。這對我們很有幫助，因為我們希望使用相同的模型（具有相同的權重）來批量訓練并從示例文本生成文本。如果沒有采用內部抽樣方法的課程，這將很難做到。理想情況下，此類代碼應位于單獨的 Python 文件中，我們可以在此腳本的開頭導入該文件： ```py class LSTM_Model(): def __init__(self, rnn_size, batch_size, learning_rate, training_seq_len, vocab_size, infer =False): self.rnn_size = rnn_size self.vocab_size = vocab_size self.infer = infer self.learning_rate = learning_rate if infer: self.batch_size = 1 self.training_seq_len = 1 else: self.batch_size = batch_size self.training_seq_len = training_seq_len self.lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(rnn_size) self.initial_state = self.lstm_cell.zero_state(self.batch_size, tf.float32) self.x_data = tf.placeholder(tf.int32, [self.batch_size, self.training_seq_len]) self.y_output = tf.placeholder(tf.int32, [self.batch_size, self.training_seq_len]) with tf.variable_scope('lstm_vars'): # Softmax Output Weights W = tf.get_variable('W', [self.rnn_size, self.vocab_size], tf.float32, tf.random_normal_initializer()) b = tf.get_variable('b', [self.vocab_size], tf.float32, tf.constant_initializer(0.0)) # Define Embedding embedding_mat = tf.get_variable('embedding_mat', [self.vocab_size, self.rnn_size], tf.float32, tf.random_normal_initializer()) embedding_output = tf.nn.embedding_lookup(embedding_mat, self.x_data) rnn_inputs = tf.split(embedding_output, num_or_size_splits=self.training_seq_len, axis=1) rnn_inputs_trimmed = [tf.squeeze(x, [1]) for x in rnn_inputs] # If we are inferring (generating text), we add a 'loop' function # Define how to get the i+1 th input from the i th output def inferred_loop(prev, count): prev_transformed = tf.matmul(prev, W) + b prev_symbol = tf.stop_gradient(tf.argmax(prev_transformed, 1)) output = tf.nn.embedding_lookup(embedding_mat, prev_symbol) return output decoder = tf.nn.seq2seq.rnn_decoder outputs, last_state = decoder(rnn_inputs_trimmed, self.initial_state, self.lstm_cell, loop_function=inferred_loop if infer else None) # Non inferred outputs output = tf.reshape(tf.concat(1, outputs), [-1, self.rnn_size]) # Logits and output self.logit_output = tf.matmul(output, W) + b self.model_output = tf.nn.softmax(self.logit_output) loss_fun = tf.contrib.legacy_seq2seq.sequence_loss_by_example loss = loss_fun([self.logit_output],[tf.reshape(self.y_output, [-1])], [tf.ones([self.batch_size * self.training_seq_len])], self.vocab_size) self.cost = tf.reduce_sum(loss) / (self.batch_size * self.training_seq_len) self.final_state = last_state gradients, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tf.trainable_variables()), 4.5) optimizer = tf.train.AdamOptimizer(self.learning_rate) self.train_op = optimizer.apply_gradients(zip(gradients, tf.trainable_variables())) def sample(self, sess, words=ix2vocab, vocab=vocab2ix, num=10, prime_text='thou art'): state = sess.run(self.lstm_cell.zero_state(1, tf.float32)) word_list = prime_text.split() for word in word_list[:-1]: x = np.zeros((1, 1)) x[0, 0] = vocab[word] feed_dict = {self.x_data: x, self.initial_state:state} [state] = sess.run([self.final_state], feed_dict=feed_dict) out_sentence = prime_text word = word_list[-1] for n in range(num): x = np.zeros((1, 1)) x[0, 0] = vocab[word] feed_dict = {self.x_data: x, self.initial_state:state} [model_output, state] = sess.run([self.model_output, self.final_state], feed_dict=feed_dict) sample = np.argmax(model_output[0]) if sample == 0: break word = words[sample] out_sentence = out_sentence + ' ' + word return out_sentence ``` 1. 現在我們將聲明 LSTM 模型以及測試模型。我們將在變量范圍內執行此操作，并告訴范圍我們將重用測試 LSTM 模型的變量： ```py with tf.variable_scope('lstm_model', reuse=tf.AUTO_REUSE) as scope: # Define LSTM Model lstm_model = LSTM_Model(rnn_size, batch_size, learning_rate, training_seq_len, vocab_size) scope.reuse_variables() test_lstm_model = LSTM_Model(rnn_size, batch_size, learning_rate, training_seq_len, vocab_size, infer=True) ``` 1. 我們創建一個保存操作，并將輸入文本拆分為相等的批量大小的塊。然后我們初始化模型的變量： ```py saver = tf.train.Saver() # Create batches for each epoch num_batches = int(len(s_text_ix)/(batch_size * training_seq_len)) + 1 # Split up text indices into subarrays, of equal size batches = np.array_split(s_text_ix, num_batches) # Reshape each split into [batch_size, training_seq_len] batches = [np.resize(x, [batch_size, training_seq_len]) for x in batches] # Initialize all variables init = tf.global_variables_initializer() sess.run(init) ``` 1. 我們現在可以遍歷我們的周期，在每個周期開始之前對數據進行混洗。我們數據的目標只是相同的數據，但是移動了 1（使用`numpy.roll()`函數）： ```py train_loss = [] iteration_count = 1 for epoch in range(epochs): # Shuffle word indices random.shuffle(batches) # Create targets from shuffled batches targets = [np.roll(x, -1, axis=1) for x in batches] # Run a through one epoch print('Starting Epoch #{} of {}.'.format(epoch+1, epochs)) # Reset initial LSTM state every epoch state = sess.run(lstm_model.initial_state) for ix, batch in enumerate(batches): training_dict = {lstm_model.x_data: batch, lstm_model.y_output: targets[ix]} c, h = lstm_model.initial_state training_dict[c] = state.c training_dict[h] = state.h temp_loss, state, _ = sess.run([lstm_model.cost, lstm_model.final_state, lstm_model.train_op], feed_dict=training_dict) train_loss.append(temp_loss) # Print status every 10 gens if iteration_count % 10 == 0: summary_nums = (iteration_count, epoch+1, ix+1, num_batches+1, temp_loss) print('Iteration: {}, Epoch: {}, Batch: {} out of {}, Loss: {:.2f}'.format(*summary_nums)) # Save the model and the vocab if iteration_count % save_every == 0: # Save model model_file_name = os.path.join(full_model_dir, 'model') saver.save(sess, model_file_name, global_step = iteration_count) print('Model Saved To: {}'.format(model_file_name)) # Save vocabulary dictionary_file = os.path.join(full_model_dir, 'vocab.pkl') with open(dictionary_file, 'wb') as dict_file_conn: pickle.dump([vocab2ix, ix2vocab], dict_file_conn) if iteration_count % eval_every == 0: for sample in prime_texts: print(test_lstm_model.sample(sess, ix2vocab, vocab2ix, num=10, prime_text=sample)) iteration_count += 1 ``` 1. 這導致以下輸出： ```py Loading Shakespeare Data Cleaning Text Building Shakespeare Vocab Vocabulary Length = 8009 Starting Epoch #1 of 10\. Iteration: 10, Epoch: 1, Batch: 10 out of 182, Loss: 10.37 Iteration: 20, Epoch: 1, Batch: 20 out of 182, Loss: 9.54 ... Iteration: 1790, Epoch: 10, Batch: 161 out of 182, Loss: 5.68 Iteration: 1800, Epoch: 10, Batch: 171 out of 182, Loss: 6.05 thou art more than i am a to be or not to the man i have wherefore art thou art of the long Iteration: 1810, Epoch: 10, Batch: 181 out of 182, Loss: 5.99 ``` 1. 最后，以下是我們如何繪制歷史上的訓練損失： ```py plt.plot(train_loss, 'k-') plt.title('Sequence to Sequence Loss') plt.xlabel('Generation') plt.ylabel('Loss') plt.show() ``` This results in the following plot of our loss values: ![](https://img.kancloud.cn/ce/d8/ced87695c3298877752963275b78b8f1_393x281.png) 圖 4：模型所有代的序列到序列損失 ## 工作原理在這個例子中，我們基于莎士比亞詞匯構建了一個帶有 LSTM 單元的 RNN 模型來預測下一個單詞。可以采取一些措施來改進模型，可能會增加序列大小，具有衰減的學習率，或者訓練模型以獲得更多的周期。 ## 更多為了抽樣，我們實現了一個貪婪的采樣器。貪婪的采樣器可能會一遍又一遍地重復相同的短語;例如，他們可能會卡住`for the for the` `for the....`為了防止這種情況，我們還可以實現一種更隨機的采樣方式，可能是根據輸出的對數或概率分布制作加權采樣器。