如何在 Keras 中開發基于字符的神經語言模型 · Machine Learning Mastery 博客文章翻譯

# 如何在 Keras 中開發基于字符的神經語言模型 > 原文： [https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/](https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/) 語言模型根據序列中前面的特定單詞預測序列中的下一個單詞。還可以使用神經網絡在角色級別開發語言模型。基于字符的語言模型的好處是它們在處理任何單詞，標點符號和其他文檔結構時的小詞匯量和靈活性。這需要以較慢的訓練需要更大的模型為代價。然而，在神經語言模型領域，基于字符的模型為語言建模的一般，靈活和強大的方法提供了許多希望。在本教程中，您將了解如何開發基于字符的神經語言模型。完成本教程后，您將了解： * 如何為基于字符的語言建模準備文本。 * 如何使用 LSTM 開發基于字符的語言模型。 * 如何使用訓練有素的基于字符的語言模型來生成文本。讓我們開始吧。 * **2018 年 2 月更新**：Keras 2.1.3 中針對 API 更改生成的次要更新。 ![How to Develop a Character-Based Neural Language Model in Keras](img/f5b42db5f9585614acf505c93ccca994.jpg) 如何在 Keras 中開發基于角色的神經語言模型 [hedera.baltica](https://www.flickr.com/photos/hedera_baltica/33907382116/) ，保留一些權利。 ## 教程概述本教程分為 4 個部分;他們是： 1. 唱一首六便士之歌 2. 數據準備 3. 訓練語言模型 4. 生成文本 ## 唱一首六便士之歌童謠“[唱一首六便士之歌](https://en.wikipedia.org/wiki/Sing_a_Song_of_Sixpence)”在西方是眾所周知的。第一節是常見的，但也有一個 4 節版本，我們將用它來開發基于角色的語言模型。它很短，所以適合模型會很快，但不會太短，以至于我們看不到任何有趣的東西。我們將用作源文本的完整 4 節版本如下所示。 ```py Sing a song of sixpence, A pocket full of rye. Four and twenty blackbirds, Baked in a pie. When the pie was opened The birds began to sing; Wasn't that a dainty dish, To set before the king. The king was in his counting house, Counting out his money; The queen was in the parlour, Eating bread and honey. The maid was in the garden, Hanging out the clothes, When down came a blackbird And pecked off her nose. ``` 復制文本并將其保存在當前工作目錄中的新文件中，文件名為“ _rhyme.txt_ ”。 ## 數據準備第一步是準備文本數據。我們將從定義語言模型的類型開始。 ### 語言模型設計必須在文本上訓練語言模型，對于基于字符的語言模型，輸入和輸出序列必須是字符。用作輸入的字符數也將定義需要提供給模型的字符數，以便引出第一個預測字符。生成第一個字符后，可將其附加到輸入序列并用作模型的輸入以生成下一個字符。較長的序列為模型提供了更多的上下文，以便了解接下來要輸出的字符，但是在生成文本時需要更長的時間來訓練并增加模型播種的負擔。我們將為此模型使用任意長度的 10 個字符。沒有很多文字，10 個字是幾個字。我們現在可以將原始文本轉換為我們的模型可以學習的形式;特別是，輸入和輸出字符序列。 ### 加載文字我們必須將文本加載到內存中，以便我們可以使用它。下面是一個名為 _load_doc（）_ 的函數，它將加載給定文件名的文本文件并返回加載的文本。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text ``` 我們可以使用童謠' _rhyme.txt_ '的文件名調用此函數，將文本加載到內存中。然后將文件的內容作為完整性檢查打印到屏幕。 ```py # load text raw_text = load_doc('rhyme.txt') print(raw_text) ``` ### 干凈的文字接下來，我們需要清理加載的文本。我們在這里不會做太多。具體來說，我們將刪除所有新行字符，以便我們只有一個由空格分隔的長字符序列。 ```py # clean tokens = raw_text.split() raw_text = ' '.join(tokens) ``` 您可能希望探索其他數據清理方法，例如將案例規范化為小寫或刪除標點符號以努力減少最終詞匯量大小并開發更小更精簡的模型。 ### 創建序列現在我們有了很長的字符列表，我們可以創建用于訓練模型的輸入輸出序列。每個輸入序列將是 10 個字符，帶有一個輸出字符，使每個序列長 11 個字符。我們可以通過枚舉文本中的字符來創建序列，從索引 10 處的第 11 個字符開始。 ```py # organize into sequences of characters length = 10 sequences = list() for i in range(length, len(raw_text)): # select sequence of tokens seq = raw_text[i-length:i+1] # store sequences.append(seq) print('Total Sequences: %d' % len(sequences)) ``` 運行此片段，我們可以看到我們最終只有不到 400 個字符序列來訓練我們的語言模型。 ```py Total Sequences: 399 ``` ### 保存序列最后，我們可以將準備好的數據保存到文件中，以便我們可以在開發模型時加載它。下面是一個函數 _save_doc（）_，給定一個字符串列表和一個文件名，將字符串保存到文件，每行一個。 ```py # save tokens to file, one dialog per line def save_doc(lines, filename): data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() ``` 我們可以調用這個函數并將我們準備好的序列保存到我們當前工作目錄中的文件名' _char_sequences.txt_ '。 ```py # save sequences to file out_filename = 'char_sequences.txt' save_doc(sequences, out_filename) ``` ### 完整的例子將所有這些結合在一起，下面提供了完整的代碼清單。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # save tokens to file, one dialog per line def save_doc(lines, filename): data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() # load text raw_text = load_doc('rhyme.txt') print(raw_text) # clean tokens = raw_text.split() raw_text = ' '.join(tokens) # organize into sequences of characters length = 10 sequences = list() for i in range(length, len(raw_text)): # select sequence of tokens seq = raw_text[i-length:i+1] # store sequences.append(seq) print('Total Sequences: %d' % len(sequences)) # save sequences to file out_filename = 'char_sequences.txt' save_doc(sequences, out_filename) ``` 運行該示例以創建' _char_seqiences.txt_ '文件。看看里面你應該看到如下內容： ```py Sing a song ing a song ng a song o g a song of a song of a song of s song of si song of six ong of sixp ng of sixpe ... ``` 我們現在準備訓練基于角色的神經語言模型。 ## 訓練語言模型在本節中，我們將為準備好的序列數據開發神經語言模型。該模型將讀取編碼字符并預測序列中的下一個字符。將使用長短期記憶循環神經網絡隱藏層來從輸入序列學習上下文以進行預測。 ### 加載數據第一步是從' _char_sequences.txt_ '加載準備好的字符序列數據。我們可以使用上一節中開發的相同 _load_doc（）_ 函數。加載后，我們按新行分割文本，以提供準備編碼的序列列表。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load in_filename = 'char_sequences.txt' raw_text = load_doc(in_filename) lines = raw_text.split('\n') ``` ### 編碼序列字符序列必須編碼為整數。這意味著將為每個唯一字符分配一個特定的整數值，并且每個字符序列將被編碼為整數序列。我們可以在原始輸入數據中給定一組排序的唯一字符來創建映射。映射是字符值到整數值的字典。 ```py chars = sorted(list(set(raw_text))) mapping = dict((c, i) for i, c in enumerate(chars)) ``` 接下來，我們可以一次處理一個字符序列，并使用字典映射查找每個字符的整數值。 ```py sequences = list() for line in lines: # integer encode line encoded_seq = [mapping[char] for char in line] # store sequences.append(encoded_seq) ``` 結果是整數列表的列表。我們稍后需要知道詞匯量的大小。我們可以將其檢索為字典映射的大小。 ```py # vocabulary size vocab_size = len(mapping) print('Vocabulary Size: %d' % vocab_size) ``` 運行這一段，我們可以看到輸入序列數據中有 38 個唯一字符。 ```py Vocabulary Size: 38 ``` ### 拆分輸入和輸出現在序列已經整數編碼，我們可以將列分成輸入和輸出字符序列。我們可以使用簡單的數組切片來完成此操作。 ```py sequences = array(sequences) X, y = sequences[:,:-1], sequences[:,-1] ``` 接下來，我們需要對每個字符進行一次熱編碼。也就是說，只要詞匯表（38 個元素）標記為特定字符，每個字符就變成一個向量。這為網絡提供了更精確的輸入表示。它還為網絡預測提供了明確的目標，其中模型可以輸出字符的概率分布，并與所有 0 值的理想情況進行比較，實際的下一個字符為 1。我們可以使用 Keras API 中的 _to_categorical（）_ 函數對輸入和輸出序列進行熱編碼。 ```py sequences = [to_categorical(x, num_classes=vocab_size) for x in X] X = array(sequences) y = to_categorical(y, num_classes=vocab_size) ``` 我們現在已準備好適應該模型。 ### 適合模型該模型由輸入層定義，該輸入層采用具有 10 個時間步長的序列和用于一個熱編碼輸入序列的 38 個特征。我們在 X 輸入數據上使用第二維和第三維，而不是指定這些數字。這樣，如果我們更改序列的長度或詞匯表的大小，我們就不需要更改模型定義。該模型具有單個 LSTM 隱藏層，具有 75 個存儲單元，通過一些試驗和錯誤選擇。該模型具有完全連接的輸出層，該輸出層輸出一個向量，其中概率分布跨越詞匯表中的所有字符。在輸出層上使用 softmax 激活函數以確保輸出具有概率分布的屬性。 ```py # define model model = Sequential() model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2]))) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) ``` 運行此命令會將已定義網絡的摘要打印為完整性檢查。 ```py _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm_1 (LSTM) (None, 75) 34200 _________________________________________________________________ dense_1 (Dense) (None, 38) 2888 ================================================================= Total params: 37,088 Trainable params: 37,088 Non-trainable params: 0 _________________________________________________________________ ``` 該模型正在學習多類分類問題，因此我們使用針對此類問題的分類日志丟失。梯度下降的有效 Adam 實現用于優化模型，并且在每次批量更新結束時報告準確性。該模型適用于 100 個訓練時期，再次通過一些試驗和錯誤找到。 ```py # compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model model.fit(X, y, epochs=100, verbose=2) ``` ### 保存模型模型適合后，我們將其保存到文件中供以后使用。 Keras 模型 API 提供 _save（）_ 函數，我們可以使用它將模型保存到單個文件，包括權重和拓撲信息。 ```py # save the model to file model.save('model.h5') ``` 我們還保存了從字符到整數的映射，在使用模型和解碼模型的任何輸出時，我們需要對任何輸入進行編碼。 ```py # save the mapping dump(mapping, open('mapping.pkl', 'wb')) ``` ### 完整的例子將所有這些結合在一起，下面列出了適合基于字符的神經語言模型的完整代碼清單。 ```py from numpy import array from pickle import dump from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load in_filename = 'char_sequences.txt' raw_text = load_doc(in_filename) lines = raw_text.split('\n') # integer encode sequences of characters chars = sorted(list(set(raw_text))) mapping = dict((c, i) for i, c in enumerate(chars)) sequences = list() for line in lines: # integer encode line encoded_seq = [mapping[char] for char in line] # store sequences.append(encoded_seq) # vocabulary size vocab_size = len(mapping) print('Vocabulary Size: %d' % vocab_size) # separate into input and output sequences = array(sequences) X, y = sequences[:,:-1], sequences[:,-1] sequences = [to_categorical(x, num_classes=vocab_size) for x in X] X = array(sequences) y = to_categorical(y, num_classes=vocab_size) # define model model = Sequential() model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2]))) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) # compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model model.fit(X, y, epochs=100, verbose=2) # save the model to file model.save('model.h5') # save the mapping dump(mapping, open('mapping.pkl', 'wb')) ``` 運行示例可能需要一分鐘。你會看到模型很好地學習了這個問題，也許是為了生成令人驚訝的字符序列。 ```py ... Epoch 96/100 0s - loss: 0.2193 - acc: 0.9950 Epoch 97/100 0s - loss: 0.2124 - acc: 0.9950 Epoch 98/100 0s - loss: 0.2054 - acc: 0.9950 Epoch 99/100 0s - loss: 0.1982 - acc: 0.9950 Epoch 100/100 0s - loss: 0.1910 - acc: 0.9950 ``` 在運行結束時，您將有兩個文件保存到當前工作目錄，特別是 _model.h5_ 和 _mapping.pkl_ 。接下來，我們可以看一下使用學習模型。 ## 生成文本我們將使用學習的語言模型生成具有相同統計特性的新文本序列。 ### 加載模型第一步是將保存的模型加載到文件' _model.h5_ '中。我們可以使用 Keras API 中的 _load_model（）_ 函數。 ```py # load the model model = load_model('model.h5') ``` 我們還需要加載 pickle 字典，用于將字符映射到文件' _mapping.pkl_ '中的整數。我們將使用 Pickle API 加載對象。 ```py # load the mapping mapping = load(open('mapping.pkl', 'rb')) ``` 我們現在準備使用加載的模型。 ### 生成角色我們必須提供 10 個字符的序列作為模型的輸入，以便開始生成過程。我們將手動選擇這些。需要以與為模型準備訓練數據相同的方式準備給定的輸入序列。首先，必須使用加載的映射對字符序列進行整數編碼。 ```py # encode the characters as integers encoded = [mapping[char] for char in in_text] ``` 接下來，序列需要使用 _to_categorical（）_ Keras 函數進行熱編碼。 ```py # one hot encode encoded = to_categorical(encoded, num_classes=len(mapping)) ``` 然后我們可以使用該模型來預測序列中的下一個字符。我們使用 _predict_classes（）_ 而不是 _predict（）_ 來直接選擇具有最高概率的字符的整數，而不是在整個字符集中獲得完整的概率分布。 ```py # predict character yhat = model.predict_classes(encoded, verbose=0) ``` 然后，我們可以通過查找映射來解碼此整數，以查看它映射到的字符。 ```py out_char = '' for char, index in mapping.items(): if index == yhat: out_char = char break ``` 然后可以將此字符添加到輸入序列中。然后，我們需要通過截斷輸入序列文本中的第一個字符來確保輸入序列是 10 個字符。我們可以使用 Keras API 中的 _pad_sequences（）_ 函數來執行此截斷操作。將所有這些放在一起，我們可以定義一個名為 _generate_seq（）_ 的新函數，用于使用加載的模型生成新的文本序列。 ```py # generate a sequence of characters with a language model def generate_seq(model, mapping, seq_length, seed_text, n_chars): in_text = seed_text # generate a fixed number of characters for _ in range(n_chars): # encode the characters as integers encoded = [mapping[char] for char in in_text] # truncate sequences to a fixed length encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') # one hot encode encoded = to_categorical(encoded, num_classes=len(mapping)) # predict character yhat = model.predict_classes(encoded, verbose=0) # reverse map integer to character out_char = '' for char, index in mapping.items(): if index == yhat: out_char = char break # append to input in_text += char return in_text ``` ### 完整的例子將所有這些結合在一起，下面列出了使用擬合神經語言模型生成文本的完整示例。 ```py from pickle import load from keras.models import load_model from keras.utils import to_categorical from keras.preprocessing.sequence import pad_sequences # generate a sequence of characters with a language model def generate_seq(model, mapping, seq_length, seed_text, n_chars): in_text = seed_text # generate a fixed number of characters for _ in range(n_chars): # encode the characters as integers encoded = [mapping[char] for char in in_text] # truncate sequences to a fixed length encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') # one hot encode encoded = to_categorical(encoded, num_classes=len(mapping)) encoded = encoded.reshape(1, encoded.shape[0], encoded.shape[1]) # predict character yhat = model.predict_classes(encoded, verbose=0) # reverse map integer to character out_char = '' for char, index in mapping.items(): if index == yhat: out_char = char break # append to input in_text += char return in_text # load the model model = load_model('model.h5') # load the mapping mapping = load(open('mapping.pkl', 'rb')) # test start of rhyme print(generate_seq(model, mapping, 10, 'Sing a son', 20)) # test mid-line print(generate_seq(model, mapping, 10, 'king was i', 20)) # test not in original print(generate_seq(model, mapping, 10, 'hello worl', 20)) ``` 運行該示例會生成三個文本序列。第一個是測試模型在從押韻開始時的作用。第二個是測試，看看它在一行開頭的表現如何。最后一個例子是一個測試，看看它對前面從未見過的一系列字符有多好。 ```py Sing a song of sixpence, A poc king was in his counting house hello worls e pake wofey. The ``` 我們可以看到，正如我們所期望的那樣，模型在前兩個示例中表現得非常好。我們還可以看到模型仍然為新文本生成了一些東西，但這是無稽之談。 ## 擴展本節列出了一些擴展您可能希望探索的教程的想法。 * **填充**。更新示例以僅逐行提供序列，并使用填充將每個序列填充到最大行長度。 * **序列長度**。嘗試不同的序列長度，看看它們如何影響模型的行為。 * **調諧模型**。嘗試不同的模型配置，例如內存單元和時期的數量，并嘗試為更少的資源開發更好的模型。 ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 * [在維基百科上演六便士之歌](https://en.wikipedia.org/wiki/Sing_a_Song_of_Sixpence) * [使用 Keras](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/) 在 Python 中使用 LSTM 循環神經網絡生成文本 * [Keras Utils API](https://keras.io/utils/) * [Keras 序列處理 API](https://keras.io/preprocessing/sequence/) ## 摘要在本教程中，您了解了如何開發基于字符的神經語言模型。具體來說，你學到了： * 如何為基于字符的語言建模準備文本。 * 如何使用 LSTM 開發基于字符的語言模型。 * 如何使用訓練有素的基于字符的語言模型來生成文本。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。