如何在 Python 中用 Keras 開發基于單詞的神經語言模型 · Machine Learning Mastery 博客文章翻譯

# 如何在 Python 中用 Keras 開發基于單詞的神經語言模型 > 原文： [https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/](https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/) 語言建模涉及在已經存在的單詞序列的情況下預測序列中的下一個單詞。語言模型是許多自然語言處理模型中的關鍵元素，例如機器翻譯和語音識別。語言模型的框架選擇必須與語言模型的使用方式相匹配。在本教程中，您將了解在從童謠中生成短序列時，語言模型的框架如何影響模型的技能。完成本教程后，您將了解： * 為給定的應用程序開發基于單詞的語言模型的良好框架的挑戰。 * 如何為基于單詞的語言模型開發單字，雙字和基于行的框架。 * 如何使用擬合語言模型生成序列。讓我們開始吧。 ![How to Develop Word-Based Neural Language Models in Python with Keras](img/d1aa5edf765e5e408fc694194fef5048.jpg) 如何使用 Keras 在 Python 中開發基于 Word 的神經語言模型照片由 [Stephanie Chapman](https://www.flickr.com/photos/imcountingufoz/5602273537/) 保留，保留一些權利。 ## 教程概述本教程分為 5 個部分;他們是： 1. 框架語言建模 2. 杰克和吉爾童謠 3. 模型 1：單字輸入，單字輸出序列 4. 模型 2：逐行序列 5. 模型 3：雙字輸入，單字輸出序列 ## 框架語言建模從原始文本中學習統計語言模型，并且在給定已經存在于序列中的單詞的情況下預測序列中下一個單詞的概率。語言模型是大型模型中的關鍵組件，用于挑戰自然語言處理問題，如機器翻譯和語音識別。它們也可以作為獨立模型開發，并用于生成與源文本具有相同統計屬性的新序列。語言模型一次學習和預測一個單詞。網絡的訓練涉及提供單詞序列作為輸入，每次處理一個單詞，其中可以為每個輸入序列進行預測和學習。類似地，在進行預測時，可以用一個或幾個單詞播種該過程，然后可以收集預測的單詞并將其作為后續預測的輸入呈現，以便建立生成的輸出序列因此，每個模型將涉及將源文本分成輸入和輸出序列，使得模型可以學習預測單詞。有許多方法可以從源文本中構建序列以進行語言建模。在本教程中，我們將探討在 Keras 深度學習庫中開發基于單詞的語言模型的 3 種不同方法。沒有單一的最佳方法，只是可能適合不同應用的不同框架。 ## 杰克和吉爾童謠杰克和吉爾是一個簡單的童謠。它由 4 行組成，如下所示： > 杰克和吉爾上山 > 去取一桶水 > 杰克摔倒了，打破了他的王冠 > 吉爾跌倒了之后我們將使用它作為我們的源文本來探索基于單詞的語言模型的不同框架。我們可以在 Python 中定義這個文本如下： ```py # source text data = """ Jack and Jill went up the hill\n To fetch a pail of water\n Jack fell down and broke his crown\n And Jill came tumbling after\n """ ``` ## 模型 1：單字輸入，單字輸出序列我們可以從一個非常簡單的模型開始。給定一個單詞作為輸入，模型將學習預測序列中的下一個單詞。例如： ```py X, y Jack, and and, Jill Jill, went ... ``` 第一步是將文本編碼為整數。源文本中的每個小寫字都被賦予一個唯一的整數，我們可以將單詞序列轉換為整數序列。 Keras 提供了 [Tokenizer](https://keras.io/preprocessing/text/#tokenizer) 類，可用于執行此編碼。首先，Tokenizer 適合源文本，以開發從單詞到唯一整數的映射。然后通過調用 _texts_to_sequences（）_ 函數將文本序列轉換為整數序列。 ```py # integer encode text tokenizer = Tokenizer() tokenizer.fit_on_texts([data]) encoded = tokenizer.texts_to_sequences([data])[0] ``` 我們稍后需要知道詞匯表的大小，以便在模型中定義單詞嵌入層，以及使用一個熱編碼對輸出單詞進行編碼。通過訪問 _word_index_ 屬性，可以從訓練好的 Tokenizer 中檢索詞匯表的大小。 ```py # determine the vocabulary size vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) ``` 運行這個例子，我們可以看到詞匯量的大小是 21 個單詞。我們添加一個，因為我們需要將最大編碼字的整數指定為數組索引，例如單詞編碼 1 到 21，數組指示 0 到 21 或 22 個位置。接下來，我們需要創建單詞序列以適合模型，其中一個單詞作為輸入，一個單詞作為輸出。 ```py # create word -> word sequences sequences = list() for i in range(1, len(encoded)): sequence = encoded[i-1:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) ``` 運行這一部分表明我們總共有 24 個輸入輸出對來訓練網絡。 ```py Total Sequences: 24 ``` 然后我們可以將序列分成輸入（ _X_ ）和輸出元素（ _y_ ）。這很簡單，因為我們在數據中只有兩列。 ```py # split into X and y elements sequences = array(sequences) X, y = sequences[:,0],sequences[:,1] ``` 我們將使用我們的模型來預測詞匯表中所有單詞的概率分布。這意味著我們需要將輸出元素從單個整數轉換為一個熱編碼，對于詞匯表中的每個單詞都為 0，對于值的實際單詞為 1。這為網絡提供了一個基本事實，我們可以從中計算錯誤并更新模型。 Keras 提供 _to_categorical（）_ 函數，我們可以使用它將整數轉換為一個熱編碼，同時指定類的數量作為詞匯表大小。 ```py # one hot encode outputs y = to_categorical(y, num_classes=vocab_size) ``` 我們現在準備定義神經網絡模型。該模型使用嵌入在輸入層中的學習單詞。這對于詞匯表中的每個單詞具有一個實值向量，其中每個單詞向量具有指定的長度。在這種情況下，我們將使用 10 維投影。輸入序列包含單個字，因此 _input_length = 1_ 。該模型具有單個隱藏的 LSTM 層，具有 50 個單元。這遠遠超過了需要。輸出層由詞匯表中每個單詞的一個神經元組成，并使用 softmax 激活函數來確保輸出被標準化為看起來像概率。 ```py # define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) ``` 網絡結構可歸納如下： ```py _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 1, 10) 220 _________________________________________________________________ lstm_1 (LSTM) (None, 50) 12200 _________________________________________________________________ dense_1 (Dense) (None, 22) 1122 ================================================================= Total params: 13,542 Trainable params: 13,542 Non-trainable params: 0 _________________________________________________________________ ``` 對于本教程中的每個示例，我們將使用相同的通用網絡結構，對學習的嵌入層進行微小更改。接下來，我們可以在編碼的文本數據上編譯和擬合網絡。從技術上講，我們正在建模一個多類分類問題（預測詞匯表中的單詞），因此使用分類交叉熵損失函數。我們在每個時代結束時使用有效的 Adam 實現梯度下降和跟蹤精度。該模型適用于 500 個訓練時期，也許比需要更多。網絡配置沒有針對此和后續實驗進行調整;選擇了一個過度規定的配置，以確保我們可以專注于語言模型的框架。 ```py # compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) ``` 在模型擬合之后，我們通過從詞匯表中傳遞給定的單詞并讓模型預測下一個單詞來測試它。在這里我們通過編碼傳遞' _Jack_ '并調用 _model.predict_classes（）_ 來獲得預測單詞的整數輸出。然后在詞匯表映射中查找，以提供相關的單詞。 ```py # evaluate in_text = 'Jack' print(in_text) encoded = tokenizer.texts_to_sequences([in_text])[0] encoded = array(encoded) yhat = model.predict_classes(encoded, verbose=0) for word, index in tokenizer.word_index.items(): if index == yhat: print(word) ``` 然后可以重復該過程幾次以建立生成的單詞序列。為了使這更容易，我們將函數包含在一個函數中，我們可以通過傳入模型和種子字來調用它。 ```py # generate a sequence from the model def generate_seq(model, tokenizer, seed_text, n_words): in_text, result = seed_text, seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] encoded = array(encoded) # predict a word in the vocabulary yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text, result = out_word, result + ' ' + out_word return result ``` 我們可以把所有這些放在一起。完整的代碼清單如下。 ```py from numpy import array from keras.preprocessing.text import Tokenizer from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.layers import Embedding # generate a sequence from the model def generate_seq(model, tokenizer, seed_text, n_words): in_text, result = seed_text, seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] encoded = array(encoded) # predict a word in the vocabulary yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text, result = out_word, result + ' ' + out_word return result # source text data = """ Jack and Jill went up the hill\n To fetch a pail of water\n Jack fell down and broke his crown\n And Jill came tumbling after\n """ # integer encode text tokenizer = Tokenizer() tokenizer.fit_on_texts([data]) encoded = tokenizer.texts_to_sequences([data])[0] # determine the vocabulary size vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # create word -> word sequences sequences = list() for i in range(1, len(encoded)): sequence = encoded[i-1:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) # split into X and y elements sequences = array(sequences) X, y = sequences[:,0],sequences[:,1] # one hot encode outputs y = to_categorical(y, num_classes=vocab_size) # define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) # compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) # evaluate print(generate_seq(model, tokenizer, 'Jack', 6)) ``` 運行該示例打印每個訓練時期的損失和準確性。 ```py ... Epoch 496/500 0s - loss: 0.2358 - acc: 0.8750 Epoch 497/500 0s - loss: 0.2355 - acc: 0.8750 Epoch 498/500 0s - loss: 0.2352 - acc: 0.8750 Epoch 499/500 0s - loss: 0.2349 - acc: 0.8750 Epoch 500/500 0s - loss: 0.2346 - acc: 0.8750 ``` 我們可以看到模型沒有記住源序列，可能是因為輸入序列中存在一些模糊性，例如： ```py jack => and jack => fell ``` 等等。在運行結束時，傳入' _Jack_ '并生成預測或新序列。我們得到一個合理的序列作為輸出，它有一些源的元素。 ```py Jack and jill came tumbling after down ``` 這是一個很好的第一個切割語言模型，但沒有充分利用 LSTM 處理輸入序列的能力，并通過使用更廣泛的上下文消除一些模糊的成對序列的歧義。 ## 模型 2：逐行序列另一種方法是逐行分割源文本，然后將每一行分解為一系列構建的單詞。例如： ```py X, y _, _, _, _, _, Jack, and _, _, _, _, Jack, and Jill _, _, _, Jack, and, Jill, went _, _, Jack, and, Jill, went, up _, Jack, and, Jill, went, up, the Jack, and, Jill, went, up, the, hill ``` 這種方法可以允許模型在一個簡單的單字輸入和輸出模型產生歧義的情況下使用每一行的上下文來幫助模型。在這種情況下，這是以跨行預測單詞為代價的，如果我們只對建模和生成文本行感興趣，那么現在可能沒問題。請注意，在此表示中，我們將需要填充序列以確保它們滿足固定長度輸入。這是使用 Keras 時的要求。首先，我們可以使用已經適合源文本的 Tokenizer 逐行創建整數序列。 ```py # create line-based sequences sequences = list() for line in data.split('\n'): encoded = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(encoded)): sequence = encoded[:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) ``` 接下來，我們可以填充準備好的序列。我們可以使用 Keras 中提供的 [pad_sequences（）](https://keras.io/preprocessing/sequence/#pad_sequences)函數來完成此操作。這首先涉及找到最長的序列，然后使用它作為填充所有其他序列的長度。 ```py # pad input sequences max_length = max([len(seq) for seq in sequences]) sequences = pad_sequences(sequences, maxlen=max_length, padding='pre') print('Max Sequence Length: %d' % max_length) ``` 接下來，我們可以將序列拆分為輸入和輸出元素，就像之前一樣。 ```py # split into input and output elements sequences = array(sequences) X, y = sequences[:,:-1],sequences[:,-1] y = to_categorical(y, num_classes=vocab_size) ``` 然后可以像之前一樣定義模型，除了輸入序列現在比單個字長。具體來說，它們的長度為 _max_length-1_ ，-1 因為當我們計算序列的最大長度時，它們包括輸入和輸出元素。 ```py # define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=max_length-1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) # compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) ``` 我們可以像以前一樣使用該模型生成新序列。通過在每次迭代中將預測添加到輸入詞列表中，可以更新 _generate_seq（）_ 函數以建立輸入序列。 ```py # generate a sequence from a language model def generate_seq(model, tokenizer, max_length, seed_text, n_words): in_text = seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] # pre-pad sequences to a fixed length encoded = pad_sequences([encoded], maxlen=max_length, padding='pre') # predict probabilities for each word yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text += ' ' + out_word return in_text ``` 將所有這些結合在一起，下面提供了完整的代碼示例。 ```py from numpy import array from keras.preprocessing.text import Tokenizer from keras.utils import to_categorical from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.layers import Embedding # generate a sequence from a language model def generate_seq(model, tokenizer, max_length, seed_text, n_words): in_text = seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] # pre-pad sequences to a fixed length encoded = pad_sequences([encoded], maxlen=max_length, padding='pre') # predict probabilities for each word yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text += ' ' + out_word return in_text # source text data = """ Jack and Jill went up the hill\n To fetch a pail of water\n Jack fell down and broke his crown\n And Jill came tumbling after\n """ # prepare the tokenizer on the source text tokenizer = Tokenizer() tokenizer.fit_on_texts([data]) # determine the vocabulary size vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # create line-based sequences sequences = list() for line in data.split('\n'): encoded = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(encoded)): sequence = encoded[:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) # pad input sequences max_length = max([len(seq) for seq in sequences]) sequences = pad_sequences(sequences, maxlen=max_length, padding='pre') print('Max Sequence Length: %d' % max_length) # split into input and output elements sequences = array(sequences) X, y = sequences[:,:-1],sequences[:,-1] y = to_categorical(y, num_classes=vocab_size) # define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=max_length-1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) # compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) # evaluate model print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4)) print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4)) ``` 運行該示例可以更好地適應源數據。添加的上下文允許模型消除一些示例的歧義。仍有兩行文字以“ _Jack_ ”開頭，可能仍然是網絡的問題。 ```py ... Epoch 496/500 0s - loss: 0.1039 - acc: 0.9524 Epoch 497/500 0s - loss: 0.1037 - acc: 0.9524 Epoch 498/500 0s - loss: 0.1035 - acc: 0.9524 Epoch 499/500 0s - loss: 0.1033 - acc: 0.9524 Epoch 500/500 0s - loss: 0.1032 - acc: 0.9524 ``` 在運行結束時，我們生成兩個具有不同種子詞的序列：' _Jack_ '和' _Jill_ '。第一個生成的行看起來很好，直接匹配源文本。第二個有點奇怪。這是有道理的，因為網絡只在輸入序列中看到' _Jill_ '，而不是在序列的開頭，所以它強制輸出使用' _Jill_ 這個詞'，即押韻的最后一行。 ```py Jack fell down and broke Jill jill came tumbling after ``` 這是一個很好的例子，說明框架可能如何產生更好的新線條，但不是良好的部分輸入線條。 ## 模型 3：雙字輸入，單字輸出序列我們可以使用單詞輸入和全句子方法之間的中間，并傳入單詞的子序列作為輸入。這將在兩個框架之間進行權衡，允許生成新線并在中線拾取生成。我們將使用 3 個單詞作為輸入來預測一個單詞作為輸出。序列的準備與第一個示例非常相似，只是源序列數組中的偏移量不同，如下所示： ```py # encode 2 words -> 1 word sequences = list() for i in range(2, len(encoded)): sequence = encoded[i-2:i+1] sequences.append(sequence) ``` 下面列出了完整的示例 ```py from numpy import array from keras.preprocessing.text import Tokenizer from keras.utils import to_categorical from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.layers import Embedding # generate a sequence from a language model def generate_seq(model, tokenizer, max_length, seed_text, n_words): in_text = seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] # pre-pad sequences to a fixed length encoded = pad_sequences([encoded], maxlen=max_length, padding='pre') # predict probabilities for each word yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text += ' ' + out_word return in_text # source text data = """ Jack and Jill went up the hill\n To fetch a pail of water\n Jack fell down and broke his crown\n And Jill came tumbling after\n """ # integer encode sequences of words tokenizer = Tokenizer() tokenizer.fit_on_texts([data]) encoded = tokenizer.texts_to_sequences([data])[0] # retrieve vocabulary size vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # encode 2 words -> 1 word sequences = list() for i in range(2, len(encoded)): sequence = encoded[i-2:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) # pad sequences max_length = max([len(seq) for seq in sequences]) sequences = pad_sequences(sequences, maxlen=max_length, padding='pre') print('Max Sequence Length: %d' % max_length) # split into input and output elements sequences = array(sequences) X, y = sequences[:,:-1],sequences[:,-1] y = to_categorical(y, num_classes=vocab_size) # define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=max_length-1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) # compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) # evaluate model print(generate_seq(model, tokenizer, max_length-1, 'Jack and', 5)) print(generate_seq(model, tokenizer, max_length-1, 'And Jill', 3)) print(generate_seq(model, tokenizer, max_length-1, 'fell down', 5)) print(generate_seq(model, tokenizer, max_length-1, 'pail of', 5)) ``` 再次運行示例可以很好地適應源文本，準確度大約為 95％。 ```py ... Epoch 496/500 0s - loss: 0.0685 - acc: 0.9565 Epoch 497/500 0s - loss: 0.0685 - acc: 0.9565 Epoch 498/500 0s - loss: 0.0684 - acc: 0.9565 Epoch 499/500 0s - loss: 0.0684 - acc: 0.9565 Epoch 500/500 0s - loss: 0.0684 - acc: 0.9565 ``` 我們看一下 4 代示例，兩個線路起始線和兩個起始中線。 ```py Jack and jill went up the hill And Jill went up the fell down and broke his crown and pail of water jack fell down and ``` 第一次啟動行案例正確生成，但第二次沒有生成。第二種情況是第 4 行的一個例子，它與第一行的內容含糊不清。也許進一步擴展到 3 個輸入單詞會更好。正確生成了兩個中線生成示例，與源文本匹配。我們可以看到，語言模型的框架選擇以及模型的使用要求必須兼容。一般情況下使用語言模型時需要仔細設計，或許通過序列生成進行現場測試，以確認模型要求已得到滿足。 ## 擴展本節列出了一些擴展您可能希望探索的教程的想法。 * **全韻序列**。考慮更新上述示例中的一個以構建整個押韻作為輸入序列。該模型應該能夠在給定第一個單詞的種子的情況下生成整個事物，并證明這一點。 * **預訓練嵌入**。在嵌入中使用預先訓練的單詞向量進行探索，而不是將嵌入作為模型的一部分進行學習。這樣一個小的源文本不需要這樣做，但可能是一個好習慣。 * **角色模型**。探索使用基于字符的語言模型來源文本而不是本教程中演示的基于單詞的方法。 ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 * [杰克和吉爾在維基百科](https://en.wikipedia.org/wiki/Jack_and_Jill_(nursery_rhyme)) * 維基百科上的[語言模型](https://en.wikipedia.org/wiki/Language_model) * [Keras 嵌入層 API](https://keras.io/layers/embeddings/#embedding) * [Keras 文本處理 API](https://keras.io/preprocessing/text/) * [Keras 序列處理 API](https://keras.io/preprocessing/sequence/) * [Keras Utils API](https://keras.io/utils/) ## 摘要在本教程中，您了解了如何為簡單的童謠開發不同的基于單詞的語言模型。具體來說，你學到了： * 為給定的應用程序開發基于單詞的語言模型的良好框架的挑戰。 * 如何為基于單詞的語言模型開發單字，雙字和基于行的框架。 * 如何使用擬合語言模型生成序列。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。