如何從零開始開發神經機器翻譯系統 · Machine Learning Mastery 博客文章翻譯

# 如何從零開始開發神經機器翻譯系統 > 原文： [https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/](https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/) #### 自動開發深度學習模型使用 Keras 逐步將 Python 從德語翻譯成英語。機器翻譯是一項具有挑戰性的任務，傳統上涉及使用高度復雜的語言知識開發的大型統計模型。 [神經機器翻譯](https://machinelearningmastery.com/introduction-neural-machine-translation/)是利用深度神經網絡解決機器翻譯問題。在本教程中，您將了解如何開發用于將德語短語翻譯成英語的神經機器翻譯系統。完成本教程后，您將了解： * 如何清理和準備數據準備訓練神經機器翻譯系統。 * 如何開發機器翻譯的編碼器 - 解碼器模型。 * 如何使用訓練有素的模型推斷新的輸入短語并評估模型技巧。讓我們開始吧。 **注**：摘錄自：“[深度學習自然語言處理](https://machinelearningmastery.com/deep-learning-for-nlp/)”。看一下，如果你想要更多的分步教程，在使用文本數據時充分利用深度學習方法。 ![How to Develop a Neural Machine Translation System in Keras](img/0f92cb4ebcdf0a35d478ceb006527e87.jpg) 如何在 Keras 中開發神經機器翻譯系統[Bj?rnGro?](https://www.flickr.com/photos/damescalito/34527830324/)，保留一些權利。 ## 教程概述本教程分為 4 個部分;他們是： 1. 德語到英語翻譯數據集 2. 準備文本數據 3. 訓練神經翻譯模型 4. 評估神經翻譯模型 ### Python 環境本教程假定您已安裝 Python 3 SciPy 環境。您必須安裝帶有 TensorFlow 或 Theano 后端的 Keras（2.0 或更高版本）。本教程還假設您已安裝 NumPy 和 Matplotlib。如果您需要有關環境的幫助，請參閱此帖子： * [如何使用 Anaconda 設置用于機器學習和深度學習的 Python 環境](https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/) 這樣的教程不需要 GPU，但是，您可以在 Amazon Web Services 上廉價地訪問 GPU。在本教程中學習如何： * [如何設置 Amazon AWS EC2 GPU 以訓練 Keras 深度學習模型（循序漸進）](https://machinelearningmastery.com/develop-evaluate-large-deep-learning-models-keras-amazon-web-services/) 讓我們潛入。 ## 德語到英語翻譯數據集在本教程中，我們將使用德語到英語術語的數據集作為語言學習的抽認卡的基礎。該數據集可從 [ManyThings.org](http://www.manythings.org) 網站獲得，其中的例子來自 [Tatoeba Project](http://tatoeba.org/home) 。該數據集由德語短語及其英語對應組成，旨在與 [Anki 閃卡軟件](https://apps.ankiweb.net/)一起使用。該頁面提供了許多語言對的列表，我建議您探索其他語言： * [制表符分隔的雙語句子對](http://www.manythings.org/anki/) 我們將在本教程中使用的數據集可在此處下載： * [德語 - 英語 deu-eng.zip](http://www.manythings.org/anki/deu-eng.zip) 將數據集下載到當前工作目錄并解壓縮;例如： ```py unzip deu-eng.zip ``` 您將擁有一個名為 _deu.txt_ 的文件，其中包含 152,820 對英語到德語階段，每行一對，并帶有分隔語言的選項卡。例如，文件的前 5 行如下所示： ```py Hi. Hallo! Hi. Grü? Gott! Run! Lauf! Wow! Potzdonner! Wow! Donnerwetter! ``` 我們將預測問題框定為德語中的一系列單詞作為輸入，翻譯或預測英語單詞的序列。我們將開發的模型適用于一些初學德語短語。 ## 準備文本數據下一步是準備好文本數據以進行建模。如果您不熟悉清理文本數據，請參閱此帖子： * [如何使用 Python 清理機器學習文本](https://machinelearningmastery.com/clean-text-machine-learning-python/) 查看原始數據并記下您在數據清理操作中可能需要處理的內容。例如，以下是我在審核原始數據時注意到的一些觀察結果： * 有標點符號。 * 該文本包含大寫和小寫。 * 德語中有特殊字符。 * 英語中有重復的短語，德語有不同的翻譯。 * 文件按句子長度排序，文件末尾有很長的句子。你有沒有注意到其他重要的事情？請在下面的評論中告訴我。良好的文本清理程序可以處理這些觀察中的一些或全部。數據準備分為兩個小節： 1. 干凈的文字 2. 拆分文字 ### 1.清潔文字首先，我們必須以保留 Unicode 德語字符的方式加載數據。下面的函數 _load_doc（）_ 將把文件加載為一團文本。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() # close the file file.close() return text ``` 每行包含一對短語，首先是英語，然后是德語，由制表符分隔。我們必須逐行拆分加載的文本，然后按短語拆分。下面的函數 _to_pairs（）_ 將拆分加載的文本。 ```py # split a loaded document into sentences def to_pairs(doc): lines = doc.strip().split('\n') pairs = [line.split('\t') for line in lines] return pairs ``` 我們現在準備清理每一句話。我們將執行的具體清潔操作如下： * 刪除所有不可打印的字符。 * 刪除所有標點字符。 * 將所有 Unicode 字符規范化為 ASCII（例如拉丁字符）。 * 將案例規范化為小寫。 * 刪除任何非字母的剩余令牌。我們將對加載的數據集中每對的每個短語執行這些操作。下面的 _clean_pairs（）_ 函數實現了這些操作。 ```py # clean a list of lines def clean_pairs(lines): cleaned = list() # prepare regex for char filtering re_print = re.compile('[^%s]' % re.escape(string.printable)) # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for pair in lines: clean_pair = list() for line in pair: # normalize unicode characters line = normalize('NFD', line).encode('ascii', 'ignore') line = line.decode('UTF-8') # tokenize on white space line = line.split() # convert to lowercase line = [word.lower() for word in line] # remove punctuation from each token line = [word.translate(table) for word in line] # remove non-printable chars form each token line = [re_print.sub('', w) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string clean_pair.append(' '.join(line)) cleaned.append(clean_pair) return array(cleaned) ``` 最后，既然已經清理了數據，我們可以將短語對列表保存到準備使用的文件中。函數 _save_clean_data（）_ 使用 pickle API 將干凈文本列表保存到文件中。將所有這些結合在一起，下面列出了完整的示例。 ```py import string import re from pickle import dump from unicodedata import normalize from numpy import array # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() # close the file file.close() return text # split a loaded document into sentences def to_pairs(doc): lines = doc.strip().split('\n') pairs = [line.split('\t') for line in lines] return pairs # clean a list of lines def clean_pairs(lines): cleaned = list() # prepare regex for char filtering re_print = re.compile('[^%s]' % re.escape(string.printable)) # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for pair in lines: clean_pair = list() for line in pair: # normalize unicode characters line = normalize('NFD', line).encode('ascii', 'ignore') line = line.decode('UTF-8') # tokenize on white space line = line.split() # convert to lowercase line = [word.lower() for word in line] # remove punctuation from each token line = [word.translate(table) for word in line] # remove non-printable chars form each token line = [re_print.sub('', w) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string clean_pair.append(' '.join(line)) cleaned.append(clean_pair) return array(cleaned) # save a list of clean sentences to file def save_clean_data(sentences, filename): dump(sentences, open(filename, 'wb')) print('Saved: %s' % filename) # load dataset filename = 'deu.txt' doc = load_doc(filename) # split into english-german pairs pairs = to_pairs(doc) # clean sentences clean_pairs = clean_pairs(pairs) # save clean pairs to file save_clean_data(clean_pairs, 'english-german.pkl') # spot check for i in range(100): print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1])) ``` 運行該示例在當前工作目錄中創建一個新文件，其中包含名為 _english-german.pkl_ 的已清理文本。打印清潔文本的一些示例供我們在運行結束時進行評估，以確認清潔操作是按預期執行的。 ```py [hi] => [hallo] [hi] => [gru gott] [run] => [lauf] [wow] => [potzdonner] [wow] => [donnerwetter] [fire] => [feuer] [help] => [hilfe] [help] => [zu hulf] [stop] => [stopp] [wait] => [warte] ... ``` ### 2.分割文字干凈的數據包含超過 150,000 個短語對，并且文件末尾的一些對非常長。這是開發小型翻譯模型的大量示例。模型的復雜性隨著示例的數量，短語的長度和詞匯的大小而增加。雖然我們有一個很好的數據集用于建模翻譯，但我們會稍微簡化問題，以大幅減少所需模型的大小，進而縮短適合模型所需的訓練時間。您可以探索在更全面的數據集上開發模型作為擴展;我很想聽聽你的表現。我們將通過將數據集減少到文件中的前 10,000 個示例來簡化問題;這些將是數據集中最短的短語。此外，我們將把前 9,000 個作為訓練示例，其余 1,000 個例子用于測試擬合模型。下面是加載干凈數據，拆分數據并將數據拆分部分保存到新文件的完整示例。 ```py from pickle import load from pickle import dump from numpy.random import rand from numpy.random import shuffle # load a clean dataset def load_clean_sentences(filename): return load(open(filename, 'rb')) # save a list of clean sentences to file def save_clean_data(sentences, filename): dump(sentences, open(filename, 'wb')) print('Saved: %s' % filename) # load dataset raw_dataset = load_clean_sentences('english-german.pkl') # reduce dataset size n_sentences = 10000 dataset = raw_dataset[:n_sentences, :] # random shuffle shuffle(dataset) # split into train/test train, test = dataset[:9000], dataset[9000:] # save save_clean_data(dataset, 'english-german-both.pkl') save_clean_data(train, 'english-german-train.pkl') save_clean_data(test, 'english-german-test.pkl') ``` 運行該示例將創建三個新文件： _english-german-both.pkl_ ，其中包含我們可用于定義問題參數的所有訓練和測試示例，例如最大短語長度和詞匯，以及訓練和測試數據集的 _english-german-train.pkl_ 和 _english-german-test.pkl_ 文件。我們現在準備開始開發我們的翻譯模型。 ## 訓練神經翻譯模型在本節中，我們將開發神經翻譯模型。如果您不熟悉神經翻譯模型，請參閱帖子： * [神經機器翻譯的溫和介紹](https://machinelearningmastery.com/introduction-neural-machine-translation/) 這涉及加載和準備準備好建模的清潔文本數據，以及在準備好的數據上定義和訓練模型。讓我們從加載數據集開始，以便我們可以準備數據。以下名為 _load_clean_sentences（）_ 的函數可用于依次加載 train，test 和兩個數據集。 ```py # load a clean dataset def load_clean_sentences(filename): return load(open(filename, 'rb')) # load datasets dataset = load_clean_sentences('english-german-both.pkl') train = load_clean_sentences('english-german-train.pkl') test = load_clean_sentences('english-german-test.pkl') ``` 我們將使用“兩者”或訓練和測試數據集的組合來定義問題的最大長度和詞匯。這是為了簡單起見。或者，我們可以單獨從訓練數據集定義這些屬性，并截斷測試集中的例子，這些例子太長或者詞匯不在詞匯表中。我們可以根據建模需要使用 Keras _Tokenize_ 類將單詞映射到整數。我們將為英語序列和德語序列使用單獨的分詞器。下面命名為 _create_tokenizer（）_ 的函數將在短語列表上訓練一個分詞器。 ```py # fit a tokenizer def create_tokenizer(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer ``` 類似地，下面名為 _max_length（）_ 的函數將找到短語列表中最長序列的長度。 ```py # max sentence length def max_length(lines): return max(len(line.split()) for line in lines) ``` 我們可以使用組合數據集調用這些函數來為英語和德語短語準備標記符，詞匯表大小和最大長度。 ```py # prepare english tokenizer eng_tokenizer = create_tokenizer(dataset[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = max_length(dataset[:, 0]) print('English Vocabulary Size: %d' % eng_vocab_size) print('English Max Length: %d' % (eng_length)) # prepare german tokenizer ger_tokenizer = create_tokenizer(dataset[:, 1]) ger_vocab_size = len(ger_tokenizer.word_index) + 1 ger_length = max_length(dataset[:, 1]) print('German Vocabulary Size: %d' % ger_vocab_size) print('German Max Length: %d' % (ger_length)) ``` 我們現在準備準備訓練數據集。每個輸入和輸出序列必須編碼為整數并填充到最大短語長度。這是因為我們將對輸入序列使用字嵌入，并對輸出序列進行熱編碼。以下名為 _encode_sequences（）_ 的函數將執行這些操作并返回結果。 ```py # encode and pad sequences def encode_sequences(tokenizer, length, lines): # integer encode sequences X = tokenizer.texts_to_sequences(lines) # pad sequences with 0 values X = pad_sequences(X, maxlen=length, padding='post') return X ``` 輸出序列需要進行單熱編碼。這是因為模型將預測詞匯表中每個單詞作為輸出的概率。下面的函數 _encode_output（）_ 將對英文輸出序列進行單熱編碼。 ```py # one hot encode target sequence def encode_output(sequences, vocab_size): ylist = list() for sequence in sequences: encoded = to_categorical(sequence, num_classes=vocab_size) ylist.append(encoded) y = array(ylist) y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size) return y ``` 我們可以利用這兩個函數并準備訓練模型的訓練和測試數據集。 ```py # prepare training data trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1]) trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0]) trainY = encode_output(trainY, eng_vocab_size) # prepare validation data testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1]) testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0]) testY = encode_output(testY, eng_vocab_size) ``` 我們現在準備定義模型。我們將在這個問題上使用編碼器 - 解碼器 LSTM 模型。在這種架構中，輸入序列由稱為編碼器的前端模型編碼，然后由稱為解碼器的后端模型逐字解碼。下面的函數 _define_model（）_ 定義了模型，并采用了許多用于配置模型的參數，例如輸入和輸出詞匯的大小，輸入和輸出短語的最大長度以及數字用于配置模型的內存單元。該模型使用有效的 Adam 方法訓練隨機梯度下降并最小化分類損失函數，因為我們將預測問題框定為多類分類。模型配置未針對此問題進行優化，這意味著您有足夠的機會對其進行調整并提升翻譯技能。我很想看看你能想出什么。有關配置神經機器翻譯模型的更多建議，請參閱帖子： * [如何為神經機器翻譯配置編碼器 - 解碼器模型](https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/) ```py # define NMT model def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units): model = Sequential() model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True)) model.add(LSTM(n_units)) model.add(RepeatVector(tar_timesteps)) model.add(LSTM(n_units, return_sequences=True)) model.add(TimeDistributed(Dense(tar_vocab, activation='softmax'))) return model # define model model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256) model.compile(optimizer='adam', loss='categorical_crossentropy') # summarize defined model print(model.summary()) plot_model(model, to_file='model.png', show_shapes=True) ``` 最后，我們可以訓練模型。我們訓練了 30 個時期的模型和 64 個樣本的批量大小。我們使用檢查點來確保每次測試集上的模型技能得到改進時，模型都會保存到文件中。 ```py # fit model filename = 'model.h5' checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min') model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2) ``` 我們可以將所有這些結合在一起并適合神經翻譯模型。完整的工作示例如下所示。 ```py from pickle import load from numpy import array from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.utils.vis_utils import plot_model from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import Embedding from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.callbacks import ModelCheckpoint # load a clean dataset def load_clean_sentences(filename): return load(open(filename, 'rb')) # fit a tokenizer def create_tokenizer(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # max sentence length def max_length(lines): return max(len(line.split()) for line in lines) # encode and pad sequences def encode_sequences(tokenizer, length, lines): # integer encode sequences X = tokenizer.texts_to_sequences(lines) # pad sequences with 0 values X = pad_sequences(X, maxlen=length, padding='post') return X # one hot encode target sequence def encode_output(sequences, vocab_size): ylist = list() for sequence in sequences: encoded = to_categorical(sequence, num_classes=vocab_size) ylist.append(encoded) y = array(ylist) y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size) return y # define NMT model def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units): model = Sequential() model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True)) model.add(LSTM(n_units)) model.add(RepeatVector(tar_timesteps)) model.add(LSTM(n_units, return_sequences=True)) model.add(TimeDistributed(Dense(tar_vocab, activation='softmax'))) return model # load datasets dataset = load_clean_sentences('english-german-both.pkl') train = load_clean_sentences('english-german-train.pkl') test = load_clean_sentences('english-german-test.pkl') # prepare english tokenizer eng_tokenizer = create_tokenizer(dataset[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = max_length(dataset[:, 0]) print('English Vocabulary Size: %d' % eng_vocab_size) print('English Max Length: %d' % (eng_length)) # prepare german tokenizer ger_tokenizer = create_tokenizer(dataset[:, 1]) ger_vocab_size = len(ger_tokenizer.word_index) + 1 ger_length = max_length(dataset[:, 1]) print('German Vocabulary Size: %d' % ger_vocab_size) print('German Max Length: %d' % (ger_length)) # prepare training data trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1]) trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0]) trainY = encode_output(trainY, eng_vocab_size) # prepare validation data testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1]) testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0]) testY = encode_output(testY, eng_vocab_size) # define model model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256) model.compile(optimizer='adam', loss='categorical_crossentropy') # summarize defined model print(model.summary()) plot_model(model, to_file='model.png', show_shapes=True) # fit model filename = 'model.h5' checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min') model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2) ``` 首先運行該示例將打印數據集參數的摘要，例如詞匯表大小和最大短語長度。 ```py English Vocabulary Size: 2404 English Max Length: 5 German Vocabulary Size: 3856 German Max Length: 10 ``` 接下來，打印已定義模型的摘要，允許我們確認模型配置。 ```py _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 10, 256) 987136 _________________________________________________________________ lstm_1 (LSTM) (None, 256) 525312 _________________________________________________________________ repeat_vector_1 (RepeatVecto (None, 5, 256) 0 _________________________________________________________________ lstm_2 (LSTM) (None, 5, 256) 525312 _________________________________________________________________ time_distributed_1 (TimeDist (None, 5, 2404) 617828 ================================================================= Total params: 2,655,588 Trainable params: 2,655,588 Non-trainable params: 0 _________________________________________________________________ ``` 還創建了模型圖，提供了模型配置的另一個視角。 ![Plot of Model Graph for NMT](img/4319d3f70ffb87578d739d5d94a563f7.jpg) NMT 模型圖的圖接下來，訓練模型。在現代 CPU 硬件上，每個時期大約需要 30 秒;不需要 GPU。在運行期間，模型將保存到文件 _model.h5_ ，準備在下一步中進行推理。 ```py ... Epoch 26/30 Epoch 00025: val_loss improved from 2.20048 to 2.19976, saving model to model.h5 17s - loss: 0.7114 - val_loss: 2.1998 Epoch 27/30 Epoch 00026: val_loss improved from 2.19976 to 2.18255, saving model to model.h5 17s - loss: 0.6532 - val_loss: 2.1826 Epoch 28/30 Epoch 00027: val_loss did not improve 17s - loss: 0.5970 - val_loss: 2.1970 Epoch 29/30 Epoch 00028: val_loss improved from 2.18255 to 2.17872, saving model to model.h5 17s - loss: 0.5474 - val_loss: 2.1787 Epoch 30/30 Epoch 00029: val_loss did not improve 17s - loss: 0.5023 - val_loss: 2.1823 ``` ## 評估神經翻譯模型我們將評估訓練上的模型和測試數據集。該模型應該在訓練數據集上表現很好，并且理想情況下已被推廣以在測試數據集上表現良好。理想情況下，我們將使用單獨的驗證數據集來幫助在訓練期間選擇模型而不是測試集。您可以嘗試將其作為擴展名。必須像以前一樣加載和準備干凈的數據集。 ```py ... # load datasets dataset = load_clean_sentences('english-german-both.pkl') train = load_clean_sentences('english-german-train.pkl') test = load_clean_sentences('english-german-test.pkl') # prepare english tokenizer eng_tokenizer = create_tokenizer(dataset[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = max_length(dataset[:, 0]) # prepare german tokenizer ger_tokenizer = create_tokenizer(dataset[:, 1]) ger_vocab_size = len(ger_tokenizer.word_index) + 1 ger_length = max_length(dataset[:, 1]) # prepare data trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1]) testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1]) ``` 接下來，必須加載訓練期間保存的最佳模型。 ```py # load model model = load_model('model.h5') ``` 評估涉及兩個步驟：首先生成翻譯的輸出序列，然后針對許多輸入示例重復此過程，并在多個案例中總結模型的技能。從推理開始，模型可以以一次性方式預測整個輸出序列。 ```py translation = model.predict(source, verbose=0) ``` 這將是一個整數序列，我們可以在 tokenizer 中枚舉和查找以映射回單詞。以下函數名為 _word_for_id（）_，將執行此反向映射。 ```py # map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None ``` 我們可以為轉換中的每個整數執行此映射，并將結果作為一個單詞串返回。下面的函數 _predict_sequence（）_ 對單個編碼的源短語執行此操作。 ```py # generate target given source sequence def predict_sequence(model, tokenizer, source): prediction = model.predict(source, verbose=0)[0] integers = [argmax(vector) for vector in prediction] target = list() for i in integers: word = word_for_id(i, tokenizer) if word is None: break target.append(word) return ' '.join(target) ``` 接下來，我們可以對數據集中的每個源短語重復此操作，并將預測結果與英語中的預期目標短語進行比較。我們可以將這些比較中的一些打印到屏幕上，以了解模型在實踐中的表現。我們還將計算 BLEU 分數，以獲得模型表現良好的定量概念。您可以在此處了解有關 BLEU 分數的更多信息： * [計算 Python 中文本的 BLEU 分數的溫和介紹](https://machinelearningmastery.com/calculate-bleu-score-for-text-python/) 下面的 _evaluate_model（）_ 函數實現了這一點，為提供的數據集中的每個短語調用上述 _predict_sequence（）_ 函數。 ```py # evaluate the skill of the model def evaluate_model(model, tokenizer, sources, raw_dataset): actual, predicted = list(), list() for i, source in enumerate(sources): # translate encoded source text source = source.reshape((1, source.shape[0])) translation = predict_sequence(model, eng_tokenizer, source) raw_target, raw_src = raw_dataset[i] if i < 10: print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation)) actual.append(raw_target.split()) predicted.append(translation.split()) # calculate BLEU score print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25))) ``` 我們可以將所有這些結合在一起，并在訓練和測試數據集上評估加載的模型。完整的代碼清單如下。 ```py from pickle import load from numpy import array from numpy import argmax from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import load_model from nltk.translate.bleu_score import corpus_bleu # load a clean dataset def load_clean_sentences(filename): return load(open(filename, 'rb')) # fit a tokenizer def create_tokenizer(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # max sentence length def max_length(lines): return max(len(line.split()) for line in lines) # encode and pad sequences def encode_sequences(tokenizer, length, lines): # integer encode sequences X = tokenizer.texts_to_sequences(lines) # pad sequences with 0 values X = pad_sequences(X, maxlen=length, padding='post') return X # map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None # generate target given source sequence def predict_sequence(model, tokenizer, source): prediction = model.predict(source, verbose=0)[0] integers = [argmax(vector) for vector in prediction] target = list() for i in integers: word = word_for_id(i, tokenizer) if word is None: break target.append(word) return ' '.join(target) # evaluate the skill of the model def evaluate_model(model, tokenizer, sources, raw_dataset): actual, predicted = list(), list() for i, source in enumerate(sources): # translate encoded source text source = source.reshape((1, source.shape[0])) translation = predict_sequence(model, eng_tokenizer, source) raw_target, raw_src = raw_dataset[i] if i < 10: print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation)) actual.append(raw_target.split()) predicted.append(translation.split()) # calculate BLEU score print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25))) # load datasets dataset = load_clean_sentences('english-german-both.pkl') train = load_clean_sentences('english-german-train.pkl') test = load_clean_sentences('english-german-test.pkl') # prepare english tokenizer eng_tokenizer = create_tokenizer(dataset[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = max_length(dataset[:, 0]) # prepare german tokenizer ger_tokenizer = create_tokenizer(dataset[:, 1]) ger_vocab_size = len(ger_tokenizer.word_index) + 1 ger_length = max_length(dataset[:, 1]) # prepare data trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1]) testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1]) # load model model = load_model('model.h5') # test on some training sequences print('train') evaluate_model(model, eng_tokenizer, trainX, train) # test on some test sequences print('test') evaluate_model(model, eng_tokenizer, testX, test) ``` 首先運行示例打印源文本，預期和預測翻譯的示例，以及訓練數據集的分數，然后是測試數據集。鑒于數據集的隨機改組和神經網絡的隨機性，您的具體結果會有所不同。首先查看測試數據集的結果，我們可以看到翻譯是可讀的并且大部分都是正確的。例如：“ _ich liebe dich_ ”被正確翻譯為“_ 我愛你 _”。我們還可以看到翻譯并不完美，“ _ich konnte nicht gehen_ ”翻譯為“_ 我不能 _”而不是預期的“_ 我無法行走 _ ]“。我們還可以看到 BLEU-4 得分為 0.51，它提供了我們對此模型的預期上限。 ```py src=[ich liebe dich], target=[i love you], predicted=[i love you] src=[ich sagte du sollst den mund halten], target=[i said shut up], predicted=[i said stop up] src=[wie geht es eurem vater], target=[hows your dad], predicted=[hows your dad] src=[das gefallt mir], target=[i like that], predicted=[i like that] src=[ich gehe immer zu fu], target=[i always walk], predicted=[i will to] src=[ich konnte nicht gehen], target=[i couldnt walk], predicted=[i cant go] src=[er ist sehr jung], target=[he is very young], predicted=[he is very young] src=[versucht es doch einfach], target=[just try it], predicted=[just try it] src=[sie sind jung], target=[youre young], predicted=[youre young] src=[er ging surfen], target=[he went surfing], predicted=[he went surfing] BLEU-1: 0.085682 BLEU-2: 0.284191 BLEU-3: 0.459090 BLEU-4: 0.517571 ``` 查看測試集上的結果，確實看到可讀的翻譯，這不是一件容易的事。例如，我們看到“ _ich mag dich nicht_ ”正確翻譯為“_ 我不喜歡你 _”。我們還看到一些不良的翻譯以及該模型可能受到進一步調整的好例子，例如“ _ich bin etwas beschwipst_ ”翻譯為“ _ia bit bit_ ”而不是預期“_ 我有點醉了 _” BLEU-4 得分為 0.076238，提供了基線技能，可以進一步改進模型。 ```py src=[tom erblasste], target=[tom turned pale], predicted=[tom went pale] src=[bring mich nach hause], target=[take me home], predicted=[let us at] src=[ich bin etwas beschwipst], target=[im a bit tipsy], predicted=[i a bit bit] src=[das ist eine frucht], target=[its a fruit], predicted=[thats a a] src=[ich bin pazifist], target=[im a pacifist], predicted=[im am] src=[unser plan ist aufgegangen], target=[our plan worked], predicted=[who is a man] src=[hallo tom], target=[hi tom], predicted=[hello tom] src=[sei nicht nervos], target=[dont be nervous], predicted=[dont be crazy] src=[ich mag dich nicht], target=[i dont like you], predicted=[i dont like you] src=[tom stellte eine falle], target=[tom set a trap], predicted=[tom has a cough] BLEU-1: 0.082088 BLEU-2: 0.006182 BLEU-3: 0.046129 BLEU-4: 0.076238 ``` ## 擴展本節列出了一些擴展您可能希望探索的教程的想法。 * **數據清理**。可以對數據執行不同的數據清理操作，例如不刪除標點符號或標準化案例，或者可能刪除重復的英語短語。 * **詞匯**。可以改進詞匯表，可能刪除在數據集中使用少于 5 或 10 次的單詞并替換為“ _unk_ ”。 * **更多數據**。用于擬合模型的數據集可以擴展到 50,000,1000 個短語或更多。 * **輸入訂單**。輸入短語的順序可以顛倒，據報道可提升技能，或者可以使用雙向輸入層。 * **層**。編碼器和/或解碼器模型可以通過附加層進行擴展，并針對更多時期進行訓練，從而為模型提供更多的代表表現力。 * **單位**。可以增加編碼器和解碼器中的存儲器單元的數量，從而為模型提供更多的代表性容量。 * **正規化**。該模型可以使用正則化，例如權重或激活正則化，或在 LSTM 層上使用壓差。 * **預訓練的單詞向量**。可以在模型中使用預訓練的單詞向量。 * **遞歸模型**。可以使用模型的遞歸公式，其中輸出序列中的下一個字可以以輸入序列和到目前為止生成的輸出序列為條件。 ## 進一步閱讀如果您希望深入了解，本節將提供有關該主題的更多資源。 * [制表符分隔的雙語句子對](http://www.manythings.org/anki/) * [德語 - 英語 deu-eng.zip](http://www.manythings.org/anki/deu-eng.zip) * [編碼器 - 解碼器長短期存儲器網絡](https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/) ## 摘要在本教程中，您了解了如何開發用于將德語短語翻譯成英語的神經機器翻譯系統。具體來說，你學到了： * 如何清理和準備數據準備訓練神經機器翻譯系統。 * 如何開發機器翻譯的編碼器 - 解碼器模型。 * 如何使用訓練有素的模型推斷新的輸入短語并評估模型技巧。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。 **注**：這篇文章摘錄自：“[深度學習自然語言處理](https://machinelearningmastery.com/deep-learning-for-nlp/)”。看一下，如果您想要在使用文本數據時獲得有關深入學習方法的更多分步教程。