如何為機器翻譯準備法語到英語的數據集 · Machine Learning Mastery 博客文章翻譯

# 如何為機器翻譯準備法語到英語的數據集 > 原文： [https://machinelearningmastery.com/prepare-french-english-dataset-machine-translation/](https://machinelearningmastery.com/prepare-french-english-dataset-machine-translation/) 機器翻譯是將文本從源語言轉換為目標語言中的連貫和匹配文本的挑戰性任務。諸如編碼器 - 解碼器循環神經網絡之類的神經機器翻譯系統正在通過直接在源語言和目標語言上訓練的單個端到端系統實現機器翻譯的最先進結果。需要標準數據集來開發，探索和熟悉如何開發神經機器翻譯系統。在本教程中，您將發現 Europarl 標準機器翻譯數據集以及如何準備數據以進行建模。完成本教程后，您將了解： * Europarl 數據集由歐洲議會以 11 種語言提供的程序組成。 * 如何加載和清理準備在神經機器翻譯系統中建模的平行法語和英語成績單。 * 如何減少法語和英語數據的詞匯量，以降低翻譯任務的復雜性。讓我們開始吧。 ![How to Prepare a French-to-English Dataset for Machine Translation](img/8792dfc88847977ba9077568a855a5fb.jpg) 如何為機器翻譯準備法語 - 英語數據集 [Giuseppe Milo](https://www.flickr.com/photos/giuseppemilo/15366744101/) 的照片，保留一些權利。 ## 教程概述本教程分為 5 個部分;他們是： 1. Europarl 機器翻譯數據集 2. 下載法語 - 英語數據集 3. 加載數據集 4. 清理數據集 5. 減少詞匯量 ### Python 環境本教程假設您安裝了安裝了 Python 3 的 Python SciPy 環境。本教程還假設您安裝了 scikit-learn，Pandas，NumPy 和 Matplotlib。如果您需要有關環境的幫助，請參閱此帖子： * [如何使用 Anaconda 設置用于機器學習和深度學習的 Python 環境](https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/) ## Europarl 機器翻譯數據集 Europarl 是用于統計機器翻譯的標準數據集，最近是神經機器翻譯。它由歐洲議會的議事程序組成，因此數據集的名稱為收縮 _Europarl_ 。訴訟程序是歐洲議會發言人的抄本，翻譯成 11 種不同的語言。 > 它是歐洲議會議事錄的集合，可追溯到 1996 年。總共包括歐盟 11 種官方語言中每種語言約 3000 萬字的語料庫。 - [Europarl：統計機器翻譯平行語料庫](http://homepages.inf.ed.ac.uk/pkoehn/publications/europarl-mtsummit05.pdf)，2005。原始數據可在[歐洲議會網站](http://homepages.inf.ed.ac.uk/pkoehn/publications/europarl-mtsummit05.pdf)上以 HTML 格式獲得。數據集的創建由 [Philipp Koehn](http://www.cs.jhu.edu/~phi/) 領導，該書是“[統計機器翻譯](http://amzn.to/2xbAuwx)”一書的作者。該數據集在網站“[歐洲議會會議論文集平行語料庫 1996-2011](http://www.statmt.org/europarl/) ”上免費提供給研究人員，并且經常作為機器翻譯挑戰的一部分出現，例如[機器翻譯任務](http://www.statmt.org/wmt14/translation-task.html)在 2014 年統計機器翻譯研討會上。最新版本的數據集是 2012 年發布的版本 7，包含 1996 年至 2011 年的數據。 ## 下載法語 - 英語數據集我們將專注于平行的法語 - 英語數據集。這是 1996 年至 2011 年間記錄的法語和英語對齊語料庫。數據集具有以下統計信息： * 句子：2,007,723 * 法語單詞：51,388,643 * 英語單詞：50,196,035 您可以從此處下載數據集： * [平行語料庫法語 - 英語](http://www.statmt.org/europarl/v7/fr-en.tgz)（194 兆字節）下載后，您當前的工作目錄中應該有“ _fr-en.tgz_ ”文件。您可以使用 tar 命令解壓縮此存檔文件，如下所示： ```py tar zxvf fr-en.tgz ``` 您現在將擁有兩個文件，如下所示： * 英語：europarl-v7.fr-en.en（288M） * 法語：europarl-v7.fr-en.fr（331M）以下是英文文件的示例。 ```py Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period. Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful. You have requested a debate on this subject in the course of the next few days, during this part-session. In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union. ``` 以下是法語文件的示例。 ```py Reprise de la session Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances. Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles. Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session. En attendant, je souhaiterais, comme un certain nombre de collègues me l'ont demandé, que nous observions une minute de silence pour toutes les victimes, des tempêtes notamment, dans les différents pays de l'Union européenne qui ont été touchés. ``` ## 加載數據集讓我們從加載數據文件開始。我們可以將每個文件作為字符串加載。由于文件包含 unicode 字符，因此在將文件作為文本加載時必須指定編碼。在這種情況下，我們將使用 [UTF-8](https://en.wikipedia.org/wiki/UTF-8) 來輕松處理兩個文件中的 unicode 字符。下面的函數名為 _load_doc（）_，它將加載一個給定的文件并將其作為一個文本塊返回。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() # close the file file.close() return text ``` 接下來，我們可以將文件拆分成句子。通常，在每一行上存儲一個話語。我們可以將它們視為句子并用新行字符拆分文件。下面的函數 _to_sentences（）_ 將拆分加載的文檔。 ```py # split a loaded document into sentences def to_sentences(doc): return doc.strip().split('\n') ``` 在以后準備我們的模型時，我們需要知道數據集中句子的長度。我們可以寫一個簡短的函數來計算最短和最長的句子。 ```py # shortest and longest sentence lengths def sentence_lengths(sentences): lengths = [len(s.split()) for s in sentences] return min(lengths), max(lengths) ``` 我們可以將所有這些結合在一起，以加載和匯總英語和法語數據文件。下面列出了完整的示例。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() # close the file file.close() return text # split a loaded document into sentences def to_sentences(doc): return doc.strip().split('\n') # shortest and longest sentence lengths def sentence_lengths(sentences): lengths = [len(s.split()) for s in sentences] return min(lengths), max(lengths) # load English data filename = 'europarl-v7.fr-en.en' doc = load_doc(filename) sentences = to_sentences(doc) minlen, maxlen = sentence_lengths(sentences) print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen)) # load French data filename = 'europarl-v7.fr-en.fr' doc = load_doc(filename) sentences = to_sentences(doc) minlen, maxlen = sentence_lengths(sentences) print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen)) ``` 運行該示例總結了每個文件中的行數或句子數以及每個文件中最長和最短行的長度。 ```py English data: sentences=2007723, min=0, max=668 French data: sentences=2007723, min=0, max=693 ``` 重要的是，我們可以看到 2,007,723 行符合預期。 ## 清理數據集在用于訓練神經翻譯模型之前，數據需要一些最小的清潔。查看一些文本樣本，一些最小的文本清理可能包括： * 用空格標記文本。 * 將大小寫歸一化為小寫。 * 從每個單詞中刪除標點符號。 * 刪除不可打印的字符。 * 將法語字符轉換為拉丁字符。 * 刪除包含非字母字符的單詞。這些只是一些基本操作作為起點;您可能知道或需要更復雜的數據清理操作。下面的函數 _clean_lines（）_ 實現了這些清理操作。一些說明： * 我們使用 unicode API 來規范化 unicode 字符，將法語字符轉換為拉丁語字符。 * 我們使用逆正則表達式匹配來僅保留可打印單詞中的那些字符。 * 我們使用轉換表按原樣翻譯字符，但不包括所有標點字符。 ```py # clean a list of lines def clean_lines(lines): cleaned = list() # prepare regex for char filtering re_print = re.compile('[^%s]' % re.escape(string.printable)) # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for line in lines: # normalize unicode characters line = normalize('NFD', line).encode('ascii', 'ignore') line = line.decode('UTF-8') # tokenize on white space line = line.split() # convert to lower case line = [word.lower() for word in line] # remove punctuation from each token line = [word.translate(table) for word in line] # remove non-printable chars form each token line = [re_print.sub('', w) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string cleaned.append(' '.join(line)) return cleaned ``` 標準化后，我們使用 pickle API 直接以二進制格式保存簡潔行列表。這將加快后期和未來的進一步操作的加載。重用前面部分中開發的加載和拆分功能，下面列出了完整的示例。 ```py import string import re from pickle import dump from unicodedata import normalize # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() # close the file file.close() return text # split a loaded document into sentences def to_sentences(doc): return doc.strip().split('\n') # clean a list of lines def clean_lines(lines): cleaned = list() # prepare regex for char filtering re_print = re.compile('[^%s]' % re.escape(string.printable)) # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for line in lines: # normalize unicode characters line = normalize('NFD', line).encode('ascii', 'ignore') line = line.decode('UTF-8') # tokenize on white space line = line.split() # convert to lower case line = [word.lower() for word in line] # remove punctuation from each token line = [word.translate(table) for word in line] # remove non-printable chars form each token line = [re_print.sub('', w) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string cleaned.append(' '.join(line)) return cleaned # save a list of clean sentences to file def save_clean_sentences(sentences, filename): dump(sentences, open(filename, 'wb')) print('Saved: %s' % filename) # load English data filename = 'europarl-v7.fr-en.en' doc = load_doc(filename) sentences = to_sentences(doc) sentences = clean_lines(sentences) save_clean_sentences(sentences, 'english.pkl') # spot check for i in range(10): print(sentences[i]) # load French data filename = 'europarl-v7.fr-en.fr' doc = load_doc(filename) sentences = to_sentences(doc) sentences = clean_lines(sentences) save_clean_sentences(sentences, 'french.pkl') # spot check for i in range(10): print(sentences[i]) ``` 運行后，干凈的句子分別保存在 _english.pkl_ 和 _french.pkl_ 文件中。作為運行的一部分，我們還打印每個清晰句子列表的前幾行，轉載如下。英語： ```py resumption of the session i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful you have requested a debate on this subject in the course of the next few days during this partsession in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union please rise then for this minute s silence the house rose and observed a minute s silence madam president on a point of order you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka one of the people assassinated very recently in sri lanka was mr kumar ponnambalam who had visited the european parliament just a few months ago ``` 法國： ```py reprise de la session je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches je vous invite a vous lever pour cette minute de silence le parlement debout observe une minute de silence madame la presidente cest une motion de procedure vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka lune des personnes qui vient detre assassinee au sri lanka est m kumar ponnambalam qui avait rendu visite au parlement europeen il y a quelques mois a peine ``` 我對法語的閱讀非常有限，但至少就英語而言，可以進一步改進，例如丟棄或連接復數的''字符。 ## 減少詞匯量作為數據清理的一部分，限制源語言和目標語言的詞匯量非常重要。翻譯任務的難度與詞匯量的大小成比例，這反過來影響模型訓練時間和使模型可行所需的數據集的大小。在本節中，我們將減少英語和法語文本的詞匯量，并使用特殊標記標記所有詞匯（OOV）單詞。我們可以從加載上一節保存的酸洗干凈線開始。下面的 _load_clean_sentences（）_ 函數將加載并返回給定文件名的列表。 ```py # load a clean dataset def load_clean_sentences(filename): return load(open(filename, 'rb')) ``` 接下來，我們可以計算數據集中每個單詞的出現次數。為此，我們可以使用 _Counter_ 對象，這是一個鍵入單詞的 Python 字典，每次添加每個單詞的新出現時都會更新計數。下面的 _to_vocab（）_ 函數為給定的句子列表創建詞匯表。 ```py # create a frequency table for all words def to_vocab(lines): vocab = Counter() for line in lines: tokens = line.split() vocab.update(tokens) return vocab ``` 然后，我們可以處理創建的詞匯表，并從計數器中刪除出現低于特定閾值的所有單詞。下面的 _trim_vocab（）_ 函數執行此操作并接受最小出現次數作為參數并返回更新的詞匯表。 ```py # remove all words with a frequency below a threshold def trim_vocab(vocab, min_occurance): tokens = [k for k,c in vocab.items() if c >= min_occurance] return set(tokens) ``` 最后，我們可以更新句子，刪除不在修剪詞匯表中的所有單詞，并用特殊標記標記它們的刪除，在本例中為字符串“ _unk_ ”。下面的 _update_dataset（）_ 函數執行此操作并返回更新行的列表，然后可以將其保存到新文件中。 ```py # mark all OOV with "unk" for all lines def update_dataset(lines, vocab): new_lines = list() for line in lines: new_tokens = list() for token in line.split(): if token in vocab: new_tokens.append(token) else: new_tokens.append('unk') new_line = ' '.join(new_tokens) new_lines.append(new_line) return new_lines ``` 我們可以將所有這些結合在一起，減少英語和法語數據集的詞匯量，并將結果保存到新的數據文件中。我們將使用最小值 5，但您可以自由探索適合您的應用的其他最小值。完整的代碼示例如下所示。 ```py from pickle import load from pickle import dump from collections import Counter # load a clean dataset def load_clean_sentences(filename): return load(open(filename, 'rb')) # save a list of clean sentences to file def save_clean_sentences(sentences, filename): dump(sentences, open(filename, 'wb')) print('Saved: %s' % filename) # create a frequency table for all words def to_vocab(lines): vocab = Counter() for line in lines: tokens = line.split() vocab.update(tokens) return vocab # remove all words with a frequency below a threshold def trim_vocab(vocab, min_occurance): tokens = [k for k,c in vocab.items() if c >= min_occurance] return set(tokens) # mark all OOV with "unk" for all lines def update_dataset(lines, vocab): new_lines = list() for line in lines: new_tokens = list() for token in line.split(): if token in vocab: new_tokens.append(token) else: new_tokens.append('unk') new_line = ' '.join(new_tokens) new_lines.append(new_line) return new_lines # load English dataset filename = 'english.pkl' lines = load_clean_sentences(filename) # calculate vocabulary vocab = to_vocab(lines) print('English Vocabulary: %d' % len(vocab)) # reduce vocabulary vocab = trim_vocab(vocab, 5) print('New English Vocabulary: %d' % len(vocab)) # mark out of vocabulary words lines = update_dataset(lines, vocab) # save updated dataset filename = 'english_vocab.pkl' save_clean_sentences(lines, filename) # spot check for i in range(10): print(lines[i]) # load French dataset filename = 'french.pkl' lines = load_clean_sentences(filename) # calculate vocabulary vocab = to_vocab(lines) print('French Vocabulary: %d' % len(vocab)) # reduce vocabulary vocab = trim_vocab(vocab, 5) print('New French Vocabulary: %d' % len(vocab)) # mark out of vocabulary words lines = update_dataset(lines, vocab) # save updated dataset filename = 'french_vocab.pkl' save_clean_sentences(lines, filename) # spot check for i in range(10): print(lines[i]) ``` 首先，報告英語詞匯的大小，然后是更新的大小。更新的數據集將保存到文件' _english_vocab.pkl_ '，并打印一些更新的示例的現場檢查，其中包含用“ _unk_ ”替換的詞匯單詞。 ```py English Vocabulary: 105357 New English Vocabulary: 41746 Saved: english_vocab.pkl ``` 我們可以看到詞匯量的大小縮減了一半到 40,000 多個單詞。 ```py resumption of the session i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful you have requested a debate on this subject in the course of the next few days during this partsession in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union please rise then for this minute s silence the house rose and observed a minute s silence madam president on a point of order you will be aware from the press and television that there have been a number of bomb explosions and killings in sri lanka one of the people assassinated very recently in sri lanka was mr unk unk who had visited the european parliament just a few months ago ``` 然后對 French 數據集執行相同的過程，將結果保存到文件' _french_vocab.pkl_ '。 ```py French Vocabulary: 141642 New French Vocabulary: 58800 Saved: french_vocab.pkl ``` 我們看到法語詞匯量大小相似縮小。 ```py reprise de la session je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches je vous invite a vous lever pour cette minute de silence le parlement debout observe une minute de silence madame la presidente cest une motion de procedure vous avez probablement appris par la presse et par la television que plusieurs attentats a la bombe et crimes ont ete perpetres au sri lanka lune des personnes qui vient detre assassinee au sri lanka est m unk unk qui avait rendu visite au parlement europeen il y a quelques mois a peine ``` ## 進一步閱讀如果您希望深入了解，本節將提供有關該主題的更多資源。 * [Europarl：統計機器翻譯平行語料庫](http://homepages.inf.ed.ac.uk/pkoehn/publications/europarl-mtsummit05.pdf)，2005。 * [歐洲議會訴訟平行語料庫 1996-2011 主頁](http://www.statmt.org/europarl/) * [維基百科上的 Europarl Corpus](https://en.wikipedia.org/wiki/Europarl_Corpus) ## 摘要在本教程中，您發現了 Europarl 機器翻譯數據集以及如何準備數據以便進行建模。具體來說，你學到了： * Europarl 數據集由歐洲議會以 11 種語言提供的程序組成。 * 如何加載和清理準備在神經機器翻譯系統中建模的平行法語和英語成績單。 * 如何減少法語和英語數據的詞匯量，以降低翻譯任務的復雜性。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。