在 Python 中計算文本 BLEU 分數的溫和介紹 · Machine Learning Mastery 博客文章翻譯

# 在 Python 中計算文本 BLEU 分數的溫和介紹 > 原文： [https://machinelearningmastery.com/calculate-bleu-score-for-text-python/](https://machinelearningmastery.com/calculate-bleu-score-for-text-python/) BLEU 或雙語評估 Understudy 是用于將文本的候選翻譯與一個或多個參考翻譯進行比較的分數。雖然是為翻譯而開發的，但它可用于評估為一系列自然語言處理任務生成的文本。在本教程中，您將發現使用 Python 中的 NLTK 庫評估和評分候選文本的 BLEU 分數。完成本教程后，您將了解： * 輕輕地介紹 BLEU 分數和對計算內容的直覺。 * 如何使用 NLTK 庫為句子和文檔計算 Python 中的 BLEU 分數。 * 如何使用一套小例子來確定候選人和參考文本之間的差異如何影響最終的 BLEU 分數。讓我們開始吧。 ![A Gentle Introduction to Calculating the BLEU Score for Text in Python](img/114a80a1936806a1189371f26560ca8e.jpg) 在 Python 中計算文本 BLEU 分數的溫和介紹照片由 [Bernard Spragg 撰寫。 NZ](https://www.flickr.com/photos/volvob12b/15624500507/) ，保留一些權利。 ## 教程概述本教程分為 4 個部分;他們是： 1. 雙語評估 Understudy 得分 2. 計算 BLEU 分數 3. 累積和個人 BLEU 分數 4. 工作的例子 ## 雙語評估 Understudy 得分雙語評估 Understudy 分數，或簡稱 BLEU，是用于評估生成的句子到參考句子的度量。完美匹配得分為 1.0，而完美匹配得分為 0.0。該評分是為評估自動機器翻譯系統的預測而開發的。它并不完美，但確實提供了 5 個引人注目的好處： * 計算速度快，成本低廉。 * 這很容易理解。 * 它與語言無關。 * 它與人類評價高度相關。 * 它已被廣泛采用。 BLEU 評分由 Kishore Papineni 等人提出。在 2002 年的論文“ [BLEU：一種自動評估機器翻譯的方法](http://www.aclweb.org/anthology/P02-1040.pdf)”。該方法通過將候選翻譯中的匹配 n-gram 計數到參考文本中的 n-gram 來進行工作，其中 1-gram 或 unigram 將是每個標記，并且 bigram 比較將是每個單詞對。無論字順序如何，都進行比較。 > BLEU 實現者的主要編程任務是將候選者的 n-gram 與參考翻譯的 n-gram 進行比較并計算匹配數。這些匹配與位置無關。匹配越多，候選翻譯就越好。 - [BLEU：一種自動評估機器翻譯的方法](http://www.aclweb.org/anthology/P02-1040.pdf)，2002。修改匹配的 n-gram 的計數以確保它考慮參考文本中的單詞的出現，而不是獎勵產生大量合理單詞的候選翻譯。這在本文中稱為修正的 n-gram 精度。 > 不幸的是，MT 系統可以過度生成“合理”的單詞，導致不可能但高精度的翻譯[...]直觀地，問題很明顯：在識別出匹配的候選詞之后，應該認為參考詞已經用盡。我們將這種直覺形式化為修改后的單字組精度。 - [BLEU：一種自動評估機器翻譯的方法](http://www.aclweb.org/anthology/P02-1040.pdf)，2002。該分數用于比較句子，但是還提出了通過其出現來標準化 n-gram 的修改版本以用于更好的多個句子的評分塊。 > 我們首先逐句計算 n-gram 匹配。接下來，我們為所有候選句子添加剪切的 n-gram 計數，并除以測試語料庫中的候選 n-gram 的數量，以計算整個測試語料庫的修改的精確度分數 pn。 - [BLEU：一種自動評估機器翻譯的方法](http://www.aclweb.org/anthology/P02-1040.pdf)，2002。在實踐中不可能獲得滿分，因為翻譯必須與參考完全匹配。人類翻譯甚至無法做到這一點。用于計算 BLEU 分數的參考文獻的數量和質量意味著比較數據集之間的分數可能很麻煩。 > BLEU 度量范圍從 0 到 1.少數翻譯將獲得 1 分，除非它們與參考翻譯相同。出于這個原因，即使是一個人類翻譯也不一定會在大約 500 個句子（40 個一般新聞報道）的測試語料中得分 1.一個人類翻譯對四個參考文獻得分為 0.3468，對兩個參考文獻得分為 0.2571。 - [BLEU：一種自動評估機器翻譯的方法](http://www.aclweb.org/anthology/P02-1040.pdf)，2002。除了翻譯，我們還可以通過深度學習方法將 BLEU 評分用于其他語言生成問題，例如： * 語言生成。 * 圖像標題生成。 * 文字摘要。 * 語音識別。以及更多。 ## 計算 BLEU 分數 Python Natural Language Toolkit 庫（即 NLTK）提供了 BLEU 分數的實現，您可以使用它來根據引用評估生成的文本。 ### 句子 BLEU 分數 NLTK 提供 [sentence_bleu（）](http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_bleu)函數，用于針對一個或多個參考句子評估候選句子。引用句子必須作為句子列表提供，其中每個引用是一個令牌列表。候選句子作為令牌列表提供。例如： ```py from nltk.translate.bleu_score import sentence_bleu reference = [['this', 'is', 'a', 'test'], ['this', 'is' 'test']] candidate = ['this', 'is', 'a', 'test'] score = sentence_bleu(reference, candidate) print(score) ``` 運行此示例會打印出一個完美的分數，因為候選者會精確匹配其中一個引用。 ```py 1.0 ``` ### 語料庫 BLEU 分數 NLTK 還提供稱為 [corpus_bleu（）](http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.corpus_bleu)的函數，用于計算多個句子（例如段落或文檔）的 BLEU 分數。必須將引用指定為文檔列表，其中每個文檔是引用列表，并且每個備選引用是令牌列表，例如，令牌列表列表。必須將候選文檔指定為列表，其中每個文檔是令牌列表，例如，令牌列表列表。這有點令人困惑;這是一個文檔的兩個引用的示例。 ```py # two references for one document from nltk.translate.bleu_score import corpus_bleu references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]] candidates = [['this', 'is', 'a', 'test']] score = corpus_bleu(references, candidates) print(score) ``` 運行該示例將像以前一樣打印出完美的分數。 ```py 1.0 ``` ## 累積和個人 BLEU 分數 NLTK 中的 BLEU 分數計算允許您在計算 BLEU 分數時指定不同 n-gram 的權重。這使您可以靈活地計算不同類型的 BLEU 分數，例如個人和累積的 n-gram 分數。讓我們來看看。 ### 個人 N-Gram 分數單獨的 N-gram 分數是僅匹配特定順序的克數的評估，例如單個單詞（1-gram）或單詞對（2-gram 或 bigram）。權重被指定為元組，其中每個索引引用克順序。要僅為 1-gram 匹配計算 BLEU 分數，您可以為 1-gram 指定權重 1，為 2,3 和 4 指定權重（1,0,0,0）。例如： ```py # 1-gram individual BLEU from nltk.translate.bleu_score import sentence_bleu reference = [['this', 'is', 'small', 'test']] candidate = ['this', 'is', 'a', 'test'] score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)) print(score) ``` 運行此示例會打印 0.5 分。 ```py 0.75 ``` 我們可以針對 1 到 4 的單個 n-gram 重復此示例，如下所示： ```py # n-gram individual BLEU from nltk.translate.bleu_score import sentence_bleu reference = [['this', 'is', 'a', 'test']] candidate = ['this', 'is', 'a', 'test'] print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))) print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0))) print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0))) print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1))) ``` 運行該示例將給出以下結果。 ```py Individual 1-gram: 1.000000 Individual 2-gram: 1.000000 Individual 3-gram: 1.000000 Individual 4-gram: 1.000000 ``` 雖然我們可以計算單個 BLEU 分數，但這不是該方法的用途，并且分數沒有很多意義，或者似乎可以解釋。 ### 累積 N-Gram 分數累積分數指的是從 1 到 n 的所有階數的單個 n-gram 分數的計算，并通過計算加權幾何平均值對它們進行加權。默認情況下， _sentence_bleu（）_ 和 _corpus_bleu（）_ 分數計算累積的 4 克 BLEU 分數，也稱為 BLEU-4。對于 1 克，2 克，3 克和 4 克的分數，BLEU-4 的重量分別為 1/4（25％）或 0.25。例如： ```py # 4-gram cumulative BLEU from nltk.translate.bleu_score import sentence_bleu reference = [['this', 'is', 'small', 'test']] candidate = ['this', 'is', 'a', 'test'] score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)) print(score) ``` 運行此示例將打印以下分數： ```py 0.707106781187 ``` 累積和單個 1 克 BLEU 使用相同的權重，例如（1,0,0,0）。 2 克重量為 1 克和 2 克各分配 50％，3 克重量為 1,2 克和 3 克分數各 33％。讓我們通過計算 BLEU-1，BLEU-2，BLEU-3 和 BLEU-4 的累積分數來具體化： ```py # cumulative BLEU scores from nltk.translate.bleu_score import sentence_bleu reference = [['this', 'is', 'small', 'test']] candidate = ['this', 'is', 'a', 'test'] print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))) print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))) print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0))) print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))) ``` 運行該示例將打印以下分數。他們是完全不同的，更具表現力它們與獨立的單個 n-gram 分數完全不同且更具表現力。 ```py Cumulative 1-gram: 0.750000 Cumulative 2-gram: 0.500000 Cumulative 3-gram: 0.632878 Cumulative 4-gram: 0.707107 ``` 在描述文本生成系統的技能時，通常會報告累積的 BLEU-1 到 BLEU-4 分數。 ## 工作的例子在本節中，我們嘗試通過一些例子為 BLEU 評分進一步發展直覺。我們使用以下單個參考句在句子級別工作： > 快速的棕色狐貍跳過懶狗首先，讓我們看看一個完美的分數。 ```py # prefect match from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog'] score = sentence_bleu(reference, candidate) print(score) ``` 運行該示例打印完美匹配。 ```py 1.0 ``` 接下來，讓我們改變一個詞，'_ 快速 _'改為'_ 快 _'。 ```py # one word different from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog'] score = sentence_bleu(reference, candidate) print(score) ``` 這個結果是得分略有下降。 ```py 0.7506238537503395 ``` 嘗試更改兩個單詞，'_ 快速 _'到'_ 快速 _'和'_ 懶惰 _'到'_ 困 _'。 ```py # two words different from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog'] score = sentence_bleu(reference, candidate) print(score) ``` 運行該示例，我們可以看到技能的線性下降。 ```py 0.4854917717073234 ``` 如果候選人的所有單詞都不同怎么辦？ ```py # all words different from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] candidate = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'] score = sentence_bleu(reference, candidate) print(score) ``` 我們得分可能更差。 ```py 0.0 ``` 現在，讓我們嘗試一個比參考詞少的候選詞（例如刪掉最后兩個詞），但這些詞都是正確的。 ```py # shorter candidate from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the'] score = sentence_bleu(reference, candidate) print(score) ``` 當兩個單詞出錯時，得分很像得分。 ```py 0.7514772930752859 ``` 如果我們讓候選人的兩個單詞長于參考文件怎么樣？ ```py # longer candidate from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', 'from', 'space'] score = sentence_bleu(reference, candidate) print(score) ``` 再次，我們可以看到我們的直覺成立并且得分類似于“_ 兩個單詞錯 _”。 ```py 0.7860753021519787 ``` 最后，讓我們比較一個太短的候選人：長度只有兩個單詞。 ```py # very short from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']] candidate = ['the', 'quick'] score = sentence_bleu(reference, candidate) print(score) ``` 首先運行此示例將打印一條警告消息，指示無法執行評估的 3 克及以上部分（最多 4 克）。這是公平的，因為我們只有 2 克與候選人一起工作。 ```py UserWarning: Corpus/Sentence contains 0 counts of 3-gram overlaps. BLEU scores might be undesirable; use SmoothingFunction(). warnings.warn(_msg) ``` 接下來，我們的分數確實非常低。 ```py 0.0301973834223185 ``` 我鼓勵你繼續玩實例。數學很簡單，我也鼓勵你閱讀論文并探索自己在電子表格中計算句子級別的分數。 ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 * 維基百科上的 [BLEU](https://en.wikipedia.org/wiki/BLEU) * [BLEU：一種自動評估機器翻譯的方法](http://www.aclweb.org/anthology/P02-1040.pdf)，2002。 * [nltk.translate.bleu_score](http://www.nltk.org/_modules/nltk/translate/bleu_score.html) 的源代碼 * [nltk.translate 包 API 文檔](http://www.nltk.org/api/nltk.translate.html) ## 摘要在本教程中，您發現了用于評估和評分候選文本以在機器翻譯和其他語言生成任務中引用文本的 BLEU 分數。具體來說，你學到了： * 輕輕地介紹 BLEU 分數和對計算內容的直覺。 * 如何使用 NLTK 庫為句子和文檔計算 Python 中的 BLEU 分數。 * 如何使用一套小例子來確定候選人和參考文本之間的差異如何影響最終的 BLEU 分數的直覺。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。