如何為文本摘要準備新聞文章 · Machine Learning Mastery 博客文章翻譯

# 如何為文本摘要準備新聞文章 > 原文： [https://machinelearningmastery.com/prepare-news-articles-text-summarization/](https://machinelearningmastery.com/prepare-news-articles-text-summarization/) 文本摘要是創建文章的簡短，準確和流暢的摘要的任務。 CNN 新聞故事數據集是一種流行的免費數據集，用于深度學習方法的文本摘要實驗。在本教程中，您將了解如何準備 CNN 新聞數據集以進行文本摘要。完成本教程后，您將了解： * 關于 CNN 新聞數據集以及如何將故事數據下載到您的工作站。 * 如何加載數據集并將每篇文章拆分為故事文本和突出顯示。 * 如何清理準備建模的數據集并將清理后的數據保存到文件中供以后使用。讓我們開始吧。 ![How to Prepare News Articles for Text Summarization](img/8629787a201fe5d6aadfbd08b965f80c.jpg) 如何為文本摘要準備新聞文章 [DieselDemon](https://www.flickr.com/photos/28096801@N05/6252168841/) 的照片，保留一些權利。 ## 教程概述本教程分為 5 個部分;他們是： 1. CNN 新聞故事數據集 2. 檢查數據集 3. 加載數據 4. 數據清理 5. 保存清潔數據 ## CNN 新聞故事數據集 DeepMind Q＆amp; A 數據集是來自 CNN 和每日郵報的大量新聞文章以及相關問題。該數據集是作為深度學習的問題和回答任務而開發的，并在 2015 年的論文“[教學機器中進行了閱讀和理解](https://arxiv.org/abs/1506.03340)”。該數據集已用于文本摘要中，其中匯總了來自新聞文章的句子。值得注意的例子是論文： * [使用序列到序列 RNN 及其后的抽象文本摘要](https://arxiv.org/abs/1602.06023)，2016。 * [達到要點：利用指針生成器網絡匯總](https://arxiv.org/abs/1704.04368)，2017 年。 Kyunghyun Cho 是紐約大學的學者，已經提供了下載數據集： * [DeepMind Q＆amp; A 數據集](http://cs.nyu.edu/~kcho/DMQA/) 在本教程中，我們將使用 CNN 數據集，特別是下載此處提供的新聞報道的 ASCII 文本： * [cnn_stories.tgz](https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ) （151 兆字節）此數據集包含超過 93,000 篇新聞文章，其中每篇文章都存儲在單個“ _.story_ ”文件中。將此數據集下載到您的工作站并解壓縮。下載后，您可以在命令行上解壓縮存檔，如下所示： ```py tar xvf cnn_stories.tgz ``` 這將創建一個 _cnn / stories /_ 目錄，其中包含 _.story_ 文件。例如，我們可以在命令行上計算故事文件的數量，如下所示： ```py ls -ltr | wc -l ``` 這向我們展示了我們共有 92,580 家商店。 ```py 92580 ``` ## 檢查數據集使用文本編輯器，查看一些故事并記下準備這些數據的一些想法。例如，下面是一個故事的例子，為簡潔起見，身體被截斷。 ```py (CNN) -- If you travel by plane and arriving on time makes a difference, try to book on Hawaiian Airlines. In 2012, passengers got where they needed to go without delay on the carrier more than nine times out of 10, according to a study released on Monday. In fact, Hawaiian got even better from 2011, when it had a 92.8% on-time performance. Last year, it improved to 93.4%. [...] @highlight Hawaiian Airlines again lands at No. 1 in on-time performance @highlight The Airline Quality Rankings Report looks at the 14 largest U.S. airlines @highlight ExpressJet and American Airlines had the worst on-time performance @highlight Virgin America had the best baggage handling; Southwest had lowest complaint rate ``` 我注意到數據集的一般結構是讓故事文本后跟一些“_ 突出顯示 _”點。回顧 CNN 網站上的文章，我可以看到這種模式仍然很常見。 ![Example of a CNN News Article With Highlights from cnn.com](img/cc79667253e4757c2223dd24295aec31.jpg) 來自 [cnn.com](http://edition.cnn.com/2017/08/28/politics/donald-trump-hurricane-harvey-response-texas/index.html) 的重點介紹 CNN 新聞文章的例子 ASCII 文本不包括文章標題，但我們可以使用這些人工編寫的“_ 重點 _”作為每篇新聞文章的多個參考摘要。我還可以看到許多文章都是從源信息開始的，可能是創建故事的 CNN 辦公室;例如： ```py (CNN) -- Gaza City (CNN) -- Los Angeles (CNN) -- ``` 這些可以完全刪除。數據清理是一個具有挑戰性的問題，必須根據系統的特定應用進行定制。如果我們通常對開發新聞文章摘要系統感興趣，那么我們可以清理文本以通過減小詞匯量來簡化學習問題。這些數據的一些數據清理思路包括。 * 將大小寫歸一化為小寫（例如“An Italian”）。 * 刪除標點符號（例如“準時”）。我們還可以進一步減少詞匯量來加速測試模型，例如： * 刪除號碼（例如“93.4％”）。 * 刪除名稱等低頻詞（例如“Tom Watkins”）。 * 將故事截斷為前 5 或 10 個句子。 ## 加載數據第一步是加載數據。我們可以先編寫一個函數來加載給定文件名的單個文檔。數據有一些 unicode 字符，因此我們將通過強制編碼為 [UTF-8](https://en.wikipedia.org/wiki/UTF-8) 來加載數據集。下面名為 _load_doc（）_ 的函數將加載單個文檔作為給定文件名的文本。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, encoding='utf-8') # read all text text = file.read() # close the file file.close() return text ``` 接下來，我們需要跳過 stories 目錄中的每個文件名并加載它們。我們可以使用 _listdir（）_ 函數加載目錄中的所有文件名，然后依次加載每個文件名。以下名為 _load_stories（）_ 的函數實現了此行為，并為準備加載的文檔提供了一個起點。 ```py # load all stories in a directory def load_stories(directory): for name in listdir(directory): filename = directory + '/' + name # load document doc = load_doc(filename) ``` 每個文檔可以分為新聞故事文本和精彩部分或摘要文本。這兩點的分割是第一次出現' _@highlight_ '令牌。拆分后，我們可以將亮點組織到列表中。以下名為 _split_story（）_ 的函數實現了此行為，并將給定的已加載文檔文本拆分為故事和高亮列表。 ```py # split a document into news story and highlights def split_story(doc): # find first highlight index = doc.find('@highlight') # split into story and highlights story, highlights = doc[:index], doc[index:].split('@highlight') # strip extra white space around each highlight highlights = [h.strip() for h in highlights if len(h) > 0] return story, highlights ``` 我們現在可以更新 _load_stories（）_ 函數，為每個加載的文檔調用 _split_story（）_ 函數，然后將結果存儲在列表中。 ```py # load all stories in a directory def load_stories(directory): all_stories = list() for name in listdir(directory): filename = directory + '/' + name # load document doc = load_doc(filename) # split into story and highlights story, highlights = split_story(doc) # store all_stories.append({'story':story, 'highlights':highlights}) return all_stories ``` 將所有這些結合在一起，下面列出了加載整個數據集的完整示例。 ```py from os import listdir # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, encoding='utf-8') # read all text text = file.read() # close the file file.close() return text # split a document into news story and highlights def split_story(doc): # find first highlight index = doc.find('@highlight') # split into story and highlights story, highlights = doc[:index], doc[index:].split('@highlight') # strip extra white space around each highlight highlights = [h.strip() for h in highlights if len(h) > 0] return story, highlights # load all stories in a directory def load_stories(directory): stories = list() for name in listdir(directory): filename = directory + '/' + name # load document doc = load_doc(filename) # split into story and highlights story, highlights = split_story(doc) # store stories.append({'story':story, 'highlights':highlights}) return stories # load stories directory = 'cnn/stories/' stories = load_stories(directory) print('Loaded Stories %d' % len(stories)) ``` 運行該示例將打印已加載故事的數量。 ```py Loaded Stories 92,579 ``` 我們現在可以訪問加載的故事并突出顯示數據，例如： ```py print(stories[4]['story']) print(stories[4]['highlights']) ``` ## 數據清理現在我們可以加載故事數據，我們可以通過清理它來預處理文本。我們可以逐行處理故事，并在每個高亮線上使用相同的清潔操作。對于給定的行，我們將執行以下操作：刪除 CNN 辦公室信息。 ```py # strip source cnn office if it exists index = line.find('(CNN) -- ') if index > -1: line = line[index+len('(CNN)'):] ``` 使用空格標記拆分線： ```py # tokenize on white space line = line.split() ``` 將案例規范化為小寫。 ```py # convert to lower case line = [word.lower() for word in line] ``` 從每個標記中刪除所有標點符號（特定于 Python 3）。 ```py # prepare a translation table to remove punctuation table = str.maketrans('', '', string.punctuation) # remove punctuation from each token line = [w.translate(table) for w in line] ``` 刪除任何具有非字母字符的單詞。 ```py # remove tokens with numbers in them line = [word for word in line if word.isalpha()] ``` 將這一切放在一起，下面是一個名為 _clean_lines（）_ 的新函數，它接受一行文本行并返回一個簡潔的文本行列表。 ```py # clean a list of lines def clean_lines(lines): cleaned = list() # prepare a translation table to remove punctuation table = str.maketrans('', '', string.punctuation) for line in lines: # strip source cnn office if it exists index = line.find('(CNN) -- ') if index > -1: line = line[index+len('(CNN)'):] # tokenize on white space line = line.split() # convert to lower case line = [word.lower() for word in line] # remove punctuation from each token line = [w.translate(table) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string cleaned.append(' '.join(line)) # remove empty strings cleaned = [c for c in cleaned if len(c) > 0] return cleaned ``` 我們可以通過首先將其轉換為一行文本來將其稱為故事。可以在高亮列表上直接調用該函數。 ```py example['story'] = clean_lines(example['story'].split('\n')) example['highlights'] = clean_lines(example['highlights']) ``` 下面列出了加載和清理數據集的完整示例。 ```py from os import listdir import string # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, encoding='utf-8') # read all text text = file.read() # close the file file.close() return text # split a document into news story and highlights def split_story(doc): # find first highlight index = doc.find('@highlight') # split into story and highlights story, highlights = doc[:index], doc[index:].split('@highlight') # strip extra white space around each highlight highlights = [h.strip() for h in highlights if len(h) > 0] return story, highlights # load all stories in a directory def load_stories(directory): stories = list() for name in listdir(directory): filename = directory + '/' + name # load document doc = load_doc(filename) # split into story and highlights story, highlights = split_story(doc) # store stories.append({'story':story, 'highlights':highlights}) return stories # clean a list of lines def clean_lines(lines): cleaned = list() # prepare a translation table to remove punctuation table = str.maketrans('', '', string.punctuation) for line in lines: # strip source cnn office if it exists index = line.find('(CNN) -- ') if index > -1: line = line[index+len('(CNN)'):] # tokenize on white space line = line.split() # convert to lower case line = [word.lower() for word in line] # remove punctuation from each token line = [w.translate(table) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string cleaned.append(' '.join(line)) # remove empty strings cleaned = [c for c in cleaned if len(c) > 0] return cleaned # load stories directory = 'cnn/stories/' stories = load_stories(directory) print('Loaded Stories %d' % len(stories)) # clean stories for example in stories: example['story'] = clean_lines(example['story'].split('\n')) example['highlights'] = clean_lines(example['highlights']) ``` 請注意，故事現在存儲為一個簡潔的行列表，名義上用句子分隔。 ## 保存清潔數據最后，既然已經清理了數據，我們可以將其保存到文件中。保存清理數據的簡便方法是選擇故事和精彩部分列表。例如： ```py # save to file from pickle import dump dump(stories, open('cnn_dataset.pkl', 'wb')) ``` 這將創建一個名為 _cnn_dataset.pkl_ 的新文件，其中包含所有已清理的數據。該文件大小約為 374 兆字節。然后我們可以稍后加載它并將其與文本摘要模型一起使用，如下所示： ```py # load from file stories = load(open('cnn_dataset.pkl', 'rb')) print('Loaded Stories %d' % len(stories)) ``` ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 * [DeepMind Q＆amp; A 數據集](http://cs.nyu.edu/~kcho/DMQA/) * [教學機器閱讀和理解](https://arxiv.org/abs/1506.03340)，2015。 * [使用序列到序列 RNN 及其后的抽象文本摘要](https://arxiv.org/abs/1602.06023)，2016。 * [達到要點：利用指針生成器網絡匯總](https://arxiv.org/abs/1704.04368)，2017 年。 ## 摘要在本教程中，您了解了如何準備 CNN 新聞數據集以進行文本摘要。具體來說，你學到了： * 關于 CNN 新聞數據集以及如何將故事數據下載到您的工作站。 * 如何加載數據集并將每篇文章拆分為故事文本和突出顯示。 * 如何清理準備建模的數據集并將清理后的數據保存到文件中供以后使用。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。