如何開發一種深度學習的詞袋模型來預測電影評論情感 · Machine Learning Mastery 博客文章翻譯

# 如何開發一種深度學習的詞袋模型來預測電影評論情感 > 原文： [https://machinelearningmastery.com/deep-learning-bag-of-words-model-sentiment-analysis/](https://machinelearningmastery.com/deep-learning-bag-of-words-model-sentiment-analysis/) 電影評論可以被分類為有利或無。電影評論文本的評估是一種通常稱為情感分析的分類問題。用于開發情感分析模型的流行技術是使用詞袋模型，其將文檔轉換為向量，其中文檔中的每個單詞被分配分數。在本教程中，您將了解如何使用詞袋表示形成電影評論情感分類來開發深度學習預測模型。完成本教程后，您將了解： * 如何準備評論文本數據以便使用受限詞匯表進行建模。 * 如何使用詞袋模型來準備訓練和測試數據。 * 如何開發多層 Perceptron 詞袋模型并使用它來預測新的評論文本數據。讓我們開始吧。 * **2017 年 10 月更新**：修正了加載和命名正面和負面評論時的小錯字（感謝 Arthur）。 ![How to Develop a Deep Learning Bag-of-Words Model for Predicting Sentiment in Movie Reviews](img/b6d93ac7970686bc3488e5204a5e6459.jpg) 如何開發一種用于預測電影評論情感的深度學習詞袋模型 [jai Mansson](https://www.flickr.com/photos/75348994@N00/302260108/) 的照片，保留一些權利。 ## 教程概述本教程分為 4 個部分;他們是： 1. 電影評論數據集 2. 數據準備 3. 詞袋表示 4. 情感分析模型 ## 電影評論數據集電影評論數據是 Bo Pang 和 Lillian Lee 在 21 世紀初從 imdb.com 網站上檢索到的電影評論的集合。收集的評論作為他們自然語言處理研究的一部分。評論最初于 2002 年發布，但更新和清理版本于 2004 年發布，稱為“v2.0”。該數據集包含 1,000 個正面和 1,000 個負面電影評論，這些評論來自 [imdb.com](http://reviews.imdb.com/Reviews) 上托管的 rec.arts.movi??es.reviews 新聞組的存檔。作者將此數據集稱為“極性數據集”。 > 我們的數據包含 2000 年之前寫的 1000 份正面和 1000 份負面評論，每位作者的評論上限為 20（每位作者共 312 位）。我們將此語料庫稱為極性數據集。 - [感傷教育：基于最小削減的主觀性總結的情感分析](http://xxx.lanl.gov/abs/cs/0409058)，2004。數據已經有所清理，例如： * 數據集僅包含英語評論。 * 所有文本都已轉換為小寫。 * 標點符號周圍有空格，如句號，逗號和括號。 * 文本每行被分成一個句子。該數據已用于一些相關的自然語言處理任務。對于分類，經典模型（例如支持向量機）對數據的表現在高 70％至低 80％（例如 78％-82％）的范圍內。更復雜的數據準備可以看到高達 86％的結果，交叉驗證 10 倍。如果我們想在現代方法的實驗中使用這個數據集，這給了我們 80 年代中期的球場。 > ...根據下游極性分類器的選擇，我們可以實現高度統計上的顯著改善（從 82.8％到 86.4％） - [感傷教育：基于最小削減的主觀性總結的情感分析](http://xxx.lanl.gov/abs/cs/0409058)，2004。您可以從此處下載數據集： * [電影評論 Polarity Dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz) （review_polarity.tar.gz，3MB）解壓縮文件后，您將有一個名為“txt_sen _t_ oken”的目錄，其中包含兩個子目錄，其中包含文本“ _neg_ ”和“ _pos_ ”消極和積極的評論。對于每個 neg 和 pos，每個文件存儲一個評論約定 _cv000_ 到 _cv999_ 。接下來，我們來看看加載和準備文本數據。 ## 數據準備在本節中，我們將看看 3 件事： 1. 將數據分成訓練和測試集。 2. 加載和清理數據以刪除標點符號和數字。 3. 定義首選詞匯的詞匯。 ### 分為訓練和測試裝置我們假裝我們正在開發一種系統，可以預測文本電影評論的情感是積極的還是消極的。這意味著在開發模型之后，我們需要對新的文本評論進行預測。這將要求對這些新評論執行所有相同的數據準備，就像對模型的訓練數據執行一樣。我們將通過在任何數據準備之前拆分訓練和測試數據集來確保將此約束納入我們模型的評估中。這意味著在數據準備和模型訓練期間，測試集中可以幫助我們更好地準備數據（例如使用的單詞）的任何知識都是不可用的。話雖如此，我們將使用最近 100 次正面評論和最后 100 次負面評論作為測試集（100 條評論），其余 1,800 條評論作為訓練數據集。這是 90％的訓練，10％的數據分割。通過使用評論的文件名可以輕松實現拆分，其中評論為 000 至 899 的評論用于訓練數據，而評論為 900 以上的評論用于測試模型。 ### 裝載和清潔評論文本數據已經相當干凈，因此不需要太多準備工作。在不了解細節的情況下，我們將使用以下方法準備數據： * 在白色空間的分裂標記。 * 從單詞中刪除所有標點符號。 * 刪除所有不完全由字母字符組成的單詞。 * 刪除所有已知停用詞的單詞。 * 刪除長度為＆lt; = 1 個字符的所有單詞。我們可以將所有這些步驟放入一個名為 clean_doc（）的函數中，該函數將從文件加載的原始文本作為參數，并返回已清理的標記列表。我們還可以定義一個函數 load_doc（），它從文件中加載文檔，以便與 clean_doc（）函數一起使用。下面列出了清理第一次正面評價的示例。 ```py from nltk.corpus import stopwords import string # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', string.punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # load the document filename = 'txt_sentoken/pos/cv000_29590.txt' text = load_doc(filename) tokens = clean_doc(text) print(tokens) ``` 運行該示例會打印一長串清潔令牌。我們可能想要探索更多的清潔步驟，并將其作為進一步的練習。我很想知道你能想出什么。 ```py ... 'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content'] ``` ### 定義詞匯表在使用詞袋模型時，定義已知單詞的詞匯表很重要。單詞越多，文檔的表示越大，因此將單詞限制為僅被認為具有預測性的單詞是很重要的。這很難事先知道，并且通常重要的是測試關于如何構建有用詞匯的不同假設。我們已經看到了如何從上一節中的詞匯表中刪除標點符號和數字。我們可以對所有文檔重復此操作，并構建一組所有已知單詞。我們可以開發一個詞匯作為 _ 計數器 _，這是一個詞典及其計數的字典映射，可以讓我們輕松更新和查詢。每個文檔都可以添加到計數器（一個名為 _add_doc_to_vocab（）_ 的新函數），我們可以跳過負目錄中的所有評論，然后是肯定目錄（一個名為 _process_docs 的新函數）（）_）。下面列出了完整的示例。 ```py from string import punctuation from os import listdir from collections import Counter from nltk.corpus import stopwords # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # load doc and add to vocab def add_doc_to_vocab(filename, vocab): # load doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # update counts vocab.update(tokens) # load all docs in a directory def process_docs(directory, vocab): # walk through all files in the folder for filename in listdir(directory): # skip any reviews in the test set if filename.startswith('cv9'): continue # create the full path of the file to open path = directory + '/' + filename # add doc to vocab add_doc_to_vocab(path, vocab) # define vocab vocab = Counter() # add all docs to vocab process_docs('txt_sentoken/pos', vocab) process_docs('txt_sentoken/neg', vocab) # print the size of the vocab print(len(vocab)) # print the top words in the vocab print(vocab.most_common(50)) ``` 運行該示例表明我們的詞匯量為 43,476 個單詞。我們還可以看到電影評論中前 50 個最常用單詞的樣本。請注意，此詞匯表僅基于訓練數據集中的那些評論構建。 ```py 44276 [('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('could', 1248), ('bad', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)] ``` 我們可以逐步瀏覽詞匯表并刪除所有發生率較低的單詞，例如僅在所有評論中使用一次或兩次。例如，以下代碼段將僅檢索在所有評論中出現 2 次或更多次的代幣。 ```py # keep tokens with a min occurrence min_occurane = 2 tokens = [k for k,c in vocab.items() if c >= min_occurane] print(len(tokens)) ``` 使用此添加運行上面的示例表明，詞匯量大小略大于其大小的一半，從 43,476 到 25,767 個單詞。 ```py 25767 ``` 最后，可以將詞匯表保存到名為 vocab.txt 的新文件中，以后我們可以加載并使用它來過濾電影評論，然后再對其進行編碼以進行建模。我們定義了一個名為 save_list（）的新函數，它將詞匯表保存到文件中，每個文件只有一個單詞。例如： ```py # save list to file def save_list(lines, filename): # convert lines to a single blob of text data = '\n'.join(lines) # open file file = open(filename, 'w') # write text file.write(data) # close file file.close() # save tokens to a vocabulary file save_list(tokens, 'vocab.txt') ``` 在詞匯表上運行最小出現過濾器并將其保存到文件，您現在應該有一個名為 _vocab.txt_ 的新文件，其中只包含我們感興趣的詞。文件中的單詞順序會有所不同，但應如下所示： ```py aberdeen dupe burt libido hamlet arlene available corners web columbia ... ``` 我們現在已準備好從準備建模的評論中提取特征。 ## 詞袋表示在本節中，我們將了解如何將每個評論轉換為我們可以為多層感知器模型提供的表示。詞袋模型是一種從文本中提取特征的方法，因此文本輸入可以與神經網絡等機器學習算法一起使用。每個文檔（在這種情況下是評論）被轉換為向量表示。表示文檔的向量中的項目數對應于詞匯表中的單詞數。詞匯量越大，向量表示越長，因此在前一部分中對較小詞匯表的偏好。對文檔中的單詞進行評分，并將分數放在表示中的相應位置。我們將在下一節中介紹不同的單詞評分方法。在本節中，我們關注的是將評論轉換為準備用于訓練第一神經網絡模型的向量。本節分為兩個步驟： 1. 將評論轉換為代幣行。 2. 使用詞袋模型表示編碼評論。 ### 對令牌行的評論在我們將評論轉換為向量進行建模之前，我們必須首先清理它們。這涉及加載它們，執行上面開發的清潔操作，過濾掉不在所選詞匯表中的單詞，并將剩余的標記轉換成準備編碼的單個字符串或行。首先，我們需要一個函數來準備一個文檔。下面列出了函數 _doc_to_line（）_，它將加載文檔，清理它，過濾掉不在詞匯表中的標記，然后將文檔作為一串空白分隔的標記返回。 ```py # load doc, clean and return line of tokens def doc_to_line(filename, vocab): # load the doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # filter by vocab tokens = [w for w in tokens if w in vocab] return ' '.join(tokens) ``` 接下來，我們需要一個函數來處理目錄中的所有文檔（例如' _pos_ '和' _neg_ '）將文檔轉換為行。下面列出了 _process_docs（）_ 函數，該函數執行此操作，期望將目錄名稱和詞匯表設置為輸入參數并返回已處理文檔的列表。 ```py # load all docs in a directory def process_docs(directory, vocab): lines = list() # walk through all files in the folder for filename in listdir(directory): # skip any reviews in the test set if filename.startswith('cv9'): continue # create the full path of the file to open path = directory + '/' + filename # load and clean the doc line = doc_to_line(path, vocab) # add to list lines.append(line) return lines ``` 最后，我們需要加載詞匯表并將其轉換為用于清理評論的集合。 ```py # load the vocabulary vocab_filename = 'vocab.txt' vocab = load_doc(vocab_filename) vocab = vocab.split() vocab = set(vocab) ``` 我們可以將所有這些放在一起，重復使用前面部分中開發的加載和清理功能。下面列出了完整的示例，演示了如何從訓練數據集準備正面和負面評論。 ```py from string import punctuation from os import listdir from collections import Counter from nltk.corpus import stopwords # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # load doc, clean and return line of tokens def doc_to_line(filename, vocab): # load the doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # filter by vocab tokens = [w for w in tokens if w in vocab] return ' '.join(tokens) # load all docs in a directory def process_docs(directory, vocab): lines = list() # walk through all files in the folder for filename in listdir(directory): # skip any reviews in the test set if filename.startswith('cv9'): continue # create the full path of the file to open path = directory + '/' + filename # load and clean the doc line = doc_to_line(path, vocab) # add to list lines.append(line) return lines # load the vocabulary vocab_filename = 'vocab.txt' vocab = load_doc(vocab_filename) vocab = vocab.split() vocab = set(vocab) # load all training reviews positive_lines = process_docs('txt_sentoken/pos', vocab) negative_lines = process_docs('txt_sentoken/neg', vocab) # summarize what we have print(len(positive_lines), len(negative_lines)) ``` ### 電影評論到詞袋向量我們將使用 Keras API 將評論轉換為編碼的文檔向量。 Keras 提供 [Tokenize 類](https://keras.io/preprocessing/text/#tokenizer)，它可以執行我們在上一節中處理的一些清理和詞匯定義任務。最好自己做這件事，以確切知道做了什么以及為什么做。然而，Tokenizer 類很方便，很容易將文檔轉換為編碼向量。首先，必須創建 Tokenizer，然后適合訓練數據集中的文本文檔。在這種情況下，這些是前一節中開發的 _positive_lines_ 和 _negative_lines_ 數組的聚合。 ```py # create the tokenizer tokenizer = Tokenizer() # fit the tokenizer on the documents docs = positive_lines + negative_lines tokenizer.fit_on_texts(docs) ``` 此過程確定將詞匯表轉換為具有 25,768 個元素的固定長度向量的一致方式，這是詞匯表文件 _vocab.txt_ 中的單詞總數。接下來，可以使用 Tokenizer 通過調用 _texts_to_matrix（）_ 對文檔進行編碼。該函數接受要編碼的文檔列表和編碼模式，這是用于對文檔中的單詞進行評分的方法。在這里，我們指定' _freq_ '根據文檔中的頻率對單詞進行評分。這可用于編碼訓練數據，例如： ```py # encode training data set Xtrain = tokenizer.texts_to_matrix(docs, mode='freq') print(Xtrain.shape) ``` 這將對訓練數據集中的所有正面和負面評論進行編碼，并將所得矩陣的形狀打印為 1,800 個文檔，每個文檔的長度為 25,768 個元素。它可以用作模型的訓練數據。 ```py (1800, 25768) ``` 我們可以用類似的方式對測試數據進行編碼。首先，需要修改上一節中的 _process_docs（）_ 函數，以僅處理測試數據集中的評論，而不是訓練數據集。我們通過添加 _is_trian_ 參數并使用它來決定要跳過哪些評論文件名來支持加載訓練和測試數據集。 ```py # load all docs in a directory def process_docs(directory, vocab, is_trian): lines = list() # walk through all files in the folder for filename in listdir(directory): # skip any reviews in the test set if is_trian and filename.startswith('cv9'): continue if not is_trian and not filename.startswith('cv9'): continue # create the full path of the file to open path = directory + '/' + filename # load and clean the doc line = doc_to_line(path, vocab) # add to list lines.append(line) return lines ``` 接下來，我們可以像在訓練集中一樣，在測試集中加載和編碼正面和負面評論。 ```py ... # load all test reviews positive_lines = process_docs('txt_sentoken/pos', vocab, False) negative_lines = process_docs('txt_sentoken/neg', vocab, False) docs = negative_lines + positive_lines # encode training data set Xtest = tokenizer.texts_to_matrix(docs, mode='freq') print(Xtest.shape) ``` 我們可以將所有這些放在一個例子中。 ```py from string import punctuation from os import listdir from collections import Counter from nltk.corpus import stopwords from keras.preprocessing.text import Tokenizer # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # load doc, clean and return line of tokens def doc_to_line(filename, vocab): # load the doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # filter by vocab tokens = [w for w in tokens if w in vocab] return ' '.join(tokens) # load all docs in a directory def process_docs(directory, vocab, is_trian): lines = list() # walk through all files in the folder for filename in listdir(directory): # skip any reviews in the test set if is_trian and filename.startswith('cv9'): continue if not is_trian and not filename.startswith('cv9'): continue # create the full path of the file to open path = directory + '/' + filename # load and clean the doc line = doc_to_line(path, vocab) # add to list lines.append(line) return lines # load the vocabulary vocab_filename = 'vocab.txt' vocab = load_doc(vocab_filename) vocab = vocab.split() vocab = set(vocab) # load all training reviews positive_lines = process_docs('txt_sentoken/pos', vocab, True) negative_lines = process_docs('txt_sentoken/neg', vocab, True) # create the tokenizer tokenizer = Tokenizer() # fit the tokenizer on the documents docs = negative_lines + positive_lines tokenizer.fit_on_texts(docs) # encode training data set Xtrain = tokenizer.texts_to_matrix(docs, mode='freq') print(Xtrain.shape) # load all test reviews positive_lines = process_docs('txt_sentoken/pos', vocab, False) negative_lines = process_docs('txt_sentoken/neg', vocab, False) docs = negative_lines + positive_lines # encode training data set Xtest = tokenizer.texts_to_matrix(docs, mode='freq') print(Xtest.shape) ``` 運行該示例分別打印編碼的訓練數據集和測試數據集的形狀，分別具有 1,800 和 200 個文檔，每個文檔具有相同大小的編碼詞匯表（向量長度）。 ```py (1800, 25768) (200, 25768) ``` ## 情感分析模型在本節中，我們將開發多層感知器（MLP）模型，將編碼文檔分類為正面或負面。模型將是簡單的前饋網絡模型，在 Keras 深度學習庫中具有稱為 _Dense_ 的完全連接層。本節分為 3 個部分： 1. 第一個情感分析模型 2. 比較單詞評分模式 3. 預測新的評論 ### 第一情感分析模型我們可以開發一個簡單的 MLP 模型來預測編碼評論的情感。模型將具有一個輸入層，該輸入層等于詞匯表中的單詞數，進而是輸入文檔的長度。我們可以將它存儲在一個名為 _n_words_ 的新變量中，如下所示： ```py n_words = Xtest.shape[1] ``` 我們還需要所有訓練和測試審核數據的類標簽。我們確定性地加載并編碼了這些評論（否定，然后是正面），因此我們可以直接指定標簽，如下所示： ```py ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)]) ytest = array([0 for _ in range(100)] + [1 for _ in range(100)]) ``` 我們現在可以定義網絡。發現所有模型配置的試驗和錯誤非常少，不應該考慮針對此問題進行調整。我們將使用具有 50 個神經元和整流線性激活函數的單個隱藏層。輸出層是具有 S 形激活函數的單個神經元，用于預測 0 為陰性，1 為陽性評價。將使用梯度下降的有效 [Adam 實現](http://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)和二元交叉熵損失函數來訓練網絡，適合于二元分類問題。在訓練和評估模型時，我們將跟蹤準確性。 ```py # define network model = Sequential() model.add(Dense(50, input_shape=(n_words,), activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile network model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) ``` 接下來，我們可以將模型擬合到訓練數據上;在這種情況下，該模型很小，很容易適應 50 個時代。 ```py # fit network model.fit(Xtrain, ytrain, epochs=50, verbose=2) ``` 最后，一旦訓練了模型，我們就可以通過在測試數據集中進行預測并打印精度來評估其表現。 ```py # evaluate loss, acc = model.evaluate(Xtest, ytest, verbose=0) print('Test Accuracy: %f' % (acc*100)) ``` 下面列出了完整的示例。 ```py from numpy import array from string import punctuation from os import listdir from collections import Counter from nltk.corpus import stopwords from keras.preprocessing.text import Tokenizer from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # load doc, clean and return line of tokens def doc_to_line(filename, vocab): # load the doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # filter by vocab tokens = [w for w in tokens if w in vocab] return ' '.join(tokens) # load all docs in a directory def process_docs(directory, vocab, is_trian): lines = list() # walk through all files in the folder for filename in listdir(directory): # skip any reviews in the test set if is_trian and filename.startswith('cv9'): continue if not is_trian and not filename.startswith('cv9'): continue # create the full path of the file to open path = directory + '/' + filename # load and clean the doc line = doc_to_line(path, vocab) # add to list lines.append(line) return lines # load the vocabulary vocab_filename = 'vocab.txt' vocab = load_doc(vocab_filename) vocab = vocab.split() vocab = set(vocab) # load all training reviews positive_lines = process_docs('txt_sentoken/pos', vocab, True) negative_lines = process_docs('txt_sentoken/neg', vocab, True) # create the tokenizer tokenizer = Tokenizer() # fit the tokenizer on the documents docs = negative_lines + positive_lines tokenizer.fit_on_texts(docs) # encode training data set Xtrain = tokenizer.texts_to_matrix(docs, mode='freq') ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)]) # load all test reviews positive_lines = process_docs('txt_sentoken/pos', vocab, False) negative_lines = process_docs('txt_sentoken/neg', vocab, False) docs = negative_lines + positive_lines # encode training data set Xtest = tokenizer.texts_to_matrix(docs, mode='freq') ytest = array([0 for _ in range(100)] + [1 for _ in range(100)]) n_words = Xtest.shape[1] # define network model = Sequential() model.add(Dense(50, input_shape=(n_words,), activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile network model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(Xtrain, ytrain, epochs=50, verbose=2) # evaluate loss, acc = model.evaluate(Xtest, ytest, verbose=0) print('Test Accuracy: %f' % (acc*100)) ``` 運行該示例，我們可以看到該模型很容易適應 50 個時期內的訓練數據，實現 100％的準確性。在測試數據集上評估模型，我們可以看到模型表現良好，達到 90％以上的準確率，完全在原始論文中看到的 80-80 年代中期。雖然，重要的是要注意這不是一個蘋果對蘋果的比較，因為原始論文使用 10 倍交叉驗證來估計模型技能而不是單個訓練/測試分裂。 ```py ... Epoch 46/50 0s - loss: 0.0167 - acc: 1.0000 Epoch 47/50 0s - loss: 0.0157 - acc: 1.0000 Epoch 48/50 0s - loss: 0.0148 - acc: 1.0000 Epoch 49/50 0s - loss: 0.0140 - acc: 1.0000 Epoch 50/50 0s - loss: 0.0132 - acc: 1.0000 Test Accuracy: 91.000000 ``` 接下來，讓我們看看為詞袋模型測試不同的單詞評分方法。 ### 比較單詞評分方法 Keras API 中 Tokenizer 的 _texts_to_matrix（）_ 函數提供了 4 種不同的評分方法;他們是： * “_ 二元 _”其中單詞被標記為存在（1）或不存在（0）。 * “ _count_ ”將每個單詞的出現次數標記為整數。 * “ _tfidf_ ”每個單詞根據其頻率進行評分，其中所有文檔中共同的單詞都會受到懲罰。 * “ _freq_ ”根據文檔中出現的頻率對單詞進行評分。我們可以使用 4 種支持的單詞評分模式中的每一種來評估上一節中開發的模型的技能。這首先涉及開發一種函數，以基于所選擇的評分模型來創建所加載文檔的編碼。該函數創建標記器，將其擬合到訓練文檔上，然后使用所選模型創建訓練和測試編碼。函數 _prepare_data（）_ 在給定訓練和測試文檔列表的情況下實現此行為。 ```py # prepare bag of words encoding of docs def prepare_data(train_docs, test_docs, mode): # create the tokenizer tokenizer = Tokenizer() # fit the tokenizer on the documents tokenizer.fit_on_texts(train_docs) # encode training data set Xtrain = tokenizer.texts_to_matrix(train_docs, mode=mode) # encode training data set Xtest = tokenizer.texts_to_matrix(test_docs, mode=mode) return Xtrain, Xtest ``` 我們還需要一個函數來評估給定特定數據編碼的 MLP。因為神經網絡是隨機的，當相同的模型適合相同的數據時，它們可以產生不同的結果。這主要是因為隨機初始權重和小批量梯度下降期間的模式混洗。這意味著任何一個模型評分都是不可靠的，我們應該根據多次運行的平均值來估計模型技能。下面的函數名為 _evaluate_mode（）_，它通過在訓練上訓練它并在測試集上估計技能 30 次來獲取編碼文檔并評估 MLP，并返回所有這些精度得分的列表。運行。 ```py # evaluate a neural network model def evaluate_mode(Xtrain, ytrain, Xtest, ytest): scores = list() n_repeats = 30 n_words = Xtest.shape[1] for i in range(n_repeats): # define network model = Sequential() model.add(Dense(50, input_shape=(n_words,), activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile network model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(Xtrain, ytrain, epochs=50, verbose=2) # evaluate loss, acc = model.evaluate(Xtest, ytest, verbose=0) scores.append(acc) print('%d accuracy: %s' % ((i+1), acc)) return scores ``` 我們現在準備評估 4 種不同單詞評分方法的表現。將所有這些結合在一起，下面列出了完整的示例。 ```py from numpy import array from string import punctuation from os import listdir from collections import Counter from nltk.corpus import stopwords from keras.preprocessing.text import Tokenizer from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from pandas import DataFrame from matplotlib import pyplot # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # load doc, clean and return line of tokens def doc_to_line(filename, vocab): # load the doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # filter by vocab tokens = [w for w in tokens if w in vocab] return ' '.join(tokens) # load all docs in a directory def process_docs(directory, vocab, is_trian): lines = list() # walk through all files in the folder for filename in listdir(directory): # skip any reviews in the test set if is_trian and filename.startswith('cv9'): continue if not is_trian and not filename.startswith('cv9'): continue # create the full path of the file to open path = directory + '/' + filename # load and clean the doc line = doc_to_line(path, vocab) # add to list lines.append(line) return lines # evaluate a neural network model def evaluate_mode(Xtrain, ytrain, Xtest, ytest): scores = list() n_repeats = 30 n_words = Xtest.shape[1] for i in range(n_repeats): # define network model = Sequential() model.add(Dense(50, input_shape=(n_words,), activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile network model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(Xtrain, ytrain, epochs=50, verbose=2) # evaluate loss, acc = model.evaluate(Xtest, ytest, verbose=0) scores.append(acc) print('%d accuracy: %s' % ((i+1), acc)) return scores # prepare bag of words encoding of docs def prepare_data(train_docs, test_docs, mode): # create the tokenizer tokenizer = Tokenizer() # fit the tokenizer on the documents tokenizer.fit_on_texts(train_docs) # encode training data set Xtrain = tokenizer.texts_to_matrix(train_docs, mode=mode) # encode training data set Xtest = tokenizer.texts_to_matrix(test_docs, mode=mode) return Xtrain, Xtest # load the vocabulary vocab_filename = 'vocab.txt' vocab = load_doc(vocab_filename) vocab = vocab.split() vocab = set(vocab) # load all training reviews positive_lines = process_docs('txt_sentoken/pos', vocab, True) negative_lines = process_docs('txt_sentoken/neg', vocab, True) train_docs = negative_lines + positive_lines # load all test reviews positive_lines = process_docs('txt_sentoken/pos', vocab, False) negative_lines = process_docs('txt_sentoken/neg', vocab, False) test_docs = negative_lines + positive_lines # prepare labels ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)]) ytest = array([0 for _ in range(100)] + [1 for _ in range(100)]) modes = ['binary', 'count', 'tfidf', 'freq'] results = DataFrame() for mode in modes: # prepare data for mode Xtrain, Xtest = prepare_data(train_docs, test_docs, mode) # evaluate model on data for mode results[mode] = evaluate_mode(Xtrain, ytrain, Xtest, ytest) # summarize results print(results.describe()) # plot results results.boxplot() pyplot.show() ``` 運行該示例可能需要一段時間（在具有 CPU 的現代硬件上大約一個小時，而不是 GPU）。在運行結束時，提供了每個單詞評分方法的摘要統計，總結了每個模式 30 次運行中每個模型技能得分的分布。我們可以看到' _freq_ '和'_ 二元 _'方法的平均得分似乎優于'_ 計數 _'和' _tfidf_ '。 ```py binary count tfidf freq count 30.000000 30.00000 30.000000 30.000000 mean 0.915833 0.88900 0.856333 0.908167 std 0.009010 0.01012 0.013126 0.002451 min 0.900000 0.86500 0.830000 0.905000 25% 0.906250 0.88500 0.850000 0.905000 50% 0.915000 0.89000 0.857500 0.910000 75% 0.920000 0.89500 0.865000 0.910000 max 0.935000 0.90500 0.885000 0.910000 ``` 還給出了結果的盒子和須狀圖，總結了每種配置的準確度分布。我們可以看到'freq'配置的分布是緊張的，這是令人鼓舞的，因為它也表現良好。此外，我們可以看到“二元”通過適度的傳播實現了最佳結果，可能是此數據集的首選方法。 ![Box and Whisker Plot for Model Accuracy with Different Word Scoring Methods](img/41c1d5a636ede307ab02540acd449347.jpg) 不同單詞評分方法的模型精度框和晶須圖 ## 預測新評論最后，我們可以使用最終模型對新的文本評論進行預測。這就是我們首先想要模型的原因。預測新評論的情感涉及遵循用于準備測試數據的相同步驟。具體來說，加載文本，清理文檔，通過所選詞匯過濾標記，將剩余標記轉換為線，使用 Tokenizer 對其進行編碼，以及進行預測。我們可以通過調用 _predict（）_ 直接使用擬合模型預測類值，該值將返回一個值，該值可以舍入為 0 的整數表示負面評論，1 表示正面評論。所有這些步驟都可以放入一個名為 _predict_sentiment（）_ 的新函數中，該函數需要復習文本，詞匯表，標記符和擬合模型，如下所示： ```py # classify a review as negative (0) or positive (1) def predict_sentiment(review, vocab, tokenizer, model): # clean tokens = clean_doc(review) # filter by vocab tokens = [w for w in tokens if w in vocab] # convert to line line = ' '.join(tokens) # encode encoded = tokenizer.texts_to_matrix([line], mode='freq') # prediction yhat = model.predict(encoded, verbose=0) return round(yhat[0,0]) ``` 我們現在可以對新的評論文本進行預測。以下是使用頻率詞評分模式使用上面開發的簡單 MLP 進行明確肯定和明顯否定評論的示例。 ```py # test positive text text = 'Best movie ever!' print(predict_sentiment(text, vocab, tokenizer, model)) # test negative text text = 'This is a bad movie.' print(predict_sentiment(text, vocab, tokenizer, model)) ``` 正確運行示例會對這些評論進行分類。 ```py 1 0 ``` 理想情況下，我們將模型放在所有可用數據（訓練和測試）上以創建[最終模型](http://machinelearningmastery.com/train-final-machine-learning-model/)并將模型和標記器保存到文件中，以便可以在新軟件中加載和使用它們。 ## 擴展如果您希望從本教程中獲得更多信息，本節列出了一些擴展。 * **管理詞匯**。使用更大或更小的詞匯進行探索。也許你可以通過一組較小的單詞獲得更好的表現。 * **調整網絡拓撲**。探索其他網絡拓撲，例如更深或更廣的網絡。也許您可以通過更適合的網絡獲得更好的表現。 * **使用正則化**。探索正規化技術的使用，例如丟失。也許您可以延遲模型的收斂并實現更好的測試集表現。 ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 ### 數據集 * [電影評論數據](http://www.cs.cornell.edu/people/pabo/movie-review-data/) * [一種感傷教育：基于最小削減的主觀性總結的情感分析](http://xxx.lanl.gov/abs/cs/0409058)，2004。 * [電影評論 Polarity Dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz) （。tgz） * 數據集自述文件 [v2.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/poldata.README.2.0.txt) 和 [v1.1](http://www.cs.cornell.edu/people/pabo/movie-review-data/README.1.1) 。 ### 蜜蜂 * [nltk.tokenize 包 API](http://www.nltk.org/api/nltk.tokenize.html) * [第 2 章，訪問文本語料庫和詞匯資源](http://www.nltk.org/book/ch02.html) * [os API 其他操作系統接口](https://docs.python.org/3/library/os.html) * [集合 API - 容器數據類型](https://docs.python.org/3/library/collections.html) * [Tokenizer Keras API](https://keras.io/preprocessing/text/#tokenizer) ## 摘要在本教程中，您了解了如何開發一個詞袋模型來預測電影評論的情感。具體來說，你學到了： * 如何準備評論文本數據以便使用受限詞匯表進行建模。 * 如何使用詞袋模型來準備訓練和測試數據。 * 如何開發多層 Perceptron 詞袋模型并使用它來預測新的評論文本數據。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。