如何為情感分析準備電影評論數據 · Machine Learning Mastery 博客文章翻譯

# 如何為情感分析準備電影評論數據 > 原文： [https://machinelearningmastery.com/prepare-movie-review-data-sentiment-analysis/](https://machinelearningmastery.com/prepare-movie-review-data-sentiment-analysis/) 每個問題的文本數據準備都不同。準備工作從簡單的步驟開始，例如加載數據，但是對于您正在使用的數據非常具體的清理任務很快就會變得困難。您需要有關從何處開始以及從原始數據到準備建模的數據的步驟的工作順序的幫助。在本教程中，您將逐步了解如何為情感分析準備電影評論文本數據。完成本教程后，您將了解： * 如何加載文本數據并清除它以刪除標點符號和其他非單詞。 * 如何開發詞匯表，定制它并將其保存到文件中。 * 如何使用清潔和預定義的詞匯表準備電影評論，并將其保存到準備建模的新文件中。讓我們開始吧。 * **2017 年 10 月更新**：修正了跳過不匹配文件的小錯誤，感謝 Jan Zett。 * **2017 年 12 月更新**：修復了完整示例中的小錯字，感謝 Ray 和 Zain。 ![How to Prepare Movie Review Data for Sentiment Analysis](img/94ff9955426ec7150d7216dce6dc3f89.jpg) 如何為情感分析準備電影評論數據 [Kenneth Lu](https://www.flickr.com/photos/toasty/1125019024/) 的照片，保留一些權利。 ## 教程概述本教程分為 5 個部分;他們是： 1. 電影評論數據集 2. 加載文本數據 3. 清理文本數據 4. 培養詞匯量 5. 保存準備好的數據 ## 1.電影評論數據集電影評論數據是 Bo Pang 和 Lillian Lee 在 21 世紀初從 imdb.com 網站上檢索到的電影評論的集合。收集的評論作為他們自然語言處理研究的一部分。評論最初于 2002 年發布，但更新和清理版本于 2004 年發布，稱為“ _v2.0_ ”。該數據集包含 1,000 個正面和 1,000 個負面電影評論，這些評論來自 [IMDB](http://reviews.imdb.com/Reviews) 托管的 rec.arts.movi??es.reviews 新聞組的存檔。作者將該數據集稱為“_ 極性數據集 _”。 > 我們的數據包含 2000 年之前寫的 1000 份正面和 1000 份負面評論，每位作者的評論上限為 20（每位作者共 312 位）。我們將此語料庫稱為極性數據集。 - [感傷教育：基于最小削減的主觀性總結的情感分析](http://xxx.lanl.gov/abs/cs/0409058)，2004。數據已經有所清理，例如： * 數據集僅包含英語評論。 * 所有文本都已轉換為小寫。 * 標點符號周圍有空格，如句號，逗號和括號。 * 文本每行被分成一個句子。該數據已用于一些相關的自然語言處理任務。對于分類，經典模型（例如支持向量機）對數據的表現在高 70％至低 80％（例如 78％至 82％）的范圍內。更復雜的數據準備可以看到高達 86％的結果，交叉驗證 10 倍。如果我們想在現代方法的實驗中使用這個數據集，這給了我們 80 年代中期的球場。 > ...根據下游極性分類器的選擇，我們可以實現高度統計上的顯著改善（從 82.8％到 86.4％） - [感傷教育：基于最小削減的主觀性總結的情感分析](http://xxx.lanl.gov/abs/cs/0409058)，2004。您可以從此處下載數據集： * [電影評論 Polarity Dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz) （review_polarity.tar.gz，3MB）解壓縮文件后，您將有一個名為“ _txt_sentoken_ ”的目錄，其中包含兩個子目錄，其中包含文本“ _neg_ ”和“ _pos_ ”的負數和積極的評論。對于 neg 和 pos 中的每一個，每個文件存儲一個評論約定 _cv000_ 到 _cv999_ 。接下來，我們來看看加載文本數據。 ## 2.加載文本數據在本節中，我們將介紹加載單個文本文件，然后處理文件目錄。我們假設審查數據已下載并在文件夾“ _txt_sentoken_ ”的當前工作目錄中可用。我們可以通過打開它，讀取 ASCII 文本和關閉文件來加載單個文本文件。這是標準的文件處理。例如，我們可以加載第一個負面評論文件“ _cv000_29416.txt_ ”，如下所示： ```py # load one file filename = 'txt_sentoken/neg/cv000_29416.txt' # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() ``` 這會將文檔加載為 ASCII 并保留任何空白區域，如新行。我們可以把它變成一個名為 load_doc（）的函數，它接受文檔的文件名加載并返回文本。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text ``` 我們有兩個目錄，每個目錄有 1,000 個文檔。我們可以依次使用 [listdir（）函數](https://docs.python.org/3/library/os.html#os.listdir)獲取目錄中的文件列表來依次處理每個目錄，然后依次加載每個文件。例如，我們可以使用 _load_doc（）_ 函數在負目錄中加載每個文檔來進行實際加載。 ```py from os import listdir # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # specify directory to load directory = 'txt_sentoken/neg' # walk through all files in the folder for filename in listdir(directory): # skip files that do not have the right extension if not filename.endswith(".txt"): continue # create the full path of the file to open path = directory + '/' + filename # load document doc = load_doc(path) print('Loaded %s' % filename) ``` 運行此示例會在加載后打印每個評論的文件名。 ```py ... Loaded cv995_23113.txt Loaded cv996_12447.txt Loaded cv997_5152.txt Loaded cv998_15691.txt Loaded cv999_14636.txt ``` 我們也可以將文檔的處理轉換為函數，稍后將其用作模板，以開發清除文件夾中所有文檔的函數。例如，下面我們定義一個 _process_docs（）_ 函數來做同樣的事情。 ```py from os import listdir # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load all docs in a directory def process_docs(directory): # walk through all files in the folder for filename in listdir(directory): # skip files that do not have the right extension if not filename.endswith(".txt"): continue # create the full path of the file to open path = directory + '/' + filename # load document doc = load_doc(path) print('Loaded %s' % filename) # specify directory to load directory = 'txt_sentoken/neg' process_docs(directory) ``` 現在我們知道了如何加載電影評論文本數據，讓我們看一下清理它。 ## 3.清理文本數據在本節中，我們將了解我們可能要對電影評論數據進行哪些數據清理。我們假設我們將使用一個詞袋模型或者可能是一個不需要太多準備的單詞嵌入。分成代幣首先，讓我們加載一個文檔，然后查看由空格分割的原始標記。我們將使用上一節中開發的 _load_doc（）_ 函數。我們可以使用 _split（）_ 函數將加載的文檔拆分為由空格分隔的標記。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load the document filename = 'txt_sentoken/neg/cv000_29416.txt' text = load_doc(filename) # split into tokens by white space tokens = text.split() print(tokens) ``` 運行該示例從文檔中提供了很長的原始令牌列表。 ```py ... 'years', 'ago', 'and', 'has', 'been', 'sitting', 'on', 'the', 'shelves', 'ever', 'since', '.', 'whatever', '.', '.', '.', 'skip', 'it', '!', "where's", 'joblo', 'coming', 'from', '?', 'a', 'nightmare', 'of', 'elm', 'street', '3', '(', '7/10', ')', '-', 'blair', 'witch', '2', '(', '7/10', ')', '-', 'the', 'crow', '(', '9/10', ')', '-', 'the', 'crow', ':', 'salvation', '(', '4/10', ')', '-', 'lost', 'highway', '(', '10/10', ')', '-', 'memento', '(', '10/10', ')', '-', 'the', 'others', '(', '9/10', ')', '-', 'stir', 'of', 'echoes', '(', '8/10', ')'] ``` 只要查看原始令牌就可以給我們提供很多想法的想法，例如： * 從單詞中刪除標點符號（例如“what's”）。 * 刪除只是標點符號的標記（例如“ - ”）。 * 刪除包含數字的標記（例如'10 / 10'）。 * 刪除具有一個字符（例如“a”）的令牌。 * 刪除沒有多大意義的令牌（例如'和'）一些想法： * 我們可以使用字符串 _translate（）_ 函數從標記中過濾出標點符號。 * 我們可以通過對每個標記使用 _isalpha（）_ 檢查來刪除只是標點符號或包含數字的標記。 * 我們可以使用 NLTK 加載的列表刪除英語停用詞。 * 我們可以通過檢查短標記來過濾掉短標記。以下是清潔此評論的更新版本。 ```py from nltk.corpus import stopwords import string # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load the document filename = 'txt_sentoken/neg/cv000_29416.txt' text = load_doc(filename) # split into tokens by white space tokens = text.split() # remove punctuation from each token table = str.maketrans('', '', string.punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] print(tokens) ``` 運行該示例可以提供更清晰的令牌列表 ```py ... 'explanation', 'craziness', 'came', 'oh', 'way', 'horror', 'teen', 'slasher', 'flick', 'packaged', 'look', 'way', 'someone', 'apparently', 'assuming', 'genre', 'still', 'hot', 'kids', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'sitting', 'shelves', 'ever', 'since', 'whatever', 'skip', 'wheres', 'joblo', 'coming', 'nightmare', 'elm', 'street', 'blair', 'witch', 'crow', 'crow', 'salvation', 'lost', 'highway', 'memento', 'others', 'stir', 'echoes'] ``` 我們可以將它放入一個名為 _clean_doc（）_ 的函數中，并在另一個評論中測試它，這次是一個積極的評論。 ```py from nltk.corpus import stopwords import string # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', string.punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # load the document filename = 'txt_sentoken/pos/cv000_29590.txt' text = load_doc(filename) tokens = clean_doc(text) print(tokens) ``` 同樣，清潔程序似乎產生了一組良好的令牌，至少作為第一次切割。 ```py ... 'comic', 'oscar', 'winner', 'martin', 'childs', 'shakespeare', 'love', 'production', 'design', 'turns', 'original', 'prague', 'surroundings', 'one', 'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content'] ``` 我們可以采取更多的清潔步驟，讓我們想象一下。接下來，讓我們看看如何管理一個首選的令牌詞匯表。 ## 4.培養詞匯量當使用文本的預測模型時，比如詞袋模型，存在減小詞匯量大小的壓力。詞匯量越大，每個單詞或文檔的表示越稀疏。為情感分析準備文本的一部分涉及定義和定制模型支持的單詞的詞匯表。我們可以通過加載數據集中的所有文檔并構建一組單詞來完成此操作。我們可能會決定支持所有這些詞，或者可能會丟棄一些詞。然后可以將最終選擇的詞匯表保存到文件中供以后使用，例如將來在新文檔中過濾單詞。我們可以在[計數器](https://docs.python.org/3/library/collections.html#collections.Counter)中跟蹤詞匯，這是一個單詞及其計數字典，帶有一些額外的便利功能。我們需要開發一個新函數來處理文檔并將其添加到詞匯表中。該函數需要通過調用先前開發的 _load_doc（）_ 函數來加載文檔。它需要使用先前開發的 _clean_doc（）_ 函數清理加載的文檔，然后需要將所有標記添加到 Counter，并更新計數。我們可以通過調用計數器對象上的 _update（）_ 函數來完成最后一步。下面是一個名為 _add_doc_to_vocab（）_ 的函數，它將文檔文件名和計數器詞匯表作為參數。 ```py # load doc and add to vocab def add_doc_to_vocab(filename, vocab): # load doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # update counts vocab.update(tokens) ``` 最后，我們可以使用上面的模板處理名為 process_docs（）的目錄中的所有文檔，并將其更新為調用 _add_doc_to_vocab（）_。 ```py # load all docs in a directory def process_docs(directory, vocab): # walk through all files in the folder for filename in listdir(directory): # skip files that do not have the right extension if not filename.endswith(".txt"): continue # create the full path of the file to open path = directory + '/' + filename # add doc to vocab add_doc_to_vocab(path, vocab) ``` 我們可以將所有這些放在一起，并從數據集中的所有文檔開發完整的詞匯表。 ```py from string import punctuation from os import listdir from collections import Counter from nltk.corpus import stopwords # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # load doc and add to vocab def add_doc_to_vocab(filename, vocab): # load doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # update counts vocab.update(tokens) # load all docs in a directory def process_docs(directory, vocab): # walk through all files in the folder for filename in listdir(directory): # skip files that do not have the right extension if not filename.endswith(".txt"): continue # create the full path of the file to open path = directory + '/' + filename # add doc to vocab add_doc_to_vocab(path, vocab) # define vocab vocab = Counter() # add all docs to vocab process_docs('txt_sentoken/neg', vocab) process_docs('txt_sentoken/pos', vocab) # print the size of the vocab print(len(vocab)) # print the top words in the vocab print(vocab.most_common(50)) ``` 運行該示例將創建包含數據集中所有文檔的詞匯表，包括正面和負面評論。我們可以看到所有評論中有超過 46,000 個獨特單詞，前 3 個單詞是'_ 電影 _'，' _one_ '和'_ 電影 _ ”。 ```py 46557 [('film', 8860), ('one', 5521), ('movie', 5440), ('like', 3553), ('even', 2555), ('good', 2320), ('time', 2283), ('story', 2118), ('films', 2102), ('would', 2042), ('much', 2024), ('also', 1965), ('characters', 1947), ('get', 1921), ('character', 1906), ('two', 1825), ('first', 1768), ('see', 1730), ('well', 1694), ('way', 1668), ('make', 1590), ('really', 1563), ('little', 1491), ('life', 1472), ('plot', 1451), ('people', 1420), ('movies', 1416), ('could', 1395), ('bad', 1374), ('scene', 1373), ('never', 1364), ('best', 1301), ('new', 1277), ('many', 1268), ('doesnt', 1267), ('man', 1266), ('scenes', 1265), ('dont', 1210), ('know', 1207), ('hes', 1150), ('great', 1141), ('another', 1111), ('love', 1089), ('action', 1078), ('go', 1075), ('us', 1065), ('director', 1056), ('something', 1048), ('end', 1047), ('still', 1038)] ``` 也許最不常見的單詞，那些僅在所有評論中出現一次的單詞，都不具有預測性。也許一些最常見的詞也沒用。這些都是好問題，應該用特定的預測模型進行測試。一般來說，在 2000 條評論中只出現一次或幾次的單詞可能不具有預測性，可以從詞匯表中刪除，大大減少了我們需要建模的標記。我們可以通過單詞和它們的計數來執行此操作，并且只保留計數高于所選閾值的計數。這里我們將使用 5 次。 ```py # keep tokens with > 5 occurrence min_occurane = 5 tokens = [k for k,c in vocab.items() if c >= min_occurane] print(len(tokens)) ``` 這將詞匯量從 46,557 減少到 14,803 個單詞，這是一個巨大的下降。也許至少 5 次發生過于激進;你可以嘗試不同的價值觀。然后，我們可以將選擇的單詞詞匯保存到新文件中。我喜歡將詞匯表保存為 ASCII，每行一個單詞。下面定義了一個名為 _save_list（）_ 的函數來保存項目列表，在這種情況下，標記為文件，每行一個。 ```py def save_list(lines, filename): data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() ``` 下面列出了定義和保存詞匯表的完整示例。 ```py from string import punctuation from os import listdir from collections import Counter from nltk.corpus import stopwords # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # load doc and add to vocab def add_doc_to_vocab(filename, vocab): # load doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # update counts vocab.update(tokens) # load all docs in a directory def process_docs(directory, vocab): # walk through all files in the folder for filename in listdir(directory): # skip files that do not have the right extension if not filename.endswith(".txt"): continue # create the full path of the file to open path = directory + '/' + filename # add doc to vocab add_doc_to_vocab(path, vocab) # save list to file def save_list(lines, filename): data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() # define vocab vocab = Counter() # add all docs to vocab process_docs('txt_sentoken/neg', vocab) process_docs('txt_sentoken/pos', vocab) # print the size of the vocab print(len(vocab)) # print the top words in the vocab print(vocab.most_common(50)) # keep tokens with > 5 occurrence min_occurane = 5 tokens = [k for k,c in vocab.items() if c >= min_occurane] print(len(tokens)) # save tokens to a vocabulary file save_list(tokens, 'vocab.txt') ``` 在創建詞匯表后運行此最終片段會將所選單詞保存到文件中。最好先查看，甚至研究您選擇的詞匯表，以便獲得更好地準備這些數據或未來文本數據的想法。 ```py hasnt updating figuratively symphony civilians might fisherman hokum witch buffoons ... ``` 接下來，我們可以看一下使用詞匯表來創建電影評論數據集的準備版本。 ## 5.保存準備好的數據我們可以使用數據清理和選擇的詞匯表來準備每個電影評論，并保存準備好的評論版本以備建??模。這是一個很好的做法，因為它將數據準備與建模分離，如果您有新想法，則可以專注于建模并循環回數據準備。我們可以從' _vocab.txt_ '加載詞匯開始。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load vocabulary vocab_filename = 'review_polarity/vocab.txt' vocab = load_doc(vocab_filename) vocab = vocab.split() vocab = set(vocab) ``` 接下來，我們可以清理評論，使用加載的詞匯來過濾掉不需要的令牌，并將干凈的評論保存在新文件中。一種方法可以是將所有正面評論保存在一個文件中，將所有負面評論保存在另一個文件中，將過濾后的標記用空格分隔，以便在單獨的行上進行每次評審。首先，我們可以定義一個函數來處理文檔，清理它，過濾它，然后將它作為可以保存在文件中的單行返回。下面定義 _doc_to_line（）_ 函數，將文件名和詞匯（作為一組）作為參數。它調用先前定義的 _load_doc（）_ 函數來加載文檔，調用 _clean_doc（）_ 來標記文檔。 ```py # load doc, clean and return line of tokens def doc_to_line(filename, vocab): # load the doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # filter by vocab tokens = [w for w in tokens if w in vocab] return ' '.join(tokens) ``` 接下來，我們可以定義新版本的 _process_docs（）_ 來逐步瀏覽文件夾中的所有評論，并通過為每個文檔調用 _doc_to_line（）_ 將它們轉換為行。然后返回行列表。 ```py # load all docs in a directory def process_docs(directory, vocab): lines = list() # walk through all files in the folder for filename in listdir(directory): # skip files that do not have the right extension if not filename.endswith(".txt"): continue # create the full path of the file to open path = directory + '/' + filename # load and clean the doc line = doc_to_line(path, vocab) # add to list lines.append(line) return lines ``` 然后我們可以為正面和負面評論的目錄調用 _process_docs（）_，然后從上一節調用 _save_list（）_ 將每個處理過的評論列表保存到文件中。完整的代碼清單如下。 ```py from string import punctuation from os import listdir from collections import Counter from nltk.corpus import stopwords # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # turn a doc into clean tokens def clean_doc(doc): # split into tokens by white space tokens = doc.split() # remove punctuation from each token table = str.maketrans('', '', punctuation) tokens = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop words stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] # filter out short tokens tokens = [word for word in tokens if len(word) > 1] return tokens # save list to file def save_list(lines, filename): data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() # load doc, clean and return line of tokens def doc_to_line(filename, vocab): # load the doc doc = load_doc(filename) # clean doc tokens = clean_doc(doc) # filter by vocab tokens = [w for w in tokens if w in vocab] return ' '.join(tokens) # load all docs in a directory def process_docs(directory, vocab): lines = list() # walk through all files in the folder for filename in listdir(directory): # skip files that do not have the right extension if not filename.endswith(".txt"): continue # create the full path of the file to open path = directory + '/' + filename # load and clean the doc line = doc_to_line(path, vocab) # add to list lines.append(line) return lines # load vocabulary vocab_filename = 'vocab.txt' vocab = load_doc(vocab_filename) vocab = vocab.split() vocab = set(vocab) # prepare negative reviews negative_lines = process_docs('txt_sentoken/neg', vocab) save_list(negative_lines, 'negative.txt') # prepare positive reviews positive_lines = process_docs('txt_sentoken/pos', vocab) save_list(positive_lines, 'positive.txt') ``` 運行該示例將保存兩個新文件，' _negative.txt_ '和' _positive.txt_ '，分別包含準備好的負面和正面評論。數據已準備好用于單詞包甚至單詞嵌入模型。 ## 擴展本節列出了您可能希望探索的一些擴展。 * **Stemming** 。我們可以使用像 Porter stemmer 這樣的詞干算法將文檔中的每個單詞減少到它們的詞干。 * **N-Grams** 。我們可以使用詞匯對詞匯，而不是處理單個詞匯。我們還可以研究使用更大的群體，例如三胞胎（三卦）和更多（n-gram）。 * **編碼字**。我們可以保存單詞的整數編碼，而不是按原樣保存標記，其中詞匯表中單詞的索引表示單詞的唯一整數。這將使建模時更容易處理數據。 * **編碼文件**。我們可以使用詞袋模型對文檔進行編碼，并將每個單詞編碼為布爾存在/不存在標記或使用更復雜的評分，例如 TF-IDF，而不是在文檔中保存標記。如果你嘗試任何這些擴展，我很想知道。在下面的評論中分享您的結果。 ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 ### 數據集 * [電影評論數據](http://www.cs.cornell.edu/people/pabo/movie-review-data/) * [一種感傷教育：基于最小削減的主觀性總結的情感分析](http://xxx.lanl.gov/abs/cs/0409058)，2004。 * [電影評論 Polarity Dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz) （。tgz） * 數據集自述文件 [v2.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/poldata.README.2.0.txt) 和 [v1.1](http://www.cs.cornell.edu/people/pabo/movie-review-data/README.1.1) 。 ### 蜜蜂 * [nltk.tokenize 包 API](http://www.nltk.org/api/nltk.tokenize.html) * [第 2 章，訪問文本語料庫和詞匯資源](http://www.nltk.org/book/ch02.html) * [os API 其他操作系統接口](https://docs.python.org/3/library/os.html) * [集合 API - 容器數據類型](https://docs.python.org/3/library/collections.html) ## 摘要在本教程中，您逐步了解了如何為情感分析準備電影評論文本數據。具體來說，你學到了： * 如何加載文本數據并清除它以刪除標點符號和其他非單詞。 * 如何開發詞匯表，定制它并將其保存到文件中。 * 如何使用清潔和預定義詞匯表準備電影評論，并將其保存到準備建模的新文件中。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。