如何用 Python 清理機器學習的文本 · Machine Learning Mastery 博客文章翻譯

# 如何用 Python 清理機器學習的文本 > 原文： [https://machinelearningmastery.com/clean-text-machine-learning-python/](https://machinelearningmastery.com/clean-text-machine-learning-python/) 你不能直接從原始文本到適合機器學習或深度學習模型。您必須首先清理文本，這意味著將其拆分為單詞并處理標點符號和大小寫。實際上，您可能需要使用一整套文本準備方法，方法的選擇實際上取決于您的自然語言處理任務。在本教程中，您將了解如何清理和準備文本，以便通過機器學習進行建模。完成本教程后，您將了解： * 如何開始開發自己非常簡單的文本清理工具。 * 如何采取措施并使用 NLTK 庫中更復雜的方法。 * 如何在使用像文字嵌入這樣的現代文本表示方法時準備文本。讓我們開始吧。 * **2017 年 11 月更新**：修正了“分裂為單詞”部分中的代碼拼寫錯誤，感謝 David Comfort。 ![How to Develop Multilayer Perceptron Models for Time Series Forecasting](img/3845e70194ea7d465b653bbb0d8b993a.jpg) 如何開發用于時間序列預測的多層感知器模型照片由土地管理局提供，保留一些權利。 ## 教程概述本教程分為 6 個部分;他們是： 1. 弗蘭茲卡夫卡的變態 2. 文本清理是特定于任務的 3. 手動標記 4. 使用 NLTK 進行標記和清理 5. 其他文字清理注意事項 6. 清除詞嵌入文本的提示 ## 弗蘭茲卡夫卡的變態讓我們從選擇數據集開始。在本教程中，我們將使用 [Franz Kafka](https://en.wikipedia.org/wiki/Franz_Kafka) 的書 [Metamorphosis](https://en.wikipedia.org/wiki/The_Metamorphosis) 中的文本。沒有具體的原因，除了它的簡短，我喜歡它，你也可能喜歡它。我希望這是大多數學生在學校必讀的經典之作。 Metamorphosis 的全文可從 Project Gutenberg 免費獲得。 * [Franz Kafka 對 Project Gutenberg 的變形](http://www.gutenberg.org/ebooks/5200) 您可以在此處下載文本的 ASCII 文本版本： * [變形由 Franz Kafka 純文本 UTF-8](http://www.gutenberg.org/cache/epub/5200/pg5200.txt) （可能需要加載頁面兩次）。下載文件并將其放在當前工作目錄中，文件名為“ _metamorphosis.txt_ ”。該文件包含我們不感興趣的頁眉和頁腳信息，特別是版權和許可證信息。打開文件并刪除頁眉和頁腳信息，并將文件另存為“ _metamorphosis_clean.txt_ ”。 clean 文件的開頭應如下所示： > 一天早上，當 Gregor Samsa 從困擾的夢中醒來時，他發現自己在床上變成了一個可怕的害蟲。該文件應以： > 并且，好像在確認他們的新夢想和善意時，一旦他們到達目的地，Grete 就是第一個站起來伸展她年輕的身體的人。窮格雷戈爾...... ## 文本清理是特定于任務的在實際掌握了您的文本數據之后，清理文本數據的第一步是對您要實現的目標有一個強烈的了解，并在該上下文中查看您的文本，看看究竟可能有什么幫助。花點時間看看文字。你注意到什么？這是我看到的： * 它是純文本，所以沒有解析標記（耶！）。 * 原始德語的翻譯使用英國英語（例如“_ 旅行 _”）。 * 這些線條是用約 70 個字符（meh）的新線條人工包裹的。 * 沒有明顯的拼寫錯誤或拼寫錯誤。 * 有標點符號，如逗號，撇號，引號，問號等。 * 有像盔甲一樣的連字符描述。 * 有很多使用 em 破折號（“ - ”）繼續句子（可能用逗號替換？）。 * 有名字（例如“ _Samsa 先生 _”） * 似乎沒有需要處理的數字（例如 1999） * 有節標記（例如“II”和“III”），我們刪除了第一個“I”。我確信還有很多人會接受訓練有素的眼睛。我們將在本教程中查看一般文本清理步驟。盡管如此，請考慮我們在處理此文本文檔時可能遇到的一些目標。例如： * 如果我們有興趣開發 [Kafkaesque](http://www.thefreedictionary.com/Kafkaesk) 語言模型，我們可能希望保留所有案例，引號和其他標點符號。 * 如果我們有興趣將文件分類為“ _Kafka_ ”和“ _Not Kafka_ ”，那么我們可能會想要刪除案例，標點符號，甚至修剪單詞。使用您的任務作為鏡頭，通過它選擇如何準備文本數據。 ## 手動標記文本清理很難，但我們選擇使用的文本已經非常干凈了。我們可以編寫一些 Python 代碼來手動清理它，這對于遇到的那些簡單問題來說是一個很好的練習。像正則表達式和拆分字符串這樣的工具可以幫到你很長的路。 ### 1.加載數據讓我們加載文本數據，以便我們可以使用它。文本很小，可以快速加載并輕松融入內存。情況并非總是如此，您可能需要將代碼寫入內存映射文件。像 NLTK 這樣的工具（將在下一節中介紹）將使得處理大文件變得更加容易。我們可以將整個“_ 變態 clean.text_ ”加載到內存中，如下所示： ```py # load text filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() ``` 運行該示例將整個文件加載到可以使用的內存中。 ### 2.按空白分割清晰文本通常表示我們可以在機器學習模型中使用的單詞或標記列表。這意味著將原始文本轉換為單詞列表并再次保存。一種非常簡單的方法是使用空格分割文檔，包括“”，新行，制表符等。我們可以在 Python 中使用 split（）函數在加載的字符串上執行此操作。 ```py # load text filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() # split into words by white space words = text.split() print(words[:100]) ``` 運行該示例將文檔拆分為一長串單詞并打印前 100 個供我們查看。我們可以看到標點符號被保留（例如“_ 不是 _”和“_ 盔甲式 _”），這很好。我們還可以看到句子標點符號的結尾與最后一個單詞保持一致（例如“_ 認為 _。”），這不是很好。 ```py ['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human'] ``` ### 3.選擇單詞另一種方法可能是使用正則表達式模型（重新）并通過選擇字母數字字符串（a-z，A-Z，0-9 和'_'）將文檔拆分為單詞。例如： ```py # load text filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() # split based on words only import re words = re.split(r'\W+', text) print(words[:100]) ``` 再次，運行示例我們可以看到我們得到了單詞列表。這一次，我們可以看到“_ 盔甲式 _”現在是兩個詞“_ 裝甲 _”和“_ 喜歡 _”（精）但是收縮像“ _]什么是 _“也是兩個詞”_ 什么 _“和” _s_ “（不是很好）。 ```py ['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room'] ``` ### 3.按空格分割并刪除標點符號注意：此示例是為 Python 3 編寫的。我們可能想要這些單詞，但沒有像逗號和引號那樣的標點符號。我們也希望將宮縮保持在一起。一種方法是通過空格將文檔拆分為單詞（如“ _2.按空白劃分 _”），然后使用字符串翻譯將所有標點符號替換為空（例如刪除它）。 Python 提供了一個名為 _string.punctuation_ 的常量，它提供了一個很好的標點字符列表。例如： ```py print(string.punctuation) ``` 結果是： ```py !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ``` Python 提供了一個名為 [translate（）](https://docs.python.org/3/library/stdtypes.html#str.translate)的函數，它將一組字符映射到另一組。我們可以使用函數 [maketrans（）](https://docs.python.org/3/library/stdtypes.html#str.maketrans)來創建映射表。我們可以創建一個空的映射表，但是這個函數的第三個參數允許我們列出在翻譯過程中要刪除的所有字符。例如： ```py table = str.maketrans('', '', string.punctuation) ``` 我們可以將所有這些放在一起，加載文本文件，通過空格將其拆分為單詞，然后翻譯每個單詞以刪除標點符號。 ```py # load text filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() # split into words by white space words = text.split() # remove punctuation from each word import string table = str.maketrans('', '', string.punctuation) stripped = [w.translate(table) for w in words] print(stripped[:100]) ``` 我們可以看到，這主要是產生了預期的效果。像“_ 什么 _”這樣的收縮已成為“_ 什么 _”，但“_ 盔甲式 _”已成為“ _armourlike_ ”。 ```py ['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human'] ``` 如果您對正則表達式有所了解，那么您就知道事情可能會變得復雜。 ### 4.規范化案例將所有單詞轉換為一個案例是很常見的。這意味著詞匯量會縮小，但會丟失一些區別（例如“ _Apple_ ”公司與“ _apple_ ”水果是一個常用的例子）。我們可以通過調用每個單詞的 lower（）函數將所有單詞轉換為小寫。例如： ```py filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() # split into words by white space words = text.split() # convert to lower case words = [word.lower() for word in words] print(words[:100]) ``` 運行該示例，我們可以看到所有單詞現在都是小寫的。 ```py ['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"what\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'it', "wasn't", 'a', 'dream.', 'his', 'room,', 'a', 'proper', 'human'] ``` ### 注意清理文本非常困難，特定于問題，并且充滿了權衡。記住，簡單就是更好。更簡單的文本數據，更簡單的模型，更小的詞匯表。您可以隨時將事情變得更復雜，看看它是否會帶來更好的模型技能。接下來，我們將介紹 NLTK 庫中的一些工具，它們提供的不僅僅是簡單的字符串拆分。 ## 使用 NLTK 進行標記和清理 [自然語言工具包](http://www.nltk.org/)，簡稱 NLTK，是為工作和建模文本而編寫的 Python 庫。它提供了用于加載和清理文本的良好工具，我們可以使用這些工具來準備我們的數據，以便使用機器學習和深度學習算法。 ### 1.安裝 NLTK 您可以使用自己喜歡的包管理器安裝 NLTK，例如 pip： ```py sudo pip install -U nltk ``` 安裝之后，您將需要安裝庫使用的數據，包括一組很好的文檔，您可以在以后用它們來測試 NLTK 中的其他工具。有幾種方法可以做到這一點，例如在腳本中： ```py import nltk nltk.download() ``` 或者從命令行： ```py python -m nltk.downloader all ``` 有關安裝和設置 NLTK 的更多幫助，請參閱： * [安裝 NLTK](http://www.nltk.org/install.html) * [安裝 NLTK 數據](http://www.nltk.org/data.html) ### 2.分成句子一個很好的有用的第一步是將文本分成句子。一些建模任務更喜歡以段落或句子的形式輸入，例如 word2vec。您可以先將文本拆分為句子，將每個句子分成單詞，然后將每個句子保存到文件中，每行一個。 NLTK 提供 _sent_tokenize（）_ 函數將文本拆分成句子。下面的示例將“ _metamorphosis_clean.txt_ ”文件加載到內存中，將其拆分為句子，然后打印第一個句子。 ```py # load data filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() # split into sentences from nltk import sent_tokenize sentences = sent_tokenize(text) print(sentences[0]) ``` 運行這個例子，我們可以看到雖然文檔被分成了句子，但每個句子仍然保留了原始文檔中行的人工包裝的新行。 > 一天早上，當格里高爾薩姆莎從困擾的夢中醒來時，他發現 > 自己在床上變成了一個可怕的害蟲。 ### 3.分成單詞 NLTK 提供了一個名為 _word_tokenize（）_ 的函數，用于將字符串拆分為標記（名義上為單詞）。它根據空格和標點符號分割標記。例如，逗號和句點被視為單獨的標記。收縮被分開（例如“_ 什么 _”變成“_ 什么 _”“' _s_ ”）。行情保留，等等。例如： ```py # load data filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() # split into words from nltk.tokenize import word_tokenize tokens = word_tokenize(text) print(tokens[:100]) ``` 運行代碼，我們可以看到標點符號現在是我們可以決定專門過濾掉的標記。 ```py ['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '``', 'What', "'s", 'happened', 'to'] ``` ### 4.過濾掉標點符號我們可以過濾掉我們不感興趣的所有令牌，例如所有獨立標點符號。這可以通過遍歷所有令牌并且僅保留那些全部是字母的令牌來完成。 Python 具有可以使用的函數 [isalpha（）](https://docs.python.org/3/library/stdtypes.html#str.isalpha)。例如： ```py # load data filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() # split into words from nltk.tokenize import word_tokenize tokens = word_tokenize(text) # remove all tokens that are not alphabetic words = [word for word in tokens if word.isalpha()] print(words[:100]) ``` 運行這個例子，你不僅可以看到標點符號，而且“_ 盔甲式 _”和“_ 的 _”等例子也被過濾掉了。 ```py ['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room'] ``` ### 5.過濾掉停用詞（和管道） [停用詞](https://en.wikipedia.org/wiki/Stop_words)是那些對詞組的深層含義沒有貢獻的詞。它們是最常見的詞，例如：“”，“ _a_ ”和“_ 是 _”。對于某些應用程序（如文檔分類），刪除停用詞可能有意義。 NLTK 提供了各種語言（例如英語）共同商定的停用詞列表。它們可以按如下方式加載： ```py from nltk.corpus import stopwords stop_words = stopwords.words('english') print(stop_words) ``` 您可以看到完整列表，如下所示： ```py ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'] ``` 您可以看到它們都是小寫并刪除了標點符號。您可以將您的令牌與停用詞進行比較并過濾掉它們，但您必須確保以相同的方式準備文本。讓我們通過一小段文本準備來演示這一點，包括： 1. 加載原始文本。 2. 分成代幣。 3. 轉換為小寫。 4. 從每個令牌中刪除標點符號。 5. 過濾掉非字母的剩余令牌。 6. 過濾掉停用詞的令牌。 ```py # load data filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() # split into words from nltk.tokenize import word_tokenize tokens = word_tokenize(text) # convert to lower case tokens = [w.lower() for w in tokens] # remove punctuation from each word import string table = str.maketrans('', '', string.punctuation) stripped = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic words = [word for word in stripped if word.isalpha()] # filter out stop words from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) words = [w for w in words if not w in stop_words] print(words[:100]) ``` 運行這個例子，我們可以看到除了所有其他變換之外，還刪除了諸如“ _a_ ”和“_ 到 _”之類的停用詞。我注意到我們仍然留下像“ _nt_ ”這樣的令牌。兔子洞很深;我們總能做得更多。 ```py ['one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armourlike', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'nt', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards', 'viewer'] ``` ### 6.詞干 [詞干](https://en.wikipedia.org/wiki/Stemming)指的是將每個單詞縮減為其根或基數的過程。例如“_ 釣魚 _”，“_ 捕撈 _”，“ _fisher_ ”全部減少到莖“_ 魚 _”。一些應用程序，如文檔分類，可以從詞干分析中受益，以便既減少詞匯量又專注于文檔的感覺或情感，而不是更深層的含義。有許多詞干算法，盡管流行的和長期存在的方法是 Porter Stemming 算法。這種方法可以通過 [PorterStemmer](https://tartarus.org/martin/PorterStemmer/) 類在 NLTK 中使用。例如： ```py # load data filename = 'metamorphosis_clean.txt' file = open(filename, 'rt') text = file.read() file.close() # split into words from nltk.tokenize import word_tokenize tokens = word_tokenize(text) # stemming of words from nltk.stem.porter import PorterStemmer porter = PorterStemmer() stemmed = [porter.stem(word) for word in tokens] print(stemmed[:100]) ``` 運行這個例子，你可以看到單詞已經減少到它們的詞干，例如“ _trouble_ ”變成了“ _troubl_ ”。您還可以看到，詞干實現還將令牌減少為小寫，可能是字表中的內部查找。您還可以看到，詞干實現還將令牌減少為小寫，可能是字表中的內部查找。 ```py ['one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', ',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg', ',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about', 'helplessli', 'as', 'he', 'look', '.', '``', 'what', "'s", 'happen', 'to' ``` 在 NLTK 中有一套很好的詞干和詞形還原算法可供選擇，如果將詞語縮減到它們的根目錄就是你的項目需要的東西。 ## 其他文字清理注意事項我們才剛開始。因為本教程的源文本開頭是相當干凈的，所以我們跳過了許多您可能需要在自己的項目中處理的文本清理問題。以下是清理文本時的其他注意事項的簡短列表： * 處理不適合內存的大型文檔和大量文本文檔。 * 從標記中提取文本，如 HTML，PDF 或其他結構化文檔格式。 * 從其他語言到英語的音譯。 * 將 Unicode 字符解碼為規范化形式，例如 UTF8。 * 處理特定領域的單詞，短語和首字母縮略詞。 * 處理或刪除數字，例如日期和金額。 * 找出并糾正常見的拼寫錯誤和拼寫錯誤。 * ... 這份名單可以繼續使用。希望您能夠看到獲得真正干凈的文本是不可能的，我們真的可以根據我們擁有的時間，資源和知識做到最好。 “清潔”的概念實際上是由項目的特定任務或關注點定義的。專家提示是在每次轉換后不斷檢查您的令牌。我試圖在本教程中表明，我希望你能理解這一點。理想情況下，您可以在每次轉換后保存新文件，以便花時間處理新表單中的所有數據。在花時間審查您的數據時，事情總是會突然發生。你以前做過一些文字清理嗎？您最喜歡的變換管道是什么？請在下面的評論中告訴我。 ## 清除詞嵌入文本的提示最近，自然語言處理領域已逐漸從單詞模型和單詞編碼轉向單詞嵌入。單詞嵌入的好處在于，它們將每個單詞編碼為一個密集的向量，捕獲有關其在訓練文本中的相對含義的內容。這意味著在嵌入空間中將自動學習諸如大小寫，拼寫，標點符號等單詞的變體。反過來，這可能意味著您的文本所需的清潔量可能更少，也許與傳統的文本清理完全不同。例如，干縮詞語或刪除標點符號可能不再有意義。 Tomas Mikolov 是 word2vec 的開發者之一，word2vec 是一種流行的嵌入式方法。他建議在學習單詞嵌入模型時只需要非常小的文本清理。下面是他在回答有關如何最好地為 word2vec 準備文本數據的問題時的回答。 > 沒有普遍的答案。這一切都取決于你打算使用的向量。根據我的經驗，通常可以從單詞中斷開（或刪除）標點符號，有時還會將所有字符轉換為小寫。人們也可以用一些單一的標記替換所有數字（可能大于某些常數），例如。 > > 所有這些預處理步驟都旨在減少詞匯量，而不刪除任何重要內容（在某些情況下，當你小寫某些單詞時可能不是這樣，即'Bush'與'bush'不同，而'Another'通常有與“另一個”的意義相同。詞匯量越小，內存復雜度越低，估計的詞的參數越穩健。您還必須以相同的方式預處理測試數據。 > > ... > > 簡而言之，如果你要進行實驗，你會更好地理解這一切。 [閱讀 Google 網上論壇](https://groups.google.com/d/msg/word2vec-toolkit/jPfyP6FoB94/tGzZxScO0GsJ)的完整帖子。 ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 * [Franz Kafka 對 Project Gutenberg 的變形](http://www.gutenberg.org/ebooks/5200) * [nltk.tokenize 包 API](http://www.nltk.org/api/nltk.tokenize.html) * [nltk.stem 包 API](http://www.nltk.org/api/nltk.stem.html) * [第 3 章：使用 Python 處理原始文本，自然語言處理](http://www.nltk.org/book/ch03.html) ## 摘要在本教程中，您了解了如何在 Python 中清理文本或機器學習。具體來說，你學到了： * 如何開始開發自己非常簡單的文本清理工具。 * 如何采取措施并使用 NLTK 庫中更復雜的方法。 * 如何在使用像文字嵌入這樣的現代文本表示方法時準備文本。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。你有清潔文字的經驗嗎？請在下面的評論中分享您的經驗。