基于樸素貝葉斯的垃圾郵件過濾 · 《機器學習實戰》筆記

概率是許多機器學習算法的基礎，在前面生成決策樹的過程中使用了一小部分關于概率的知識，即統計特征在數據集中取某個特定值的次數，然后除以數據集的實例總數，得到特征取該值的概率。之前的基礎實驗中簡單實現了樸素貝葉斯分類器，并正確執行了文本分類，這一節將貝葉斯運用到實際場景，垃圾郵件過濾這一實際應用。 **實例：使用樸素貝葉斯過濾垃圾郵件** 在上一節：[http://blog.csdn.net/liyuefeilong/article/details/48383175](http://blog.csdn.net/liyuefeilong/article/details/48383175)中，使用了簡單的文本文件，并從中提取了字符串列表。這個例子中，我們將了解樸素貝葉斯的一個最著名的應用：電子郵件垃圾過濾。首先看一下如何使用通用框架來解決問題： * 收集數據：提供文本文件，下載地址：[http://download.csdn.net/detail/liyuefeilong/9106481](http://download.csdn.net/detail/liyuefeilong/9106481)，放在工程目錄下并解壓即可； * 準備數據：將文本文件解析成詞條向量； * 分析數據：檢查詞條確保解析的正確性； * 訓練算法：使用我們之前建立的trainNaiveBayes(trainMatrix, classLabel)函數； * 測試算法：使用函數naiveBayesClassify(vec2Classify, p0, p1, pBase)，并且構建一個新的測試函數來計算文檔集的錯誤率； * 使用算法：構建一個完整的程序對一組文檔進行分類，將錯分的文檔輸出到屏幕上。 **1.生成貝葉斯分類器** 在上一節已實現，在實現樸素貝葉斯的兩個應用前，需要用到之前的分類器訓練函數，完整的代碼如下： ~~~ # -*- coding: utf-8 -*- """ Created on Tue Sep 08 16:12:55 2015 @author: Administrator """ from numpy import * # 創建實驗樣本，可能需要對真實樣本做一些處理，如去除標點符號 def loadDataSet(): postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] listClass = [0, 1, 0, 1, 0, 1] # 1代表存在侮辱性的文字，0代表不存在 return postingList, listClass # 將所有文檔所有詞都存到一個列表中，用set()函數去除重復出現的詞 def createNonRepeatedList(data): vocList = set([]) for doc in data: vocList = vocList | set(doc) # 兩集合的并集 return list(vocList) def detectInput(vocList, inputStream): returnVec = [0]*len(vocList) # 創建和vocabList一樣長度的全0列表 for word in inputStream: if word in vocList: # 針對某段words進行處理 returnVec[vocList.index(word)] = 1 # ? else: print "The word :%s is not in the vocabulary!" % word return returnVec # 貝葉斯分類器訓練函數 def trainNaiveBayes(trainMatrix, classLabel): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pBase = sum(classLabel) / float(numTrainDocs) # The following Settings aim at avoiding the probability of 0 p0Num = ones(numWords) p1Num = ones(numWords) p0Denom = 2.0 p1Denom = 2.0 for i in range(numTrainDocs): if classLabel[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p0 = log(p0Num / p0Denom) p1 = log(p1Num / p1Denom) return p0, p1, pBase def trainNaiveBayes(trainMatrix, classLabel): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pBase = sum(classLabel) / float(numTrainDocs) # The following Settings aim at avoiding the probability of 0 p0Num = ones(numWords) p1Num = ones(numWords) p0Denom = 2.0 p1Denom = 2.0 for i in range(numTrainDocs): if classLabel[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p0 = log(p0Num / p0Denom) p1 = log(p1Num / p1Denom) return p0, p1, pBase trainMat = [] for doc in loadData: trainMat.append(detectInput(vocList, doc)) p0,p1,pBase = trainNaiveBayes(trainMat, dataLabel) #print "trainMat : " #print trainMat # test the algorithm def naiveBayesClassify(vec2Classify, p0, p1, pBase): p0res = sum(vec2Classify * p0) + log(1 - pBase) p1res = sum(vec2Classify * p1) + log(pBase) if p1res > p0res: return 1 else: return 0 def testNaiveBayes(): loadData, classLabel = loadDataSet() vocList = createNonRepeatedList(loadData) trainMat = [] for doc in loadData: trainMat.append(detectInput(vocList, doc)) p0, p1, pBase = trainNaiveBayes(array(trainMat), array(classLabel)) testInput = ['love', 'my', 'dalmation'] thisDoc = array(detectInput(vocList, testInput)) print testInput, 'the classified as: ', naiveBayesClassify(thisDoc, p0, p1, pBase) testInput = ['stupid', 'garbage'] thisDoc = array(detectInput(vocList, testInput)) print testInput, 'the classified as: ', naiveBayesClassify(thisDoc, p0, p1, pBase) ~~~ **2.準備數據：切分文本** 首先，編寫一個Python函數textSplit()，用來對所有的email文件進行解析并把一篇文章分解為一個個的單詞。這里將郵件分為兩種，正常的郵件放在路徑/email/ham/下，垃圾郵件放在/email/spam/下。以下的代碼就是讀入文本數據，然后切分，得到詞向量，然后將詞向量中的詞都轉換成小寫，并把長度大于2的字符串提取出來，寫入到文本文件中去，在切分文本的過程中使用了一些技巧，包括正則表達式、將所有字符串轉換成小寫（.lower()）等等。 ~~~ def textParse(bigString) : # 正則表達式進行文本解析 import re listOfTokens = re.split(r'\W*', bigString) return[tok.lower() for tok in listOfTokens if len(tok) > 2] ~~~ 以下是使用不同的處理對文本進行切分，第一次輸出將一些標點符號也劃分為單詞的一部分；第二部分使用了正則表達式，去除了標點符號，但由于對字符串長度沒有限制，因此出現了空字符；第三個輸出加入了字符串長度控制，同時將字母全部變成小寫。 ![](https://box.kancloud.cn/2016-01-05_568b3836eafc7.jpg) **3.測試算法：使用樸素貝葉斯進行交叉驗證** 該部分將文本解析器集成到一個完整分類器中： ~~~ # 過濾垃圾郵件 def spamTest() : docList = []; classList = []; fullText = [] for i in range(1, 26) : # 導入并解析文本文件，25個普通郵件和25個垃圾郵件 wordList = textParse(open('email/spam/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(1) wordList = textParse(open('email/ham/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList) trainingSet = range(50); testSet = [] for i in range(10) : # 隨機構建訓練集，包含10個樣本 randIndex = int(random.uniform(0, len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) trainMat = []; trainClasses = [] # 用于存放訓練集和訓練樣本的標簽 for docIndex in trainingSet : trainMat.append(setOfWords2Vec(vocabList, docList[docIndex])) trainClasses.append(classList[docIndex]) p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses)) errorCount = 0 for docIndex in testSet : # 對測試集進行分類 wordVector = setOfWords2Vec(vocabList, docList[docIndex]) if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]: # 誤判 errorCount += 1 print 'the error rate is: ', float(errorCount) / len(testSet) # 輸出分類誤差 ~~~ ![](https://box.kancloud.cn/2016-01-05_568b38371e6d9.jpg) **4.小結** 以上代碼會對10封隨機選擇的電子郵件進行分類，并統計分類的錯誤率。經過多次的運算，平均錯誤率為6%，這里的錯誤是指將垃圾郵件誤判為正常郵件。相比之下，將垃圾郵件誤判為正常郵件要好過將正常郵件誤判為垃圾郵件，同時，若提高訓練樣本個數，可以進一步降低錯誤率。算法訓練測試的方法是從總的數據集中隨機選擇數字，將其添加到測試集中，同時將其從訓練集中剔除。這種隨機選擇數據的一部分作為訓練集，而剩余部分作為測試集的過程為留存交叉驗證（hold-out cross validation）。有時為了更精確地估計分類器的錯誤率，就應該進行多次迭代后求出平均錯誤率。