如何在Python中從頭開始實現樸素貝葉斯 · Machine Learning Mastery 博客文章翻譯

# 如何在Python中從頭開始實現樸素貝葉斯 > 原文： [https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/](https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/) 樸素貝葉斯算法簡單有效，應該是您嘗試分類問題的第一種方法之一。在本教程中，您將學習Naive Bayes算法，包括它的工作原理以及如何在Python中從頭開始實現它。 * **更新**：查看關于使用樸素貝葉斯算法的提示的后續內容：“ [Better Naive Bayes：從Naive Bayes算法中獲取最多的12個技巧](http://machinelearningmastery.com/better-naive-bayes/ "Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm")”。 * **更新March / 2018** ：添加了備用鏈接以下載數據集，因為原始圖像已被刪除。 [![naive bayes classifier](img/e6a92a6bcab0d5c51968019190f71f21.jpg)](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2014/12/naive-bayes-classifier.jpg) 樸素貝葉斯分類器攝影： [Matt Buck](https://www.flickr.com/photos/mattbuck007/3676624894) ，保留一些權利 ## 關于樸素貝葉斯樸素貝葉斯算法是一種直觀的方法，它使用屬于每個類的每個屬性的概率來進行預測。如果您想要概率性地建模預測建模問題，那么您將提出監督學習方法。樸素貝葉斯通過假設屬于給定類值的每個屬性的概率獨立于所有其他屬性來簡化概率的計算。這是一個強有力的假設，但會產生一種快速有效的方法。給定屬性值的類值的概率稱為條件概率。通過將條件概率乘以給定類值的每個屬性，我們得到屬于該類的數據實例的概率。為了進行預測，我們可以計算屬于每個類的實例的概率，并選擇具有最高概率的類值。樸素堿基通常使用分類數據來描述，因為它易于使用比率進行描述和計算。用于我們目的的更有用的算法版本支持數字屬性并假設每個數字屬性的值是正態分布的（落在鐘形曲線上的某處）。同樣，這是一個強有力的假設，但仍然提供了可靠的結果。 ## 獲取免費算法思維導圖 ![Machine Learning Algorithms Mind Map](img/2ce1275c2a1cac30a9f4eea6edd42d61.jpg) 方便的機器學習算法思維導圖的樣本。我已經創建了一個由類型組織的60多種算法的方便思維導圖。下載，打印并使用它。 ## 預測糖尿病的發病我們將在本教程中使用的測試問題是[皮馬印第安人糖尿病問題](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes)。這個問題包括對Pima印第安人專利的醫療細節的768次觀察。記錄描述了從患者身上獲取的瞬時測量值，例如他們的年齡，懷孕次數和血液檢查次數。所有患者均為21歲或以上的女性。所有屬性都是數字，其單位因屬性而異。每個記錄具有類別值，該類別值指示患者在進行測量（1）或不進行測量（0）的5年內是否患有糖尿病。這是一個標準的數據集，已在機器學習文獻中進行了大量研究。良好的預測準確率為70％-76％。下面是來自 _pima-indians.data.csv_ 文件的示例，以了解我們將要使用的數據（更新：[從此處下載](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)）。 Sample from the pima-indians.data.csv file ```py 6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 ``` ## 樸素貝葉斯算法教程本教程分為以下幾個步驟： 1. **句柄數據**：從CSV文件加載數據并將其拆分為訓練和測試數據集。 2. **匯總數據**：總結訓練數據集中的屬性，以便我們可以計算概率并進行預測。 3. **進行預測**：使用數據集的摘要生成單個預測。 4. **制作預測**：根據測試數據集和匯總的訓練數據集生成預測。 5. **評估準確度**：評估為測試數據集做出的預測的準確性，作為所有預測中的正確百分比。 6. **將它綁在一起**：使用所有代碼元素來呈現Naive Bayes算法的完整且獨立的實現。 ### 1.處理數據我們需要做的第一件事是加載我們的數據文件。數據為CSV格式，沒有標題行或任何引號。我們可以使用open函數打開文件，并使用csv模塊中的reader函數讀取數據行。我們還需要將作為字符串加載的屬性轉換為可以使用它們的數字。下面是用于加載Pima indians數據集的 **loadCsv（）**函數。 Load a CSV file of scalars into memory Python ```py import csv def loadCsv(filename): lines = csv.reader(open(filename, "rb")) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [float(x) for x in dataset[i]] return dataset ``` 我們可以通過加載pima indians數據集并打印已加載的數據實例的數量來測試此函數。 Test the loadCsv() function Python ```py filename = 'pima-indians-diabetes.data.csv' dataset = loadCsv(filename) print('Loaded data file {0} with {1} rows').format(filename, len(dataset)) ``` 運行此測試，您應該看到類似的內容： Example output of testing the loadCsv() function ```py Loaded data file pima-indians-diabetes.data.csv rows ``` 接下來，我們需要將數據拆分為Naive Bayes可用于進行預測的訓練數據集和我們可用于評估模型準確性的測試數據集。我們需要將數據集隨機分成訓練和數據集，比率為67％訓練和33％測試（這是在數據集上測試算法的常用比率）。下面是 **splitDataset（）**函數，它將給定數據集拆分為給定的分割比率。 Split a loaded dataset into a train and test datasets Python ```py import random def splitDataset(dataset, splitRatio): trainSize = int(len(dataset) * splitRatio) trainSet = [] copy = list(dataset) while len(trainSet) < trainSize: index = random.randrange(len(copy)) trainSet.append(copy.pop(index)) return [trainSet, copy] ``` 我們可以通過定義一個包含5個實例的模擬數據集來測試它，將其拆分為訓練和測試數據集并打印出來以查看哪些數據實例最終到達哪里。 Test the splitDataset() function Python ```py dataset = [[1], [2], [3], [4], [5]] splitRatio = 0.67 train, test = splitDataset(dataset, splitRatio) print('Split {0} rows into train with {1} and test with {2}').format(len(dataset), train, test) ``` Running this test, you should see something like: Example output from testing the splitDataset() function ```py Split 5 rows into train with [[4], [3], [5]] and test with [[1], [2]] ``` ### 2.總結數據樸素貝葉斯模型由訓練數據集中的數據摘要組成。然后在進行預測時使用此摘要。收集的訓練數據摘要涉及每個屬性的平均值和標準偏差，按類別值。例如，如果有兩個類值和7個數值屬性，那么我們需要每個屬性（7）和類值（2）組合的均值和標準差，即14個屬性摘要。在進行預測以計算屬于每個類值的特定屬性值的概率時，這些是必需的。我們可以將此摘要數據的準備工作分解為以下子任務： 1. 按類別分開數據 2. 計算平均值 3. 計算標準差 4. 總結數據集 5. 按類別匯總屬性 #### 按類別分開數據第一個任務是按類值分隔訓練數據集實例，以便我們可以計算每個類的統計數據。我們可以通過創建每個類值的映射到屬于該類的實例列表并將實例的整個數據集排序到適當的列表中來實現。下面的 **separateByClass（）**函數就是這樣做的。 The separateByClass() function ```py def separateByClass(dataset): separated = {} for i in range(len(dataset)): vector = dataset[i] if (vector[-1] not in separated): separated[vector[-1]] = [] separated[vector[-1]].append(vector) return separated ``` 您可以看到該函數假定最后一個屬性（-1）是類值。該函數將類值映射返回到數據實例列表。我們可以使用一些示例數據測試此函數，如下所示： Testing the separateByClass() function ```py dataset = [[1,20,1], [2,21,0], [3,22,1]] separated = separateByClass(dataset) print('Separated instances: {0}').format(separated) ``` Running this test, you should see something like: Output when testing the separateByClass() function ```py Separated instances: {0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]} ``` #### 計算平均值我們需要計算類值的每個屬性的平均值。均值是數據的中心中心或中心趨勢，我們將在計算概率時將其用作高斯分布的中間。我們還需要計算類值的每個屬性的標準偏差。標準偏差描述了數據傳播的變化，我們將用它來表征計算概率時高斯分布中每個屬性的預期傳播。標準偏差計算為方差的平方根。方差計算為每個屬性值與平均值的平方差的平均值。注意我們使用的是N-1方法，它在計算方差時從屬性值的數量中減去1。 Functions to calculate the mean and standard deviations of attributes ```py import math def mean(numbers): return sum(numbers)/float(len(numbers)) def stdev(numbers): avg = mean(numbers) variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1) return math.sqrt(variance) ``` 我們可以通過取1到5的數字的平均值來測試這個。 Code to test the mean() and stdev() functions ```py numbers = [1,2,3,4,5] print('Summary of {0}: mean={1}, stdev={2}').format(numbers, mean(numbers), stdev(numbers)) ``` Running this test, you should see something like: Output of testing the mean() and stdev() functions ```py Summary of [1, 2, 3, 4, 5]: mean=3.0, stdev=1.58113883008 ``` #### 總結數據集現在我們有了匯總數據集的工具。對于給定的實例列表（對于類值），我們可以計算每個屬性的均值和標準差。 zip函數將數據實例中每個屬性的值分組到它們自己的列表中，以便我們可以計算屬性的均值和標準差值。 The summarize() function ```py def summarize(dataset): summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)] del summaries[-1] return summaries ``` 我們可以用一些測試數據來測試這個 **summarize（）**函數，該數據顯示第一和第二數據屬性的平均值和標準偏差值明顯不同。 Code to test the summarize() function ```py dataset = [[1,20,0], [2,21,1], [3,22,0]] summary = summarize(dataset) print('Attribute summaries: {0}').format(summary) ``` Running this test, you should see something like: Output of testing the summarize() function ```py Attribute summaries: [(2.0, 1.0), (21.0, 1.0)] ``` #### 按類別匯總屬性我們可以通過首先將訓練數據集分成按類分組的實例來將它們整合在一起。然后計算每個屬性的摘要。 The summarizeByClass() function ```py def summarizeByClass(dataset): separated = separateByClass(dataset) summaries = {} for classValue, instances in separated.iteritems(): summaries[classValue] = summarize(instances) return summaries ``` 我們可以用一個小的測試數據集測試這個 **summarizeByClass（）**函數。 Code to test the summarizeByClass() function ```py dataset = [[1,20,1], [2,21,0], [3,22,1], [4,22,0]] summary = summarizeByClass(dataset) print('Summary by class value: {0}').format(summary) ``` Running this test, you should see something like: Output from testing the summarizeByClass() function ```py Summary by class value: {0: [(3.0, 1.4142135623730951), (21.5, 0.7071067811865476)], 1: [(2.0, 1.4142135623730951), (21.0, 1.4142135623730951)]} ``` ### 3.進行預測我們現在準備使用從我們的訓練數據準備的摘要進行預測。進行預測涉及計算給定數據實例屬于每個類的概率，然后選擇具有最大概率的類作為預測。我們可以將這部分分為以下任務： 1. 計算高斯概率密度函數 2. 計算類概率 3. 做一個預測 4. 估計準確度 #### 計算高斯概率密度函數在給定從訓練數據估計的屬性的已知平均值和標準偏差的情況下，我們可以使用高斯函數來估計給定屬性值的概率。假定為每個屬性和類值準備的屬性匯總，結果是給定類值的給定屬性值的條件概率。有關高斯概率密度函數的詳細信息，請參閱參考資料。總之，我們將已知細節插入高斯（屬性值，平均值和標準偏差）并讀取屬性值屬于類的可能性。在 **calculateProbability（）**函數中，我們首先計算指數，然后計算主要除法。這讓我們可以在兩條線上很好地擬合方程。 The calculateProbability() function ```py import math def calculateProbability(x, mean, stdev): exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2)))) return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent ``` 我們可以使用一些示例數據對此進行測試，如下所示。 Code to test the calculateProbability() function ```py x = 71.5 mean = 73 stdev = 6.2 probability = calculateProbability(x, mean, stdev) print('Probability of belonging to this class: {0}').format(probability) ``` Running this test, you should see something like: Output from testing the calculateProbability() function ```py Probability of belonging to this class: 0.0624896575937 ``` #### 計算類概率現在我們可以計算出屬于某個類的屬性的概率，我們可以組合數據實例的所有屬性值的概率，并得出整個數據實例屬于該類的概率。我們將概率乘以它們，將概率結合在一起。在下面的 **calculateClassProbabilities（）**中，通過將每個類的屬性概率相乘來計算給定數據實例的概率。結果是類值與概率的映射。 Code for the calculateClassProbabilities() function ```py def calculateClassProbabilities(summaries, inputVector): probabilities = {} for classValue, classSummaries in summaries.iteritems(): probabilities[classValue] = 1 for i in range(len(classSummaries)): mean, stdev = classSummaries[i] x = inputVector[i] probabilities[classValue] *= calculateProbability(x, mean, stdev) return probabilities ``` 我們可以測試 **calculateClassProbabilities（）**函數。 Code to test the calculateClassProbabilities() function ```py summaries = {0:[(1, 0.5)], 1:[(20, 5.0)]} inputVector = [1.1, '?'] probabilities = calculateClassProbabilities(summaries, inputVector) print('Probabilities for each class: {0}').format(probabilities) ``` Running this test, you should see something like: Output from testing the calculateClassProbabilities() function ```py Probabilities for each class: {0: 0.7820853879509118, 1: 6.298736258150442e-05} ``` #### 做一個預測現在，可以計算屬于每個類值的數據實例的概率，我們可以查找最大概率并返回關聯類。 **predict（）**函數屬于那個。 Implementation of the predict() function ```py def predict(summaries, inputVector): probabilities = calculateClassProbabilities(summaries, inputVector) bestLabel, bestProb = None, -1 for classValue, probability in probabilities.iteritems(): if bestLabel is None or probability > bestProb: bestProb = probability bestLabel = classValue return bestLabel ``` 我們可以測試 **predict（）**函數如下： Code to test the predict() function ```py summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]} inputVector = [1.1, '?'] result = predict(summaries, inputVector) print('Prediction: {0}').format(result) ``` Running this test, you should see something like: Output of testing the predict() function ```py Prediction: A ``` ### 4.做出預測最后，我們可以通過對測試數據集中的每個數據實例進行預測來估計模型的準確性。 **getPredictions（）**將執行此操作并返回每個測試實例的預測列表。 Code for the getPredictions() function ```py def getPredictions(summaries, testSet): predictions = [] for i in range(len(testSet)): result = predict(summaries, testSet[i]) predictions.append(result) return predictions ``` 我們可以測試 **getPredictions（）**函數。 Code to test the getPredictions() function ```py summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]} testSet = [[1.1, '?'], [19.1, '?']] predictions = getPredictions(summaries, testSet) print('Predictions: {0}').format(predictions) ``` Running this test, you should see something like: Output from testing the getPredictions() function ```py Predictions: ['A', 'B'] ``` ### 5.獲得準確性可以將預測與測試數據集中的類值進行比較，并且可以將分類精度計算為0和0之間的準確度比率。和100％。 **getAccuracy（）**將計算此準確率。 Code for the getAccuracy() function ```py def getAccuracy(testSet, predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] == predictions[x]: correct += 1 return (correct/float(len(testSet))) * 100.0 ``` 我們可以使用下面的示例代碼測試 **getAccuracy（）**函數。 Code to test the getAccuracy() function ```py testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']] predictions = ['a', 'a', 'a'] accuracy = getAccuracy(testSet, predictions) print('Accuracy: {0}').format(accuracy) ``` Running this test, you should see something like: Output from testing the getAccuracy() function ```py Accuracy: 66.6666666667 ``` ### 6.把它綁在一起最后，我們需要將它們結合在一起。下面提供了從頭開始在Python中實現的Naive Bayes的完整代碼清單。 Complete code for implementing Naive Bayes from scratch in Python Python ```py # Example of Naive Bayes implemented from Scratch in Python import csv import random import math def loadCsv(filename): lines = csv.reader(open(filename, "rb")) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [float(x) for x in dataset[i]] return dataset def splitDataset(dataset, splitRatio): trainSize = int(len(dataset) * splitRatio) trainSet = [] copy = list(dataset) while len(trainSet) < trainSize: index = random.randrange(len(copy)) trainSet.append(copy.pop(index)) return [trainSet, copy] def separateByClass(dataset): separated = {} for i in range(len(dataset)): vector = dataset[i] if (vector[-1] not in separated): separated[vector[-1]] = [] separated[vector[-1]].append(vector) return separated def mean(numbers): return sum(numbers)/float(len(numbers)) def stdev(numbers): avg = mean(numbers) variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1) return math.sqrt(variance) def summarize(dataset): summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)] del summaries[-1] return summaries def summarizeByClass(dataset): separated = separateByClass(dataset) summaries = {} for classValue, instances in separated.iteritems(): summaries[classValue] = summarize(instances) return summaries def calculateProbability(x, mean, stdev): exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2)))) return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent def calculateClassProbabilities(summaries, inputVector): probabilities = {} for classValue, classSummaries in summaries.iteritems(): probabilities[classValue] = 1 for i in range(len(classSummaries)): mean, stdev = classSummaries[i] x = inputVector[i] probabilities[classValue] *= calculateProbability(x, mean, stdev) return probabilities def predict(summaries, inputVector): probabilities = calculateClassProbabilities(summaries, inputVector) bestLabel, bestProb = None, -1 for classValue, probability in probabilities.iteritems(): if bestLabel is None or probability > bestProb: bestProb = probability bestLabel = classValue return bestLabel def getPredictions(summaries, testSet): predictions = [] for i in range(len(testSet)): result = predict(summaries, testSet[i]) predictions.append(result) return predictions def getAccuracy(testSet, predictions): correct = 0 for i in range(len(testSet)): if testSet[i][-1] == predictions[i]: correct += 1 return (correct/float(len(testSet))) * 100.0 def main(): filename = 'pima-indians-diabetes.data.csv' splitRatio = 0.67 dataset = loadCsv(filename) trainingSet, testSet = splitDataset(dataset, splitRatio) print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet)) # prepare model summaries = summarizeByClass(trainingSet) # test model predictions = getPredictions(summaries, testSet) accuracy = getAccuracy(testSet, predictions) print('Accuracy: {0}%').format(accuracy) main() ``` 運行該示例提供如下輸出： Output from running the final code ```py Split 768 rows into train=514 and test=254 rows Accuracy: 76.3779527559% ``` ## 實施擴展本節為您提供了可以應用的擴展的概念，并使用您在本教程中實現的Python代碼進行調查。您已經從頭開始在python中實現了自己的Gaussian Naive Bayes版本。您可以進一步擴展實施。 * **計算類概率**：更新示例以概括屬于每個類的數據實例的概率作為比率。這可以被計算為屬于一個類的數據實例的概率除以屬于每個類的數據實例的概率之和。例如，A類的概率為0.02，B類的概率為0.001，屬于A類的實例的可能性為（0.02 /(0.02 + 0.001））* 100，約為95.23％。 * **對數概率**：給定屬性值的每個類的條件概率很小。當它們相乘時會產生非常小的值，這可能導致浮點下溢（數字太小而無法在Python中表示）。對此的常見修復是將概率的對數組合在一起。研究并實施這一改進。 * **標稱屬性**：更新實現以支持名義屬性。這非常相似，您可以為每個屬性收集的摘要信息是每個類的類別值的比率。深入了解參考資料以獲取更多信息。 * **不同的密度函數**（ _bernoulli_ 或_多項式_）：我們已經看過高斯樸素貝葉斯，但你也可以看看其他分布。實現不同的分布，例如多項式，bernoulli或內核樸素貝葉斯，它們對屬性值的分布和/或它們與類值的關系做出不同的假設。 ## 資源和進一步閱讀本節將提供一些資源，您可以使用這些資源來了解Naive Bayes算法的更多信息，包括它的工作原理和原理以及在代碼中實現它的實際問題。 ### 問題有關預測糖尿病發病問題的更多資源。 * [Pima Indians糖尿病數據集](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes)：此頁面提供對數據集文件的訪問，描述屬性并列出使用該數據集的論文。 * [數據集文件](https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data)：數據集文件。 * [數據集摘要](https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names)：數據集屬性的描述。 * [糖尿病數據集結果](http://www.is.umk.pl/projects/datasets.html#Diabetes)：該數據集上許多標準算法的準確性。 ### 碼本節鏈接到流行的機器學習庫中樸素貝葉斯的開源實現。如果您正在考慮實施自己的方法版本以供操作使用，請查看這些內容。 * [Scikit-Learn中的樸素貝葉斯](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py)：在scikit-learn庫中實現樸素的貝葉斯。 * [樸素貝葉斯文檔](http://scikit-learn.org/stable/modules/naive_bayes.html)：樸素貝葉斯的Scikit-Learn文檔和示例代碼 ### 圖書您可能有一本或多本關于應用機器學習的書籍。本節重點介紹有關機器學習的常見應用書籍中涉及樸素貝葉斯的部分或章節。 * [Applied Predictive Modeling](http://www.amazon.com/dp/1461468485?tag=inspiredalgor-20) ，第353頁 * [數據挖掘：實用機器學習工具和技術](http://www.amazon.com/dp/0123748569?tag=inspiredalgor-20)，第94頁 * [黑客機器學習](http://www.amazon.com/dp/1449303714?tag=inspiredalgor-20)，第78頁 * [統計學習簡介：在R](http://www.amazon.com/dp/1461471370?tag=inspiredalgor-20) 中的應用，第138頁 * [機器學習：算法視角](http://www.amazon.com/dp/1420067184?tag=inspiredalgor-20)，第171頁 * [機器學習在行動](http://www.amazon.com/dp/1617290181?tag=inspiredalgor-20)，第61頁（第4章） * [機器學習](http://www.amazon.com/dp/0070428077?tag=inspiredalgor-20)，第177頁（第6章） ## 下一步采取行動。按照教程從頭開始實施Naive Bayes。使示例適應另一個問題。遵循擴展并改進實施。發表評論并分享您的經驗。 **更新**：查看關于使用樸素貝葉斯算法的提示的后續內容：“ [Better Naive Bayes：從Naive Bayes算法中獲取最多的12個技巧](http://machinelearningmastery.com/better-naive-bayes/ "Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm")”