理論內容
貝葉斯定理
貝葉斯定理是描述條件概率關系的定律
$$P(A|B) = \cfrac{P(B|A) * P(A)}{P(B)}$$
樸素貝葉斯分類器
樸素貝葉斯分類器是一種基于概率的分類器,我們做以下定義:
- B:具有特征向量B
- A:屬于類別A
有了這個定義,我們解釋貝葉斯公式
- P(A|B):具有特征向量B樣本屬于A類別的概率(計算目標)
- P(B|A):在A類別中B向量出現的概率(訓練樣本中的數據)
- P(A):A類出現的概率(訓練樣本中的頻率)
- P(B):B特征向量出現的概率(訓練樣本中的頻率)
對于樸素貝葉斯分類器,進一步假設特征向量之間無關,那么樸素貝葉斯分類器公式可以如下表示$$P(A|B) = \cfrac{P(A)\prod P(B_{i} |A)}{P(B)}$$
以上公式右側的值都可以在訓練樣本中算得。進行預測時,分別計算每個類別的概率,取概率最高的一個類別。
特征向量為連續值的樸素貝葉斯分類器
對于連續值,有以下兩種處理方式
- 將連續值按區間離散化
- 假設特征向量服從正態分布或其他分布(很強的先驗假設),由樣本中估計出參數,計算貝葉斯公式時帶入概率密度
代碼實現
導入數據——文本新聞數據
# from sklearn.datasets import fetch_20newsgroups
# news = fetch_20newsgroups(subset='all')
# print(len(news.data))
# print(news.data[0])
from sklearn import datasets
train = datasets.load_files("./20newsbydate/20news-bydate-train")
test = datasets.load_files("./20newsbydate/20news-bydate-test")
print(train.DESCR)
print(len(train.data))
print(train.data[0])
None
11314
b"From: cubbie@garnet.berkeley.edu ( )\nSubject: Re: Cubs behind Marlins? How?\nArticle-I.D.: agate.1pt592$f9a\nOrganization: University of California, Berkeley\nLines: 12\nNNTP-Posting-Host: garnet.berkeley.edu\n\n\ngajarsky@pilot.njin.net writes:\n\nmorgan and guzman will have era's 1 run higher than last year, and\n the cubs will be idiots and not pitch harkey as much as hibbard.\n castillo won't be good (i think he's a stud pitcher)\n\n This season so far, Morgan and Guzman helped to lead the Cubs\n at top in ERA, even better than THE rotation at Atlanta.\n Cubs ERA at 0.056 while Braves at 0.059. We know it is early\n in the season, we Cubs fans have learned how to enjoy the\n short triumph while it is still there.\n"
處理數據——特征抽取(文字向量化)
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words="english",decode_error='ignore')
train_vec = vec.fit_transform(train.data)
test_vec = vec.transform(test.data)
print(train_vec.shape)
(11314, 129782)
模型訓練
from sklearn.naive_bayes import MultinomialNB
bays = MultinomialNB()
bays.fit(train_vec,train.target)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
模型評估
使用自帶評估器
bays.score(test_vec,test.target)
0.80244291024960168
調用評估器
from sklearn.metrics import classification_report
y = bays.predict(test_vec)
print(classification_report(test.target,y,target_names=test.target_names))
precision recall f1-score support
alt.atheism 0.80 0.81 0.80 319
comp.graphics 0.65 0.80 0.72 389
comp.os.ms-windows.misc 0.80 0.04 0.08 394
comp.sys.ibm.pc.hardware 0.55 0.80 0.65 392
comp.sys.mac.hardware 0.85 0.79 0.82 385
comp.windows.x 0.69 0.84 0.76 395
misc.forsale 0.89 0.74 0.81 390
rec.autos 0.89 0.92 0.91 396
rec.motorcycles 0.95 0.94 0.95 398
rec.sport.baseball 0.95 0.92 0.93 397
rec.sport.hockey 0.92 0.97 0.94 399
sci.crypt 0.80 0.96 0.87 396
sci.electronics 0.79 0.70 0.74 393
sci.med 0.88 0.87 0.87 396
sci.space 0.84 0.92 0.88 394
soc.religion.christian 0.81 0.95 0.87 398
talk.politics.guns 0.72 0.93 0.81 364
talk.politics.mideast 0.93 0.94 0.94 376
talk.politics.misc 0.68 0.62 0.65 310
talk.religion.misc 0.88 0.44 0.59 251
avg / total 0.81 0.80 0.78 7532