用樸素貝葉斯實現垃圾郵件分類器,解題代碼如下
numTrainDocs = 700;
numTokens = 2500;
M = dlmread('F:\machine\ex6DataPrepared\train-features.txt', ' ');
spmatrix = sparse(M(:,1), M(:,2), M(:,3), numTrainDocs, numTokens);
train_matrix = full(spmatrix);
y = dlmread('F:\machine\ex6DataPrepared\train-labels.txt', ' ');
spam=find(y==1);
nonspam=find(y==0);
p_y=length(spam)/numTrainDocs;
xofspam=zeros(numTokens,1);
xofnonspam=zeros(numTokens,1);
for i=1:numTokens
xofspam(i,1)=sum(train_matrix(spam,i));
xofnonspam(i,1)=sum(train_matrix(nonspam,i));
end
word=sum(train_matrix,2);
fi_y1=(xofspam+1)./(sum(word(spam))+numTokens);
fi_y0=(xofnonspam+1)./(sum(word(nonspam))+numTokens);
%以上是train
%以下是test
numTestDocs = 260;
M =dlmread('F:\machine\ex6DataPrepared\test-features.txt', ' ');
test_spmatrix = sparse(M(:,1), M(:,2), M(:,3), numTestDocs, numTokens);
test_matrix = full(test_spmatrix);
test_result=zeros(numTestDocs,1);
a=test_matrix*log(fi_y1);
b=test_matrix*log(fi_y0);
test_result=a>b;
test_labels=dlmread('F:\machine\ex6DataPrepared\test-labels.txt', ' ');
length(find(test_result-test_labels));
對公式理解的兩處錯誤導致我改了一晚上bug,以及MATLAB使用不熟練導致代碼冗余,一個矩陣運算或者一個函數就可以搞定的問題我就傻傻的寫了for循環。