国产成人a亚洲精v品无码,无码日韩精品一区二区人妻,久久久久久av无码免费网站下载

機器學習實戰之K-近鄰算法（二）

2-1 K-近鄰算法概述

簡單的說，K-近鄰算法采用測量不同特征值之間的距離方法進行分類。

K-近鄰算法

優點：精度高、對異常值不敏感、無數據輸入假定

缺點：計算復雜度高、空間復雜度高

適用數據范圍：數值型和標稱型

K-近鄰算法（KNN），工作原理：

存在一個樣本數據集合，稱之為訓練樣本集，并且樣本集中的每個數據都存在標簽，即我們知道集中每一數據與所屬分類的對應關系。

輸入沒有標簽的新數據后，將新數據的每個特征與樣本集中數據對應的特征進行比較，然后算法提取樣本集中特征最相似數據（最近鄰）的分類標簽。

一般來說，我們只選擇樣本數據集中前K個最相似的數據，這就是K-近鄰算法中K的出處，通常K是不大于20的整數。

最后，選擇K個最相似數據出現次數最多的分類，作為新數據的分類。

數學公式

K-近鄰算法的一般流程

收集數據：可以使用任何方法

準備數據：距離計算所需要的數值，最好是結構化的數據格式

分析數據：可以使用任何方法

訓練算法：此步驟不適用于K-近鄰算法

測試算法：計算錯誤率

使用算法：首先需要輸入樣本數據和結構化的輸出結果，然后運行K-近鄰算法判定輸入數據分別屬于哪個分類，最后應用對計算出的分類執行后續的處理。

2.1.1 準備：使用python導入數據

首先，我們創建名為kNN.py的python模塊。

from numpy import *

import operator

def createDataSet():

group=array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])

labels=['A','A','B','B']

return group,labels

上面的代碼中，我們導入了兩個模塊，第一個是科學計算包NumPy,第二個是運算符模塊，k-近鄰算法執行排序操作使用這個模塊提供的函數。

我們現在要引入自定義的kNN模塊。

要在python shell中調用這個函數，進入python交互開發環境

我們先使用os模塊，

查看當前路徑? os.getcwd()

更改當前路徑? os.chdir()

In [16]:import os

In [17]:os.getcwd()

Out[17]:'C:\\Users\\Administrator'

In [18]:os.chdir("H:\ML")

In [19]:import kNN

In [20]:group,labels=kNN.createDataSet()

In [21]:group

Out[21]:

array([[ 1. , 1.1],

[ 1. , 1. ],

[ 0. , 0. ],

[ 0. , 0.1]])

In [24]:labels

Out[24]:['A', 'A', 'B', 'B']

2.1.2 從文本文件中解析數據

我們給出k-近鄰算法的偽代碼和實際的python代碼。

偽代碼如下：

對未知類別屬性的數據集中的每個點依次執行以下操作：

（1）計算已知類別數據集中的點與當前點之間的距離；

（2）按照距離遞增次序排序；

（3）選取與當前點距離最小的k和點；

（4）確定前k個點所在類別的出現頻率；

（5）返回當前k和點出現頻率最高類別作為當前點的預測分類；

python函數classify0() 代碼如下：

def classify0(inX,dataSet,labels,k):

"""應用KNN方法對測試點進行分類，返回一個結果類型

Keyword argument:

testData: 待測試點，格式為數組

dataSet：訓練樣本集合，格式為矩陣

labels：訓練樣本類型集合，格式為數組

k：近鄰點數

"""

dataSetSize=dataSet.shape[0]

#距離計算，新的數據與樣本的距離進行減法

diffMat = tile(inX, (dataSetSize,1)) - dataSet

sqDiffMat=diffMat**2? #對數組的每一項進行平方

sqDistances=sqDiffMat.sum(axis=1)? #數組每個特征值進行求和

distances=sqDistances**0.5? #每個值開方

sortedDistIndicies = distances.argsort() 索引值排序

#選取距離最小的前k個值進行索引，從k個中選取分類最多的一個作為新數據的分類

classCount={}

for i in range(k):? #統計前k個點所屬類別

voteIlabel=labels[sortedDistIndicies[i]]

classCount[voteIlabel]=classCount.get(voteIlabel,0)+1

sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)

return sortedClassCount[0][0]? #返回前k個點鐘頻率最高的類別

kNN算法核心：

（1）計算當前點和訓練集中的每個點的歐式距離

（2）從小到大排列訓練集中前k個點

（3）返回這些點中出現頻率最高的

python函數classify0() 代碼語法解析：

1.用于分類的向量是inX，輸入訓練樣本集為dataSet，標簽向量為labels，參數k表示用于選擇最近鄰居的數目，其中標簽向量的元素數目和矩陣dataSet的行數相同。

2. shape返回array的大小，shape[0] 為第一維大小（訓練集數據量）

3.tile(inX,(dataSetSize,1)-dataSet) ：把inX按照(dataSetSize,1的形式復制，即：(dataSetSize,1是一個矩陣，把矩陣的每個元素用inX替代的就是最后結果。

例如：

In [31]:import numpy as np

In [32]:a=np.array([0,1,2])

In [34]:np.tile(a,2)

Out[34]:array([0, 1, 2, 0, 1, 2])

In [35]:np.tile(a,(2,2))

Out[35]:

array([[0, 1, 2, 0, 1, 2],

[0, 1, 2, 0, 1, 2]])

In [36]:b=np.array=[[4,5],[6,7]]

In [37]:np.tile(b,2)

Out[37]:

array([[4, 5, 4, 5],

[6, 7, 6, 7]])

4. argsort() ，返回排序后的下標array

5. 字典dict.get(key,x)查找鍵為key的value，如果不存在返回x

6.operator.itemgetter(1)返回的對象是第i+1個元素，相當于匿名函數

測試算法：

In [85]:import kNN

In [86]:reload(kNN)

Out[86]:

In [87]:group,labels=kNN.createDataSet()

In [88]:group,labels

Out[88]:

(array([[ 1. , 1.1],

[ 1. , 1. ],

[ 0. , 0. ],

[ 0. , 0.1]]), ['A', 'A', 'B', 'B'])

In [89]:kNN.classify0([0,0],group,labels,3)

Out[89]:'B'

測試結果，[0,0]屬于分類B 。

2.2 使用K-近鄰算法來改進約會網站

示例：在約會網站上使用K-近鄰算法

收集數據：提供文本文件

準備數據：使用python解析文本文件

分析數據：使用matplotlib畫二維擴散圖

訓練算法：此步驟不適應于K-近鄰算法

測試算法：使用海倫提供的部分數據作為測試樣本

測試樣本和非測試樣本的區別在于：測試樣本是已經完成分類的數據，如果預測分類與實際類別不同，則標記為一個錯誤。

使用算法：產生簡單的命令行程序，然后海倫可以輸入一些特征數據以判斷對方是否為自己喜歡的類型

2.2.1 準備數據：從文本文件中解析數據

海倫的數據樣本特征：

每年獲得的飛行常客里程數

玩視頻游戲所耗時間百分比

每周消費的冰淇淋公升數

我們將海倫提供的樣本特征數據輸入到分類器之前，必須將待處理的數據格式轉換為分類器可以接受的格式。

我們在kNN.py中創建名為file2matrix的函數，用來處理輸入格式問題。

該函數的輸入為文本名字符串，輸出為訓練樣本矩陣的和類標簽向量。

將下列的代碼增加到kNN.py中：

def file2matrix(filename):

fr=open(filename)

arrayOLines=fr.readlines()

#得到文本行數

numberOfLines=len(arrayOLines)

#創建返回的numpy矩陣

returnMat=zeros((numberOfLines,3))

classLabelVector = []

index=0

#解析文本數據到列表

for line in arrayOLines:

line=line.strip()? ? #截取掉所有回車字符

listFromLine=line.split('\t') #以指定字符為分割符分割字符串，不指定則為空格

returnMat[index,:]=listFromLine[0:3]

classLabelVector.append(int(listFromLine[-1]))

index +=1

return returnMat,classLabelVector

我們需要重新加載kNN模塊，否則python還是使用之前的加載的模塊。

In [46]:import kNN

In [47]:reload(kNN)

Out[47]:

In [48]:datingDataMat,datingLabels =kNN.file2matrix('datingTestSet2.txt')

In [49]:datingDataMat

Out[49]:

array([[? 4.09200000e+04,? 8.32697600e+00,? 9.53952000e-01],

[? 1.44880000e+04,? 7.15346900e+00,? 1.67390400e+00],

[? 2.60520000e+04,? 1.44187100e+00,? 8.05124000e-01],

...,

[? 2.65750000e+04,? 1.06501020e+01,? 8.66627000e-01],

[? 4.81110000e+04,? 9.13452800e+00,? 7.28045000e-01],

[? 4.37570000e+04,? 7.88260100e+00,? 1.33244600e+00]])

In [50]:datingLabels[0:20]

Out[50]:[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

我們已經導入數據，接下來需要了解數據的真實含義，一般來說，除了瀏覽數據，我們會采用圖形化的方式直觀的去展示數據。

2.2. 分析數據：使用matplotlib創建散點圖

In [60]:import numpy as np

...:import matplotlib

...:import matplotlib.pyplot as plt

In [64]:fig = plt.figure()? #創建一張新的圖像

...:ax=fig.add_subplot(111)? #表示把圖像分割為1行1列，當前子圖像畫在第1塊

#scatter（X，Y）以X為x坐標，Y為y坐標繪制散點圖

...:ax.scatter(datingDataMat[:,1],datingDataMat[:,2])

...:ax.axis([-2,25,-0.2,2.0])

...:plt.ylabel('Kilogram of ice cream per week')

...:plt.xlabel('Percentage of time spent playing games')

...:plt.show()

沒有使用樣本分類的特征值，很難從上圖中看到任何有用的數據模式信息。為此，我們重新采用彩色的來標記不同樣本。

In [65]:fig = plt.figure()

...:ax=fig.add_subplot(111)

...:ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*np.array(datingLabels),15.0*np.array(datingLabels))

...:ax.axis([-2,25,-0.2,2.0])

...:plt.ylabel('Kilogram of ice cream per week')

...:plt.xlabel('Percentage of time spent playing games')

...:plt.show()

In [66]:fig = plt.figure()

...:ax=fig.add_subplot(111)

...:ax.scatter(datingDataMat[:,0], datingDataMat[:,1], 15.0*array(datingLabels),15.0*array(datingLabels))

...:plt.ylabel('Percentage of time spent playing games')

...:plt.xlabel('The number of frequent flier miles per year')

...:plt.show()

散點圖使用datingDataMat矩陣的第一、第二列數據，分別表示特征值“每年獲得的飛行常客里程數”和“玩視頻游戲所耗時間百分比”。

圖中給出了不同樣本分類區域。

2.2.3 準備數據：歸一化數據

不同的特征值將會有不同的取值和范圍，如果直接使用特征值來計算距離，取值范圍較大的特征值將會對結果產生較大的影響，取值范圍小的值將會對結果產生很小的影響。這使得較小的特征值沒有起到作用。

如兩組特征：{0, 20000, 1.1}和{67, 32000, 0.1}，計算距離的算式為：

那我們看到上面的計算式里面，只有第二個特征會產生很大的影響，第一個，第三個特征則影響很小，甚至可以忽略掉。

但是三個特征是同等重要的，因此三個等權重的特征之一，飛行時間不能去這么嚴重的影響結果。

處理上面的問題，我們采用的方式是將數據值歸一化：

如將取值范圍處理為0到1 的或者-1到1之間。給出一個公式，可以將任意取值范圍的特征值轉換為0到1區間的值：

newValue=(oldValue-min)/(max-min)

其中，max和min分別是數據集中最大和最小的特征值。

雖然我們改變取值范圍增加了分類器的復雜度，但是可以得到準確的結果。

接下來我們在kNN.py中增加一個函數autoNorm()，這個函數可以自動將數字特征值轉化為0到1之間的區間。

def autoNorm(dataSet):

minVals=dataSet.min(0)

maxVals=dataSet.max(0)

ranges=maxVals-minVals

normDataSet=zeros(shape(dataSet))

m=dataSet.shape[0]

normDataSet=dataSet-tile(minVals,(m,1))

#特征值相除

normDataSet=normDataSet/tile(ranges,(m,1))

return normDataSet,ranges,minVals

檢測函數執行結果：

In [99]:import kNN

In [100]:reload(kNN)

Out[100]:

In [101]:normMat,ranges,minVals=kNN.autoNorm(datingDataMat)

In [102]:normMat

Out[102]:

array([[ 0.44832535, 0.39805139, 0.56233353],

[ 0.15873259, 0.34195467, 0.98724416],

[ 0.28542943, 0.06892523, 0.47449629],

...,

[ 0.29115949, 0.50910294, 0.51079493],

[ 0.52711097, 0.43665451, 0.4290048 ],

[ 0.47940793, 0.3768091 , 0.78571804]])

In [103]:ranges

Out[103]:array([ 9.12730000e+04, 2.09193490e+01, 1.69436100e+00])

In [104]:minVals

Out[104]:array([ 0. , 0. , 0.001156])

2.2.4 測試算法：作為完整程序驗證分類器

我們將測試分類器的效果，如果分類器的正確率滿足要求，海倫就可以使用這個名單來處理約會網站這個事情了。

機器學習算法重要的一個工作就是：評估算法的正確率為多少。

通常我們給出樣本數據的90%作為訓練樣本來訓練分類器，剩下的10%的數據去測試分類器，檢測分類器的正確率。值得注意的是：10%的數據應該是隨機選擇的。

對于分類器：錯誤率就是分類器給出錯誤結果的次數除以測試數據的總數。完美分類器的錯誤率為0，錯誤率為1的分類就不會有正確結果。

下面給出分類器針對約會網站的測試代碼

def datingClassTest():

hoRatio=0.10

datingDataMat,datingLabels=file2matrix('datingTestSet2.txt')

normMat,ranges,minVals=autoNorm(datingDataMat)

m=normMat.shape[0]

numTestVecs=int(m*hoRatio)

errorCount=0.0

for i in range(numTestVecs):

classifierResult=classify0(normMat[i,:],normMat[numTestVecs:m,:],\

datingLabels[numTestVecs:m],3)

print "the classifier came back with:%d,the real anwer is:%d"\

% (classifierResult,datingLabels[i])

if (classifierResult!=datingLabels[i]):errorCount +=1.0

print "the total error rate is: %f" % (errorCount/float(numTestVecs))

接下來我們執行分類器的測試程序：

In [112]:import kNN

In [113]:reload(kNN)

Out[113]:

In [114]:kNN.datingClassTest()

the classifier came back with:3,the real anwer is:3

the classifier came back with:2,the real anwer is:2

the classifier came back with:1,the real anwer is:1

...

...

the classifier came back with:2,the real anwer is:2

the classifier came back with:1,the real anwer is:1

the classifier came back with:3,the real anwer is:1

the total error rate is: 0.050000

我們看到最終分類器的處理約會數據集的錯誤率為5%，這是相對不錯的結果。

我們可以改變函數datingClassTest內變量hoRatio和變量k的值，檢測錯誤率是否隨著變化量值的變化而增加。

這個例子表明我們可以正確地預測分類，錯誤率僅僅是2.4%。海倫完全可以輸入未知對象的屬性信息，由分類軟件來幫助她判定某一對象的可交往程度：討厭、一般喜歡、非常喜歡。

2.2.5使用算法：構建完整的可用系統

上面我們已經在數據上對分類器進行測試，現在就額可以去使用這個分類器來對人們進行分類。

我們給出下面的代碼，海倫只需要在約會網站上找到某個人輸入信息，代碼就可以給出她的喜歡程度的預測值。

我們將代碼添加到kNN.py中：（約會網站預函數）

def classifyPerson():

resultList=['not at all','in small dose','in large doses']

percentTats=float(raw_input("percentage of time spent playing video games?"))

ffMiles=float(raw_input("frequent flier miles earned per year?"))

iceCream=float(raw_input("liters of ice cream consumed per year?"))

datingDataMat,datingLabels=file2matrix('datingTestSet2.txt')

normMat,ranges,minVals=autoNorm(datingDataMat)

inArr=array([ffMiles,percentTats,iceCream])

classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3)

print "You will probably like this person:",resultList[classifierResult -1]

增加了一個運行三個月文本行輸入命令的函數raw_input()

我們來檢驗實際運行效果：

In [127]:import kNN

In [128]:reload(kNN)

Out[128]:

In [129]:kNN.classifyPerson()

percentage of time spent playing video games?9

frequent flier miles earned per year?9000

liters of ice cream consumed per year?0.4

You will probably like this person: in small dose

我們看過最后輸入數據之后，程序預測出這個人一點也不喜歡，這樣海倫也許就沒有必要進行這次約會，可以篩選下一個目標了。

2.3 手寫識別系統

為了簡單起見，構造的識別系統只能識別數字0到9 。

2.3.1準備數據：將圖像轉換為測試向量。

我們把一個32*32的二進制圖像矩陣轉換為1*1024的向量，這樣使用前面的分類器就可以處理圖像信息了。

給kNN.py添加下列代碼：

#準備數據：將圖像轉換為測試向量

def img2vector(filename):

#該函數創建1*1024的NumPy數組

returnVect = zeros((1,1024))

fr = open(filename)

#循環出文件的前32行，并將每行的頭32行存儲在NumPy數組熵，最后返回數組

for i in range(32):

lineStr = fr.readline()

for j in range(32):

returnVect[0,32*i+j] = int(lineStr[j])

return returnVect

測試一下代碼：

In [19]:import kNN

In [20]:reload(kNN)

Out[20]:

In [21]:testVector = kNN.img2vector(r'E:\ML\ML_source_code\machinelearninginaction\Ch02\digits\testDigits\0_13.txt')

In [22]:testVector[0,0:31]

Out[22]:array([ 0., 0., 0., ..., 0., 0., 0.])

In [23]:testVector[0,32:63]

Out[23]:array([ 0., 0., 0., ..., 0., 0., 0.])

2.3.2 測試算法：使用k-近鄰算法識別手寫數字

os模塊有函數listdir，可以列出給定目錄的文件名，我們確保腳本文件有

from os import listdir

手寫數字識別系統的測試代碼如下，

def handwritingClassTest():

hwLabels = []

#獲取目錄內容

trainingFileList = listdir(r'E:\ML\ML_source_code\mlia\Ch02\digits\trainingDigits')

#trainingFileList下面有1934個文件

m = len(trainingFileList)

#形成了一個1934*1024的0矩陣

trainingMat = zeros((m,1024))

#從文件名解分類數字

for i in range(m):

#構造要打開的文件名

fileNameStr = trainingFileList[i]

#按照"."分開取第一個數

fileStr = fileNameStr.split('.')[0]

#按照"_"來分開來取第一數值并強制轉換為int類型

classNumstr = int(fileStr.split('_')[0])

hwLabels.append(classNumstr)

trainingMat[i,:] = img2vector(r'E:\ML\ML_source_code\mlia\Ch02\digits\trainingDigits/%s' %fileNameStr)

testFileList = listdir(r'E:\ML\ML_source_code\mlia\Ch02\digits\testDigits')

errorCount = 0.0

mTest = len(testFileList)

for i in range(mTest):

fileNameStr = testFileList[i]

fileStr = fileNameStr.split('.')[0]

classNumstr = int(fileStr.split('_')[0])

vectorUnderTest = img2vector(r'E:\ML\ML_source_code\mlia\Ch02\digits\trainingDigits/%s' %fileNameStr)

classifierResult = classify0(vectorUnderTest,trainingMat,hwLabels,3)

print "the classifier came back with: %d, the read answer is:%d" %(classifierResult,classNumstr)

#計算錯誤率

if (classifierResult !=classNumstr):

errorCount += 1.0

print "\nthe total number of errors is %d" % errorCount

print "\nthe total error rate is: %f" % (errorCount/float(mTest))

測試函數的結果，輸出結果：

In [102]:reload(kNN)

Out[102]:

In [103]:kNN.handwritingClassTest()

the classifier came back with: 0, the read answer is:0

the classifier came back with: 0, the read answer is:0

the classifier came back with: 0, the read answer is:0

..

the classifier came back with: 4, the read answer is:4

the classifier came back with: 4, the read answer is:4

the classifier came back with: 9, the read answer is:5

the classifier came back with: 5, the read answer is:5

...

the classifier came back with: 9, the read answer is:9

the classifier came back with: 9, the read answer is:9

the classifier came back with: 9, the read answer is:9

the total number of errors is 15

the total error rate is: 0.015856

k-近鄰算法手寫數字數據集，錯誤率為1.58% 。

需要注意的是k-近鄰算法執行效率并不高。決策樹其實就是k-近鄰算法的優化版。可以節省計算開銷。

本文參考：

《機器學習實戰》

http://blog.csdn.net/baoli1008/article/details/50708507

http://www.tuicool.com/articles/i26baaa

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

機器學習實戰之K-近鄰算法（二）

機器學習實戰之K-近鄰算法（二）

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

機器學習實戰之K-近鄰算法（二）

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频