在數據挖掘中,聚類是一個很重要的概念。傳統的聚類分析計算方法主要有如下幾種:劃分方法、層次方法、基于密度的方法、基于網格的方法、基于模型的方法等。其中K-Means算法是劃分方法中的一個經典的算法。
一、K-均值聚類(K-Means)概述
1、聚類:
“類”指的是具有相似性的集合,聚類是指將數據集劃分為若干類,使得各個類之內的數據最為相似,而各個類之間的數據相似度差別盡可能的大。聚類分析就是以相似性為基礎,在一個聚類中的模式之間比不在同一個聚類中的模式之間具有更多的相似性。對數據集進行聚類劃分,屬于無監督學習。
2、K-Means:
K-Means算法是一種簡單的迭代型聚類算法,采用距離作為相似性指標,從而發現給定數據集中的K個類,且每個類的中心是根據類中所有數值的均值得到的,每個類的中心用聚類中心來描述。對于給定的一個(包含n個一維以及一維以上的數據點的)數據集X以及要得到的類別數量K,選取歐式距離作為相似度指標,聚類目標實施的個類的聚類平反和最小,即最小化:
公式.jpg
結合最小二乘法和拉格朗日原理,聚類中心為對應類別中各數據點的平均值,同時為了使算法收斂,在迭代的過程中,應使得最終的聚類中心盡可能的不變。
3、K-Means算法流程:
-隨機選取K個樣本作為聚類中心;
-計算各樣本與各個聚類中心的距離;
-將各樣本回歸于與之距離最近的聚類中心;
-求各個類的樣本的均值,作為新的聚類中心;
-判定:若類中心不再發生變動或者達到迭代次數,算法結束,否則回到第二步。
聚類演示-1.png
4、K-Means演示舉例
1.將a~d四個點聚為兩類:
選定樣本a和b為初始聚類中心,中心值分別為1、2
聚類演示-2.png
2.將平面上的100個點進行聚類,要求聚為兩類,其橫坐標都為0~99。
Python代碼演示:
import numpy as np
"""
任務要求:對平面上的 100 個點進行聚類,要求聚類為兩類,其橫坐標都為 0 到 99。
"""
x = np.linspace(0, 99, 100)
y = np.linspace(0, 99, 100)
k = 2
n = len(x)
dis = np.zeros([n, k+1])
# 1.選擇初始聚類中心
center1 = np.array([x[0], y[0]])
center2 = np.array([x[1], y[1]])
iter_ = 100
while iter_ > 0:
# 2.求各個點到兩個聚類中心距離
for i in range(n):
dis[i, 0] = np.sqrt((x[i] - center1[0])**2 + (y[i] - center1[1])**2)
dis[i, 1] = np.sqrt((x[i] - center2[0])**2 + (y[i] - center2[1])**2)
# 3.歸類
dis[i, 2] = np.argmin(dis[i,:2]) # 將值較小的下標值賦值給dis[i, 2]
# 4.求新的聚類中心
index1 = dis[:, 2] == 0
index2 = dis[:, 2] == 1
center1_new = np.array([x[index1].mean(), y[index1].mean()])
center2_new = np.array([x[index2].mean(), y[index2].mean()])
# 5.判定聚類中心是否發生變換
if all((center1 == center1_new) & (center2 == center2_new)):
# 如果沒發生變換則退出循環,表示已得到最終的聚類中心
break
center1 = center1_new
center2 = center2_new
# 6.輸出結果以驗證
print(dis)
結果如下:
其中第 3 項代表聚類:
[[ 34.64823228 105.3589104 0. ]
[ 33.23401872 103.94469683 0. ]
[ 31.81980515 102.53048327 0. ]
[ 30.40559159 101.11626971 0. ]
[ 28.99137803 99.70205615 0. ]
[ 27.57716447 98.28784258 0. ]
[ 26.1629509 96.87362902 0. ]
[ 24.74873734 95.45941546 0. ]
[ 23.33452378 94.0452019 0. ]
[ 21.92031022 92.63098834 0. ]
[ 20.50609665 91.21677477 0. ]
[ 19.09188309 89.80256121 0. ]
[ 17.67766953 88.38834765 0. ]
[ 16.26345597 86.97413409 0. ]
[ 14.8492424 85.55992052 0. ]
[ 13.43502884 84.14570696 0. ]
[ 12.02081528 82.7314934 0. ]
[ 10.60660172 81.31727984 0. ]
[ 9.19238816 79.90306627 0. ]
[ 7.77817459 78.48885271 0. ]
[ 6.36396103 77.07463915 0. ]
[ 4.94974747 75.66042559 0. ]
[ 3.53553391 74.24621202 0. ]
[ 2.12132034 72.83199846 0. ]
[ 0.70710678 71.4177849 0. ]
[ 0.70710678 70.00357134 0. ]
[ 2.12132034 68.58935778 0. ]
[ 3.53553391 67.17514421 0. ]
[ 4.94974747 65.76093065 0. ]
[ 6.36396103 64.34671709 0. ]
[ 7.77817459 62.93250353 0. ]
[ 9.19238816 61.51828996 0. ]
[ 10.60660172 60.1040764 0. ]
[ 12.02081528 58.68986284 0. ]
[ 13.43502884 57.27564928 0. ]
[ 14.8492424 55.86143571 0. ]
[ 16.26345597 54.44722215 0. ]
[ 17.67766953 53.03300859 0. ]
[ 19.09188309 51.61879503 0. ]
[ 20.50609665 50.20458146 0. ]
[ 21.92031022 48.7903679 0. ]
[ 23.33452378 47.37615434 0. ]
[ 24.74873734 45.96194078 0. ]
[ 26.1629509 44.54772721 0. ]
[ 27.57716447 43.13351365 0. ]
[ 28.99137803 41.71930009 0. ]
[ 30.40559159 40.30508653 0. ]
[ 31.81980515 38.89087297 0. ]
[ 33.23401872 37.4766594 0. ]
[ 34.64823228 36.06244584 0. ]
[ 36.06244584 34.64823228 1. ]
[ 37.4766594 33.23401872 1. ]
[ 38.89087297 31.81980515 1. ]
[ 40.30508653 30.40559159 1. ]
[ 41.71930009 28.99137803 1. ]
[ 43.13351365 27.57716447 1. ]
[ 44.54772721 26.1629509 1. ]
[ 45.96194078 24.74873734 1. ]
[ 47.37615434 23.33452378 1. ]
[ 48.7903679 21.92031022 1. ]
[ 50.20458146 20.50609665 1. ]
[ 51.61879503 19.09188309 1. ]
[ 53.03300859 17.67766953 1. ]
[ 54.44722215 16.26345597 1. ]
[ 55.86143571 14.8492424 1. ]
[ 57.27564928 13.43502884 1. ]
[ 58.68986284 12.02081528 1. ]
[ 60.1040764 10.60660172 1. ]
[ 61.51828996 9.19238816 1. ]
[ 62.93250353 7.77817459 1. ]
[ 64.34671709 6.36396103 1. ]
[ 65.76093065 4.94974747 1. ]
[ 67.17514421 3.53553391 1. ]
[ 68.58935778 2.12132034 1. ]
[ 70.00357134 0.70710678 1. ]
[ 71.4177849 0.70710678 1. ]
[ 72.83199846 2.12132034 1. ]
[ 74.24621202 3.53553391 1. ]
[ 75.66042559 4.94974747 1. ]
[ 77.07463915 6.36396103 1. ]
[ 78.48885271 7.77817459 1. ]
[ 79.90306627 9.19238816 1. ]
[ 81.31727984 10.60660172 1. ]
[ 82.7314934 12.02081528 1. ]
[ 84.14570696 13.43502884 1. ]
[ 85.55992052 14.8492424 1. ]
[ 86.97413409 16.26345597 1. ]
[ 88.38834765 17.67766953 1. ]
[ 89.80256121 19.09188309 1. ]
[ 91.21677477 20.50609665 1. ]
[ 92.63098834 21.92031022 1. ]
[ 94.0452019 23.33452378 1. ]
[ 95.45941546 24.74873734 1. ]
[ 96.87362902 26.1629509 1. ]
[ 98.28784258 27.57716447 1. ]
[ 99.70205615 28.99137803 1. ]
[101.11626971 30.40559159 1. ]
[102.53048327 31.81980515 1. ]
[103.94469683 33.23401872 1. ]
[105.3589104 34.64823228 1. ]]
Process finished with exit code 0
以上代碼結果中每個值的第一項和第二項分別代表到第一個聚類中心和第二個聚類中心的距離。
菜雞,有錯見諒!