Calculating distance between observations計算兩點間距離
lims(x = c(-30,30), y = c(-20, 20)) 應(yīng)用于ggplot中,可以設(shè)置圖標(biāo)坐標(biāo)軸的范圍
dist(two_players) dist(data.frame)會計算出數(shù)據(jù)結(jié)構(gòu)中各個點相互之間的舉例
scale(data.frame)后 再dist,可以消除因為同組數(shù)之間相差太大引起的影響,比如一個是千米,一個是毫升這種毫不相關(guān)的量。即矩陣的中心化。
如果data.frame中的data是 YES/NO LOW/MIDDLE/HIGH這樣的組合 如何用dist進(jìn)行計算呢?
首先,library(dummies)
dummy_survey <- dummy.data.frame(job_survey)用dummy的數(shù)據(jù)結(jié)構(gòu)格式轉(zhuǎn)化
dist_survey <- dist(dummy_survey, method = 'binary')再dist,方式選擇二進(jìn)制
以下是method的取值
euclidean 歐幾里德距離,就是平方再開方。
maximum 切比雪夫距離
manhattan 絕對值距離
canberra Lance 距離
minkowski 明科夫斯基距離,使用時要指定p值
binary 定性變量距離.
矩陣中給出各個參數(shù)互相之間的關(guān)聯(lián)值,其中其他數(shù)據(jù)對一組數(shù)據(jù)的關(guān)聯(lián)值,分三個方面:
- Complete: the resulting distance is based on the maximum,max()
- Single: the resulting distance is based on the minimum,min()
- Average: the resulting distance is based on the average,mean()
hc_players <- hclust(dist_players, method = "complete")
clusters_k2 <- cutree(hc_players, k = 2)
hclust()是聚類函數(shù)
cutree(k = )從中提取聚類后的???
library(dendextend)
color_branches()
dend_20 <- color_branches(dend_players, h = 20)
library(dendextend)
dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")
dend_players <- as.dendrogram(hc_players)as.dendrogram這里是轉(zhuǎn)化成什么格式?
plot(dend_players)做出來是樹狀圖
dend_20 <- color_branches(dend_players, h = 20) color_branches是給樹狀圖上色,h是指上色的高度
dist_customers <- dist(customers_spend)計算兩點距離
hc_customers <- hclust(dist_customers, method = "complete")用hclust聚類之
plot(hc_customers)畫出聚類后的樹狀圖
clust_customers <- cutree(hc_customers, h = 15000)設(shè)置一個高度限制,cutree,這里的h具體是指代什么?
segment_customers <- mutate(customers_spend, cluster = clust_customers)將cutree下來的各組數(shù)的組別加入到原始datafram中成為新的一列cluster
ggplot中的ifelse
K-means clustering K值平均分類
kmeans(lineup, centers = 2)創(chuàng)建一個k均值模型,此處k=幾就是分為按顏色分為幾類。
clust_km2 <- model_km2$cluster模型中的cluster列選出來
lineup_km2 <- mutate(lineup, cluster = clust_km2)將模型中分配好組的cluster列加入原來的數(shù)據(jù)結(jié)構(gòu)中
ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
geom_point()繪制出來,利用散點圖看出分組情況。此處有關(guān)ggplot中的顏色要不要factor()之,是因為如果不轉(zhuǎn)化為因子,那么原來的格式是int,是連續(xù)的,按顏色分類時就會是一個連續(xù)的按顏色漸變分類,如果變成factor后就會變成離散型的分類,也就是說從1~2變成了1,2這樣的分類。
library(purrr)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = lineup, centers = k)
model$tot.withinss
})
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
取很多個K值(從1到10)
library(cluster)
pam()與kmeans的功能類似,都是創(chuàng)建模型model。pam_k2 <- pam(lineup, k = 2)
kmeans是圍繞均值進(jìn)行劃分,對異常值敏感。而pam更穩(wěn)健,是對于中心值劃分。
silhouette()
plot(silhouette(pam_k2))繪制出相關(guān)的條形圖
sil_width <- map_dbl(2:10, function(k){
model <- pam(x = customers_spend, k = k)
modelavg.width
})
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
批量設(shè)置K值然后繪制出關(guān)于K值的折線圖來確定K值
segment_customers %>%
group_by(cluster) %>%
summarise_all(funs(mean(.)))
分類匯總查看之前的結(jié)果
Case Study: National Occupational mean wage
library(tibble)
rownames_to_column(as.data.frame(oes), var = 'occupation')此函數(shù)可以將數(shù)據(jù)結(jié)構(gòu)中的每一列的名字轉(zhuǎn)化為一列存儲起來,其新的這一列的名稱就是var = '...'