《機(jī)器學(xué)習(xí)-實(shí)用案例解析》學(xué)習(xí)筆記
1、數(shù)據(jù)準(zhǔn)備
數(shù)據(jù)下載:https://spamassassin.apache.org/old/publiccorpus/
參考谷歌Gmail服務(wù),他們將郵件特征分為社交特征、內(nèi)容特征、線程特征和標(biāo)簽特征。我們的數(shù)據(jù)中沒有詳細(xì)的時(shí)間戳及無法得知用戶何時(shí)做了何種響應(yīng)。但我們可以測量接收量,因此可以假設(shè)這種單向度量能夠較好地代表數(shù)據(jù)中的社交特征類型。
社交特征。用同一主題郵件的發(fā)送間隔時(shí)間來決定郵件的重要性,很自然的方法就是計(jì)算收件人在收到郵件后過了多久才處理這封郵件,在給定特征集下,這個(gè)平均時(shí)間越短,說明郵件在所屬類型中的重要性越高。
線程特征。匹配線程特征詞項(xiàng),比如“RE:”,線程很活躍,那么就比不活躍的更重要。
內(nèi)容特征。抽取郵件正文中的詞項(xiàng),新來一封郵件當(dāng)它們包含更多的特征詞項(xiàng)時(shí),說明更重要。
標(biāo)簽特征。暫不考慮。
我們只需要正常的郵件數(shù)據(jù),對所有郵件信息按時(shí)間排序,然后將數(shù)據(jù)拆分為訓(xùn)練集和測試集。第一部分用于訓(xùn)練排序算法,第二部分用來測試模型效果。
> library(pacman)
> p_load(chinese.misc,stringr,dplyr,ggplot2)
> easy.ham.files <- dir_or_file("./easy_ham")
> easy.ham2.files <- dir_or_file("./easy_ham_2")
> hard.ham.files <- dir_or_file("./hard_ham")
> hard.ham2.files <- dir_or_file("./hard_ham_2")
>
> emails <- c(easy.ham.files,easy.ham2.files,
+ hard.ham.files,hard.ham2.files) %>% unique()
郵件頭信息:
From:這封郵件來自誰?使用來自該發(fā)件人的郵件量作為社交特征的表征量。
Date:何時(shí)收到這封郵件?作為時(shí)間度量。
Subj:這是一個(gè)活躍線程嗎?如果來自一個(gè)已知線程,那么可以確定其活躍程度以作為線程特征。
正文:郵件內(nèi)容是什么?找到最常出現(xiàn)的詞項(xiàng)作為內(nèi)容特征。
構(gòu)造函數(shù),在讀取時(shí)從每一封郵件中抽取如上內(nèi)容,將半結(jié)構(gòu)化數(shù)據(jù)轉(zhuǎn)換為高度結(jié)構(gòu)化的訓(xùn)練數(shù)據(jù)集。
> pre_fun <- function(string) {
+ string <- str_replace_all(string,"\\s+"," ")
+ string <- tolower(string)
+ string <- str_replace_all(string,"[^a-z]"," ")
+ string <- str_replace_all(string,"\\s+"," ")
+ string <- str_trim(string,side = "both")
+ return(string)
+ }
>
> # 數(shù)據(jù)讀取函數(shù)
> read_fun <- function(f) {
+ if (!str_detect(f,"cmds")) {
+ f.txt <- readr::read_file(f)
+ # 抽取From
+ from <- str_extract_all(f.txt,"From:(.*)") %>% unlist
+ from <- ifelse(length(from>1),from[str_detect(from,"@")],from)
+ # 如果檢測到郵箱地址在<>中,提取
+ if(str_detect(from,"<")) {
+ from <- str_extract(from,"<+(.*?)+>") %>%
+ str_remove_all("<|>")
+ } else {
+ # 如果沒有檢測到尖括號(hào),清除From和括號(hào)中的內(nèi)容
+ from <- str_remove_all(from,"From: |\\(.*?\\)")}
+ # 抽取Date
+ date <- str_extract(f.txt,"Date:(.*)") %>%
+ str_remove("Date: ")
+ # 抽取Subject
+ subject <- str_extract(f.txt,"Subject:(.*)") %>%
+ str_remove("Subject: ")
+ # 按第一個(gè)空行切割,抽取郵件正文
+ message <- str_split_fixed(f.txt,"\n\n",2)
+ message <- message[1,2] %>% pre_fun
+ df <- tibble(from=from,date=date,subject=subject,message=message,id=f)
+ return(df)
+ }
+ }
> dt <- sapply(emails,read_fun) %>%
+ do.call(bind_rows,.) %>% distinct()
> head(dt)
## # A tibble: 6 x 5
## from date subject message id
## <chr> <chr> <chr> <chr> <chr>
## 1 kre@munna~ Thu, 22 A~ Re: New Sequen~ date wed aug from ~ D:/R/data_set/sp~
## 2 steve.bur~ Thu, 22 A~ [zzzzteana] RE~ martin a posted ta~ D:/R/data_set/sp~
## 3 timc@2ubh~ Thu, 22 A~ [zzzzteana] Mo~ man threatens expl~ D:/R/data_set/sp~
## 4 monty@ros~ Thu, 22 A~ [IRR] Klez: Th~ klez the virus tha~ D:/R/data_set/sp~
## 5 Stewart.S~ Thu, 22 A~ Re: [zzzzteana~ in adding cream to~ D:/R/data_set/sp~
## 6 martin@sr~ Thu, 22 A~ Re: [zzzzteana~ i just had to jump~ D:/R/data_set/sp~
> mice::md.pattern(dt)
## from date message id subject
## 6944 1 1 1 1 1 0
## 7 1 1 1 1 0 1
## 0 0 0 0 7 7
subject變量存在7個(gè)缺失值。
另外,我們還需要針對具體變量做更詳細(xì)的檢查。
> # 檢查from中是否都存在@符號(hào)
> table(str_detect(dt$from,"@"))
##
## TRUE
## 6951
說明郵箱中發(fā)件人信息從郵箱格式上看是沒有問題的。
隨機(jī)查看30個(gè)date列的值:
> dt$date[sample(nrow(dt),30)]
## [1] "Tue, 27 Aug 2002 21:36:22 -0400"
## [2] "Fri, 9 Aug 2002 20:09:02 -0700"
## [3] "Thu, 25 Jul 2002 04:56:39 -0400 (EDT)"
## [4] "Sat, 24 Aug 2002 10:57:13 -0400 (EDT)"
## [5] "Fri, 04 Oct 2002 10:03:14 +0300"
## [6] "Mon, 2 Sep 2002 09:33:47 -0400"
## [7] "Mon, 09 Sep 2002 12:29:51 -0400"
## [8] "Thu, 22 Aug 2002 12:39:47 -0300"
## [9] "Wed, 28 Aug 2002 07:45:18 -0700"
## [10] "Tue, 24 Sep 2002 08:00:11 -0000"
## [11] "Sat, 03 Aug 2002 22:31:23 -0700"
## [12] "Mon, 07 Oct 2002 08:00:59 -0000"
## [13] "20 Jul 2002 10:50:58 +1200"
## [14] "Wed, 10 Jul 2002 16:34:42 -0700 (PDT)"
## [15] "Tue, 08 Oct 2002 13:28:56 +0100"
## [16] "Tue, 20 Aug 2002 16:30:38 -0300"
## [17] "Thu, 26 Sep 2002 08:01:56 -0000"
## [18] "Tue, 1 Oct 2002 14:16:16 +0300 (EEST)"
## [19] "Mon, 30 Sep 2002 15:55:47 -0400"
## [20] "Thu, 18 Jul 2002 17:20:20 -0700 (PDT)"
## [21] "Tue, 20 Aug 2002 15:31:17 +0100"
## [22] "03 Oct 2002 21:58:55 -0400"
## [23] "Sun, 01 Dec 2002 18:03:10 -0700"
## [24] "Thu, 18 Jul 2002 13:46:12 -0700 (PDT)"
## [25] "Sun, 29 Sep 2002 08:00:02 -0000"
## [26] "Thu, 26 Sep 2002 15:32:19 -0000"
## [27] "Wed, 31 Jul 2002 16:37:42 +0100"
## [28] "Mon, 12 Aug 2002 09:29:38 +0100"
## [29] "Sat Sep 7 04:38:51 2002"
## [30] "Wed, 09 Oct 2002 08:00:35 -0000"
多抽樣幾次,可以發(fā)現(xiàn)date列的格式比較多,比如
"Sun, 15 Sep 2002 21:22:52 -0400",
"01 Oct 2002 19:22:16 -0700",
"Wed, 17 Jul 2002 20:58:30 -0700 (PDT)",
"Tue, 24 Sep 2002 08:46:08 EDT",
"Tue Sep 10 10:29:19 2002",
需要重新整理成統(tǒng)一的格式。構(gòu)建日期轉(zhuǎn)換函數(shù):
> trans_date <- function(string) {
+
+ string <- str_split(string," ") %>% unlist %>%
+
+ str_remove_all("Sun|Mon|Tue|Wed|Thu|Fri|Sat|,") %>%
+
+ str_remove_all("[+|-](.*)") %>% str_remove_all("\\(.*\\)") %>%
+
+ str_remove_all("[A-Z]{2,}")
+ year <- string[nchar(string)==4]
+ month <- string[nchar(string)==3]
+ day <- string[nchar(string)==1|nchar(string)==2]
+ time <- string[nchar(string)==8]
+ string.new <- paste(day,month,year,time) %>% lubridate::dmy_hms()
+ return(string.new)
+ }
>
> dt$date <- dt$date %>% trans_date
> dt$date[sample(nrow(dt),10)]
## [1] "2002-07-15 03:00:01 UTC" "2002-10-08 08:01:21 UTC"
## [3] "2002-08-29 08:32:08 UTC" "2002-09-25 08:00:22 UTC"
## [5] "2002-10-08 08:01:05 UTC" "2002-09-12 09:05:50 UTC"
## [7] "2002-09-30 22:00:02 UTC" "2002-08-06 16:50:07 UTC"
## [9] "2002-07-10 16:05:42 UTC" "2002-10-08 08:00:31 UTC"
轉(zhuǎn)換后的結(jié)果很規(guī)整,完全符合我們的要求。
> head(dt)
## # A tibble: 6 x 5
## from date subject message id
## <chr> <dttm> <chr> <chr> <chr>
## 1 kre@munn~ 2002-08-22 18:26:25 Re: New Sequ~ date wed aug fr~ D:/R/data_set~
## 2 steve.bu~ 2002-08-22 12:46:18 [zzzzteana] ~ martin a posted~ D:/R/data_set~
## 3 timc@2ub~ 2002-08-22 13:52:38 [zzzzteana] ~ man threatens e~ D:/R/data_set~
## 4 monty@ro~ 2002-08-22 09:15:25 [IRR] Klez: ~ klez the virus ~ D:/R/data_set~
## 5 Stewart.~ 2002-08-22 14:38:22 Re: [zzzztea~ in adding cream~ D:/R/data_set~
## 6 martin@s~ 2002-08-22 14:50:31 Re: [zzzztea~ i just had to j~ D:/R/data_set~
現(xiàn)在數(shù)據(jù)基本轉(zhuǎn)換成了我們需要的樣子,下面繼續(xù)做一些必要的轉(zhuǎn)換。將from和subject全部轉(zhuǎn)換為小寫,并且將整個(gè)數(shù)據(jù)框按date列排序。最后將數(shù)據(jù)拆分為訓(xùn)練集和測試集。
> dt$from <- tolower(dt$from)
> dt$subject <- tolower(dt$subject)
> dt <- arrange(dt,date)
>
> set.seed(123)
> ind <- sample(1:nrow(dt),nrow(dt)*0.8,replace = T)
> train <- dt[ind,]
> test <- dt[-ind,]
2、郵件發(fā)送量權(quán)重計(jì)算策略
來自同一地址(from)的郵件越頻繁,說明該郵件越重要。所以按郵件中的from計(jì)數(shù)來設(shè)計(jì)用于重要性排序的權(quán)重。
> p_load(ggplot2)
> from.weight <- train[,"from"] %>% group_by(from) %>%
+ summarise(freq=n()) %>% arrange(-freq)
> head(from.weight)
## # A tibble: 6 x 2
## from freq
## <chr> <int>
## 1 rssfeeds@example.com 498
## 2 rssfeeds@spamassassin.taint.org 468
## 3 tomwhore@slack.net 112
## 4 garym@canada.com 104
## 5 pudge@perl.org 102
## 6 matthias@egwn.net 86
> # 查看郵件數(shù)量最多的前30個(gè)賬號(hào)
> from.weight %>% top_n(30) %>%
+ ggplot(aes(freq,reorder(from,freq))) +
+ geom_bar(stat = "identity") +
+ theme_bw() +
+ labs(x="接收郵件數(shù)量",y="")
對發(fā)送量做對數(shù)轉(zhuǎn)換。
> from.weight %>%
+ ggplot(aes(x=freq)) +
+ geom_line(aes(y=log10(freq)),col="green") +
+ geom_text(aes(500,2.5,label="對數(shù)變換")) +
+ geom_line(aes(y=log(freq)),col="red") +
+ geom_text(aes(500,6,label="自然對數(shù)變換")) +
+ theme_bw() +
+ labs(x="",y="接收的郵件量")
做對數(shù)變換后曲線會(huì)更平緩,同時(shí),自然對數(shù)變換相對對數(shù)變換程度更小,更能保留原始數(shù)據(jù)的一些差異,所以最終我們選擇自然對數(shù)變換后的值作為發(fā)送量特征的權(quán)重。
但是在做對數(shù)變換時(shí)需要注意的是,如果觀測值為1,轉(zhuǎn)換后就為0,計(jì)算權(quán)重時(shí)0乘以其他任何值都為0。為了避免這種情況,在轉(zhuǎn)換前一般對觀測值都加1。
> # log1p()函數(shù)計(jì)算log(p+1)
> from.weight$freq <- log1p(from.weight$freq)
> # 檢查下變換為的數(shù)據(jù)
> summary(from.weight$freq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6931 0.6931 1.3863 1.4869 1.7918 6.2126
3、郵件線程活躍度權(quán)重計(jì)算策略
從subject中查找“re:”,然后查找這個(gè)線程里面的其他郵件,并測量其活躍度。在短時(shí)間內(nèi)有更多郵件發(fā)送的線程就更活躍,因此也更重要。
> # 提取包含“re:”的subject,并提取“re:”后面的內(nèi)容作為主題
> threads.train <- train %>% filter(str_detect(subject,"re:"))
>
> extract_subject <- function(string){
+ string <- str_split(string,"re:") %>%
+ unlist %>% .[2] %>% str_trim()
+ return(string)
+ }
> threads.train$subject <- threads.train$subject %>%
+ lapply(extract_subject) %>% unlist
>
> # 分組統(tǒng)計(jì)數(shù)量
> threads.freq <- threads.train %>% group_by(subject) %>%
+ summarise(freq=n()) %>% arrange(freq)
數(shù)據(jù)中存在freq<2的情況,是因?yàn)閿?shù)據(jù)集在采集的時(shí)候存在一部分主題郵件是在采集時(shí)間開始之前發(fā)起的,這時(shí)候主題中也存在“re:”標(biāo)記,但是該線程發(fā)起時(shí)間并不在數(shù)據(jù)集中,所以需要去掉這部分?jǐn)?shù)據(jù)。
> # 線程時(shí)間跨度,即第一封郵件和最后一封郵件之間的時(shí)間間隔
> time_span <- function(df){
+ max.time <- max(df$date)
+ min.time <- min(df$date)
+ threads.span <- difftime(max.time,min.time,units = "secs")
+ df.new <- tibble(subject=df$subject[1],threads.span=threads.span)
+ return(df.new)
+ }
>
> # 將數(shù)據(jù)框按subject拆分
> threads.train.split <- split.data.frame(threads.train,
+ threads.train$subject)
>
> threads.span <- lapply(threads.train.split,time_span) %>%
+ do.call(rbind.data.frame,.)
按主題合并兩個(gè)數(shù)據(jù)框。
> subject.weight <- left_join(threads.freq,threads.span,by="subject") %>%
+ # 轉(zhuǎn)換為數(shù)值型
+ transform(threads.span=as.numeric(threads.span)) %>%
+ filter(freq>=2 & threads.span!=0) %>%
+ mutate(weight=freq/threads.span) %>%
+ # 仿射變換
+ transform(weight=log10(weight)+10) %>%
+ arrange(weight)
> head(subject.weight)
## subject freq threads.span weight
## 1 activebuddy 17 820721053 2.316253
## 2 [zzzzteana] 6 8275106 3.860378
## 3 no matter where you go 4 2672629 4.175121
## 4 [sadev] 7 3325905 4.323188
## 5 [ilug-social] 4 1649403 4.384733
## 6 [razor-users] 16 5174599 4.490243
> summary(subject.weight)
## subject freq threads.span weight
## Length:287 Min. : 2.000 Min. : 16 Min. :2.316
## Class :character 1st Qu.: 3.000 1st Qu.: 15482 1st Qu.:5.675
## Mode :character Median : 5.000 Median : 49344 Median :6.126
## Mean : 7.784 Mean : 3094884 Mean :6.115
## 3rd Qu.: 9.000 3rd Qu.: 145816 3rd Qu.:6.512
## Max. :41.000 Max. :820721053 Max. :9.097
從摘要中可以看到freq平均為7.784,threads.span平均為3094884,這樣計(jì)算的weight將會(huì)很小,平均為2.515118e-06,在做對數(shù)轉(zhuǎn)換時(shí),就會(huì)得到負(fù)值:log10(7.784/3094884)=-5.600825。計(jì)算時(shí)權(quán)重不能為負(fù)值,所以這里進(jìn)行仿射變換,簡單地給所有轉(zhuǎn)換值加10,以保證所有權(quán)重值為正數(shù)。
4、郵件內(nèi)容中高頻詞項(xiàng)的權(quán)重策略
假設(shè)出現(xiàn)在活躍線程郵件主題中的高頻詞比低頻詞和出現(xiàn)在不活躍線程中的詞項(xiàng)更重要。
> p_load(text2vec)
>
> it <- itoken(threads.dt$message,ids = threads.dt$id,progressbar = F)
>
> # 創(chuàng)建訓(xùn)練集詞匯表
> vocab <- create_vocabulary(it)
>
> # 去除停用詞
> stopword <- readr::read_table("D:/R/dict/english_stopword.txt",
+ col_names = F)
>
> # 還是以對數(shù)轉(zhuǎn)換計(jì)算高頻詞的權(quán)重
> term.weight <- anti_join(vocab,stopword,by=c("term"="X1")) %>%
+ mutate(term.weight=log10(term_count)) %>%
+ filter(term.weight>0)
5、訓(xùn)練和測試排序算法
一封郵件的整體權(quán)重(優(yōu)先級(jí))等于前面三種權(quán)重的乘積。當(dāng)收到一封郵件的時(shí)候,我們需要先對其進(jìn)行解析,計(jì)算其權(quán)重,然后對其進(jìn)行優(yōu)先級(jí)排序。
構(gòu)造排序函數(shù):
> get_weight <- function(newemail){
+
+ from.new.n <- left_join(newemail[,1],from.weight,by = "from")
+ from.new <- ifelse(is.na(from.new.n$freq),1,from.new.n$freq)
+
+
+ if (!is.na(newemail$subject) & str_detect(newemail$subject,"re:")) {
+ newemail$subject <- extract_subject(newemail$subject)
+ subject.new.n <- left_join(newemail,subject.weight,by="subject")
+
+ subject.new <- ifelse(is.na(subject.new.n$weight),1,
+ subject.new.n$weight)
+ } else {
+
+ subject.new <- 1
+ }
+
+
+ if (newemail$message!="") {
+
+ msg.weight <- str_split(newemail$message," ") %>% unlist %>%
+ jiebaR::freq() %>% anti_join(stopword,by=c("char"="X1")) %>%
+ filter(char!="")
+
+ if (nrow(msg.weight)!=0) {
+ msg.weight.n <- left_join(msg.weight,term.weight[,c(1,4)],
+ by=c("char"="term")) %>%
+ summarise(msg.new=sum(freq*term.weight,na.rm = T))
+ msg.new <- msg.weight.n$msg.new
+ }
+ } else {
+ msg.new <- 1
+ }
+
+
+ return(prod(from.new,subject.new,msg.new))
+ }
5.1 對訓(xùn)練集進(jìn)行排序
> rank.train <- vector(length = nrow(train))
> for (i in 1:nrow(train)) {
+ rank.train[i] <- get_weight(train[i,])
+ }
>
> train.rank <- tibble(id=train$id,rank=rank.train)
>
> summary(train.rank$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.64 282.39 859.82 2300.37 2702.20 241506.57
> # 檢查排序值的分布
> p1 <- ggplot(train.rank,aes(rank)) +
+ geom_histogram(bins = 1000, fill = "dodgerblue") +
+ geom_vline(xintercept = median(rank.train),size=1) +
+ xlim(c(0,25000)) +
+ theme_bw() +
+ labs(y="")
6、對測試集進(jìn)行排序
> rank <- vector(length = nrow(test))
> for (i in 1:nrow(test)) {
+ rank[i] <- get_weight(test[i,])
+ }
>
> test.rank <- tibble(id=test$id,rank=rank)
>
> summary(test.rank$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.64 237.50 716.02 1913.47 2301.08 36852.14
> # 檢查排序值的分布
> p2 <- ggplot(test.rank,aes(rank)) +
+ geom_histogram(bins = 1000, fill = "red") +
+ geom_vline(xintercept = median(rank),size=1) +
+ xlim(c(0,25000)) +
+ theme_bw() +
+ labs(y="")
對比訓(xùn)練集和測試的排序分布。
> p_load(patchwork)
> p1 + p2 + plot_layout(nrow = 2)
可以看到訓(xùn)練集和測試集的排序分布幾乎一模一樣,都是長尾分布,意味著更多的郵件的優(yōu)先級(jí)排序不高,這也符合常理。
然后檢查一下測試集排序最靠前的20行。
> test[,3] %>% cbind(rank=rank) %>% arrange(-rank) %>% head(20)
## subject rank
## 1 re: apple sauced...again 36852.14
## 2 re: apple sauced...again 36852.14
## 3 sed /s/united states/roman empire/g 33739.40
## 4 re: selling wedded bliss (was re: ouch...) 25141.77
## 5 re: selling wedded bliss (was re: ouch...) 25141.77
## 6 re: new sequences window 22509.49
## 7 [lockergnome windows daily] fraud wipes 21055.61
## 8 [lockergnome windows daily] fraud wipes 21023.90
## 9 [lockergnome windows daily] brilliant mistakes 21004.60
## 10 [lockergnome penguin shell] recursive metaphor 20823.06
## 11 [lockergnome windows daily] cranky beats 20449.50
## 12 re: comrade communism (was re: crony capitalism (was re: sed 20073.10
## 13 [lockergnome windows daily] deeper uplink 20043.92
## 14 bush covers the waterfront 19548.67
## 15 [lockergnome digital media] clever ritual 19499.38
## 16 [lockergnome digital media] clever ritual 19467.67
## 17 [lockergnome windows daily] dignity shakedown 19325.42
## 18 [lockergnome penguin shell] good hearts 19295.45
## 19 [lockergnome windows daily] dignity shakedown 19293.71
## 20 [lockergnome windows daily] sticker courtesy 19132.87
主題中幾乎有一大半是不活躍的郵件,因?yàn)閟ubject中不包含“re:”,也表明排序算法可以將主題之外的其他權(quán)重應(yīng)用到數(shù)據(jù)中。
盡管這種非監(jiān)督的排序算法無法測算其準(zhǔn)確度,但這結(jié)果仍然是很鼓舞人心的。