就是這篇文章,知乎稱“您的帳號由于存在異常行為暫時被知乎反作弊系統限制使用”,然后任憑我申訴多久,都恢復不了了!!!最可惡的是,在你發布文章的時候一點兒提示都沒有,顯示發布成功,但就是擅自刪除,辛苦寫的文章就找不到了?。。?/p>
1.引言
本文數據來自Kaggle中知名的數據集Titanic Machine Learning from Disaster,是利用訓練集訓練模型,來預測測試集中的乘客能否在沉船事件存活。
說明:本文借鑒了 Megan L. Risdal 的文章,在細節處略有改動。
2.數據導入與觀察
加載需要用到的包:
> pkgs <- c("dplyr","ggplot2","ggthemes","scales","mice","randomForest")
> install.packages(pkgs,dependencies = TRUE)
> library(dplyr) # 操作數據的包
> library(ggplot2) # 繪圖包
> library(ggthemes) # ggplot2的主題修改包
> library(scales) # 可視化的包
> library(VIM) # 查看缺失數據的包
> library(mice) # 插補數據的包
> library(randomForest) # 隨機森林
導入并初步觀察數據:
> train <- read.csv(file.choose(),stringsAsFactors = FALSE)
> test <- read.csv(file.choose(),stringsAsFactors = FALSE)
> str(train)
略
> str(test)
略
兩個數據集除Survived字段不同外,其他字段均相同。為了方便數據清洗,我們合并訓練集與測試集。
> full <- bind_rows(train, test)
> str(full)
'data.frame': 1309 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
> summary(full)
PassengerId Survived Pclass Name Sex Age
Min. : 1 Min. :0.0000 Min. :1.000 Length:1309 Length:1309 Min. : 0.17
1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character Class :character 1st Qu.:21.00
Median : 655 Median :0.0000 Median :3.000 Mode :character Mode :character Median :28.00
Mean : 655 Mean :0.3838 Mean :2.295 Mean :29.88
3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:39.00
Max. :1309 Max. :1.0000 Max. :3.000 Max. :80.00
NA's :418 NA's :263
SibSp Parch Ticket Fare Cabin Embarked
Min. :0.0000 Min. :0.000 Length:1309 Min. : 0.000 Length:1309 Length:1309
1st Qu.:0.0000 1st Qu.:0.000 Class :character 1st Qu.: 7.896 Class :character Class :character
Median :0.0000 Median :0.000 Mode :character Median : 14.454 Mode :character Mode :character
Mean :0.4989 Mean :0.385 Mean : 33.295
3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 31.275
Max. :8.0000 Max. :9.000 Max. :512.329
NA's :1
合并后的數據中,共有1309個觀測,其中訓練集891個,測試集418個,生存情況(Survived)中缺失值418個(正常,需要預測的部分),年齡(Age)中缺失值263個,船票費用(Fare)中缺失值1個。
解釋一下各個變量對應的含義:
3.數據清洗
3.1 觀察姓名變量
我們注意到乘客名字(Name)有一個非常顯著的特點:每個名字當中都包含了具體的稱謂或者頭銜。我們可以將這部分信息提取出來,或許可以作為非常有用的新變量。
> # 乘客姓名提取頭銜
> full$Title <- gsub("(.*, )|(\\..*)", "", full$Name)
> # 按照性別劃分頭銜數量
> table(full$Sex, full$Title)
Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir the Countess
female 0 0 0 1 1 0 1 0 0 260 2 1 0 197 2 0 0 1
male 1 4 1 0 7 1 0 2 61 0 0 0 757 0 0 8 1 0
> # 對于那些出現頻率較低的頭銜合并為一類
> rare_title <- c("Capt","Col","Don","Dona","Dr", "Jonkheer", "Lady","Major", "Rev", "Sir","the Countess")
> # 數據量不多,我想試試手工調整頭銜
> filter(full,full$Title %in% rare_title)%>%
+ select(Name,Sex,Age,SibSp,Parch,Title)%>%
+ arrange(Title)
Name Sex Age SibSp Parch Title
1 Crosby, Capt. Edward Gifford male 70 1 1 Capt
2 Simonius-Blumer, Col. Oberst Alfons male 56 0 0 Col
3 Weir, Col. John male 60 0 0 Col
4 Gracie, Col. Archibald IV male 53 0 0 Col
5 Astor, Col. John Jacob male 47 1 0 Col
6 Uruchurtu, Don. Manuel E male 40 0 0 Don
7 Oliva y Ocana, Dona. Fermina female 39 0 0 Dona
8 Minahan, Dr. William Edward male 44 2 0 Dr
9 Moraweck, Dr. Ernest male 54 0 0 Dr
10 Pain, Dr. Alfred male 23 0 0 Dr
11 Stahelin-Maeglin, Dr. Max male 32 0 0 Dr
12 Frauenthal, Dr. Henry William male 50 2 0 Dr
13 Brewe, Dr. Arthur Jackson male NA 0 0 Dr
14 Leader, Dr. Alice (Farnham) female 49 0 0 Dr
15 Dodge, Dr. Washington male 53 1 1 Dr
16 Reuchlin, Jonkheer. John George male 38 0 0 Jonkheer
17 Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan") female 48 1 0 Lady
18 Peuchen, Major. Arthur Godfrey male 52 0 0 Major
19 Butt, Major. Archibald Willingham male 45 0 0 Major
20 Byles, Rev. Thomas Roussel Davids male 42 0 0 Rev
21 Bateman, Rev. Robert James male 51 0 0 Rev
22 Carter, Rev. Ernest Courtenay male 54 1 0 Rev
23 Kirkland, Rev. Charles Leonard male 57 0 0 Rev
24 Harper, Rev. John male 28 0 1 Rev
25 Montvila, Rev. Juozas male 27 0 0 Rev
26 Lahtinen, Rev. William male 30 1 1 Rev
27 Peruschitz, Rev. Joseph Maria male 41 0 0 Rev
28 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") male 49 1 0 Sir
29 Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) female 33 0 0 the Countess
> # 我將Lady調整為Mrs,Sir調整為Mr
> rare_title2 <- c("Capt","Col","Don","Dona","Dr", "Jonkheer","Major", "Rev","the Countess")
> # 對于一些稱呼進行重新指定(按含義) 如mlle, ms指小姐, mme 指女士
> full$Title[full$Title %in% c("Mlle","Ms")]<- "Miss"
> full$Title[full$Title %in% c("Mme","Lady")]<- "Mrs"
> full$Title[full$Title =="Sir"]<- "Mr"
> full$Title[full$Title %in% rare_title2] <- "Rare Title"
> # 查看重新調整后的情況
> table(full$Sex, full$Title)
Master Miss Mr Mrs Rare Title
female 0 264 0 199 3
male 61 0 758 0 24
> # 從乘客姓名中,提取姓氏,作為家庭變量
> full$Surname <- sapply(full$Name, function(x) strsplit(x, split = '[,.]')[[1]][1])
> length(unique(full$Surname))
[1] 875
Megan L. Risdal 的文章中,將乘客姓氏也提取出來,提示可以發掘乘客姓氏之間的聯系,但沒有進行進一步操,我們這里也就不探討了。
3.2 觀察家庭情況
處理完乘客姓名這一變量,我們再考慮衍生一些家庭相關的變量。基于已有變量SubSp和Parch生成家庭人數family size 這一變量。
> # 生成家庭人數變量,包括自己在內
> full$Fsize <- full$SibSp + full$Parch + 1
> # 生成一個家庭變量:格式為姓_家庭人數
> full$Family <- paste(full$Surname, full$Fsize, sep="_")
> # 使用 ggplot2 繪制家庭人數與生存情況之間的關系
> ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
+ geom_bar(stat="count", position="dodge") +
+ scale_x_continuous(breaks=c(1:11)) +
+ labs(x = "Family Size") +
+ theme_few()
通過圖形我們可以明顯發現以下特點:
- 個人和家庭人數>4的家庭,存活人數小于死亡人數
- 家庭人數在[2:4]的家庭,存活人數大于死亡人數
因此,我們可以將家庭人數變量進行分段合并,明顯的可以分為3段:個人,小家庭,大家庭,由此生成新變量。
> # 離散化
> full$FsizeD[full$Fsize == 1] <- "single"
> full$FsizeD[full$Fsize < 5 & full$Fsize > 1]<- "small"
> full$FsizeD[full$Fsize > 4] <- "large"
> # 通過馬賽克圖查看家庭規模與生存情況之間的關系
> mosaicplot(table(full$FsizeD,full$Survived), main="Family Size by Survival", shade=TRUE)
顯而易見,個人與大家庭不利于在沉船事故中生存,而小家庭當中生存率相對較高。
3.3 嘗試生成更多變量
在乘客客艙變量Cabin中,也存在一些有價值的信息,如客艙層數deck。
> # 可以看出這一變量有很多缺失值,有單個客戶多個客艙,格式基本為字母+數字
> head(full$Cabin,30)
[1] "" "C85" "" "C123" "" "" "E46"
[8] "" "" "" "G6" "C103" "" ""
[15] "" "" "" "" "" "" ""
[22] "D56" "" "A6" "" "" "" "C23 C25 C27"
[29] "" ""
> # 假設第一個字母即為客艙層數,建立一個層數變量(Deck),取值范圍A - z:
> full$Deck<-factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
> summary(full$Deck)
A B C D E F G T NA's
22 65 94 46 41 21 5 1 1014
上面只是對變量進行了基本處理,還有很多可以進一步操作的地方,如有些乘客名下包含很多間房 (如28行, "C23 C25 C27"), 但是這一變量有1014 個缺失值,占比太高了。 后面就不再進一步考慮。
4.處理缺失值
我們先找到所有的缺失數據看看情況:
- 數值型數據:age缺失263個,fare缺失1個
- 字符型數據:cabin缺失1014個,embarked缺失2個
- 因子型數據:deck缺失1014個
> sapply(full,function(x) {sum(is.na(x))})
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
0 418 0 0 0 263 0 0 0
Fare Cabin Embarked Title Surname Fsize Family FsizeD Deck
1 0 0 0 0 0 0 0 1014
> sapply(full,function(x) {sum(x == "",na.rm=TRUE)})
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
0 0 0 0 0 0 0 0 0
Fare Cabin Embarked Title Surname Fsize Family FsizeD Deck
0 1014 2 0 0 0 0 0 0
處理缺失值的方法有很多種,考慮到數據集本身較小,樣本數也不多,因而不能直接整行或者整列刪除缺失值樣本。我們考慮對于缺失值較少的,用均值或中位數填補,缺失值較多的通過現有數據和變量進行預估填補。
4.1 登船港口缺失
找到缺失數據在哪一行,觀察情況。
> filter(full,full$Embarked=="")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
1 62 1 1 Icard, Miss. Amelie female 38 0 0 113572 80
2 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0 113572 80
Cabin Embarked Title Surname Fsize Family FsizeD Deck
1 B28 Miss Icard 1 Icard_1 single B
2 B28 Mrs Stone 1 Stone_1 single B
我們可以看到他們支付的票價都是$ 80,艙位等級都是1,我們可以大膽推論有相同艙位等級(passenger class)和票價(Fare)的乘客,也許有著相同的登船港口?
> # 去除缺失值乘客的ID
> embark_fare <- full %>% filter(PassengerId != 62 & PassengerId != 830)
> # 用 ggplot2 繪制embarkment, passenger class, & median fare 三者關系圖
> ggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
+ geom_boxplot() +
+ geom_hline(aes(yintercept=80),
+ colour="red", linetype="dashed", lwd=2) +
+ scale_y_continuous(labels=dollar_format()) +
+ theme_few()
很明顯,港口C出發的頭等艙支付的票價中位數為80,因此我們可以把兩個缺失值替換為C。
> full$Embarked[c(62, 830)] <- "C"
4.1 票價缺失
找到缺失數據在哪一行,觀察情況。
> filter(full,is.na(full$Fare))
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Surname
1 1044 NA 3 Storey, Mr. Thomas male 60.5 0 0 3701 NA S Mr Storey
Fsize Family FsizeD Deck
1 1 Storey_1 single <NA>
這是從港口S出發的三等艙乘客,根據上面登船港口缺失值填補的推論,我們可以作圖看看:
> ggplot(full[full$Pclass == "3" & full$Embarked == "S", ],
+ aes(x = Fare)) +
+ geom_density(fill = "#99d6ff", alpha=0.4) +
+ geom_vline(aes(xintercept=median(Fare, na.rm=T)),colour="red", linetype="dashed", lwd=1) +
+ scale_x_continuous(labels=dollar_format()) +
+ theme_few()
從得到的圖形上看,將缺失值用中位數進行替換是合理的。替換數值為$8.05。
> full$Fare[1044] <- median(full[full$Pclass==3&full$Embarked=="S",]$Fare,na.rm=TRUE)
> full$Fare[1044]
[1] 8.05
4.3 年齡缺失
用戶年齡(Age) 中有大量缺失存在,簡單用中位數或均值肯定不妥,這里我們用mice包的隨機插補,將基于其他變量構建一個預測模型對年齡缺失值進行填補。
> # 使變量因子化
> factor_vars <- c("PassengerId","Pclass","Sex","Embarked",
+ "Title","Surname","Family","FsizeD")
> full[factor_vars] <- lapply(full[factor_vars],function(x) as.factor(x))
> # 設置隨機種子
> set.seed(123)
> # 執行多重插補法,剔除一些沒什么用的變量
> mice_mod <- mice(full[, !names(full) %in%
+ c("PassengerId","Name","Ticket","Cabin",
+ "Family","Surname","Survived")],
+ method="rf")
iter imp variable
1 1 Age Deck
1 2 Age Deck
1 3 Age Deck
1 4 Age Deck
1 5 Age Deck
2 1 Age Deck
2 2 Age Deck
2 3 Age Deck
2 4 Age Deck
2 5 Age Deck
3 1 Age Deck
3 2 Age Deck
3 3 Age Deck
3 4 Age Deck
3 5 Age Deck
4 1 Age Deck
4 2 Age Deck
4 3 Age Deck
4 4 Age Deck
4 5 Age Deck
5 1 Age Deck
5 2 Age Deck
5 3 Age Deck
5 4 Age Deck
5 5 Age Deck
> # 保存完整輸出
> mice_output <- complete(mice_mod)
> # 繪制年齡分布圖
> par(mfrow=c(1,2))
> hist(full$Age, freq=F, main="Age: Original Data",
+ col="darkgreen", ylim=c(0,0.04))
> hist(mice_output$Age, freq=F, main="Age: MICE Output",
+ col="lightgreen", ylim=c(0,0.04))
從上面的圖來看,數據填補前后,并沒有發生明顯的偏移,隨機插補應該是有效的,那么下面可以用mice模型的結果對原年齡數據進行替換。
> full$Age <- mice_output$Age
> # 檢查缺失值是否被完全替換了
> sum(is.na(full$Age))
[1] 0
5.特征工程
現在我們知道每一位乘客的年齡,那么基于“婦女與兒童優先”的原則,我們可以生成一些變量,如兒童(Child)和 母親(Mother)。
劃分標準:
- 兒童 : 年齡Age < 18
- 母親 : 女性+年齡 > 18+擁有超過1個子女+頭銜不是'Miss'
> # 首先我們來看年齡與生存情況之間的關系
> ggplot(full[1:891,], aes(Age, fill = factor(Survived))) +
+ geom_histogram() +
+ # 分性別來看,因為我們知道性別對于生存情況有重要影響
+ facet_grid(.~Sex) +
+ theme_few()
生成child變量, 并且基于此劃分兒童child與成人adult。
> full$child[full$Age < 18] <- "Child"
> full$child[full$Age >= 18] <- "Adult"
> # 展示對應人數
> table(full$child, full$Survived)
0 1
Adult 479 270
Child 70 72
下面來生成母親這個變量。
> #增加mother變量
> full$Mother <- "NotMother"
> full$Mother[full$Sex == "female" & full$Parch > 0 & full$Age > 18 & full$Title != "Miss"] <- "Mother"
> table(full$Mother, full$Survived)
0 1
Mother 15 39
NotMother 534 303
將child和mother變量轉化成因子類型
> full$child <- factor(full$child)
> full$Mother <- factor(full$Mother)
至此,所有我們需要的變量都已經生成,并且其中沒有缺失值。 為了保險起見,我們進行二次確認。
> sapply(full,function(x){sum(is.na(x))})
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
0 418 0 0 0 0 0 0 0
Fare Cabin Embarked Title Surname Fsize Family FsizeD Deck
0 0 0 0 0 0 0 0 1014
child Mother
0 0
> sapply(full,function(x){sum(x=="",na.rm=T)})
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
0 0 0 0 0 0 0 0 0
Fare Cabin Embarked Title Surname Fsize Family FsizeD Deck
0 1014 0 0 0 0 0 0 0
child Mother
0 0
6.模型設定與預測
在完成上面的工作之后,我們進入到最后一步:預測泰坦尼克號上乘客的生存狀況。 在這里我們使用隨機森林分類算法(The RandomForest Classification Algorithm) 。
第一步,拆分訓練集與測試集
> train <- full[1:891,]
> test <- full[892:1309,]
第二步, 建立模型
> # 設置隨機種子
> set.seed(1234)
> # 建立模型 (注意: 不是所有可用變量全部加入)
> rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp +Parch +
+ Fare + Embarked + Title + FsizeD + child + Mother,
+ data = train)
> # 顯示模型誤差
> plot(rf_model, ylim=c(0,0.36))
> legend("topright", colnames(rf_model$err.rate), col=1:3, fill=1:3)
黑色那條線表示:整體誤差率(the overall error rate)保持在20% 左右
紅色和綠色線分別表示:遇難與生還的誤差率,紅線保持在10%,綠線保持在30%左右。
我們的模型,在預測死亡情況時更準確一些。
第三步, 查看變量重要性
> importance <- importance(rf_model)
> varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,"MeanDecreaseGini"],2))
> rankImportance <- varImportance %>%
+ mutate(Rank = paste0("#",dense_rank(desc(Importance))))
> # 作圖
> ggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) +
+ geom_bar(stat="identity") +
+ geom_text(aes(x = Variables, y = 0.5, label = Rank), hjust=0, vjust=0.55, size = 4, colour = "red") + labs(x = "Variables") +
+ coord_flip()
我們從圖上可以看出,頭銜和票價對于生存情況影響最大,其次是性別和年齡,而乘客艙位排第五。 而母親和孩子對于生存與否的影響最小,真是有點出乎意料。
第四步, 預測
> # 將模型帶入測試集
> prediction <- predict(rf_model, test)
> # 保存結果
> solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
> # 輸出結果到CSV文件格式
> write.csv(solution, file = "rf_mod_Solution.csv", row.names = F)
7.總結
本次數據分析是基于Megan L. Risdal 的文章,算是一次深度模仿,而且對于其中的一些操作還有不太理解的部分,還應該多看多學,爭取早日獨立完成一項數據分析工作。