第一次上kaggle來做實訓,第一印象界面美觀,向導友好,考慮周全,是一個比較成熟的平臺。
數據集也很豐富,有關于歐洲足球歷史比分的,美國總統競選分析的,計算機語言使用調查的,人力資源分析,歷史上的飛機事故統計,IMDB電影得分的數據分析,還有些脫敏的金融借貸和風險信息。
找到了Titanic數據集,跟著向導第一次做任務,DataCamp中的課程有任務說明,可以根據提示寫代碼,然后提交,錯誤還可以根據提示進行修正,直到教會你為止。感覺和以前打一個新游戲的任務向導很像。
需求分析
有兩部分關于Titanic乘客的信息,一部分是Train數據,一部分是Test數據,通過分析Train特征數據以及標簽數據“是否幸存”進行數據清洗,特征選取,建立決策模型來預測Test數據的乘客是否幸存?
字段含義
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
Python代碼
import numpy as np
from sklearn import tree
import pandas as pd
官方安裝numpy和scipy庫的時候一直報錯,后來找到了下載鏈接,提供了很多非官方的python庫
cmd >> python -m pip install xx.whl >> 安裝成功。
步驟 1 導入并觀察數據
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)
#Print the `head` of the train and test dataframes
print(train.describe())
print(test.describe())
PassengerId Pclass Age SibSp Parch Fare
count 418.000000 418.000000 332.000000 418.000000 418.000000 417.000000
mean 1100.500000 2.265550 30.272590 0.447368 0.392344 35.627188
std 120.810458 0.841838 14.181209 0.896760 0.981429 55.907576
min 892.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 996.250000 1.000000 21.000000 0.000000 0.000000 7.895800
50% 1100.500000 3.000000 27.000000 0.000000 0.000000 14.454200
75% 1204.750000 3.000000 39.000000 1.000000 0.000000 31.500000
max 1309.000000 3.000000 76.000000 8.000000 9.000000 512.329200
發現Fare和Age中有部分值是空值,在進行模型訓練之前要對其進行數值填充。
當時的歐洲紳士們倡導女士優先的傳統,所以看看性別對于預測標簽的影響。
步驟2 分析特征值性別和目標標簽的關系
# Passengers that survived vs passengers that passed away
print(train["Survived"].value_counts())
# As proportions
print(train["Survived"].value_counts(normalize = True))
# Males that survived vs males that passed away
print(train["Survived"][train["Sex"] == 'male'].value_counts())
# Females that survived vs Females that passed away
print(train["Survived"][train["Sex"] == 'female'].value_counts())
# Normalized male survival
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True))
# Normalized female survival
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True))
<script.py> output:
0 549
1 342
Name: Survived, dtype: int64
0 0.616162
1 0.383838
Name: Survived, dtype: float64
0 468
1 109
Name: Survived, dtype: int64
1 233
0 81
Name: Survived, dtype: int64
0 0.811092
1 0.188908
Name: Survived, dtype: float64
1 0.742038
0 0.257962
Name: Survived, dtype: float64
通過分析性別和幸存的關系發現,有18%的男性和74%的女性幸存。所以對于測試數據集來說,假如全部判斷為女性幸存,理論上也會有74%的正確率,這個是一個基準線。
另外,我們知道年紀小的孩子有優先上救生船的權利。
步驟3 分析特征值年齡和目標標簽幸存的關系
為了方便統計以及之后決策樹建模的訓練,把連續性變量年齡統一成離散型分類變量
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')
# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
print(train)
# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))
# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))
1 0.539823
0 0.460177
Name: Survived, dtype: float64
0 0.618968
1 0.381032
Name: Survived, dtype: float64
有53%的未成年人幸存,有38%的成年人幸存。
步驟4 數據清洗和數據格式轉換
# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")
# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == 'S'] = 0
train["Embarked"][train["Embarked"] == 'C'] = 1
train["Embarked"][train["Embarked"] == 'Q'] = 2
為了使得決策樹模型能夠正常的,高效的工作,一定要對數據進行清洗:
- 讓性別轉換成0和1變量
- 對缺損的年齡字段進行均值填充
- 把Embarked變量進行格式轉換成離散數值變量
一般來說,數據清洗和特征選取占到整個數據分析時間的70%~80%,是預測是否能夠準確的重要部分。
步驟5 決策樹建模以及訓練模型
# Import the Numpy library
import numpy as np
# Import 'tree' from scikit-learn library
from sklearn import tree
# Print the train data to see the available features
print(train)
# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values
# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)
# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))
使用了科學計算的庫Numpy和機器學習的庫sklearn,對特征值進行數據模型擬合。
[ 0.12545743 0.31274009 0.23086653 0.33093596]
0.977553310887
出乎意料之外的是Fare字段對于標簽預測的權重作用最大,占到了33%。預測準確率是97%。
步驟6 利用訓練模型預測測試數據
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)
對Fare的缺損數據進行均值填充,并且利用訓練模型預測測試數據集。
步驟7 把預測結果導出到csv
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)
# Check that your data frame has 418 entries
print(my_solution.shape)
# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])
補充1:決策樹參數調整
# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values
#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)
#Print the score of the new decison tree
print(my_tree_two.feature_importances_)
print(my_tree_two.score(features_two, target))
Maybe we can improve the overfit model by making a less complex model? In DecisionTreeRegressor
, the depth of our model is defined by two parameters: - the max_depth
parameter determines when the splitting up of the decision tree stops. - the min_samples_split
parameter monitors the amount of observations in a bucket. If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.
為了避免有可能出現的決策樹過擬合的可能,我們需要對決策樹進行“剪枝”,有下列參數可以優化調整
**max_features: **選擇最適屬性時劃分的特征不能超過此值。
max_depth: (default=None)設置樹的最大深度,默認為None,這樣建樹時,會使每一個葉節點只有一個類別,或是達到min_samples_split。
min_samples_split****:根據屬性劃分節點時,每個劃分最少的樣本數。
min_samples_leaf:葉子節點最少的樣本數。
**max_leaf_nodes: **(default=None)葉子樹的最大樣本數。
補充2:特征工程--嘗試建立新的特征值
Data Science is an art that benefits from a human element. Enter feature engineering: creatively engineering your own features by combining the different existing variables.
While feature engineering is a discipline in itself, too broad to be covered here in detail, you will have a look at a simple example by creating your own new predictive attribute: family_size
# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two["SibSp"] + train_two["Parch"] + 1
print(train_two["family_size"])
# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values
# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three,target)
# Print the score of this decision tree
print(my_tree_three.score(features_three, target))
print(my_tree_three.feature_importances_)
補充3:使用新的模型算法--隨機森林
A detailed study of Random Forests would take this tutorial a bit too far. However, since it's an often used machine learning technique, gaining a general understanding in Python won't hurt.
In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.
隨機森林就是會生成很多決策樹,然后每棵樹都會對最終預測進行投票,取投票比例多的作為最終預測結果,避免了過渡擬合。
優點:
a. 在數據集上表現良好,兩個隨機性的引入,使得隨機森林不容易陷入過擬合
b. 在當前的很多數據集上,相對其他算法有著很大的優勢,兩個隨機性的引入,使得隨機森林具有很好的抗噪聲能力
c. 它能夠處理很高維度(feature很多)的數據,并且不用做特征選擇,對數據集的適應能力強:既能處理離散型數據,也能處理連續型數據,數據集無需規范化
d. 可生成一個Proximities=(pij)矩陣,用于度量樣本之間的相似性: pij=aij/N, aij表示樣本i和j出現在隨機森林中同一個葉子結點的次數,N隨機森林中樹的顆數
e. 在創建隨機森林的時候,對generlization error使用的是無偏估計
f. 訓練速度快,可以得到變量重要性排序(兩種:基于OOB誤分率的增加量和基于分裂時的GINI下降量
g. 在訓練過程中,能夠檢測到feature間的互相影響
h. 容易做成并行化方法
i. 實現比較簡單
# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier
# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)
# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))
# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))
總結:
把整個數據分析的流程走了一遍,由于本身的數據質量較高,業務場景相對簡單,所以模型預測效果好,沒有遇到實際問題。
另外,本次使用的建模和訓練模型的過程都是基于sklearn,直接調用方法,如果有時間的話,最好還是自己把建模的過程用python實現一遍,這樣可以更加好的理解決策樹模型。
還有就是以后可以學習一下可視化數據的方法,比如應用matplotlib去對數據有一個直觀的了解和觀測,像這樣: