Kaggle初試

第一次上kaggle來做實訓,第一印象界面美觀,向導友好,考慮周全,是一個比較成熟的平臺。
數據集也很豐富,有關于歐洲足球歷史比分的,美國總統競選分析的,計算機語言使用調查的,人力資源分析,歷史上的飛機事故統計,IMDB電影得分的數據分析,還有些脫敏的金融借貸和風險信息。

1

找到了Titanic數據集,跟著向導第一次做任務,DataCamp中的課程有任務說明,可以根據提示寫代碼,然后提交,錯誤還可以根據提示進行修正,直到教會你為止。感覺和以前打一個新游戲的任務向導很像。

2
需求分析

有兩部分關于Titanic乘客的信息,一部分是Train數據,一部分是Test數據,通過分析Train特征數據以及標簽數據“是否幸存”進行數據清洗,特征選取,建立決策模型來預測Test數據的乘客是否幸存?

字段含義
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

Python代碼
import numpy as np
from sklearn import tree
import pandas as pd

官方安裝numpy和scipy庫的時候一直報錯,后來找到了下載鏈接,提供了很多非官方的python庫
cmd >> python -m pip install xx.whl >> 安裝成功。

步驟 1 導入并觀察數據

 train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
 train = pd.read_csv(train_url)
 test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
 test = pd.read_csv(test_url)
 #Print the `head` of the train and test dataframes
 print(train.describe())
 print(test.describe())
       PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200

發現Fare和Age中有部分值是空值,在進行模型訓練之前要對其進行數值填充。

當時的歐洲紳士們倡導女士優先的傳統,所以看看性別對于預測標簽的影響。

步驟2 分析特征值性別和目標標簽的關系
# Passengers that survived vs passengers that passed away
print(train["Survived"].value_counts())

# As proportions
print(train["Survived"].value_counts(normalize = True))

# Males that survived vs males that passed away
print(train["Survived"][train["Sex"] == 'male'].value_counts())

# Females that survived vs Females that passed away
print(train["Survived"][train["Sex"] == 'female'].value_counts())

# Normalized male survival
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True))

# Normalized female survival
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True))

<script.py> output:
    0    549
    1    342
    Name: Survived, dtype: int64
    0    0.616162
    1    0.383838
    Name: Survived, dtype: float64
    0    468
    1    109
    Name: Survived, dtype: int64
    1    233
    0     81
    Name: Survived, dtype: int64
    0    0.811092
    1    0.188908
    Name: Survived, dtype: float64
    1    0.742038
    0    0.257962
    Name: Survived, dtype: float64

通過分析性別和幸存的關系發現,有18%的男性和74%的女性幸存。所以對于測試數據集來說,假如全部判斷為女性幸存,理論上也會有74%的正確率,這個是一個基準線

另外,我們知道年紀小的孩子有優先上救生船的權利。

步驟3 分析特征值年齡和目標標簽幸存的關系

為了方便統計以及之后決策樹建模的訓練,把連續性變量年齡統一成離散型分類變量

# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.

train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0

print(train)

# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))

# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))
    1    0.539823
    0    0.460177
    Name: Survived, dtype: float64
    0    0.618968
    1    0.381032
    Name: Survived, dtype: float64

有53%的未成年人幸存,有38%的成年人幸存。

步驟4 數據清洗和數據格式轉換
# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == 'S'] = 0
train["Embarked"][train["Embarked"] == 'C'] = 1
train["Embarked"][train["Embarked"] == 'Q'] = 2

為了使得決策樹模型能夠正常的,高效的工作,一定要對數據進行清洗:

  1. 讓性別轉換成0和1變量
  2. 對缺損的年齡字段進行均值填充
  3. 把Embarked變量進行格式轉換成離散數值變量

一般來說,數據清洗和特征選取占到整個數據分析時間的70%~80%,是預測是否能夠準確的重要部分。

步驟5 決策樹建模以及訓練模型
# Import the Numpy library
import numpy as np
# Import 'tree' from scikit-learn library
from sklearn import tree

# Print the train data to see the available features
print(train)

# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))

使用了科學計算的庫Numpy和機器學習的庫sklearn,對特征值進行數據模型擬合。

[ 0.12545743  0.31274009  0.23086653  0.33093596]
0.977553310887

出乎意料之外的是Fare字段對于標簽預測的權重作用最大,占到了33%。預測準確率是97%。

步驟6 利用訓練模型預測測試數據
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)

對Fare的缺損數據進行均值填充,并且利用訓練模型預測測試數據集。

步驟7 把預測結果導出到csv
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])

補充1:決策樹參數調整
# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print(my_tree_two.feature_importances_)
print(my_tree_two.score(features_two, target))

Maybe we can improve the overfit model by making a less complex model? In DecisionTreeRegressor
, the depth of our model is defined by two parameters: - the max_depth
parameter determines when the splitting up of the decision tree stops. - the min_samples_split
parameter monitors the amount of observations in a bucket. If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.

為了避免有可能出現的決策樹過擬合的可能,我們需要對決策樹進行“剪枝”,有下列參數可以優化調整

**max_features: **選擇最適屬性時劃分的特征不能超過此值。
max_depth: (default=None)設置樹的最大深度,默認為None,這樣建樹時,會使每一個葉節點只有一個類別,或是達到min_samples_split。
min_samples_split****:根據屬性劃分節點時,每個劃分最少的樣本數。
min_samples_leaf:葉子節點最少的樣本數。
**max_leaf_nodes: **(default=None)葉子樹的最大樣本數。

補充2:特征工程--嘗試建立新的特征值

Data Science is an art that benefits from a human element. Enter feature engineering: creatively engineering your own features by combining the different existing variables.
While feature engineering is a discipline in itself, too broad to be covered here in detail, you will have a look at a simple example by creating your own new predictive attribute: family_size

# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two["SibSp"] + train_two["Parch"] + 1
print(train_two["family_size"])

# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three,target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))
print(my_tree_three.feature_importances_)
補充3:使用新的模型算法--隨機森林

A detailed study of Random Forests would take this tutorial a bit too far. However, since it's an often used machine learning technique, gaining a general understanding in Python won't hurt.

In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.

隨機森林就是會生成很多決策樹,然后每棵樹都會對最終預測進行投票,取投票比例多的作為最終預測結果,避免了過渡擬合。

優點:

a. 在數據集上表現良好,兩個隨機性的引入,使得隨機森林不容易陷入過擬合

b. 在當前的很多數據集上,相對其他算法有著很大的優勢,兩個隨機性的引入,使得隨機森林具有很好的抗噪聲能力

c. 它能夠處理很高維度(feature很多)的數據,并且不用做特征選擇,對數據集的適應能力強:既能處理離散型數據,也能處理連續型數據,數據集無需規范化

d. 可生成一個Proximities=(pij)矩陣,用于度量樣本之間的相似性: pij=aij/N, aij表示樣本i和j出現在隨機森林中同一個葉子結點的次數,N隨機森林中樹的顆數

e. 在創建隨機森林的時候,對generlization error使用的是無偏估計

f. 訓練速度快,可以得到變量重要性排序(兩種:基于OOB誤分率的增加量和基于分裂時的GINI下降量

g. 在訓練過程中,能夠檢測到feature間的互相影響

h. 容易做成并行化方法

i. 實現比較簡單

# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))
總結:

把整個數據分析的流程走了一遍,由于本身的數據質量較高,業務場景相對簡單,所以模型預測效果好,沒有遇到實際問題。
另外,本次使用的建模和訓練模型的過程都是基于sklearn,直接調用方法,如果有時間的話,最好還是自己把建模的過程用python實現一遍,這樣可以更加好的理解決策樹模型。
還有就是以后可以學習一下可視化數據的方法,比如應用matplotlib去對數據有一個直觀的了解和觀測,像這樣:

ZT}1H0_%5FW(`L@TQ258$IW.png
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 229,565評論 6 539
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 99,115評論 3 423
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 177,577評論 0 382
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 63,514評論 1 316
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 72,234評論 6 410
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,621評論 1 326
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,641評論 3 444
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,822評論 0 289
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 49,380評論 1 335
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 41,128評論 3 356
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 43,319評論 1 371
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,879評論 5 362
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,548評論 3 348
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,970評論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,229評論 1 291
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 52,048評論 3 397
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 48,285評論 2 376

推薦閱讀更多精彩內容