知識篇——sklearn決策樹分類器使用(網格搜索+交叉驗證)

使用sklearn的DecisionTreeClassifier解決分類問題實例。

數據集描述

數據集存放在一個csv文件中,其中11列特征變量,1列目標變量。特征變量的類型有數字類型和字符串類型。

加載數據

from sklearn import tree
from sklearn.model_selection import train_test_split
import pandas as pandas
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

處理數據

1、剔除Nan的數據

full_data = full_data.dropna(axis=0)

2、拆分特征變量和目標變量

out = full_data['Survived']
features = full_data.drop('Survived', axis = 1)

3、將特征變量中的字符串類型轉成數字類型

features = pandas.get_dummies(features)

拆分訓練集和測試集

X_train, X_test, y_train, y_test = train_test_split(features, out, test_size = 0.2, random_state = 0)
# 顯示切分的結果
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

定義評價指標

def accuracy_score(truth, pred):
    """ Returns accuracy score for input truth and predictions. """
    
    # Ensure that the number of predictions matches number of outcomes
    # 確保預測的數量與結果的數量一致
    if len(truth) == len(pred): 
        
        # Calculate and return the accuracy as a percent
        # 計算預測準確率(百分比)
        # 用bool的平均數算百分比
        return(truth == pred).mean()*100
    
    else:
        return 0

建模

用兩種方式,一種是用網格搜索和交叉驗證找決策樹的最優參數,創建有最優參數的決策樹,一種是默認決策樹

創建決策樹,用網格搜索和交叉驗證找最優參數并擬合數據

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
def fit_model_k_fold(X, y):
    """ Performs grid search over the 'max_depth' parameter for a 
        decision tree regressor trained on the input data [X, y]. """
    
    # Create cross-validation sets from the training data
    # cv_sets = ShuffleSplit(n_splits = 10, test_size = 0.20, random_state = 0)
    k_fold = KFold(n_splits=10)
    
    #  Create a decision tree clf object
    clf = DecisionTreeClassifier(random_state=80)

    params = {'max_depth':range(1,21),'criterion':np.array(['entropy','gini'])}

    # Transform 'accuracy_score' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(accuracy_score)

    # Create the grid search object
    grid = GridSearchCV(clf, param_grid=params,scoring=scoring_fnc,cv=k_fold)

    # Fit the grid search object to the data to compute the optimal model
    grid = grid.fit(X, y)

    # Return the optimal model after fitting the data
    return grid.best_estimator_

查看最優參數

print "k_fold Parameter 'max_depth' is {} for the optimal model.".format(clf.get_params()['max_depth'])
print "k_fold Parameter 'criterion' is {} for the optimal model.".format(clf.get_params()['criterion'])

創建默認參數的決策樹

def predict_4(X, Y):
    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(X, Y)
    return clf

預測

clf = fit_model_k_fold(X_train, y_train)

繪制決策樹

from IPython.display import Image  
import pydotplus
dot_data = tree.export_graphviz(clf, out_file=None,
                         class_names=['0','1'],  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png()) 
決策樹

以上內容來自822實驗室2017年5月7日17:30第二次知識分享活動:Titanic幸存者預測。
我們的822,我們的青春
歡迎所有熱愛知識熱愛生活的朋友和822實驗室一起成長,吃喝玩樂,享受知識。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容