99超碰久久久久精品无码欧洲,高h调教np学生,欧美三级真做在线观看

在 Kaggle 的很多比賽中，我們可以看到很多 winner 喜歡用 xgboost，而且獲得非常好的表現(xiàn)，今天就來看看 xgboost 到底是什么以及如何應(yīng)用。

本文結(jié)構(gòu)：

什么是 xgboost？
為什么要用它？
怎么應(yīng)用？
學(xué)習(xí)資源

什么是 xgboost？

XGBoost ：eXtreme Gradient Boosting
項(xiàng)目地址：https://github.com/dmlc/xgboost

是由 Tianqi Chen http://homes.cs.washington.edu/~tqchen/ 最初開發(fā)的實(shí)現(xiàn)可擴(kuò)展，便攜，分布式 gradient boosting (GBDT, GBRT or GBM) 算法的一個(gè)庫，可以下載安裝并應(yīng)用于 C++，Python，R，Julia，Java，Scala，Hadoop，現(xiàn)在有很多協(xié)作者共同開發(fā)維護(hù)。

XGBoost 所應(yīng)用的算法就是 gradient boosting decision tree，既可以用于分類也可以用于回歸問題中。

那什么是 Gradient Boosting？

Gradient boosting 是 boosting 的其中一種方法

所謂 Boosting ，就是將弱分離器 f_i(x) 組合起來形成強(qiáng)分類器 F(x) 的一種方法。

所以 Boosting 有三個(gè)要素：

A loss function to be optimized：
例如分類問題中用 cross entropy，回歸問題用 mean squared error。
A weak learner to make predictions：
例如決策樹。
An additive model：
將多個(gè)弱學(xué)習(xí)器累加起來組成強(qiáng)學(xué)習(xí)器，進(jìn)而使目標(biāo)損失函數(shù)達(dá)到極小。

Gradient boosting 就是通過加入新的弱學(xué)習(xí)器，來努力糾正前面所有弱學(xué)習(xí)器的殘差，最終這樣多個(gè)學(xué)習(xí)器相加在一起用來進(jìn)行最終預(yù)測，準(zhǔn)確率就會(huì)比單獨(dú)的一個(gè)要高。之所以稱為 Gradient，是因?yàn)樵谔砑有履Ｐ蜁r(shí)使用了梯度下降算法來最小化的損失。

為什么要用 xgboost？

前面已經(jīng)知道，XGBoost 就是對(duì) gradient boosting decision tree 的實(shí)現(xiàn)，但是一般來說，gradient boosting 的實(shí)現(xiàn)是比較慢的，因?yàn)槊看味家葮?gòu)造出一個(gè)樹并添加到整個(gè)模型序列中。

而 XGBoost 的特點(diǎn)就是計(jì)算速度快，模型表現(xiàn)好，這兩點(diǎn)也正是這個(gè)項(xiàng)目的目標(biāo)。

表現(xiàn)快是因?yàn)樗哂羞@樣的設(shè)計(jì)：

Parallelization：
訓(xùn)練時(shí)可以用所有的 CPU 內(nèi)核來并行化建樹。
Distributed Computing ：
用分布式計(jì)算來訓(xùn)練非常大的模型。
Out-of-Core Computing：
對(duì)于非常大的數(shù)據(jù)集還可以進(jìn)行 Out-of-Core Computing。
Cache Optimization of data structures and algorithms：
更好地利用硬件。

下圖就是 XGBoost 與其它 gradient boosting 和 bagged decision trees 實(shí)現(xiàn)的效果比較，可以看出它比 R, Python，Spark，H2O 中的基準(zhǔn)配置要更快。

另外一個(gè)優(yōu)點(diǎn)就是在預(yù)測問題中模型表現(xiàn)非常好，下面是幾個(gè) kaggle winner 的賽后采訪鏈接，可以看出 XGBoost 的在實(shí)戰(zhàn)中的效果。

Vlad Sandulescu, Mihai Chiru, 1st place of the KDD Cup 2016 competition. Link to the arxiv paper.
Marios Michailidis, Mathias Müller and HJ van Veen, 1st place of the Dato Truely Native? competition. Link to the Kaggle interview.
Vlad Mironov, Alexander Guschin, 1st place of the CERN LHCb experiment Flavour of Physics competition. Link to the Kaggle interview.

怎么應(yīng)用？

先來用 Xgboost 做一個(gè)簡單的二分類問題，以下面這個(gè)數(shù)據(jù)為例，來判斷病人是否會(huì)在 5 年內(nèi)患糖尿病，這個(gè)數(shù)據(jù)前 8 列是變量，最后一列是預(yù)測值為 0 或 1。

數(shù)據(jù)描述：
https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

下載數(shù)據(jù)集，并保存為 “pima-indians-diabetes.csv“ 文件：
https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

1. 基礎(chǔ)應(yīng)用

引入 xgboost 等包

from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

分出變量和標(biāo)簽

dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

X = dataset[:,0:8]
Y = dataset[:,8]

將數(shù)據(jù)分為訓(xùn)練集和測試集，測試集用來預(yù)測，訓(xùn)練集用來學(xué)習(xí)模型

seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

xgboost 有封裝好的分類器和回歸器，可以直接用 XGBClassifier 建立模型
這里是 XGBClassifier 的文檔：
http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

model = XGBClassifier()
model.fit(X_train, y_train)

xgboost 的結(jié)果是每個(gè)樣本屬于第一類的概率，需要用 round 將其轉(zhuǎn)換為 0 1 值

y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

得到 Accuracy: 77.95%

accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

2. 監(jiān)控模型表現(xiàn)

xgboost 可以在模型訓(xùn)練時(shí)，評(píng)價(jià)模型在測試集上的表現(xiàn)，也可以輸出每一步的分?jǐn)?shù)

只需要將

model = XGBClassifier()
model.fit(X_train, y_train)

變?yōu)椋?/strong>

model = XGBClassifier() eval_set = [(X_test, y_test)] model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)

那么它會(huì)在每加入一顆樹后打印出 logloss

[31] validation_0-logloss:0.487867 [32] validation_0-logloss:0.487297 [33] validation_0-logloss:0.487562

并打印出 Early Stopping 的點(diǎn)：

Stopping. Best iteration: [32] validation_0-logloss:0.487297

3. 輸出特征重要度

gradient boosting 還有一個(gè)優(yōu)點(diǎn)是可以給出訓(xùn)練好的模型的特征重要性，
這樣就可以知道哪些變量需要被保留，哪些可以舍棄

需要引入下面兩個(gè)類

from xgboost import plot_importance from matplotlib import pyplot

和前面的代碼相比，就是在 fit 后面加入兩行畫出特征的重要性

model.fit(X, y) plot_importance(model) pyplot.show()

4. 調(diào)參

如何調(diào)參呢，下面是三個(gè)超參數(shù)的一般實(shí)踐最佳值，可以先將它們?cè)O(shè)定為這個(gè)范圍，然后畫出 learning curves，再調(diào)解參數(shù)找到最佳模型：

learning_rate ＝ 0.1 或更小，越小就需要多加入弱學(xué)習(xí)器；

tree_depth ＝ 2～8；

subsample ＝訓(xùn)練集的 30%～80%；

接下來我們用 GridSearchCV 來進(jìn)行調(diào)參會(huì)更方便一些：

可以調(diào)的超參數(shù)組合有：

樹的個(gè)數(shù)和大小 (n_estimators and max_depth).
學(xué)習(xí)率和樹的個(gè)數(shù) (learning_rate and n_estimators).
行列的 subsampling rates (subsample, colsample_bytree and colsample_bylevel).

下面以學(xué)習(xí)率為例：

先引入這兩個(gè)類

from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold

設(shè)定要調(diào)節(jié)的 learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
和原代碼相比就是在 model 后面加上 grid search 這幾行：

model = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, Y)

最后會(huì)給出最佳的學(xué)習(xí)率為 0.1
Best: -0.483013 using {'learning_rate': 0.1}

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

我們還可以用下面的代碼打印出每一個(gè)學(xué)習(xí)率對(duì)應(yīng)的分?jǐn)?shù)：

means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

-0.689650 (0.000242) with: {'learning_rate': 0.0001} -0.661274 (0.001954) with: {'learning_rate': 0.001} -0.530747 (0.022961) with: {'learning_rate': 0.01} -0.483013 (0.060755) with: {'learning_rate': 0.1} -0.515440 (0.068974) with: {'learning_rate': 0.2} -0.557315 (0.081738) with: {'learning_rate': 0.3}

前面就是關(guān)于 xgboost 的一些基礎(chǔ)概念和應(yīng)用實(shí)例，下面還有一些學(xué)習(xí)資源供參考：

學(xué)習(xí)資源：

Tianqi Chen 的講座：
https://www.youtube.com/watch?v=Vly8xGnNiWs&feature=youtu.be
講義：
https://speakerdeck.com/datasciencela/tianqi-chen-xgboost-overview-and-latest-news-la-meetup-talk

入門教程：
https://xgboost.readthedocs.io/en/latest/

安裝教程：
http://xgboost.readthedocs.io/en/latest/build.html

應(yīng)用示例：
https://github.com/dmlc/xgboost/tree/master/demo

最好的資源當(dāng)然就是項(xiàng)目的 Github 主頁：
https://github.com/dmlc/xgboost

參考：
http://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/
https://www.zhihu.com/question/37683881

推薦閱讀歷史技術(shù)博文鏈接匯總
 http://www.lxweimin.com/p/28f02bb59fe5
也許可以找到你想要的

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Kaggle 神器 xgboost

Kaggle 神器 xgboost

什么是 xgboost？

為什么要用 xgboost？

怎么應(yīng)用？

1. 基礎(chǔ)應(yīng)用

2. 監(jiān)控模型表現(xiàn)

3. 輸出特征重要度

4. 調(diào)參

學(xué)習(xí)資源：

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Kaggle 神器 xgboost

什么是 xgboost？

為什么要用 xgboost？

怎么應(yīng)用？

1. 基礎(chǔ)應(yīng)用

2. 監(jiān)控模型表現(xiàn)

3. 輸出特征重要度

4. 調(diào)參

學(xué)習(xí)資源：

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频