0x00 前言

本文是《GBDT源碼分析》系列的第三篇，主要關注和GBDT本身以及Ensemble算法在scikit-learn中的實現。

0x01 整體說明

scikit-learn的ensemble模塊里包含許多各式各樣的集成模型，所有源碼均在sklearn/ensemble文件夾里，代碼的文件結構可以參考該系列的第一篇文章。其中 GradientBoostingRegressor 和 GradientBoostingClassifier分別是基于grandient_boosting的回歸器和分類器。

0x02 源碼結構分析

BaseEnsemble

ensemble中的所有模型均基于基類BaseEnsemble，該基類在sklearn/ensemble/base.py里。BaseEnsemble繼承了兩個父類，分別是BaseEstimator和MetaEstimatorMixin。BaseEnsemble里有如下幾個方法，基本都是私有方法：

__init__: 初始化方法，共三個參數base_estimator, n_estimators, estimator_params。
_validate_estimator: 對n_estimators和base_estimator做檢查，其中base_enstimator指集成模型的基模型。在GBDT中，base_estimator（元算法/基模型）是決策樹。
_make_estimator: 從base_estimator中復制參數。
__len__: 返回ensemble中estimator的個數。
__getitem__: 返回ensemble中第i個estimator。
__iter__: ensemble中所有estimator的迭代器。

gradient_boosting.py

GradientBoostingClassifier和GradientBoostingRegressor這兩個模型的實現均在gradient_boosting.py里。事實上，該腳本主要實現了一個集成回歸樹Gradient Boosted Regression Tree，而分類器和回歸器都是基于該集成回歸樹的。

注意： 這里要先說明一個問題，GBDT本質上比較適合回歸和二分類問題，而并不特別適用于多分類問題。在scikit-learn中，處理具有K類的多分類問題的GBDT算法實際上在每一次迭代里都構建了K個決策樹，分別對應于這K個類別。我們在理論學習時并沒有怎么接觸過這個點，這里也并不對多分類問題做闡述。

實現Gradient Boosted Regression Tree的類是BaseGradientBoosting，它是GradientBoostingClassifier和GradientBoostingRegressor的基類。它實現了一個通用的fit方法，而回歸問題和分類問題的區別主要在于損失函數LossFunction上。

下面我們梳理一下gradient_boosting.py的內容。

基本的estimator類

一些簡單基本的estimator類，主要用于LossFunction中init_estimator的計算，即初始預測值的計算。舉例來說：QuantileEstimator(alpha=0.5) 代表用中位數作為模型最初的預測值。

QuantileEstimator: 預測訓練集target的alpha-百分位的estimator。
MeanEstimator: 預測訓練集target的平均值的estimator。
LogOddsEstimator: 預測訓練集target的對數幾率的estimator（適合二分類問題）。
ScaledLogOddsEstimator: 縮放后的對數幾率（適用于指數損失函數）。
PriorProbabilityEstimator: 預測訓練集中每個類別的概率。
ZeroEstimator：預測結果都是0的estimator。

LossFunction

LossFunction: 損失函數的基類，有以下一些主要方法

__init__: 輸入為n_classes。當問題是回歸或二分類問題時，n_classes為1；K類多分類為題時n_classes為K。
init_estimator: LossFunction的初始estimator，對應上述那些基本的estimator類，用來計算模型初始的預測值。在基類中不實現，并拋出NotImplementedError。
negative_gradient：根據預測值和目標值計算負梯度。
update_terminal_regions: 更新樹的葉子節點，更新模型的當前預測值。
_update_terminal_regions: 更新樹的葉子節點的方法模板。

RegressionLossFunction

RegressionLossFunction: 繼承LossFunction類，是所有回歸損失函數的基類。

LeastSquaresError: init_estimator是MeanEstimator；負梯度是目標值y和預測值pred的差；唯一一個在update_terminal_regions中不需要更新葉子節點value的LossFunction。
LeastAbsoluteError: init_estimator是QuantileEstimator(alpha=0.5)；負梯度是目標值y和預測值pred的差的符號，適用于穩健回歸。
HuberLossFunction: 一種適用于穩健回歸Robust Regression的損失函數，init_estimator是QuantileEstimator(alpha=0.5)。
QuantileLossFunction: 分位數回歸的損失函數，分位數回歸允許估計目標值條件分布的百分位值。

ClassificationLossFunction

ClassificationLossFunction: 繼承LossFunction，是所有分類損失函數的基類。有_score_to_proba（將分數轉化為概率的方法模板）以及_score_to_decision（將分數轉化為決定的方法模板）兩個抽象方法。

BinomialDeviance: 二分類問題的損失函數，init_estimator是LogOddsEstimator。
MultinomialDeviance: 多分類問題的損失函數，init_estimator是PriorProbabilityEstimator。
ExponentialLoss: 二分類問題的指數損失，等同于AdaBoost的損失。init_estimator是ScaledLogOddsEstimator。

其它

gradient_boosting.py文件里還有以下幾個內容，再下一章里我們將對這些內容做深入分析。

VerboseReporter: 輸出設置。
BaseGradientBoosting: Gradient Boosted Regression Tree的實現基類。
GradientBoostingClassifier: Gradient Boosting分類器。
GradientBoostingRegressor: Gradient Boosting回歸器。

0x03 GDBT主要源碼分析

BaseGradientBoosting

下面我們具體看一下BaseGradientBoosting這個基類，該類有幾個主要方法：

__init__: 除了來自決策樹的參數外，還有n_estimators, learning_rate, loss, init, alpha, verbose, warm_start幾個新的參數。其中loss是損失函數LossFunction的選擇，learning_rate為學習率，n_estimators是boosting的次數（迭代次數或者Stage的個數）。通常learning_rate和n_estimators中需要做一個trade_off上的選擇。init參數指的是BaseEstimator，即默認為每個LossFunction里面的init_estimator，用來計算初始預測值。warm_start決定是否重用之前的結果并加入更多estimators，或者直接抹除之前的結果。
_check_params: 檢查參數是否合法，以及初始化模型參數（包括所用的loss等）。
_init_state: 初始化init_estimator以及model中的狀態（包括init_estimator、estimators_、train_score_、oob_improvement_，后三個都是數組，分別存儲每一個Stage對應的estimator、訓練集得分和outofbag評估的進步）。
_clear_state: 清除model的狀態。
_resize_state: 調整estimators的數量。
fit: 訓練gradient boosting模型，會調用_fit_stages方法。
_fit_stages: 迭代boosting訓練的過程，會調用_fit_stage方法。
_fit_stage: 單次stage訓練過程。
_decision_function：略。
_staged_decision_function：略。
_apply：返回樣本在每個estimator落入的葉子節點編號。

fit方法

# 如果不是warm_start，清除之前的model狀態
if not self.warm_start:
____self._clear_state()

# ......檢查輸入、參數是否合法
# ......如果模型沒有被初始化，則初始化模型，訓練出初始模型以及預測值；
# ......如果模型已被初始化，判斷n_estimators的大小并重新設置模型狀態。
# boosting訓練過程，調用_fit_stages方法
n_stages = self._fit_stages(X, y, y_pred, sample_weight, random_state, begin_at_stage, monitor, X_idx_sorted)
# ......當boosting訓練次數與初始化的estimators_長度不一致時，修正相關變量/狀態，包括estimators_、train_score_、oob_improvement_
# fit方法返回self

`_fit_stages`方法

# 獲取樣本數（為什么每次都要獲取樣本數而不作為self.n_samples呢）
n_samples = X.shape[0]
# 判斷是否做oob（交叉驗證），僅當有抽樣時才做oob
do_oob = self.subsample < 1.0
# 初始化sample_mask，即標注某一輪迭代每個樣本是否要被抽樣的數組
sample_mask = np.ones((n_samples, ), dtype=np.bool)
# 計算inbag（用來訓練的樣本個數）
n_inbag = max(1, int(self.subsample * n_samples))
# 獲取loss對象
loss_ = self.loss_
# ......設置min_weight_leaf、verbose、sparsity等相關參數
# 開始boosting迭代
# begin_at_stage是迭代初始次數，一般來說是0，如果是warm_start則是之前模型結束的地方
i = begin_at_stage
# 開始迭代
for i in range(begin_at_stage, self.n_estimators):
    # 如果subsample < 1，do_oob為真，做下采樣
    if do_oob:
        # _random_sample_mask是在_gradient_boosting.pyx里用cpython實現的一個方法，用來做隨機采樣，生成inbag/outofbag樣本（inbag樣本為True）
        sample_mask = _random_sample_mask(n_samples, n_inbag, random_state)
        # 獲得之前的oob得分
        old_oob_score = loss_(y[~sample_mask], y_pred[~sample_mask], sample_weight[~sample_mask])

        # 調用_fit_stage來訓練下一階段的數
        y_pred = self._fit_stage(i, X, y, y_pred, sample_weight, sample_mask, random_state, X_idx_sorted, X_csc, X_csr)

    # 跟蹤偏差/loss
    # 當do_oob時，計算訓練樣本的loss和oob_score的提升值
    if do_oob:
        # inbag訓練樣本的loss
        self.train_score_[i] = loss_(y[sample_mask], y_pred[sample_mask], sample_weight[sample_mask])
        # outofbag樣本的loss提升
        self.oob_improvement_[i] = (old_oob_score - loss_(y[~sample_mask], y_pred[~sample_mask], sample_weight[~sample_mask]))
    # subsample為1時
    else:
        self.train_score_[i] = loss_(y, y_pred, sample_weight)

    # 若verbose大于0，更新標準輸出
    if self.verbose > 0:
        verbose_reporter.update(i, self)
    # ......若有monitor，檢查是否需要early_stopping
# _fit_stages方法返回i+1，即迭代總次數（包括warm_start以前的迭代）

`_fit_stage`方法

# 判斷sample_mask的數據類型
assert sample_mask.dtype == np.bool
# 獲取損失函數
loss = self.loss_
# 獲取目標值 
original_y = y
# 這里K針對的是多分類問題，回歸和二分類時K為1
for k in range(loss.K):
    # 當問題是多分類問題時，獲取針對該分類的y值
    if loss.is_multi_class:
        y = np.array(original_y == k, dtype=np.float64)
    # 計算當前負梯度
    residual = loss.negative_gradient(y, y_pred, k=k, sample_weight=sample_weight)
    # 構造決策回歸樹（事實上是對負梯度做決策樹模型）
    tree = DecisionTreeRegressor(
        criterion=self.criterion,
        splitter='best',
        max_depth=self.max_depth,
        min_samples_split=self.min_samples_split,
        min_samples_leaf=self.min_samples_leaf,
        min_weight_fraction_leaf=self.min_weight_fraction_leaf,
        min_impurity_split=self.min_impurity_split,
        max_features=self.max_features,
        max_leaf_nodes=self.max_leaf_nodes,
        random_state=random_state,
        presort=self.presort)
    # 如果做sabsample，重新計算sample_weight
    if self.subsample < 1.0:
        sample_weight = sample_weight * sample_mask.astype(np.float64)
    # 根據輸入X是否稀疏，采用不同的fit方法，針對負梯度訓練決策樹
    if X_csc is not None:
        tree.fit(X_csc, residual, sample_weight=sample_weight, check_input=False, X_idx_sorted=X_idx_sorted)
    else:
        tree.fit(X, residual, sample_weight=sample_weight, check_input=False, X_idx_sorted=X_idx_sorted)
    # 根據輸入X是否稀疏，使用update_terminal_regions方法更新葉子節點（注意這是LossFunction里的一個方法）
    if X_csr is not None:
        loss.update_terminal_regions(tree.tree_, X_csr, y, residual, y_pred, sample_weight, sample_mask, self.learning_rate, k=k)
    else:
        loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred, sample_weight, sample_mask, self.learning_rate, k=k)
    # 將新的樹加入到ensemble模型中
    self.estimators_[i, k] = tree
# _fit_stage方法返回新的預測值y_pred，注意這里y_pred是在loss.update_terminal_regions計算的

LossFunction中的update_terminal_regions方法

為了加深理解，我么我們再看一下update_terminal_regions都做了什么。

# 計算每個樣本對應到樹的哪一個葉子節點
terminal_regions = tree.apply(X)
# 將outofbag的樣本的結果都置為-1（不參與訓練過程）
masked_terminal_regions = terminal_regions.copy()
masked_terminal_regions[~sample_mask] = -1
# 更新每個葉子節點上的value，tree.children_left == TREE_LEAF是判斷葉子節點的方法。一個很關鍵的點是這里只更新了葉子節點，而只有LossFunction是LeastSquaresError時訓練時生成的決策樹上的value和我們實際上想要的某個節點的預測值是一致的。
for leaf in np.where(tree.children_left == TREE_LEAF)[0]:
    # _update_terminal_region由每個具體的損失函數具體實現，在LossFunction基類中只提供模板
    self._update_terminal_region(tree, masked_terminal_regions, leaf, X, y, residual, y_pred[:, k], sample_weight)
# 更新預測值，tree預測的是負梯度值，預測值通過加上學習率 * 負梯度來更新，這里更新所有inbag和outofbag的預測值
y_pred[:, k] += (learning_rate * tree.value[:, 0, 0].take(terminal_regions, axis=0))

筆者之前使用GBDT做回歸模型時觀察每顆樹的可視化結果，發現對于損失函數是ls（LeastSquaresError）的情況，每棵樹的任意一個節點上的value都是當前點的target預估值差（即residual，所有樹葉子節點預測的都是residual，它們的和是最終的預測結果）；但使用lad損失函數時，只有葉子節點的結果是收入預估值差。原因應該就在這里：

ls對應的LossFunction類是LeastSquaresError，每個節點的value就是當前點的target預估值差，葉子節點也不需要更新。這是因為ls的負梯度計算方法是預測值和目標值的差，這本身就是residual的概念，所以所有節點的value都是我們想要的值。
lad對應的LossFunction類是LeastAbsoluteError，每個節點的value并不是當前點的target預估值差，而最后代碼里也只更新了葉子節點，所以可視化時會有一些問題，也不能直接獲得每個節點的value作為target預估值差。事實上，lad在訓練的時候有點像一個“二分類”問題，它的負梯度只有兩種取值-1和1，即預測值比目標值大還是小，然后根據這個標準進行分裂。

所以如果沒有改源碼并重新訓練模型的話，若不是ls，其它已有的GBDT模型沒有辦法直接獲取每個非葉子結點的target預估值差，這個在分析模型時會有一些不方便的地方。

feature_importances_的計算

# 初始化n_feautures_長度的數組
total_sum = np.zeros((self.n_features_, ), dtype=np.float64)
# 對于boosting模型中的每一個estimator（實際上就是一棵樹，多分類是多棵樹的數組）
for stage in self.estimators_:
    # 當前stage每個feature在各個樹內的所有的importance平均（多分類時一個stage有多棵樹）
    stage_sum = sum(tree.feature_importances_ for tree in stage) / len(stage)
    # 累加各個stage的importance
    total_sum += stage_sum
# 做歸一化
importances = total_sum / len(self.estimators_)

GradientBoostingClassifier

GBDT分類器的loss可取deviance或exponential，分別對應MultinomialDeviance和ExponentialLoss這兩個損失函數。分類器在predict時需要多加一步，把不同類別對應的樹的打分綜合起來才能輸出結果。因此GBDT實際上不太適合做多分類問題。

GradientBoostingRegressor

GBDT回歸器的loss可取ls, lad, huber, quantile，分別對應LeastSquaresError, LeastAbsoluteError, HuberLossFunction, QuantileLossFunction這幾個損失函數。

0xFF 總結

至此，該系列三篇文章已結。

作者：cathyxlyl | 簡書 | GITHUB

個人主頁：http://cathyxlyl.github.io/
文章可以轉載, 但必須以超鏈接形式標明文章原始出處和作者信息

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

GBDT源碼分析之三：GBDT

GBDT源碼分析之三：GBDT

0x00 前言

0x01 整體說明

0x02 源碼結構分析

BaseEnsemble

gradient_boosting.py

基本的estimator類

LossFunction

RegressionLossFunction

ClassificationLossFunction

其它

0x03 GDBT主要源碼分析

BaseGradientBoosting

fit方法

`_fit_stages`方法

`_fit_stage`方法

LossFunction中的update_terminal_regions方法

feature_importances_的計算

GradientBoostingClassifier

GradientBoostingRegressor

0xFF 總結

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

GBDT源碼分析之三：GBDT

0x00 前言

0x01 整體說明

0x02 源碼結構分析

BaseEnsemble

gradient_boosting.py

基本的estimator類

LossFunction

RegressionLossFunction

ClassificationLossFunction

其它

0x03 GDBT主要源碼分析

BaseGradientBoosting

fit方法

_fit_stages方法

_fit_stage方法

LossFunction中的update_terminal_regions方法

feature_importances_的計算

GradientBoostingClassifier

GradientBoostingRegressor

0xFF 總結

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

`_fit_stages`方法

`_fit_stage`方法