阿里云天池——金融風控-貸款違約預測(一)

賽題理解

賽題以金融風控中的個人信貸為背景,要求選手根據貸款申請人的數據信息預測其是否有違約的可能,以此判斷是否通過此項貸款,這是一個典型的分類問題。通過這道賽題來引導大家了解金融風控中的一些業務背景,解決實際問題,幫助競賽新人進行自我練習、自我提高。

數據是給好的,但是實際的風控數據的X的定義和y標簽的定義都是很有學問的。
簡單來說:得劃分,觀察期——觀察點——表現期。
觀察期確定X,利用運營商數據、電商數據、金融機構數據、第三方數據等并進行衍生。
觀察點:一般選擇授信日。
表現期(需要結合滾動率分析、Vintage分析進行確定):確定Y標簽。

賽題概況

比賽要求參賽選手根據給定的數據集,建立模型,預測金融風險。賽題以預測金融風險為任務,數據集報名后可見并可下載,該數據來自某信貸平臺的貸款記錄,總數據量超過120w,包含47列變量信息,其中15列為匿名變量。為了保證比賽的公平性,將會從中抽取80萬條作為訓練集,20萬條作為測試集A,20萬條作為測試集B,同時會對employmentTitle、purpose、postCode和title等信息進行脫敏。

數據概況

一般而言,對于數據在比賽界面都有對應的數據概況介紹(匿名特征除外),說明列的性質特征。了解列的性質會有助于我們對于數據的理解和后續分析。 Tip:匿名特征,就是未告知數據列所屬的性質的特征列。
train.csv

  1. id 為貸款清單分配的唯一信用證標識
  2. loanAmnt 貸款金額
  3. term 貸款期限(year)
  4. interestRate 貸款利率
  5. installment 分期付款金額
  6. grade 貸款等級
  7. subGrade 貸款等級之子級
  8. employmentTitle就業職稱
  9. employmentLength 就業年限(年)
  10. homeOwnership 借款人在登記時提供的房屋所有權狀況
  11. annualIncome 年收入
  12. verificationStatus 驗證狀態
  13. issueDate 貸款發放的月份
  14. purpose 借款人在貸款申請時的貸款用途類別
  15. postCode 借款人在貸款申請中提供的郵政編碼的前3位數字
  16. regionCode 地區編碼
  17. dti 債務收入比
  18. delinquency_2years 借款人過去2年信用檔案中逾期30天以上的違約事件數
  19. ficoRangeLow 借款人在貸款發放時的fico所屬的下限范圍
  20. ficoRangeHigh 借款人在貸款發放時的fico所屬的上限范圍
  21. openAcc 借款人信用檔案中未結信用額度的數量
  22. pubRec 貶損公共記錄的數量
  23. pubRecBankruptcies 公開記錄清除的數量
  24. revolBal 信貸周轉余額合計
  25. revolUtil 循環額度利用率,或借款人使用的相對于所有可用循環信貸的信貸金額
  26. totalAcc 借款人信用檔案中當前的信用額度總數
  27. initialListStatus 貸款的初始列表狀態
  28. applicationType 表明貸款是個人申請還是與兩個共同借款人的聯合申請
  29. earliesCreditLine 借款人最早報告的信用額度開立的月份
  30. title 借款人提供的貸款名稱
  31. policyCode 公開可用的策略代碼=1新產品不公開可用的策略代碼=2
  32. n系列匿名特征 匿名特征n0-n14,為一些貸款人行為計數特征的處理

預測指標

競賽采用AUC作為評價指標。AUC(Area Under Curve)被定義為 ROC曲線 下與坐標軸圍成的面積。

為什么用AUC呢?(之前實習的時候,剛開始我就做了一個準確率90%+的模型,興沖沖的拿給領導去看,然后就被告知,這個模型的入模變量有問題,再進行篩選吧)
后邊也知道了不光得看AUC,還得結合KS指標進行評價。

為什么風控模型不拿準確率來衡量呢?為什么要用AUC和KS呢?

因為風控模型并不和貓狗二分類問題一樣,信貸風控追求的是風險與收益之間的平衡,因此好壞定義常常是模糊的。原因在于,壞的客群雖然能帶來壞賬損失,但同時也能帶來利息、罰息等收入。那么,我們能接受多壞的客群呢?這就取決于風險容忍度。在風控中,y的定義并不是非黑即白(離散型),而用概率分布(連續型)來衡量或許更合理

還有一個問題就是在風控場景中,樣本不均衡問題非常普遍,一般正負樣本比都能達到1:100甚至更低。此時,評估模型的準確率是不可靠的。因為只要全部預測為負樣本,就能達到很高的準確率。例如,如果數據集中有95個貓和5個狗,分類器會簡單的將其都分為貓,此時準確率是95%。因此,評估準確率是沒有意義的。
(參考:客戶層申請評分卡(A卡)模型
風控模型—區分度評估指標(KS)深入理解應用)

AUC

真陽率:TPR = \frac{TP}{TP+FN}
假陽率:FPR = \frac{FP}{FP+TN}

業務的目的:追求更高的TPR,也就是"抓對了";以及更低的FPR,也就是"抓錯了"。

  1. 給定不同的閾值T,低于閾值預測為bad,高于閾值預測為good,然后計算TPR和FPR。
  2. 重復多次,在不同閾值T下計算得到多個TPR和FPR。
  3. 以FPR為橫軸,TPR為縱軸,畫出ROC曲線。曲線下方的面積即為AUC值。

我們的訴求就是更高的TPR更低的FPR,因此可以的定義如下目標函數(也就是KS了):

KS = MAX(|TPR - FPR|)

而TPR是比FPR大的,因此有TPR = KS + FPR,也就是說ROC曲線上點的切線的截距項反應了KS值的大小,而KS反應的是累計壞客戶率(TPR = 閾值左方區域(預測為bad & 真實為bad) / 整體區域真實為bad)和累計好客戶率(FPR = 閾值左方區域(預測為bad & 真實為good)/ 整體區域真實為good)的區分。

詳情請看求是旺在路上大神的文章,我也是在看過他的多篇文章后才對風控有了更深的理解的。

  1. 若希望KS盡可能大,那么切點需要盡可能接近(0,1),此時AUC一般也會增大。
  2. 對于相同的KS值,在KS曲線上有兩個選擇,但TPR和FPR同時大或同時小。雖然我們的目的通常是抓對更多的壞人(TPR?),盡可能減少錯抓的好人(FPR?),但兩者需要trade-off。到底選擇哪個閾值,取決于業務目標:是希望對bad有更高的召回,還是對good有更低的誤傷?
  3. 由于KS只是在一個最大分隔點時的值,并不夠全面。通常我們也會同時參考KS和AUC(或Gini)

基本的評分卡模型(邏輯回歸)

這一用到了兩個特別好的評分卡建模的包:toad、scorecardpy。

如果本地運行較慢可以將數據上傳到kaggle或者天池的在線編程。

pip install toad
pip install scorecardpy

參考:
https://github.com/ShichenXie/scorecardpy 這個有例子(雖然例子里有小錯誤),但不影響使用
https://toad.readthedocs.io/en/latest/ 也有中文教程 但還是英語的好讀一些

加載包

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
import toad
import scorecardpy as sc

讀取數據

data_train =pd.read_csv('../input/fengkong/train.csv', index_col='id')
data_test_a = pd.read_csv('../input/fengkong/testA.csv', index_col='id')

數據清洗

區分數值列和非數值列(object類型:日期、字符串等)

'''
# 非數值列
s = data_train.apply(lambda x:x.dtype)
tecols = s[s=='object'].index.tolist()
'''
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
label = 'isDefault'
numerical_fea.remove(label)

對于非數值列進行編碼

category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

最初的想法是,對于'grade', 'subGrade'采用LabelEncoder(),對于'employmentLength'提取字符串中的數字,'issueDate', 'earliesCreditLine'都轉為與最小時間的差值。

'''
for data in [data_train, data_test_a]:
    data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
    data['subGrade'] = data['subGrade'].map({'A1':1.0,'A2':1.2,'A3':1.4,'A4':1.6,'A5':1.8,
                                       'B1':2.0,'B2':2.2,'B3':2.4,'B4':2.6,'B5':2.8,
                                       'C1':3.0,'C2':3.2,'C3':3.4,'C4':3.6,'C5':3.8,
                                       'D1':4.0,'D2':4.2,'D3':4.4,'D4':4.6,'D5':4.8,
                                       'E1':5.0,'R2':5.2,'E3':5.4,'E4':5.6,'E5':5.8,
                                       'F1':6.0,'F2':6.2,'F3':6.4,'F4':6.6,'F5':6.8,
                                       })

def employmentLength_to_int(s):
    if pd.isnull(s):
        return s
    else:
        return np.int8(s.split()[0])
    
for data in [data_train, data_test_a]:
    data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
    data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
    data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
    
data_train['employmentLength'].value_counts(dropna=False).sort_index()
'''
'''
#轉化成時間格式
for data in [data_train, data_test_a]:
    data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d')
    startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
    #構造時間特征
    data['issueDate'] = data['issueDate'].apply(lambda x: x-startdate).dt.days
data_train['issueDate'].sample(5)
'''
'''
for data in [data_train, data_test_a]:
    data['earliesCreditLine'] = pd.to_datetime(data['earliesCreditLine'])
    startdate = np.min(data['earliesCreditLine'])
    #構造時間特征
    data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda x: x-startdate).dt.days
data_train['earliesCreditLine'].sample(5)
'''

但是后期發現了一個更加簡單的 TargetEncoder
https://zhuanlan.zhihu.com/p/40231966
https://blog.csdn.net/SHU15121856/article/details/102100689

簡而言之就是:對每個特征c的每個取值,將其變成使target為1的頻率
\frac{這個特征出現時,target為1的次數}{這個特征出現的總次數}
其實就是這個特征對應的壞客戶率。

為了避免過擬合,K折目標編碼將要編碼的樣本分成K份,每其中一份中的樣本的目標編碼,使用的是另外K-1份數據中相同類別的那些樣本的頻率值。

from category_encoders.target_encoder import TargetEncoder
te = TargetEncoder(cols=category_fea)
train = te.fit_transform(data_train, target)
test = te.transform(data_test_a)

數據探索

與describe()類似,但toad中的detect函數更全面,不光能給出數值變量的統計特征,也可以對類別變量進行統計分析。
可以看到數值類型、數據大小、缺失情況、唯一值的數量、均值、方差、分位數(頻率前幾的類別型變量)。

toad.detect(train)

可以看出policyCode均為1,這個變量對于分類是不起作用的。

特征篩選

  1. 計算IV值
    scorecardpy中,計算sc.iv(dt, y, x=None, positive='bad|1', order=True):
    toad中,toad.quality(dataframe, target=’target’, iv_only=False): 輸出IV(信息值),基尼系數,熵和唯一值的數量的集合。 功能按IV降序排序。 “ target”是目標變量,“ iv_only”指定是否僅計算IV。
toad.quality(train, target=target, iv_only=True)

變量選擇

基本可以從以下幾個方面進行篩選:缺失率、單一值、變異系數、穩定性PSI、信息量IV值、基于RF/XGBoost特征重要性、線性相關性、多重共線性、逐步回歸、P值的顯著性檢驗。

scorecardpy中: 通過IV值,缺失率,單一值率進行篩選

var_filter(dt, y, x=None, iv_limit=0.02, missing_limit=0.95,
identical_limit=0.95, var_rm=None, var_kp=None,
return_rm_reason=False, positive='bad|1'):

 Params
    ------
    dt: A data frame with both x (predictor/feature) and y 
      (response/label) variables.
    y: Name of y variable.
    x: Name of x variables. Default is NULL. If x is NULL, then all 
      variables except y are counted as x variables.
    iv_limit: The information value of kept variables should>=iv_limit. 
      The default is 0.02.
    missing_limit: The missing rate of kept variables should<=missing_limit. 
      The default is 0.95.
    identical_limit: The identical value rate (excluding NAs) of kept 
      variables should <= identical_limit. The default is 0.95.
    var_rm: Name of force removed variables, default is NULL.
    var_kp: Name of force kept variables, default is NULL.
    return_rm_reason: Logical, default is FALSE.
    positive: Value of positive class, default is "bad|1".
    
    Returns
    ------
    DataFrame
        A data.table with y and selected x variables
    Dict(if return_rm_reason == TRUE)
        A DataFrame with y and selected x variables and 
          a DataFrame with the reason of removed x variable.

toad中:

toad.selection.select(dataframe, target=’target’, empty=0.9, iv=0.02, corr=0.7, return_drop=False, exclude=None):

根據缺失百分比,IV和相關性(與其他特征)進行初步特征選擇,變量為: 
empyt = 0.9:缺失百分比大于90%的要素被過濾;
iv = 0.02:消除IV小于0.02的特征; 
corr = 0.7:如果兩個或多個特征的Pearson相關性大于0.7,則IV值較低的特征將被消除;
return_drop = False:如果設置為True,則該函數返回已刪除列的列表;否則為false。
exclude = None:輸入要從算法中排除的功能列表,通常是ID列和month列。

這里使用Toad中的函數進行篩選變量:

train_selected, dropped = toad.selection.select(train, target=target, empty=0.9, iv=0.02, corr=0.9, return_drop=True)

print("keep:",train_selected.shape[1],
      "drop empty:",len(dropped['empty']),
      "drop iv:",len(dropped['iv']),
      "drop corr:",len(dropped['corr']))

輸出

keep: 15 drop empty: 0 drop iv: 24 drop corr: 6
{'empty': array([], dtype=float64),
 'iv': array(['employmentLength', 'purpose', 'postCode', 'regionCode',
        'delinquency_2years', 'openAcc', 'pubRec', 'pubRecBankruptcies',
        'revolBal', 'totalAcc', 'initialListStatus', 'applicationType',
        'policyCode', 'n0', 'n1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n10',
        'n11', 'n12', 'n13'], dtype=object),
 'corr': array(['n9', 'grade', 'n2.1', 'installment', 'ficoRangeHigh',
        'interestRate'], dtype=object)}

分箱

scorecardpy中:默認決策樹分箱

woebin(dt, y, x=None, 
           var_skip=None, breaks_list=None, special_values=None, 
           stop_limit=0.1, count_distr_limit=0.05, bin_num_limit=8, 
           # min_perc_fine_bin=0.02, min_perc_coarse_bin=0.05, max_num_bin=8, 
           positive="bad|1", no_cores=None, print_step=0, method="tree",
           ignore_const_cols=True, ignore_datetime_cols=True, 
           check_cate_num=True, replace_blank=True, 
           save_breaks_list=None, **kwargs):
 WOE Binning
    ------
    `woebin` generates optimal binning for numerical, factor and categorical 
    variables using methods including tree-like segmentation or chi-square 
    merge. woebin can also customizing breakpoints if the breaks_list or 
    special_values was provided.
    
    The default woe is defined as ln(Distr_Bad_i/Distr_Good_i). If you 
    prefer ln(Distr_Good_i/Distr_Bad_i), please set the argument `positive` 
    as negative value, such as '0' or 'good'. If there is a zero frequency 
    class when calculating woe, the zero will replaced by 0.99 to make the 
    woe calculable.
    
    Params
    ------
    dt: A data frame with both x (predictor/feature) and y (response/label) variables.
    y: Name of y variable.
    x: Name of x variables. Default is None. If x is None, 
      then all variables except y are counted as x variables.
    var_skip: Name of variables that will skip for binning. Defaults to None.
    breaks_list: List of break points, default is None. 
      If it is not None, variable binning will based on the 
      provided breaks.
    special_values: the values specified in special_values 
      will be in separate bins. Default is None.
    count_distr_limit: The minimum percentage of final binning 
      class number over total. Accepted range: 0.01-0.2; default 
      is 0.05.
    stop_limit: Stop binning segmentation when information value 
      gain ratio less than the stop_limit, or stop binning merge 
      when the minimum of chi-square less than 'qchisq(1-stoplimit, 1)'. 
      Accepted range: 0-0.5; default is 0.1.
    bin_num_limit: Integer. The maximum number of binning.
    positive: Value of positive class, default "bad|1".
    no_cores: Number of CPU cores for parallel computation. 
      Defaults None. If no_cores is None, the no_cores will 
      set as 1 if length of x variables less than 10, and will 
      set as the number of all CPU cores if the length of x variables 
      greater than or equal to 10.
    print_step: A non-negative integer. Default is 1. If print_step>0, 
      print variable names by each print_step-th iteration. 
      If print_step=0 or no_cores>1, no message is print.
    method: Optimal binning method, it should be "tree" or "chimerge". 
      Default is "tree".
    ignore_const_cols: Logical. Ignore constant columns. Defaults to True.
    ignore_datetime_cols: Logical. Ignore datetime columns. Defaults to True.
    check_cate_num: Logical. Check whether the number of unique values in 
      categorical columns larger than 50. It might make the binning process slow 
      if there are too many unique categories. Defaults to True.
    replace_blank: Logical. Replace blank values with None. Defaults to True.
    save_breaks_list: The file name to save breaks_list. Default is None.
    
    Returns
    ------
    dictionary
        Optimal or customized binning dataframe.

toad中默認卡房分箱:

Toad的分箱功能同時支持類別變量和數值變量。 
toad.transform.Combiner()用于訓練 
1.初始化:c = toad.transform.Combiner() 
2. 訓練分箱:c.fit(dataframe, y = ‘target’, method = ‘chi’, min_samples = None, n_bins = None, empty_separate = False) 
§ y: target variable; 
§ method: the method to apply binning. Suport ‘chi’ (Chi-squared), ‘dt’, (decisin tree), ‘kmeans’ (K-means), ‘quantile’ (by the same percentile), and ‘step’ (by the same step); 
§ min_samples: can be a number or a porportion. Minimum number / porportion of samples required in each bucket; 
§ n_bins: mininum number of buckets. If the number is too large, the algorithm will return the maxinum number of buckets it can get; 
§ empty_separate: whether to seperate the missing values in a bucket. If False, missing values will be put along with the bucket of most close bad rate. 
3. 分箱結果:c.export() 
4. 分箱調整: c.set_rules(dict) 
5. 應用分箱并轉為離散值 c.transform(dataframe, labels=False): 
§ labels: whether to convert data to explanatory labels. Returns 0, 1, 2 … when False. Categorical features will be sorted in a descending order of porportion. Returns (-inf, 0], (0,10], (10, inf) when True. 

Note: 1. remember to exclude the unwanted columns, especially ID column and timestamp column. 2. Columns with large number of unique values may take much time to train.

兩者都有調整分箱的方法:
scorecardpy : sc.woebin(dt_s, y="creditability", breaks_list=breaks_adj)
toad :c.set_rules(dict)

這里采用scorecardpy中的方法進行WOE分箱

train_selected = pd.concat([train_selected, target.rename('isDefault')], axis=1) 
bins = sc.woebin(train_selected, y="isDefault")

將分箱可視化并分析單調性:

sc.woebin_plot(bins)

幾個例子:





這里需要觀察單調性,然后進行分箱調整(沒有做)。

然后講訓練集和測試集都轉為WOE編碼

train_woe = sc.woebin_ply(train_selected, bins)
test_a_woe = sc.woebin_ply(test_a_selected, bins)

模型訓練

# breaking dt into train and val
train, val = sc.split_df(train_woe, 'isDefault').values()

y_train = train.loc[:,'isDefault']
X_train = train.loc[:,train.columns != 'isDefault']
y_val = val.loc[:,'isDefault']
X_val = val.loc[:,val.columns != 'isDefault']

# logistic regression ------
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l1', C=0.9, solver='saga', n_jobs=-1)
lr.fit(X_train, y_train)
# lr.coef_
# lr.intercept_

# predicted proability
train_pred = lr.predict_proba(X_train)[:,1]
val_pred = lr.predict_proba(X_val)[:,1]

查看在訓練集和驗證集上的AUC及KS

train_perf = sc.perf_eva(y_train, train_pred, title = "train")
val_perf = sc.perf_eva(y_val, val_pred, title = "val")

KS、AUC都在合理的范圍之內,并且驗證集合訓練集的表現的十分接近,說明模型十分穩健。

評分卡

將變量分箱轉為分數:

card = sc.scorecard(bins, lr, xcolumns = X_train.columns)
{'basepoints':      variable  bin  points
 0  basepoints  NaN   488.0,
 'n14':    variable         bin  points
 26      n14  [-inf,1.0)     9.0
 27      n14   [1.0,3.0)     3.0
 28      n14   [3.0,5.0)    -6.0
 29      n14   [5.0,inf)   -12.0,
 'employmentTitle':            variable                  bin  points
 30  employmentTitle      [-inf,200000.0)    -0.0
 31  employmentTitle  [200000.0,240000.0)     1.0
 32  employmentTitle  [240000.0,310000.0)     1.0
 33  employmentTitle       [310000.0,inf)     0.0,
 'earliesCreditLine':             variable                                        bin  points
 0  earliesCreditLine                 [-inf,0.17999999999999997)     6.0
 1  earliesCreditLine  [0.17999999999999997,0.19999999999999996)     2.0
 2  earliesCreditLine  [0.19999999999999996,0.20999999999999996)    -1.0
 3  earliesCreditLine  [0.20999999999999996,0.22999999999999995)    -4.0
 4  earliesCreditLine                  [0.22999999999999995,inf)    -8.0,
 'homeOwnership':         variable         bin  points
 5  homeOwnership  [-inf,1.0)    12.0
 6  homeOwnership   [1.0,2.0)   -13.0
 7  homeOwnership   [2.0,inf)    -3.0,
 'verificationStatus':               variable         bin  points
 8   verificationStatus  [-inf,1.0)     7.0
 9   verificationStatus   [1.0,2.0)    -1.0
 10  verificationStatus   [2.0,inf)    -5.0,
 'revolUtil':      variable          bin  points
 34  revolUtil  [-inf,20.0)     1.0
 35  revolUtil  [20.0,35.0)     0.0
 36  revolUtil  [35.0,55.0)     0.0
 37  revolUtil  [55.0,75.0)    -0.0
 38  revolUtil   [75.0,inf)    -0.0,
 'annualIncome':         variable                 bin  points
 39  annualIncome      [-inf,45000.0)   -14.0
 40  annualIncome   [45000.0,65000.0)    -6.0
 41  annualIncome   [65000.0,75000.0)     0.0
 42  annualIncome  [75000.0,105000.0)     8.0
 43  annualIncome      [105000.0,inf)    20.0,
 'title':    variable         bin  points
 11    title  [-inf,4.0)     0.0
 12    title   [4.0,5.0)    -0.0
 13    title   [5.0,6.0)    -0.0
 14    title  [6.0,20.0)     0.0
 15    title  [20.0,inf)    -0.0,
 'loanAmnt':     variable                bin  points
 44  loanAmnt      [-inf,4000.0)    17.0
 45  loanAmnt   [4000.0,10000.0)    11.0
 46  loanAmnt  [10000.0,16000.0)    -2.0
 47  loanAmnt      [16000.0,inf)    -9.0,
 'n2':    variable         bin  points
 52       n2  [-inf,4.0)     7.0
 53       n2   [4.0,6.0)     3.0
 54       n2   [6.0,9.0)    -3.0
 55       n2   [9.0,inf)   -11.0,
 'issueDate':      variable                                        bin  points
 48  issueDate                 [-inf,0.17999999999999994)    24.0
 49  issueDate  [0.17999999999999994,0.19999999999999996)     3.0
 50  issueDate  [0.19999999999999996,0.21999999999999995)    -4.0
 51  issueDate                  [0.21999999999999995,inf)   -18.0,
 'subGrade':     variable                        bin  points
 16  subGrade                 [-inf,0.1)    64.0
 17  subGrade                  [0.1,0.2)    19.0
 18  subGrade  [0.2,0.30000000000000004)   -13.0
 19  subGrade  [0.30000000000000004,0.4)   -34.0
 20  subGrade                  [0.4,inf)   -54.0,
 'dti':    variable          bin  points
 21      dti  [-inf,14.0)     8.0
 22      dti  [14.0,21.0)     2.0
 23      dti  [21.0,25.0)    -3.0
 24      dti  [25.0,30.0)    -8.0
 25      dti   [30.0,inf)   -14.0,
 'term':    variable         bin  points
 56     term  [-inf,5.0)    11.0
 57     term   [5.0,inf)   -26.0,
 'ficoRangeLow':         variable            bin  points
 58  ficoRangeLow   [-inf,685.0)    -7.0
 59  ficoRangeLow  [685.0,710.0)     0.0
 60  ficoRangeLow  [710.0,740.0)     9.0
 61  ficoRangeLow  [740.0,760.0)    18.0
 62  ficoRangeLow    [760.0,inf)    25.0}

模型驗證

得到y變量相應的得分,并檢查訓練集和驗證集所得分數的穩定性(PSI指標):

train_data = train_selected.loc[train.index].drop(columns=['isDefault'])
val_data = train_selected.loc[val.index].drop(columns=['isDefault'])
# credit score
train_score = sc.scorecard_ply(train_data, card, print_step=0)
val_score = sc.scorecard_ply(val_data, card, print_step=0)
# psi
sc.perf_psi(
  score = {'train':train_score, 'test':val_score},
  label = {'train':y_train, 'test':y_val}
)

這結果有些離譜。。

完成測試集的預測

lr2 = LogisticRegression(penalty='l1', C=0.9, solver='saga', n_jobs=-1)
lr2.fit(X_train, y_train)

# predicted proability
test_pred = lr2.predict_proba(test_a_woe)[:,1]

線上驗證的結果為0.7113與(訓練集的0.7141比較接近)。

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。