本文采用lending club官網公開數據中2017年Q2部分,數據內容為貸款申請人信息包括申請人的年齡、性別、婚姻狀況、學歷、貸款金額、申請人財產情況等(自變量)和貸款履行情況(因變量)。(使用2017年數據是為了方便與其他人的結果對比)
本文基于對象過去行為和屬性預測其未來是否逾期,流程主要包括處理缺失值、將原始變量進行WOE編碼,通過IV值、相關系數、顯著性依次篩選變量,使用SMOTE解決類別不平衡問題,通過邏輯回歸算法解決二元分類問題(判定貸款申請人是否違約),再計算出每個樣本的評分(為了方便業務使用,類似于芝麻信用分)。
最終結果,auc=0.953, ks=0.802, accuracy_score=0.938 。
完整代碼
1.數據下載與讀取
百度網盤:網盤地址 密碼:let1
先看一下大概數據
In[1]: df.head()
Out[1]:
loan_amnt funded_amnt ... total_bc_limit total_il_high_credit_limit
0 7500.0 7500.0 ... 35000.0 92511.0
1 20000.0 20000.0 ... 22900.0 42517.0
2 12000.0 12000.0 ... 9200.0 30780.0
3 6025.0 6025.0 ... 17600.0 0.0
4 4000.0 4000.0 ... 5000.0 15523.0
查看df的信息
[In]: df.info()
[Out]: <class 'pandas.core.frame.DataFrame'>
RangeIndex: 105453 entries, 0 to 105452
Columns: 145 entries, id to settlement_term
dtypes: float64(107), object(38)
memory usage: 116.7+ MB
可以看到這個數據共有105453行,145列,在列中有107列是float64類型也就是數值變量,38列是object類型
2.數據預處理
2.1 因變量映射
在這份數據中,因變量列是loan_status,但是觀察這一列的值發現并不是二元的
In[46]: df.loan_status.value_counts() # 7個類別
Out[46]:
Current 77347
Fully Paid 19652
Charged Off 4519
Late (31-120 days) 2089
In Grace Period 1083
Late (16-30 days) 598
Default 163
Name: loan_status, dtype: int64
查找業務解釋
Fully Paid:已結清; Current:當前已還款;
Charged Off:壞賬; Late (31-120 days):預期31-120天; Late (16-30 days):預期16-30天
In Grace Period :已逾期但在寬限期類; Default:逾期超過90天
上面只有Fully Paid和Current是未逾期,將7個值映射為{0, 1}
d = {'Current':0,
'Fully Paid':0,
'Charged Off':1,
'Late (31-120 days)':1,
'Late (16-30 days)':1,
'In Grace Period':1,
'Default':1}
df.loan_status = df.loan_status.map(d)
df = df[df['loan_status'].notnull()]
此時再次查看loan_status列
In: df['loan_status'].value_counts(normalize=True)
Out:
0 0.919849
1 0.080151
Name: loan_status, dtype: float64
可以看到已經映射完畢,同時發現未逾期(0)和逾期(1)存在類別不平衡,在后續建模時可以使用2種方法:1.損失函數中使用類別權重 2.使用SMOTE對原始數據類別少的過采樣或者對于原始數據類別多的多次下采樣構建多個判別器bagging
2.1 缺失值處理
我們關注的因變量是loan_status查看這一列發現有2個null,將這兩行去除
df.loan_status.isnull().sum() # out = 2
df = df[df.loan_status.notnull()]
缺失率超過50%的列往往不能帶來什么信息,直接刪除
miss_large_col = \
[k for k,v in dict(df.isnull().sum()/df.shape[0]).items() if v>=0.5]
df = df.drop(miss_large_col,axis=1)
miss_large_col一共有42列,刪除后數據的規模是(105451, 103),查看當前缺失值情況
In[53]: (df.isnull().sum() / df.shape[0]).sort_values(ascending=False)
Out[53]:
mths_since_last_delinq 0.484765
next_pymnt_d 0.229215
il_util 0.126884
mths_since_recent_inq 0.113313
emp_title 0.064314
emp_length 0.063508
num_tl_120dpd_2m 0.050052
mo_sin_old_il_acct 0.025149
mths_since_rcnt_il 0.025149
bc_util 0.011294
percent_bc_gt_75 0.010868
bc_open_to_buy 0.010839
mths_since_recent_bc 0.010270
last_pymnt_d 0.001157
revol_util 0.000711
dti 0.000711
all_util 0.000123
avg_cur_bal 0.000019
out_prncp 0.000000
total_acc 0.000000
initial_list_status 0.000000
其中13列存在缺失值,mths_since_last_delinq這一列缺失率雖然沒有超過0.5但也有0.48,刪掉
df = df.drop(['mths_since_last_delinq'], axis=1)
接下來遍歷每一列,查看那些列中有一種值占比超過0.95
In[58]:
tmp_list = []
for x in df.drop(['loan_status'],axis=1).columns:
if df[x].value_counts(normalize=True).iloc[0] >=0.95:
tmp_list.append((x, df[x].value_counts(normalize=True).iloc[0]))
tmp_list
Out[58]:
[('pymnt_plan', 0.9995637784373785),
('total_rec_late_fee', 0.9741111985661587),
('recoveries', 0.9834235806203829),
('collection_recovery_fee', 0.9834425467752795),
('collections_12_mths_ex_med', 0.9786156603540981),
('policy_code', 1.0),
('acc_now_delinq', 0.9948127566357834),
('chargeoff_within_12_mths', 0.9915979933808119),
('delinq_amnt', 0.9955714028316469),
('num_tl_120dpd_2m', 0.9990516406616553),
('num_tl_30dpd', 0.9965860921186144),
('tax_liens', 0.9542346682345355),
('hardship_flag', 0.9993835999658609),
('disbursement_method', 0.9998198215284825),
('debt_settlement_flag', 0.9971740429204086)]
如果某一屬性所有的樣本都是同一個值,那么這個屬性肯定對是否逾期沒有影響,所以將占比最多的值超過0.95的列刪除。
not_col=[]
for x in df.drop(['loan_status'],axis=1).columns:
if df[x].value_counts(normalize=True).iloc[0] >=0.95:
not_col.append(x)
df = df.drop(not_col,axis=1)
print(df.shape[1]) # out = 88
此時還剩下88列,查看剩下的這些列
df.dtypes.sort_values()
這62列中,loan_status是int64類型,'sub_grade', 'grade', 'initial_list_status', 'int_rate', 'term', 'emp_title', 'application_type', 'emp_length', 'issue_d', 'last_credit_pull_d', 'verification_status', 'purpose', 'title', 'zip_code', 'addr_state', 'next_pymnt_d', 'last_pymnt_d', 'revol_util' 'home_ownership', 'earliest_cr_line'
共20列是object類型,其余67列是float64類型。
查看這20列,對于unique值超過100的emp_title, zip_code這兩個類別型變量直接刪除,int_rate是百分比數字因為%被識別為文本需要轉化為數字,sub_grade是grade信用評級的細分,和grade信息有交疊這里先刪掉(也許用sub_grade比grade更好,需要實驗);emp_length的unique值也可以映射為年份的數值變量,add_state與償還能力無關,刪掉;earliest_cr_line, last_pymnt_d, next_pymnt_d,last_credit_pull_d都是時間類型,可以與當前時間做差值計算時間間隔;revol_util可以轉化為數值變量。
In[68]:
object_col = list(df.select_dtypes(include=['O']).columns)
df.loc[:,object_col].describe().T
Out[68]:
count unique top freq
term 105451 2 36 months 77105
int_rate 105451 65 16.02% 4956
grade 105451 7 C 36880
sub_grade 105451 35 C1 8088
emp_title 98669 38551 Teacher 1999
emp_length 98754 11 10+ years 35438
home_ownership 105451 5 MORTGAGE 52502
verification_status 105451 3 Source Verified 42033
issue_d 105451 3 Jun-2017 38087
purpose 105451 13 debt_consolidation 58557
title 105451 12 Debt consolidation 58564
zip_code 105451 851 112xx 1100
addr_state 105451 49 CA 13751
earliest_cr_line 105451 627 Sep-2004 892
revol_util 105376 1076 0% 468
initial_list_status 105451 2 w 79488
last_pymnt_d 105329 16 Jun-2018 54794
next_pymnt_d 81280 2 Jul-2018 56176
last_credit_pull_d 105451 17 Jun-2018 84157
application_type 105451 2 Individual 98638
對于unique值超過100的emp_title, zip_code這兩個類別型變量直接刪除,int_rate是百分比數字因為%被識別為文本需要轉化為數字,sub_grade是grade信用評級的細分,和grade信息有交疊這里先刪掉(也許用sub_grade比grade更好,需要實驗);emp_length的unique值也可以映射為年份的數值變量,add_state與償還能力無關,刪掉;earliest_cr_line, last_pymnt_d, next_pymnt_d,last_credit_pull_d都是時間類型,可以與當前時間做差值計算時間間隔;revol_util可以轉化為數值變量。操作如下
df = df.drop(['emp_title', 'zip_code', 'sub_grade', 'addr_state'], axis=1)
df['revol_util'] = df['revol_util']\
.map(lambda x: float(x.split('%')[0])/100 if not pd.isnull(x) else x)
df['int_rate'] = df['int_rate']\
.map(lambda x: float(x.split('%')[0])/100 if not pd.isnull(x) else x)
df['emp_length'].unique()
d = {'10+ years':10, '< 1 year':0, '7 years':7,'2 years':2, '1 year':1,
'3 years':3, '9 years':9, '8 years':8, '5 years':5, '6 years':6, '4 years':4}
df['emp_length'] = df['emp_length'].map(d)
再次查看經過處理后的object列
object_col = list(df.select_dtypes(include=['O']).columns)
object_col
df.loc[:,object_col].describe().T
# 依次對剩余列檢查
for ob in object_col:
print(ob, dict(df[ob].value_counts(normalize=True)))
依次查看每一列關于loan_status的分組條形圖
發現home_ownership中{'MORTGAGE': 0.50, 'RENT': 0.39, 'ANY': 4.7415387241467604e-05, 'OWN': 0.11, 'NONE': 1.896615489658704e-05}
‘ANY’和NONE占比太少,用最多的MORTGAGE替換
df.loc[df.home_ownership.isin(['ANY', 'NONE']), 'home_ownership'] = 'MORTGAGE'
for i in object_col:
pvt=pd.pivot_table(df[['loan_status',i]],index=i,columns="loan_status",aggfunc=len)
pvt.plot(kind="bar")
這里只展示幾個列,下面的term關于loan_status的條形圖可以看出,多數選擇36個月,而且選擇36個月的客戶違約率更低。
grade中各個評級對于違約率的影響我們不能直接看出來,那么怎么衡量這個變量對于loan_status有沒有影響,這就需要用到信用卡評分模型中常用的WOE編碼(詳細點擊這篇文章[待完成])。
2.2 缺失值填充
接下來填充缺失值,策略是對于數值類型變量,如果缺失值超過0.05,用-999代替作為一個特征值,沒有超過0.05用中位數填充;對于類別型變量,本項目中沒有缺失值,如果有可以用新的值或者占比最多的值填充。
rate = dict(df.isnull().sum()/df.shape[0])
rate
# 1.對于數值型數據,缺失率超過 0.05, 用 -999代替nan
# 2.對于類別數據,
cate_col = list(df.select_dtypes(include=['O']).columns)#4
num_col = [x for x in df.columns if x not in cate_col and x!='loan_status']#57
d1 = [k for k,v in rate.items() if k in num_col and v>=0.05]
for i in d1:
df[i] = df[i].fillna(-999)
d2 = [x for x in num_col if x not in d1]
for i in d2:
df[i] = df[i].fillna(df[i].median())
df.loc[:,cate_col].isnull().sum()# 類別類型無缺失
3.WOE編碼
3.1 對類別變量進行WOE編碼
def binning_cate(df,col,target):
total = df[target].count()
bad = df[target].sum()
good = total-bad
group = df.groupby([col],as_index=True)
bin_df = pd.DataFrame()
bin_df['total'] = group[target].count()
bin_df['totalrate'] = bin_df['total']/total
bin_df['bad'] = group[target].sum()
bin_df['badrate'] = bin_df['bad']/bin_df['total']
bin_df['good'] = bin_df['total'] - bin_df['bad']
bin_df['goodrate'] = bin_df['good']/bin_df['total']
bin_df['badattr'] = bin_df['bad']/bad
bin_df['goodattr'] = (bin_df['total']-bin_df['bad'])/good
bin_df['woe'] = np.log(bin_df['badattr']/bin_df['goodattr'])
bin_df['bin_iv'] = (bin_df['badattr']-bin_df['goodattr'])*bin_df['woe']
bin_df['iv'] = bin_df['bin_iv'].sum()
return bin_df
cate_bin_df_list = []
for col in cate_col:
bin_df = binning_cate(df, col, 'loan_status')
cate_bin_df_list.append(bin_df)
# 存類別變量名、IV值
cate_iv_df = pd.DataFrame({'col':cate_col, 'iv':[x['iv'].iloc[0] for x in cate_bin_df_list]}).sort_values('iv',ascending=False).reset_index(drop=True)
cate_iv_df
結果是
Out[168]:
col iv
0 purpose inf
1 grade 0.476388
2 verification_status 0.083826
3 initial_list_status 0.022144
4 title 0.018638
5 home_ownership 0.017939
6 term 0.016072
7 issue_d 0.005004
8 application_type 0.000880
purpose的iv值居然是正無窮,顯然不符合常理,出現這種情況是因為purpose變量中某一類的數量太少,我們用查看這一列值的分布,顯然wedding只有一個樣本,將這個樣本刪掉df = df.loc[df.purpose != 'wedding']
。
df['purpose'].value_counts()
Out[169]:
debt_consolidation 58557
credit_card 21261
home_improvement 9222
other 7140
major_purchase 2616
medical 1648
car 1334
vacation 1170
small_business 1034
moving 945
house 453
renewable_energy 70
wedding 1
Name: purpose, dtype: int64
3.2 對數值變量進行WOE編碼
對數值變量進行WOE編碼的方法是對于某一個自變量比如說last_pymnt_d
,使用這個變量和loan_status構建單變量決策樹模型,決策樹節點的分裂區間來對數值變量進行分箱。
In[181]: # 對數值變量分箱, 使用單變量決策樹方法
def tree_split(df,col,target,max_bin,min_binpct,nan_value):
missing_rate = df[df[col]==nan_value].shape[0]/df.shape[0]
if missing_rate < 0.05:
x = np.array(df[col]).reshape(-1,1)
y = np.array(df[target])
tree = DecisionTreeClassifier(max_leaf_nodes=max_bin,min_samples_leaf=min_binpct)
tree.fit(x,y)
threshold = tree.tree_.threshold
threshold = threshold[threshold!=_tree.TREE_UNDEFINED]
split_list = sorted(threshold.tolist())
else:
x = np.array(df[df[col]!=nan_value][col]).reshape(-1,1)
y = np.array(df[df[col]!=nan_value][target])
tree = DecisionTreeClassifier(max_leaf_nodes=max_bin-1,min_samples_leaf=min_binpct)
tree.fit(x,y)
threshold = tree.tree_.threshold
threshold = threshold[threshold!=_tree.TREE_UNDEFINED]
split_list = sorted(threshold.tolist())
split_list.insert(0,nan_value)
return split_list
# 數值型特征的分箱,計算woe,IV
def binning_num(df,col,target,cut):
total = df[target].count()
bad = df[target].sum()
good = total-bad
bucket = pd.cut(df[col],cut)
group = df.groupby(bucket)
bin_df = pd.DataFrame()
bin_df['total'] = group[target].count()
bin_df['totalrate'] = bin_df['total']/total
bin_df['bad'] = group[target].sum()
bin_df['badrate'] = bin_df['bad']/bin_df['total']
bin_df['good'] = bin_df['total'] - bin_df['bad']
bin_df['goodrate'] = bin_df['good']/bin_df['total']
bin_df['badattr'] = bin_df['bad']/bad
bin_df['goodattr'] = (bin_df['total']-bin_df['bad'])/good
bin_df['woe'] = np.log(bin_df['badattr']/bin_df['goodattr'])
bin_df['bin_iv'] = (bin_df['badattr']-bin_df['goodattr'])*bin_df['woe']
bin_df['iv'] = bin_df['bin_iv'].sum()
return bin_df
num_dict={}
for col in num_col:
split_list = tree_split(df,col,'loan_status',5,0.05,-999)
split_list.insert(0,float('-inf'))
split_list.append(float('inf'))
bin_df = binning_num(df,col,'loan_status',split_list)
num_dict.setdefault(col,{})
num_dict[col]['bin_df']=bin_df
num_dict[col]['cut'] = split_list
num_iv_df = pd.DataFrame({'col':num_col,'iv':[num_dict[x]['bin_df']['iv'].iloc[0] for x in num_col]})\
.sort_values('iv',ascending=False).reset_index(drop=True)
num_iv_df.head()
Out[181]:
col iv
0 last_pymnt_d 2.059883
1 total_rec_prncp 1.171917
2 last_pymnt_amnt 0.687479
3 out_prncp 0.567522
4 out_prncp_inv 0.567459
4.變量篩選
4.1 根據IV值篩選
根據業務經驗,將閾值設定為0.03,將大于0.03的變量篩選出來,最后得到32個數值變量、2個類別變量
#根據業務經驗將閾值定在0.03,大于0.03篩選得23個數值型字段,1個類別型字段。
iv_select_num_col = list(num_iv_df[num_iv_df.iv>0.03]['col'])
select_num_dict = {k:v for k,v in num_dict.items() if k in iv_select_num_col}
len(iv_select_num_col)
iv_select_cate_col = list(cate_iv_df[cate_iv_df.iv>0.03]['col'])
len(iv_select_cate_col)
iv_select_df = pd.concat([num_iv_df[num_iv_df.iv>0.03],cate_iv_df[cate_iv_df.iv>0.03]],axis=0).\
sort_values('iv',ascending=False).reset_index(drop=True)
df2 = df.loc[:,iv_select_num_col+iv_select_cate_col+['loan_status']]
df2.shape
4.2 將原始變量轉化為WOE變量
將原始變量轉化為 WOE 變量
woe_list = list(select_num_dict[col]['bin_df']['woe'])
cut = select_num_dict[col]['cut']
df2[col+'_woe'] = pd.cut(df2[col], bins=cut, labels=woe_list)
for col in iv_select_cate_col:
woe_dict = dict([x for x in cate_bin_df_list if x.index.name==col][0]['woe'])
df2[col+'_woe'] = df2[col].map(woe_dict)
df2.head()
df2_woe = df2.loc[:, [x for x in df2.columns if x.find('woe')>0]+['loan_status']]
df2_woe.head()
for col in df2_woe.columns:
df2_woe[col] = df2_woe[col].astype('float64')
此時共有35列,34個自變量,1個因變量
4.3 使用前向逐步回歸發根據相關系數篩選變量
首先選定一個變量,每次加入一個變量,將當前相關系數大于0.7的變量去除
# 根據相關系數去除多重共線性
def forward_corr_delete(data,col_list):
corr_list=[]
corr_list.append(col_list[0])
delete_col=[]
for col in col_list[1:]:
corr_list.append(col)
corr = data.loc[:,corr_list].corr()
corr_tup = [(k,v) for k,v in zip(corr[col].index,corr[col].values)]
corr_value = [v for k,v in corr_tup if k!=col]
if len([x for x in corr_value if abs(x)>=0.65])>0:
delete_col.append(col)
select_corr_col=[x for x in col_list if x not in delete_col]
return select_corr_col
corr_col = [x+'_woe' for x in iv_select_df.col]
select_corr_col = forward_corr_delete(df2_woe,corr_col)
len(select_corr_col)
df2_woe2 = df2_woe.loc[:,select_corr_col+['loan_status']]
df2_woe2.head()
經過篩選,得到了17個變量
4.4 根據方差膨脹因子(VIF)去除多重共線性
在這一步,沒有發現多重共線性
# 根據方差膨脹因子去除共線性
def vif_delete(df,list_corr):
col_list = list_corr.copy()
vifs_matrix = np.matrix(df[col_list])
vifs_list = [variance_inflation_factor(vifs_matrix,i)for i in range(vifs_matrix.shape[1])]
vif_high = [x for x,y in zip(col_list,vifs_list) if y>10]
if len(vif_high)>0:
for col in reversed(vif_high):
col_list.remove(col)
vif_matrix=np.matrix(df[col_list])
vifs = [variance_inflation_factor(vif_matrix,i)for i in range(vif_matrix.shape[1])]
if len([x for x in vifs if x>10])==0:
break
return col_list
vif_select_col = vif_delete(df2_woe2,select_corr_col)
len(vif_select_col)
4.5 根據顯著性篩選變量
使用statistic模塊根據p值做顯著性檢驗,刪除inq_fi_woe
變量
# 顯著性篩選 根據p值
def forward_pvalue_delete(x,y):
col_list = x.columns.tolist()
pvalues_col=[]
for col in col_list:
pvalues_col.append(col)
x_const = sm.add_constant(x.loc[:,pvalues_col])
sm_lr = sm.Logit(y,x_const)
sm_lr = sm_lr.fit()
pvalue = sm_lr.pvalues[col]
if pvalue>=0.5:
pvalues_col.remove(col)
return pvalues_col
# 將數據集分為特征集X和標簽集Y
x = df2_woe2.drop(['loan_status'],axis=1)
y = df2_woe2['loan_status']
# 做顯著性篩選
pvalues_col = forward_pvalue_delete(x,y)
df2_woe3 = df2_woe2.loc[:, pvalues_col+['loan_status']]
5. 建模
使用 sklearn中的邏輯回歸模型作為分類器
5.1 簡單建模,超參數使用默認
x2 = df2_woe3.drop(['loan_status'],axis=1)
y2 = df2_woe3['loan_status']
x_train,x_test,y_train,y_test = train_test_split(x2,y2,test_size=0.2,random_state=2020)
lr_model = LogisticRegression().fit(x_train,y_train)
對使用默認的參數訓練的模型衡量指標,包括auc, ks, 敏感性,特異性,精準性
#繪制roc曲線
def plot_roc(y_label,y_pred):
tpr,fpr,threshold = metrics.roc_curve(y_label,y_pred)
AUC = metrics.roc_auc_score(y_label,y_pred)
fig = plt.figure(figsize=(6,4))
ax = fig.add_subplot(1,1,1)
ax.plot(tpr,fpr,color='blue',label='AUC=%.3f'%AUC)
ax.plot([0,1],[0,1],'r--')
ax.set_xlim(0,1)
ax.set_ylim(0,1)
ax.set_title('ROC')
ax.legend(loc='best')
return plt.show(ax)
#繪制KS曲線
def plot_model_ks(y_label,y_pred):
pred_list = list(y_pred)
label_list = list(y_label)
total_bad = sum(label_list)
total_good = len(label_list)-total_bad
items = sorted(zip(pred_list,label_list),key=lambda x :x[0])
step = (max(pred_list)-min(pred_list))/200
pred_bin = []
good_rate = []
bad_rate = []
ks_list = []
for i in range(1,201):
idx = min(pred_list)+i*step
pred_bin .append(idx)
label_bin = [x[1] for x in items if x[0]<idx]
bad_num = sum(label_bin)
good_num = len(label_bin)-bad_num
goodrate = good_num/total_good
badrate = bad_num/total_bad
ks = abs(goodrate-badrate)
good_rate.append(goodrate)
bad_rate.append(badrate)
ks_list.append(ks)
fig = plt.figure(figsize=(6,4))
ax = fig.add_subplot(1,1,1)
ax.plot(pred_bin,good_rate,color='green',label='good_rate')
ax.plot(pred_bin,bad_rate,color='red',label='bad_rate')
ax.plot(pred_bin,ks_list,color='blue',label='good-bad')
ax.set_title('KS:{:.3f}'.format(max(ks_list)))
ax.legend(loc='best')
return plt.show(ax)
y_pred = lr_model.predict_proba(x_test)[:,1]
plot_roc(y_test,y_pred) #
plot_model_ks(y_test,y_pred)
fpr,tpr,thre=roc_curve(y_test, y_pred)
ks=max(tpr-fpr)
此時的auc=0.950, ks=0.798, roc和ks曲線如下
5.2 使用網格搜索交叉驗證選擇最優參數
In[157]:
#利用交叉驗證和網格搜索
from sklearn.model_selection import GridSearchCV #網格搜索
from sklearn.linear_model import LogisticRegression # 邏輯回歸
from sklearn.model_selection import train_test_split # 測試集與訓練集劃分
#構建網格參數組合
param_test1={"C":[0.01,0.1,1.0,10.0,20.0,30.0,100.0,200.0,300.0,1000.0], #正則化系數
"penalty":["l1","l2"], #正則化參數
"max_iter":[100,200,300,400,500]} #算法收斂的最大迭代次數
gsearch1=GridSearchCV(LogisticRegression(),param_grid=param_test1,cv=10)
gsearch1.fit(x_train,y_train) #訓練模型
gsearch1.best_params_, gsearch1.best_score_ #查看評分最高的參數組合與最佳評分
Out[157]:
({'C': 10.0, 'max_iter': 100, 'penalty': 'l2'}, 0.9728544333807492)
最優的參數是C=10.0, max_iter=100, peanalty=l2(正則化項)
使用最優參數構建分類器,訓練得到的auc和ks并沒有較大提升,說明在本項目里選定了邏輯回歸,改變一個超參數對結果影響不大。
5.3 使用SMOTE解決類別不平衡問題
在當前數據中,逾期的類只占了0.08,有一些不平衡,使用SMOTE算法對少數類進行過采樣生成均衡的數據集,檢驗指標是否有提升。注意:使用SMOTE算法只能對訓練集進行過采樣。
In[237]: y.value_counts(normalize=True)
Out[237]:
0.0 0.919848
1.0 0.080152
# 使用SMOTE算法解決類別不平衡
from imblearn.over_sampling import SMOTE # 導入SMOTE算法模塊
# 處理不平衡數據
smo = SMOTE(random_state=42) # 處理過采樣的方法
x_train2, y_train2 = smo.fit_sample(x_train, y_train)
print('通過SMOTE方法平衡正負樣本后')
n_sample = y_train2.shape[0]
n_pos_sample = y_train2[y_train2 == 0].shape[0]
n_neg_sample = y_train2[y_train2 == 1].shape[0]
print('樣本個數:{}; 正樣本占{:.2%}; 負樣本占{:.2%}'.format(n_sample,
n_pos_sample / n_sample,
n_neg_sample / n_sample))
lr_model_smo = LogisticRegression().fit(x_train2,y_train2)
y_pred_smo = lr_model_smo.predict_proba(x_test)[:,1]
plot_roc(y_test,y_pred_smo)
plot_model_ks(y_test, y_pred_smo)
使用SMOTE算法,得到的auc=0.953, ks=0.802, 說明本項目中將訓練集構造成均衡的數據集有效果。
6.計算每個樣本的評分
在當前的數據集上,每個樣本都是特征+預測得到的逾期概率,為了在業務上有更好的解釋性,需要將概率轉化為信用評分(類似于芝麻信用分)
# 計算基礎分
def cal_scale(score,odds,PDO,model):
B = PDO/np.log(2)
A = score+B*np.log(odds)
base_score = A-B*model.intercept_[0]
return A,B,base_score
A,B,base_score = cal_scale(400,999/1,20,lr_model)
x_test_score = x_test.copy()
for col in x_test_score.columns:
col_coe = coe_dict[col]
x_test_score[col.replace('woe','score')]=x_test_score[col].map(lambda x:round(x*-B*col_coe))
x_test_score['score'] = round(base_score)
for col in [x for x in x_test_score.columns if x.find('_score')>=0]:
x_test_score['score']+=x_test_score[col]
x_test_score['label']=list(y_test)
sns.kdeplot(x_test_score[x_test_score['label']==1].score,shade=True,label='bad')
sns.kdeplot(x_test_score[x_test_score['label']==0].score,shade=True,label='good')
在上圖中可以看出,正負樣本的區分度還是很高的,但是正樣本與負樣本都不是標準的正態分布,說明模型還是有局限性。
7.其他模型、模型融合(待完成)
在業務中因為邏輯回歸模型并行化、訓練速度快、可解釋性強等優點被廣泛使用,但是預測是否逾期是一個很典型的機器學習問題,當然要使用
7.1 LightGBM
7.2 DNN
7.3 模型融合
參考文章:https://zhuanlan.zhihu.com/p/152128764這篇文章的auc只有0.67左右