4. 機器學(xué)習(xí)之特征選擇-Python代碼

1. 特征選擇------sklearn代碼

1.1 特征選擇------方差法

忽略warning錯誤

import warnings
warnings.filterwarnings("ignore")
# 方差法
from sklearn.feature_selection import VarianceThreshold
X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
selector = VarianceThreshold()
X_new = selector.fit_transform(X)
# 
print('X_new:\n',selector.fit_transform(X))
print('get_params:\n',selector.get_params())
print('get_support:\n',selector.get_support())
print('inverse_transform:\n',selector.inverse_transform(X_new))

運行結(jié)果

X_new:
 [[2 0]
 [1 4]
 [1 1]]
get_params:
 {'threshold': 0.0}
get_support:
 [False  True  True False]
inverse_transform:
 [[0 2 0 0]
 [0 1 4 0]
 [0 1 1 0]]

1.2 特征選擇------單變量特征選擇 (卡方,F分布,互信息)

1.2.1 什么是卡方分布

Compute chi-squared stats between each non-negative feature and class.This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features.
檢測每個自變量與因變量的相關(guān)性,使用此函數(shù)“除去”最可能獨立于類的特征。要優(yōu)于方差檢驗

1.2.2 時間復(fù)雜度

 O(n_classes * n_features)

1.2.3 sklearn中卡方的兩個函數(shù)

* SelectKBest(score_func,k=)。保留評分最高得分的 K 個特征;
* SelectPercentile(score_func,percentile=)。保留最高得分的百分比特征;

1.2.4 score_func的選擇

1.2.3 的兩個函數(shù) 默認:f_classif 函數(shù)。這里想使用卡方分布法,選擇函數(shù)chi2.

對于回歸:

f_regression:相關(guān)系數(shù),計算每個變量與目標變量的相關(guān)系數(shù),然后計算出F值和P值;
mutual_info_regression:互信息;

對于分類:

chi2:卡方檢驗;
f_classif:方差分析,計算方差分析(ANOVA)的F值 (組間均方 / 組內(nèi)均方);
mutual_info_classif:互信息;  

注:chi2 , mutual_info_classif , mutual_info_regression 可以保持數(shù)據(jù)的稀疏性。

注: 關(guān)于f_classif和f_regression的了解。點此鏈接

1.2.5 兩個屬性

  • scores_ : array-like, shape=(n_features,),Scores of features.

  • pvalues_ : array-like, shape=(n_features,),p-values of feature scores, None if score_func returned only scores.

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest,SelectPercentile
from sklearn.feature_selection import chi2,f_classif,mutual_info_classif
iris = load_iris()
X, y = iris.data, iris.target
print('old_data_shape:\n',X.shape)

# SelectKBest 保留評分最高的 K 個特征
X_SelectKBest = SelectKBest(chi2, k=2)
X_new_SelectKBest = X_SelectKBest.fit_transform(X, y)
print('new_SelectKBest_data_shape:\n',X_new_SelectKBest.shape)


# SelectPercentile 保留最高得分百分比的特征
# 默認:f_classif 函數(shù)
# 對于回歸: f_regression , mutual_info_regression
# 對于分類: chi2 , f_classif , mutual_info_classif
X_SelectPercentile = SelectPercentile(chi2,percentile=50)
X_new_SelectPercentile = X_SelectPercentile.fit_transform(X, y)
print('new_SelectPercentile_data_shape:\n',X_new_SelectPercentile.shape)


# Attributes
# scores_ : array-like, shape=(n_features,)
#     Scores of features.

# pvalues_ : array-like, shape=(n_features,)
#     p-values of feature scores, None if `score_func` returned only scores.
print('KBest scores:    \n',X_SelectKBest.scores_)
print('Percentile scores:\n',X_SelectPercentile.scores_)
print('KBest pvalues_:    \n',X_SelectKBest.pvalues_)
print('Percentile pvalues_:\n',X_SelectPercentile.pvalues_)



# 查看 選擇的那兩列
# 對比發(fā)現(xiàn) SelectKBest 和 SelectPercentile 挑選的都是最后兩列
print('old_data_10:\n',X[0:10,:])
print('X_new_SelectKBest:\n',X_new_SelectKBest[0:10,:])
print('X_new_SelectPercentile:\n',X_new_SelectPercentile[0:10,:])

運行結(jié)果

old_data_shape:
(150, 4)
new_SelectKBest_data_shape:
(150, 2)
new_SelectPercentile_data_shape:
(150, 2)
KBest scores:    
[ 10.81782088   3.59449902 116.16984746  67.24482759]
Percentile scores:
[ 10.81782088   3.59449902 116.16984746  67.24482759]
KBest pvalues_:    
[4.47651499e-03 1.65754167e-01 5.94344354e-26 2.50017968e-15]
Percentile pvalues_:
[4.47651499e-03 1.65754167e-01 5.94344354e-26 2.50017968e-15]
old_data_10:
[[5.1 3.5 1.4 0.2]
[4.9 3.  1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5.  3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5.  3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
X_new_SelectKBest:
[[1.4 0.2]
[1.4 0.2]
[1.3 0.2]
[1.5 0.2]
[1.4 0.2]
[1.7 0.4]
[1.4 0.3]
[1.5 0.2]
[1.4 0.2]
[1.5 0.1]]
X_new_SelectPercentile:
[[1.4 0.2]
[1.4 0.2]
[1.3 0.2]
[1.5 0.2]
[1.4 0.2]
[1.7 0.4]
[1.4 0.3]
[1.5 0.2]
[1.4 0.2]
[1.5 0.1]]

# 對比 不同的 score_func 有什么區(qū)別,是不是不同的函數(shù)選擇的特征都是一樣的?
# 說明:這里小樣本,結(jié)果具有偶然性,實際開發(fā)項目時可以對比一下。
# 個人覺得,根據(jù)理論方法,選擇不同的score_func會選擇出不同的特征。
#  分類函數(shù)比較: chi2 , f_classif , mutual_info_classif
X_SelectPercentile_chi2 = SelectPercentile(chi2,percentile=50)
X_SelectPercentile_f_classif = SelectPercentile(f_classif,percentile=50)
X_SelectPercentile_mutual_info_classif = SelectPercentile(mutual_info_classif,percentile=50)

X_new_SelectKBest_chi2 = X_SelectPercentile_chi2.fit_transform(X, y)
X_new_SelectKBest_f_classif = X_SelectPercentile_f_classif.fit_transform(X, y)
X_new_SelectKBest_mutual_info_classif = X_SelectPercentile_mutual_info_classif.fit_transform(X, y)

print('chi2 scores:             \n',X_SelectPercentile_chi2.scores_)
print('f_classif scores:         \n',X_SelectPercentile_f_classif.scores_)
print('mutual_info_classif scores:\n',X_SelectPercentile_mutual_info_classif.scores_)

print('chi2 pvalues_:             \n',X_SelectPercentile_chi2.pvalues_)
print('f_classif pvalues_:         \n',X_SelectPercentile_f_classif.pvalues_)
print('mutual_info_classif pvalues_:\n',X_SelectPercentile_mutual_info_classif.pvalues_)

運行結(jié)果

chi2 scores:             
 [ 10.81782088   3.59449902 116.16984746  67.24482759]
f_classif scores:         
 [ 119.26450218   47.3644614  1179.0343277   959.32440573]
mutual_info_classif scores:
 [0.47980703 0.26374852 0.98914392 0.97626377]
chi2 pvalues_:             
 [4.47651499e-03 1.65754167e-01 5.94344354e-26 2.50017968e-15]
f_classif pvalues_:         
 [1.66966919e-31 1.32791652e-16 3.05197580e-91 4.37695696e-85]
mutual_info_classif pvalues_:
 None
 

# 互信息
from minepy import MINE
import pandas as pd

def my_mine(x_data,y):
   # 定義一個列表,將互信息的值放在列表中
   mic_score = []
   m = MINE()
   # 取出 x 的每一列
   for column in x_data.columns:
       m.compute_score(x_data[column],y)
       mic_score.append(m.mic())
       
   # 轉(zhuǎn)為 dataframe 并將其 列名 一一對應(yīng)
   mic_score = pd.DataFrame(mic_score )
   # 更改 索引
   mic_score.index = x_data.columns
   # 更改列名
   mic_score = mic_score.rename(columns = {0:'MICSCORE'}) 
   return   mic_score 

from sklearn.datasets import load_iris        
iris = load_iris()
x ,y= pd.DataFrame(iris.data).rename(columns={0:'a',1:'b',2:'c',3:'d'}), iris.target
my_mine(x,y)

# 說明:
# sklearn.feature_selection  中封裝的互信息,和此處有一點的差異,但是差異不大
# 推薦使用 sklearn 官網(wǎng)的sklearn.feature_selection 庫
#  [0.50056966 0.24913013 0.99109155 0.99213811]

運行結(jié)果

    MICSCORE
a   0.642196
b   0.401504
c   0.918296
d   0.918296

1.3 特征選擇------Pearson相關(guān)系數(shù) (Pearson Correlation)

皮爾森相關(guān)系數(shù)是一種最簡單的,能幫助理解特征和響應(yīng)變量之間關(guān)系的方法,該方法衡量的是變量之間的線性相關(guān)性,結(jié)果的取值區(qū)間為[-1,1],-1表示完全的負相關(guān),+1表示完全的正相關(guān),0表示沒有線性相關(guān)。 Pearson相關(guān)系數(shù)的一個明顯缺陷是,作為特征排序機制,只對線性關(guān)系敏感。如果關(guān)系是非線性的,即便兩個變量具有一一對應(yīng)的關(guān)系,Pearson相關(guān)性也可能會接近0。

# Scipy的 pearsonr 方法
import pandas as pd
from pandas  import Series
from scipy.stats import pearsonr

def my_pearsonr(x_data,y_data):
    # 遍歷 x_data的每一列
    sorce_p_value = []
    for column in x_data.columns:
        sorce = pearsonr(x_data[column], y_data)
        sorce_p_value.append(list(sorce))
    # 轉(zhuǎn)化為 dataframe
    sorce_p_value = pd.DataFrame(sorce_p_value)
    # 更改 索引
    sorce_p_value.index = x_data.columns
    # 更改列名
    sorce_p_value = sorce_p_value.rename(columns = {0:'sorce',1:'p-value'}) 

    return   sorce_p_value


from sklearn.datasets import load_iris
x ,y= pd.DataFrame(iris.data).rename(columns={0:'a',1:'b',2:'c',3:'d'}), iris.target

sorce_p_value = my_pearsonr(x,y)
sorce_p_value

運行結(jié)果


sorce   p-value
a   0.782561    2.890478e-32
b   -0.419446   9.159985e-08
c   0.949043    4.155478e-76
d   0.956464    4.775002e-81
# 相關(guān)系數(shù)法對非線性不明感
import numpy as np
x = np.random.uniform(-1, 1, 100000)
print(pearsonr(x, x**2))

運行結(jié)果

(-0.006741987892572058, 0.03300673581074622)

1.4 特征選擇------距離相關(guān)系數(shù) (Distance correlation)

# 預(yù)備知識

# 維度改變
## atleast_xd 支持將輸入數(shù)據(jù)直接視為 x維。這里的 x 可以表示:1,2,3。
from scipy.spatial.distance import pdist, squareform
import numpy as np
np.atleast_1d([1])
np.atleast_2d([1])
np.atleast_3d([1])
print('np.atleast_1d:',np.atleast_1d([1]))
print('np.atleast_2d:',np.atleast_2d([1]))
print('np.atleast_3d:',np.atleast_3d([1]))

# 歐幾里德距離
# Pairwise distance between pairs of object(Pdist函數(shù)用于各種距離的生成)
X = [[1,2,3,4,5],
    [2,4,6,8,10],
    [3,6,9,12,15],
    [1,2,3,4,5]]
x_pdist=pdist(X)
# pdist(x) 計算m*n的數(shù)據(jù)矩陣中對象之間的歐幾里得距離
# 得到的是一個長度為m(m-1)/2的距離向量,距離是按順序排列的(2,1),(3,1)…….(m,1),(3,2)……..(m,2)………(m,m-1)
print('歐氏距離:\n',x_pdist)
#矩陣中的(i,j),對應(yīng)于原始數(shù)據(jù)集中的i列和j列之間的歐氏距離。
print('矩陣:\n',squareform(x_pdist))

# 連乘操作
print(np.prod([1,2]))                         # 1*2
print(np.prod([[1,2],[3,4]]))               # (1*2) * (3*4)
print(np.prod([[1,2],[3,4]],axis=0))   # (1*3) * (2*4)
print(np.prod([[1,2],[3,4]],axis=1))   # (1*2)   (3*4)

運行結(jié)果

np.atleast_1d: [1]
np.atleast_2d: [[1]]
np.atleast_3d: [[[1]]]
歐氏距離:
 [ 7.41619849 14.83239697  0.          7.41619849  7.41619849 14.83239697]
矩陣:
 [[ 0.          7.41619849 14.83239697  0.        ]
 [ 7.41619849  0.          7.41619849  7.41619849]
 [14.83239697  7.41619849  0.         14.83239697]
 [ 0.          7.41619849 14.83239697  0.        ]]
2
24
[3 8]
[ 2 12]
from scipy.spatial.distance import pdist, squareform
import numpy as np
def my_distcorr(X, Y):
    """ Compute the distance correlation function
    
    >>> a = [1,2,3,4,5]
    >>> b = np.array([2,4,6,8,10])
    >>> distcorr(a, b)
    0.762676242417
    """
    X = np.atleast_1d(X)
    Y = np.atleast_1d(Y)
    if np.prod(X.shape) == len(X):
        X = X[:, None]
    if np.prod(Y.shape) == len(Y):
        Y = Y[:, None]
    X = np.atleast_2d(X)
    Y = np.atleast_2d(Y)
    n = X.shape[0]
    if Y.shape[0] != X.shape[0]:
        raise ValueError('Number of samples must match')
    a = squareform(pdist(X))
    b = squareform(pdist(Y))
    A = a - a.mean(axis=0)[None, :] - a.mean(axis=1)[:, None] + a.mean()
    B = b - b.mean(axis=0)[None, :] - b.mean(axis=1)[:, None] + b.mean()
    
    dcov2_xy = (A * B).sum()/float(n * n)
    dcov2_xx = (A * A).sum()/float(n * n)
    dcov2_yy = (B * B).sum()/float(n * n)
    dcor = np.sqrt(dcov2_xy)/np.sqrt(np.sqrt(dcov2_xx) * np.sqrt(dcov2_yy))
    return dcor

a = [1,2,3,4,5]
b = np.array([2,4,6,8,10])
print(my_distcorr(a, b))

# 查看 Pearson相關(guān)系數(shù) 接近零,其 距離相關(guān)系數(shù)
my_distcorr(a, [i*i for i in a])

運行結(jié)果

1.0
0.9869160440537483

1.5 特征選擇------基于模型的特征排序 (Model based ranking)

from sklearn.cross_validation import cross_val_score, ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
import numpy as np
 
# Load boston housing dataset as an example
boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
 
rf = RandomForestRegressor(n_estimators=20, max_depth=4)
scores = []
# 單獨采用每個特征進行建模,并進行交叉驗證
for i in range(X.shape[1]):
    score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2",  # 注意X[:, i]和X[:, i:i+1]的區(qū)別
                            cv=ShuffleSplit(len(X), 5, .3),
#                             cv=5,
                            n_jobs=-1)
    scores.append((format(np.mean(score), '.3f'), names[i]))
    
    
print(sorted(scores, reverse=True))

運行結(jié)果

[('0.665', 'LSTAT'), ('0.545', 'RM'), ('0.394', 'NOX'), ('0.327', 'INDUS'), ('0.320', 'TAX'), ('0.318', 'PTRATIO'), ('0.201', 'CRIM'), ('0.179', 'ZN'), ('0.161', 'RAD'), ('0.124', 'DIS'), ('0.114', 'B'), ('0.078', 'AGE'), ('0.031', 'CHAS')]

1.6 特征選擇------遞歸特征消除 (Recursive Feature Elimination)

屬性:

support_:選擇的特征,布爾類型。
ranking_:特征的排名位置。 選擇的特征被指定為等級1。越差的特征,等級數(shù)越高。
# 屬性:
# support_:選擇的特征,布爾類型。
# ranking_:特征的排名位置。 選擇的特征被指定為等級1。越差的特征,等級數(shù)越高。

# 不采用交叉驗證
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
# 可以選擇不同的基模型
estimator = SVR(kernel="linear")
#參數(shù)estimator為基模型,參數(shù)n_features_to_select為選擇的特征個數(shù)
selector = RFE(estimator, n_features_to_select=5)
selector = selector.fit(X, y)
print('selector.support_:\n',selector.support_)
print('selector.ranking_:\n',selector.ranking_)
print(selector)

運行結(jié)果

selector.support_:
 [ True  True  True  True  True False False False False False]
selector.ranking_:
 [1 1 1 1 1 6 4 3 2 5]
RFE(estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
  n_features_to_select=5, step=1, verbose=0)
  
# # 采用交叉驗證,
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
# 參數(shù):
# step: 如果大于或等于1,那么`step`對應(yīng)于(整數(shù))每次迭代時要刪除的要素數(shù)。
#        如果在(0.0,1.0)內(nèi),那么`step`對應(yīng)于百分比(向下舍入)要在每次迭代時刪除的要素。
selector = RFECV(estimator, step=1, cv=6)
selector = selector.fit(X, y)
print('selector.support_:\n',selector.support_)
print('selector.ranking_:\n',selector.ranking_)

運行結(jié)果

selector.support_:
 [ True  True  True  True  True False False False False False]
selector.ranking_:
 [1 1 1 1 1 6 4 3 2 5]
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                          n_redundant=2, n_repeated=0, n_classes=8,
                          n_clusters_per_class=1, random_state=0)

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
             scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

運行結(jié)果

Optimal number of features : 3

1.7 特征選擇------基于L1的特征選擇 (L1-based feature selection)

  • 使用L1范數(shù)作為懲罰項的線性模型(Linear models)會得到稀疏解:大部分特征對應(yīng)的系數(shù)為0。當你希望減少特征的維度以用于其它分類器時,可以通過 feature_selection.SelectFromModel 來選擇不為0的系數(shù)。
  • 特別指出,常用于此目的的稀疏預(yù)測模型有 linear_model.Lasso(回歸),linear_model.LogisticRegression 和 svm.LinearSVC(分類).
  • SelectFromModel 作為meta-transformer,能夠用于擬合后任何擁有coef_或feature_importances_ 屬性的預(yù)測模型。 如果特征對應(yīng)的coef_ 或 feature_importances_ 值低于設(shè)定的閾值threshold,那么這些特征將被移除。除了手動設(shè)置閾值,也可通過字符串參數(shù)調(diào)用內(nèi)置的啟發(fā)式算法(heuristics)來設(shè)置閾值,包括:平均值(“mean”), 中位數(shù)(“median”)以及他們與浮點數(shù)的乘積,如”0.1*mean”。
# svm.LinearSVC
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print('X_shape:',X.shape)
print('X_new.shape:',X_new.shape)

運行結(jié)果

X_shape: (150, 4)
X_new.shape: (150, 3)

## 帶L1和L2懲罰項的邏輯回歸作為基模型的特征選擇
# 對于SVM和邏輯回歸,參數(shù)C控制稀疏性:C越小,被選中的特征越少。對于Lasso,參數(shù)alpha越大,被選中的特征越少。
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression as LR
X_new = SelectFromModel(LR(C=0.1)).fit_transform(iris.data, iris.target)
print('X_new.shape:',X_new.shape)

運行結(jié)果

X_new.shape: (150, 2)


# Use SelectFromModel meta-transformer along with Lasso to select the best couple of features from the Boston dataset.
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

# Load the boston dataset.
boston = load_boston()
X, y = boston['data'], boston['target']

# We use the base estimator LassoCV since the L1 norm promotes sparsity of features.
clf = LassoCV()

# Set a minimum threshold of 0.25
sfm = SelectFromModel(clf, threshold=0.25)
sfm.fit(X, y)
n_features = sfm.transform(X).shape[1]

print('X_shape:',X.shape)
print('new_X_shape:',sfm.transform(X).shape)

# Reset the threshold till the number of features equals two.
# Note that the attribute can be set directly instead of repeatedly
# fitting the metatransformer.
while n_features > 2:
    sfm.threshold += 0.1
    X_transform = sfm.transform(X)
    n_features = X_transform.shape[1]

# Plot the selected two features from X.
plt.title(
    "Features selected from Boston using SelectFromModel with "
    "threshold %0.3f." % sfm.threshold)
feature1 = X_transform[:, 0]
feature2 = X_transform[:, 1] 
plt.plot(feature1, feature2, 'r.')
plt.xlabel("Feature number 1")
plt.ylabel("Feature number 2")
plt.ylim([np.min(feature2), np.max(feature2)])
plt.show()

運行結(jié)果

X_shape: (506, 13)
new_X_shape: (506, 5)

1.8 特征選擇------ 隨機稀疏模型 (Randomized sparse models)

  • 基于L1的稀疏模型的局限在于,當面對一組互相關(guān)的特征時,它們只會選擇其中一項特征。為了減輕該問題的影響,可以使用隨機化技術(shù),通過多次重新估計稀疏模型來擾亂設(shè)計矩陣,或通過多次下采樣數(shù)據(jù)來統(tǒng)計一個給定的回歸量被選中的次數(shù)。
  • RandomizedLasso 實現(xiàn)了使用這項策略的Lasso,RandomizedLogisticRegression 使用邏輯回歸,適用于分類任務(wù)。要得到整個迭代過程的穩(wěn)定分數(shù),你可以使用 lasso_stability_path。
  • 注意到對于非零特征的檢測,要使隨機稀疏模型比標準F統(tǒng)計量更有效, 那么模型的參考標準需要是稀疏的,換句話說,非零特征應(yīng)當只占一小部分。
import matplotlib.pyplot as plt
import numpy as np
from scipy import linalg

from sklearn.linear_model import (RandomizedLasso, lasso_stability_path,
                                  LassoLarsCV)
from sklearn.feature_selection import f_regression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import auc, precision_recall_curve
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.utils.extmath import pinvh


def mutual_incoherence(X_relevant, X_irelevant):
    """Mutual incoherence, as defined by formula (26a) of [Wainwright2006].
    """
    projector = np.dot(np.dot(X_irelevant.T, X_relevant),
                       pinvh(np.dot(X_relevant.T, X_relevant)))
    return np.max(np.abs(projector).sum(axis=1))


for conditioning in (1, 1e-4):
    ###########################################################################
    # Simulate regression data with a correlated design
    n_features = 501
    n_relevant_features = 3
    noise_level = .2
    coef_min = .2
    # The Donoho-Tanner phase transition is around n_samples=25: below we
    # will completely fail to recover in the well-conditioned case
    n_samples = 25
    block_size = n_relevant_features

    rng = np.random.RandomState(42)

    # The coefficients of our model
    coef = np.zeros(n_features)
    coef[:n_relevant_features] = coef_min + rng.rand(n_relevant_features)

    # The correlation of our design: variables correlated by blocs of 3
    corr = np.zeros((n_features, n_features))
    for i in range(0, n_features, block_size):
        corr[i:i + block_size, i:i + block_size] = 1 - conditioning
    corr.flat[::n_features + 1] = 1
    corr = linalg.cholesky(corr)

    # Our design
    X = rng.normal(size=(n_samples, n_features))
    X = np.dot(X, corr)
    # Keep [Wainwright2006] (26c) constant
    X[:n_relevant_features] /= np.abs(
        linalg.svdvals(X[:n_relevant_features])).max()
    X = StandardScaler().fit_transform(X.copy())

    # The output variable
    y = np.dot(X, coef)
    y /= np.std(y)
    # We scale the added noise as a function of the average correlation
    # between the design and the output variable
    y += noise_level * rng.normal(size=n_samples)
    mi = mutual_incoherence(X[:, :n_relevant_features],
                            X[:, n_relevant_features:])

    ###########################################################################
    # Plot stability selection path, using a high eps for early stopping
    # of the path, to save computation time
    alpha_grid, scores_path = lasso_stability_path(X, y, random_state=42,
                                                   eps=0.05)

    plt.figure()
    # We plot the path as a function of alpha/alpha_max to the power 1/3: the
    # power 1/3 scales the path less brutally than the log, and enables to
    # see the progression along the path
    hg = plt.plot(alpha_grid[1:] ** .333, scores_path[coef != 0].T[1:], 'r')
    hb = plt.plot(alpha_grid[1:] ** .333, scores_path[coef == 0].T[1:], 'k')
    ymin, ymax = plt.ylim()
    plt.xlabel(r'$(\alpha / \alpha_{max})^{1/3}$')
    plt.ylabel('Stability score: proportion of times selected')
    plt.title('Stability Scores Path - Mutual incoherence: %.1f' % mi)
    plt.axis('tight')
    plt.legend((hg[0], hb[0]), ('relevant features', 'irrelevant features'),
               loc='best')

    ###########################################################################
    # Plot the estimated stability scores for a given alpha

    # Use 6-fold cross-validation rather than the default 3-fold: it leads to
    # a better choice of alpha:
    # Stop the user warnings outputs- they are not necessary for the example
    # as it is specifically set up to be challenging.
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', UserWarning)
        lars_cv = LassoLarsCV(cv=6).fit(X, y)

    # Run the RandomizedLasso: we use a paths going down to .1*alpha_max
    # to avoid exploring the regime in which very noisy variables enter
    # the model
    alphas = np.linspace(lars_cv.alphas_[0], .1 * lars_cv.alphas_[0], 6)
    clf = RandomizedLasso(alpha=alphas, random_state=42).fit(X, y)
    trees = ExtraTreesRegressor(100).fit(X, y)
    # Compare with F-score
    F, _ = f_regression(X, y)

    plt.figure()
    for name, score in [('F-test', F),
                        ('Stability selection', clf.scores_),
                        ('Lasso coefs', np.abs(lars_cv.coef_)),
                        ('Trees', trees.feature_importances_),
                        ]:
        precision, recall, thresholds = precision_recall_curve(coef != 0,
                                                               score)
        plt.semilogy(np.maximum(score / np.max(score), 1e-4),
                     label="%s. AUC: %.3f" % (name, auc(recall, precision)))

    plt.plot(np.where(coef != 0)[0], [2e-4] * n_relevant_features, 'mo',
             label="Ground truth")
    plt.xlabel("Features")
    plt.ylabel("Score")
    # Plot only the 100 first coefficients
    plt.xlim(0, 100)
    plt.legend(loc='best')
    plt.title('Feature selection scores - Mutual incoherence: %.1f'
              % mi)

plt.show()

1.9 特征選擇------ 基于樹的特征選擇 (Tree-based feature selection)

  • 基于樹的預(yù)測模型,能夠用來計算特征的重要程度,因此能用來去除不相關(guān)的特征:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
                           n_features=10,
                           n_informative=3,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

運行結(jié)果

Feature ranking:
1. feature 1 (0.295902)
2. feature 2 (0.208351)
3. feature 0 (0.177632)
4. feature 3 (0.047121)
5. feature 6 (0.046303)
6. feature 8 (0.046013)
7. feature 7 (0.045575)
8. feature 4 (0.044614)
9. feature 9 (0.044577)
10. feature 5 (0.043912)

2019061208.png

#  用于人臉識別數(shù)據(jù)
from time import time
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces
from sklearn.ensemble import ExtraTreesClassifier

# Number of cores to use to perform parallel fitting of the forest model
n_jobs = 1

# Load the faces dataset
data = fetch_olivetti_faces()
X = data.images.reshape((len(data.images), -1))
y = data.target

mask = y < 5  # Limit to 5 classes
X = X[mask]
y = y[mask]

# Build a forest and compute the pixel importances
print("Fitting ExtraTreesClassifier on faces data with %d cores..." % n_jobs)
t0 = time()
forest = ExtraTreesClassifier(n_estimators=1000,
                              max_features=128,
                              n_jobs=n_jobs,
                              random_state=0)

forest.fit(X, y)
print("done in %0.3fs" % (time() - t0))
importances = forest.feature_importances_
importances = importances.reshape(data.images[0].shape)

# Plot pixel importances
plt.matshow(importances, cmap=plt.cm.hot)
plt.title("Pixel importances with forests of trees")
plt.show()

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容