數(shù)據(jù)和特征決定了機(jī)器學(xué)習(xí)的上限,而模型和算法只是逼近這個(gè)上限而已。那特征工程到底是什么呢?顧名思義,其本質(zhì)是一項(xiàng)工程活動(dòng),目的是最大限度地從原始數(shù)據(jù)中提取特征以供算法和模型使用。通過總結(jié)和歸納,人們認(rèn)為特征工程包括以下方面:
博客轉(zhuǎn)載
[ http://blog.csdn.net/u010472823/article/details/53509658 ]
數(shù)據(jù)預(yù)處理實(shí)戰(zhàn)
官方文檔 [ http://scikit-learn.org/stable/modules/preprocessing.html]
Country | Age | Salary | Purchased |
---|---|---|---|
France | 44 | 72000 | No |
Spain | 27 | 48000 | Yes |
Germany | 30 | 54000 | No |
Spain | 38 | 61000 | No |
Germany | 40 | Yes | |
France | 35 | 58000 | Yes |
Spain | 52000 | No | |
France | 48 | 79000 | Yes |
Germany | 50 | 83000 | No |
France | 37 | 67000 | Yes |
首先,有上述表格可以發(fā)現(xiàn),樣例數(shù)據(jù)中存在缺失值。 一般刪除數(shù)據(jù)或補(bǔ)全數(shù)據(jù)。在缺失值補(bǔ)全之前,這里需要理解三個(gè)概念,眾數(shù),均值,中位數(shù)。
眾數(shù):數(shù)據(jù)中出現(xiàn)次數(shù)最多個(gè)數(shù)
均值:數(shù)據(jù)的求和平均。
中位數(shù):數(shù)據(jù)排序后的中間數(shù)據(jù)。
具體選擇類中類型填充需要依據(jù)場(chǎng)景選擇。
首先,我們需要導(dǎo)入sklearn 中的Imputer 類,在sklearn 的 preprocessing 包下
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X[:,1:3] = imputer.fit(X[:,1:3]).transform(X[:,1:3])
strategy采用均值策略,填補(bǔ)上述數(shù)據(jù)中的2,3 兩列。axis = 0指列
strategy : string, optional (default="mean")
The imputation strategy.
- If "mean", then replace missing values using the mean along
the axis.
- If "median", then replace missing values using the median along
the axis.
- If "most_frequent", then replace missing using the most frequent
value along the axis.
>>> print(X)
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
這里采用均值的策略補(bǔ)全了缺失數(shù)據(jù)。
由于X[0],y是類別數(shù)據(jù),需要進(jìn)行標(biāo)簽編碼,采用sklearn .preprocessing包下的LabelEncoder.
from sklearn.preprocessing import LabelEncoder
>>> X[:,0] = LabelEncoder().fit_transform(X[:,0])
>>> y = LabelEncoder().fit_transform(y)
>>> print(y)
[0 1 0 0 1 1 0 1 0 1]
對(duì)于預(yù)測(cè)值采用標(biāo)簽編碼是沒有問題的,然而,在類目特征中,標(biāo)簽編碼轉(zhuǎn)換是不夠的,國家一列,特征按照0-2順序編碼,這里還需要對(duì)數(shù)據(jù)進(jìn)行亞編碼,one-hot encoding. 采用sklearn.preprocessing 包下的 OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
>>> X = OneHotEncoder(categorical_features=[0]).fit_transform(X).toarray()
>>> print(X)
[[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[ 0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
5.40000000e+04]
[ 0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]
[ 0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
8.30000000e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]]
在回歸情況下,我們需要對(duì)特征值進(jìn)行縮放,年齡和薪酬是屬于不同量集的。
from sklearn.preprocessing import StandardScaler
sd = StandardScaler().fit(X)
X = sd.transform(X)
print(X)
[[ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 7.58874362e-01
7.49473254e-01]
[ -8.16496581e-01 -6.54653671e-01 1.52752523e+00 -1.71150388e+00
-1.43817841e+00]
[ -8.16496581e-01 1.52752523e+00 -6.54653671e-01 -1.27555478e+00
-8.91265492e-01]
[ -8.16496581e-01 -6.54653671e-01 1.52752523e+00 -1.13023841e-01
-2.53200424e-01]
[ -8.16496581e-01 1.52752523e+00 -6.54653671e-01 1.77608893e-01
6.63219199e-16]
[ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 -5.48972942e-01
-5.26656882e-01]
[ -8.16496581e-01 -6.54653671e-01 1.52752523e+00 0.00000000e+00
-1.07356980e+00]
[ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 1.34013983e+00
1.38753832e+00]
[ -8.16496581e-01 1.52752523e+00 -6.54653671e-01 1.63077256e+00
1.75214693e+00]
[ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 -2.58340208e-01
2.93712492e-01]]
其他標(biāo)準(zhǔn)化縮放方法 如MinMaxScaler() 區(qū)間縮放。
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X = min_max_scaler.fit_transform(X)
歸一化方法 Normalizer(),將特征向量長度歸一到單位向量。
>>> normalizer = preprocessing.Normalizer().fit(X) # fit does nothing
>>> normalizer.transform(X)
到此,基本的數(shù)據(jù)預(yù)處理到此完成,接下來就是模型訓(xùn)練和預(yù)測(cè)拉~
其余的一些操作:
#df之間合并
df = pd.concat([df1,df2])
#查看df的信息
df.info()
#查看各個(gè)維度的統(tǒng)計(jì)數(shù)據(jù),各個(gè)對(duì)象名稱
df.describe()
df.describe(include='o').columns
#統(tǒng)計(jì)某個(gè)維度的個(gè)數(shù)
print train_df['column_name'].value_counts()
#屬性列刪除
df= df.drop(['Name'], axis=1)
#刪除列中重復(fù)數(shù)據(jù),刪除某一列重復(fù)的數(shù)據(jù)
df = df.drop_duplicates()
df = df.drop_duplicates('columns_name')