曰韩少妇内射免费播放,人成午夜免费大片,国产国语对白露脸

數(shù)據(jù)和特征決定了機(jī)器學(xué)習(xí)的上限，而模型和算法只是逼近這個(gè)上限而已。那特征工程到底是什么呢？顧名思義，其本質(zhì)是一項(xiàng)工程活動(dòng)，目的是最大限度地從原始數(shù)據(jù)中提取特征以供算法和模型使用。通過總結(jié)和歸納，人們認(rèn)為特征工程包括以下方面：

image.png

博客轉(zhuǎn)載
[ http://blog.csdn.net/u010472823/article/details/53509658 ]

數(shù)據(jù)預(yù)處理實(shí)戰(zhàn)

官方文檔 [ http://scikit-learn.org/stable/modules/preprocessing.html]

Country	Age	Salary	Purchased
France	44	72000	No
Spain	27	48000	Yes
Germany	30	54000	No
Spain	38	61000	No
Germany	40		Yes
France	35	58000	Yes
Spain		52000	No
France	48	79000	Yes
Germany	50	83000	No
France	37	67000	Yes

首先，有上述表格可以發(fā)現(xiàn)，樣例數(shù)據(jù)中存在缺失值。一般刪除數(shù)據(jù)或補(bǔ)全數(shù)據(jù)。在缺失值補(bǔ)全之前，這里需要理解三個(gè)概念，眾數(shù)，均值，中位數(shù)。
眾數(shù)：數(shù)據(jù)中出現(xiàn)次數(shù)最多個(gè)數(shù)
均值：數(shù)據(jù)的求和平均。
中位數(shù)：數(shù)據(jù)排序后的中間數(shù)據(jù)。
具體選擇類中類型填充需要依據(jù)場(chǎng)景選擇。

首先，我們需要導(dǎo)入sklearn 中的Imputer 類，在sklearn 的 preprocessing 包下

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X[:,1:3] = imputer.fit(X[:,1:3]).transform(X[:,1:3])

strategy采用均值策略，填補(bǔ)上述數(shù)據(jù)中的2，3 兩列。axis = 0指列

strategy : string, optional (default="mean")
        The imputation strategy.
        - If "mean", then replace missing values using the mean along
          the axis.
        - If "median", then replace missing values using the median along
          the axis.
        - If "most_frequent", then replace missing using the most frequent
          value along the axis.

>>> print(X)
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

這里采用均值的策略補(bǔ)全了缺失數(shù)據(jù)。

由于X[0],y是類別數(shù)據(jù),需要進(jìn)行標(biāo)簽編碼，采用sklearn .preprocessing包下的LabelEncoder.

from sklearn.preprocessing import LabelEncoder

>>>  X[:,0] = LabelEncoder().fit_transform(X[:,0])
>>>  y = LabelEncoder().fit_transform(y)
>>>  print(y)
[0 1 0 0 1 1 0 1 0 1]

對(duì)于預(yù)測(cè)值采用標(biāo)簽編碼是沒有問題的，然而，在類目特征中，標(biāo)簽編碼轉(zhuǎn)換是不夠的，國家一列，特征按照0-2順序編碼，這里還需要對(duì)數(shù)據(jù)進(jìn)行亞編碼，one-hot encoding. 采用sklearn.preprocessing 包下的 OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

>>> X = OneHotEncoder(categorical_features=[0]).fit_transform(X).toarray()
>>> print(X)

[[  1.00000000e+00   0.00000000e+00   0.00000000e+00   4.40000000e+01
    7.20000000e+04]
 [  0.00000000e+00   0.00000000e+00   1.00000000e+00   2.70000000e+01
    4.80000000e+04]
 [  0.00000000e+00   1.00000000e+00   0.00000000e+00   3.00000000e+01
    5.40000000e+04]
 [  0.00000000e+00   0.00000000e+00   1.00000000e+00   3.80000000e+01
    6.10000000e+04]
 [  0.00000000e+00   1.00000000e+00   0.00000000e+00   4.00000000e+01
    6.37777778e+04]
 [  1.00000000e+00   0.00000000e+00   0.00000000e+00   3.50000000e+01
    5.80000000e+04]
 [  0.00000000e+00   0.00000000e+00   1.00000000e+00   3.87777778e+01
    5.20000000e+04]
 [  1.00000000e+00   0.00000000e+00   0.00000000e+00   4.80000000e+01
    7.90000000e+04]
 [  0.00000000e+00   1.00000000e+00   0.00000000e+00   5.00000000e+01
    8.30000000e+04]
 [  1.00000000e+00   0.00000000e+00   0.00000000e+00   3.70000000e+01 
    6.70000000e+04]]

在回歸情況下，我們需要對(duì)特征值進(jìn)行縮放，年齡和薪酬是屬于不同量集的。

from sklearn.preprocessing import StandardScaler
sd = StandardScaler().fit(X)
X = sd.transform(X)
print(X)

[[  1.22474487e+00  -6.54653671e-01  -6.54653671e-01   7.58874362e-01
    7.49473254e-01]
 [ -8.16496581e-01  -6.54653671e-01   1.52752523e+00  -1.71150388e+00
   -1.43817841e+00]
 [ -8.16496581e-01   1.52752523e+00  -6.54653671e-01  -1.27555478e+00
   -8.91265492e-01]
 [ -8.16496581e-01  -6.54653671e-01   1.52752523e+00  -1.13023841e-01
   -2.53200424e-01]
 [ -8.16496581e-01   1.52752523e+00  -6.54653671e-01   1.77608893e-01
    6.63219199e-16]
 [  1.22474487e+00  -6.54653671e-01  -6.54653671e-01  -5.48972942e-01
   -5.26656882e-01]
 [ -8.16496581e-01  -6.54653671e-01   1.52752523e+00   0.00000000e+00
   -1.07356980e+00]
 [  1.22474487e+00  -6.54653671e-01  -6.54653671e-01   1.34013983e+00
    1.38753832e+00]
 [ -8.16496581e-01   1.52752523e+00  -6.54653671e-01   1.63077256e+00
    1.75214693e+00]
 [  1.22474487e+00  -6.54653671e-01  -6.54653671e-01  -2.58340208e-01
    2.93712492e-01]]

其他標(biāo)準(zhǔn)化縮放方法如MinMaxScaler() 區(qū)間縮放。

>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X = min_max_scaler.fit_transform(X)

歸一化方法 Normalizer()，將特征向量長度歸一到單位向量。

>>> normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
>>> normalizer.transform(X)

到此，基本的數(shù)據(jù)預(yù)處理到此完成，接下來就是模型訓(xùn)練和預(yù)測(cè)拉~

其余的一些操作：
#df之間合并
df = pd.concat([df1,df2])
#查看df的信息
df.info()
#查看各個(gè)維度的統(tǒng)計(jì)數(shù)據(jù),各個(gè)對(duì)象名稱
df.describe()
df.describe(include='o').columns
#統(tǒng)計(jì)某個(gè)維度的個(gè)數(shù)
print train_df['column_name'].value_counts()
#屬性列刪除
df= df.drop(['Name'], axis=1)
#刪除列中重復(fù)數(shù)據(jù),刪除某一列重復(fù)的數(shù)據(jù)
df = df.drop_duplicates()
df = df.drop_duplicates('columns_name')

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

scikit_learn（sklearn）數(shù)據(jù)預(yù)處理

scikit_learn（sklearn）數(shù)據(jù)預(yù)處理

數(shù)據(jù)預(yù)處理實(shí)戰(zhàn)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

scikit_learn（sklearn）數(shù)據(jù)預(yù)處理

數(shù)據(jù)預(yù)處理實(shí)戰(zhàn)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频