基本步驟

縱覽全局
獲取數(shù)據(jù)
數(shù)據(jù)可視化、找規(guī)律
準(zhǔn)備用于機(jī)器學(xué)習(xí)算法的數(shù)據(jù)
選擇模型并進(jìn)行訓(xùn)練
模型微調(diào)
展示解決方案
系統(tǒng)維護(hù)

真實數(shù)據(jù)來源

UC Irvine Machine Learning Repository
Kaggle datasets
AWS datasets
Data Portals
opendatamonitor.eu
quandl
本文選取StatLib 的加州房產(chǎn)價格數(shù)據(jù)集，如下圖。

加州房產(chǎn)價格

1. 縱覽全局

任務(wù)是利用加州普查數(shù)據(jù)，建立一個加州房價模型。這個數(shù)據(jù)包含每個街區(qū)組的人口、收入中位數(shù)、房價中位數(shù)等指標(biāo)。
我們的模型要利用這個數(shù)據(jù)進(jìn)行學(xué)習(xí)，然后根據(jù)其它指標(biāo)，預(yù)測任何街區(qū)的的房價中位數(shù)。

劃定問題

商業(yè)目標(biāo)是什么？設(shè)計的系統(tǒng)將如何被使用？

老板告訴你你的模型的輸出（一個區(qū)的房價中位數(shù)）會傳給另一個機(jī)器學(xué)習(xí)系統(tǒng)，也有其它信號會傳入后面的系統(tǒng)。這一整套系統(tǒng)可以確定某個區(qū)進(jìn)行投資值不值。確定值不值得投資非常重要，它直接影響利潤。

如果有，現(xiàn)在的解決方案效果如何？

老板說，現(xiàn)在街區(qū)的房價是靠專家手工估計的，專家隊伍收集最新的關(guān)于一個區(qū)的信息（不包括房價中位數(shù)），他們使用復(fù)雜的規(guī)則進(jìn)行估計。這種方法費錢費時間，而且估計結(jié)果不理想。

確定是哪種機(jī)器學(xué)習(xí)問題

這個問題是典型監(jiān)督學(xué)習(xí)的問題，每個實例都有標(biāo)簽，即街區(qū)房價的中位數(shù)。
這個問題也是典型的回歸問題，是一個多變量回歸問題（人口、收入等），來預(yù)測一個值。
最后，這是一個批量學(xué)習(xí)問題，因為數(shù)據(jù)量完全可以放到內(nèi)存中。

選擇性能指標(biāo) Performance Measurement

回歸問題的典型指標(biāo)是均方根誤差（Root Mean Square Error）

RMSE

m: 實例數(shù)量
x(i): 實例i的特征向量
y(i): 實例i的標(biāo)簽
h: 系統(tǒng)預(yù)測函數(shù)，也成為假設(shè)（hypothesis）

另外一種性能指標(biāo)是平均絕對誤差（Mean Absolute Error，Average Absolute Deviation）

MAE

2. 獲取數(shù)據(jù)

https://github.com/ageron/handson-ml/tree/master/datasets

創(chuàng)建workspace

如果需要獨立的工作環(huán)境，請自行搜索virtualenv的用法。大致方法如下：

# 安裝virtualenv
pip install --user --upgrade virtualenv
# 創(chuàng)建獨立的python環(huán)境，為了在不同的工作環(huán)境中的庫的版本不沖突
virtualenv myenv
# 使用myenv（source .sh 或者 .bat）
myenv/Scripts/activate

pycharm中在選擇python interpreter的時候也可以創(chuàng)建和指定virtualenv。

Pycharm中設(shè)置virtualenv

python環(huán)境配置參考另一篇簡書，需要安裝的庫包括numpy，pandas，matplotlib，scikit-learn
JupyerLab的使用請參考這篇簡書。
修改pip源，參考這篇文章

pip install --upgrade matplotlib numpy pandas scipy scikit-learn jupyter jupyterlab

下載數(shù)據(jù)

下載數(shù)據(jù)壓縮包并解壓

import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL):
    if not os.path.isdir(HOUSING_PATH):
        os.makedirs(HOUSING_PATH)
    tgz_path = os.path.join(HOUSING_PATH, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=HOUSING_PATH)
    housing_tgz.close()


fetch_housing_data()

查看數(shù)據(jù)

目測數(shù)據(jù)

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

raw_data = load_housing_data()
raw_data.head()

raw_data.head()展示數(shù)據(jù)前五行

每行數(shù)據(jù)表示一個街區(qū)。共十個屬性longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, ocean_proximity。

查看整體數(shù)據(jù)結(jié)構(gòu)

raw_data.info()

raw_data.info()展示數(shù)據(jù)類型及基本信息

可以看到total_bedroom不是所有數(shù)據(jù)都有值(20433/20640)，處理的時候需要小心。
ocean_proximity顯然是個枚舉值，可以通過下面的方法查詢所有的枚舉值。

raw_data["ocean_proximity"].value_counts()

ocean_proximty的枚舉值

查看所有屬性的基本信息

raw_data.describe()

raw_data各屬性的基本統(tǒng)計信息

用matplotlib可視化數(shù)據(jù)

import matplotlib.pyplot as plt

# 直方圖50個桶， 2000*1000像素
raw_data.hist(bins=50, figsize=(20, 10))
plt.show()

直方圖可視化

創(chuàng)建測試集

隨機(jī)從數(shù)據(jù)中抽取20%作為測試數(shù)據(jù)。

import numpy as np

def split_train_test(data, test_ratio):
    # seed 方法保證相同的種子每次隨機(jī)生成的數(shù)組一致，即保證了測試集的一致。
    np.random.seed(714)
    '''
    numpy.random中函數(shù)shuffle與permutation都是對原來的數(shù)組進(jìn)行重新洗牌（即隨機(jī)打亂原來的元素順序）；
    區(qū)別在于shuffle直接在原來的數(shù)組上進(jìn)行操作，改變原來數(shù)組的順序，無返回值。
    而permutation不直接在原來的數(shù)組上進(jìn)行操作，而是返回一個新的打亂順序的數(shù)組，并不改變原來的數(shù)組。
    當(dāng)然，這里只是數(shù)組下標(biāo)的打亂。
    '''
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    # iloc: 根據(jù)標(biāo)簽的所在位置，從0開始計數(shù)，選取列
    return data.iloc[train_indices], data.iloc[test_indices]

train_set, test_set = split_train_test(raw_data, 0.2)
print(len(train_set), "train +", len(test_set), "test")
#16512 train + 4128 test

那么問題來了，如果源數(shù)據(jù)有更新，如何保證測試集不變？
一個通常的解決辦法是使用每個實例的ID來判定這個實例是否應(yīng)該放入測試集（假設(shè)每個實例都有唯一并且不變的ID）。例如，你可以計算出每個實例ID的哈希值，只保留其最后一個字節(jié)，如果該值小于等于 51（約為 256 的20%），就將其放入測試集。

import hashlib

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

#  添加一列index，從0開始
raw_data_with_id = raw_data.reset_index()
train_set, test_set = split_train_test_by_id(raw_data_with_id, 0.2, "index")
''' 
 更好的辦法是選擇永遠(yuǎn)不會變的index：
 raw_data_with_id["index"] = raw_data["longitude"] * 1000 + housing["latitude"]
 因為經(jīng)緯度是永遠(yuǎn)不會變的
  
hashlib基本用法：

import hashlib
md5 = hashlib.md5()
md5.update('how to use md5 in python hashlib?')
print(md5.hexdigest())
'''

Scikit-Learn中提供了分割數(shù)據(jù)集的函數(shù)，最簡單的是train_test_split
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(raw_data, test_size=0.2, random_state=714)
print(len(train_set), "train +", len(test_set), "test")
# 16512 train + 4128 test

分層采樣（得用這個）

目前為止的采樣都是隨機(jī)采樣，在數(shù)據(jù)集非常大的時候沒有問題，但如果數(shù)據(jù)集不大，就需要分層采樣（stratified sampling），從每個分層取合適數(shù)量的實例，以保證測試集具有代表性。

import numpy as np

'''
 根據(jù)原始數(shù)據(jù)直方圖中median_income的分布，新增一列income_cat，將數(shù)據(jù)映射到1-5之間
'''
raw_data["income_cat"] = np.ceil(raw_data["median_income"] / 1.5) # ceil 向上取整
raw_data["income_cat"].where(raw_data["income_cat"] < 5, 5.0, inplace=True) # where(condition, other=NAN), 滿足condition，則保留，不滿足取other

raw_data["income_cat"].value_counts() / len(raw_data) # 查看不同收入分類的比例
#  3.0    0.350581
#  2.0    0.318847
#  4.0    0.176308
#  5.0    0.114438
#  1.0    0.039826
#  Name: income_cat, dtype: float64

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) # n_splites 將訓(xùn)練數(shù)據(jù)分成train/test對的組數(shù)
for train_index, test_index in split.split(raw_data, raw_data["income_cat"]): # split.split(X, y, groups=None) 根據(jù)y對X進(jìn)行分割
    strat_train_set = raw_data.loc[train_index]
    strat_test_set = raw_data.loc[test_index]
'''
pandas中iloc和loc的區(qū)別：
  iloc主要使用數(shù)字來索引數(shù)據(jù)，而不能使用字符型的標(biāo)簽來索引數(shù)據(jù)。而loc則剛好相反，只能使用字符型標(biāo)簽來索引數(shù)據(jù)，不能使用數(shù)字來索引數(shù)據(jù)
'''

# 最后在數(shù)據(jù)中刪除添加的income_cat列
for set in (strat_train_set, strat_test_set):
    set.drop(["income_cat"], axis=1, inplace=True)
# drop函數(shù)默認(rèn)刪除行，列需要加axis = 1, 它不改變原有的df中的數(shù)據(jù)，而是返回另一個dataframe來存放刪除后的數(shù)據(jù)。

3. 數(shù)據(jù)可視化、探索規(guī)律

創(chuàng)建數(shù)據(jù)副本

housing = strat_train_set.copy()

可視化

首先很直觀的，看下經(jīng)緯度的散點圖。

housing.plot(kind="scatter", x="longitude", y="latitude")

地理位置散列圖

將alpha設(shè)為0.1可以看出地理位置信息的密度分布。

地理位置密度散列圖

再加入人口和房價信息，每個圈的半徑表示人口（population），圈的顏色表示房價（median_house_value）。

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population",
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True) 
plt.legend()

房價、人口散列圖

目測的規(guī)律：房價與人口密度密切相關(guān)，離大海的距離也是一個很有用的屬性。

查找關(guān)聯(lián)

相關(guān)系數(shù)的直觀形態(tài)

方式一

# 計算每個屬性之間的標(biāo)準(zhǔn)相關(guān)系數(shù)（也稱作皮爾遜相關(guān)系數(shù)）
corr_matrix = housing.corr()
# 查看每個屬性和median_house_value的相關(guān)系數(shù)，數(shù)值在[-1,1]之間。
corr_matrix["median_house_value"].sort_values(ascending=False)

各屬性與median_house_value的相關(guān)系數(shù)

方式二

from pandas.plotting import scatter_matrix
# 計算一下四個屬性之間的關(guān)聯(lián)性
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

pandas.plotting.scatter_matrix可視化相關(guān)性

median_income與房價的相關(guān)性最大，把這個圖單獨拿出來。

housing.plot(kind="scatter", x="median_income",y="median_house_value", alpha=0.1)

median_income與median_house_value關(guān)系圖

屬性組合試驗

給算法準(zhǔn)備數(shù)據(jù)之前，你需要做的最后一件事是嘗試多種屬性組合。例如，如果你不知道某個街區(qū)有多少戶，該街區(qū)的總房間數(shù)就沒什么用。你真正需要的是每戶有幾個房間。相似的，總臥室數(shù)也不重要：你可能需要將其與房間數(shù)進(jìn)行比較。每戶的人口數(shù)也是一個有趣的屬性組合。讓我們來創(chuàng)建這些新的屬性：

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

新屬性與房價的關(guān)聯(lián)系數(shù)

小結(jié)

這一步的數(shù)據(jù)探索不必非常完備，此處的目的是有一個正確的開始，快速發(fā)現(xiàn)規(guī)律，以得到一個合理的原型。但是這是一個交互過程：一旦你得到了一個原型，并運行起來，你就可以分析它的輸出，進(jìn)而發(fā)現(xiàn)更多的規(guī)律，然后再回到數(shù)據(jù)探索這步。

4. 準(zhǔn)備用于機(jī)器學(xué)習(xí)算法的數(shù)據(jù)

先把屬性和標(biāo)簽分開

housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

數(shù)據(jù)清洗

之前注意到有一些街區(qū)的total_bedrooms屬性缺失
三種處理方法

# 去掉對應(yīng)的街區(qū)
housing.dropna(subset=["total_bedrooms"]) 
# 去掉整個屬性
housing.drop("total_bedrooms", axis=1) 
# 進(jìn)行賦值（0、平均值、中位數(shù)等等）
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median)

Scikit-Learn 提供了一個方便的類來處理缺失值： Imputer

from sklearn.preprocessing import Imputer

imputer = Imputer(strategy="median")
# 因為只有數(shù)值屬性才能算出中位數(shù)，我們需要創(chuàng)建一份不包括文本屬性 ocean_proximity 的數(shù)據(jù)副本
housing_num = housing.drop("ocean_proximity", axis=1)
# imputer 計算出了每個屬性的中位數(shù)，并將結(jié)果保存在了實例變量 statistics_ 中。
imputer.fit(housing_num)
# 使用這個“訓(xùn)練過的” imputer 來對訓(xùn)練集進(jìn)行轉(zhuǎn)換，將缺失值替換為中位數(shù)
X = imputer.transform(housing_num)
# 結(jié)果是一個包含轉(zhuǎn)換后特征的普通的 Numpy 數(shù)組。將其放回到Pandas DataFrame 中。
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

處理文本和類別屬性 Categorical Attributes

數(shù)機(jī)器學(xué)習(xí)算法喜歡和數(shù)字打交道，所以將文本轉(zhuǎn)換為數(shù)字

from sklearn.preprocessing import OrdinalEncoder

housing_cat = housing[['ocean_proximity']]
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
'''array([[0.],
       [0.],
       [4.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]])'''
ordinal_encoder.categories_
# [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype=object)]
'''
這種做法的問題是，機(jī)器學(xué)習(xí)算法會認(rèn)為兩個臨近的值比兩個疏遠(yuǎn)的值要更相似，顯然這樣不對。
要解決這個問題，一個常見的方法是給每個分類創(chuàng)建一個二元屬性：
    當(dāng)分類是 <1H OCEAN ，該屬性為 1（否則為 0），當(dāng)分類是 INLAND ，另一個屬性等于 1（否則為 0），以此類推。
這稱作獨熱編碼（One-Hot Encoding），因為只有一個屬性會等于 1（熱），其余會是 0（冷）。
'''
# OneHotEncoder ，用于將整數(shù)分類值轉(zhuǎn)變?yōu)楠殶嵯蛄俊Ｗ⒁?fit_transform() 用于 2D 數(shù)組，而 housing_cat_encoded 是一個 1D 數(shù)組，所以需要將其變形
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot # 輸出是系數(shù)矩陣
#<16512x5 sparse matrix of type '<class 'numpy.float64'>'
#   with 16512 stored elements in Compressed Sparse Row format>

housing_cat_1hot.toarray() # 轉(zhuǎn)換成密集矩陣，或者在初始化的時候 cat_encoder = OneHotEncoder(sparse=False)
'''
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])
'''

自定義轉(zhuǎn)換器

盡管 Scikit-Learn 提供了許多有用的轉(zhuǎn)換器，你還是需要自己動手寫轉(zhuǎn)換器執(zhí)行任務(wù)，比如自定義的清理操作，或?qū)傩越M合。你需要讓自制的轉(zhuǎn)換器與 Scikit-Learn 組件（比如流水線）無縫銜接工作，因為 Scikit-Learn 是依賴鴨子類型的（而不是繼承），你所需要做的是創(chuàng)建一個類并執(zhí)行三個方法： fit() （返回 self ）， transform() ，和 fit_transform() 。通過添加 TransformerMixin 作為基類，可以很容易地得到最后一個。另外，如果你添加 BaseEstimator 作為基類（且構(gòu)造器中避免使用 *args 和 **kargs ），你就能得到兩個額外的方法（ get_params() 和 set_params() ），二者可以方便地進(jìn)行超參數(shù)自動微調(diào)。例如，一個小轉(zhuǎn)換器類添加了上面討論的屬性：

from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

特征縮放

線性歸一化（Min-Max scaling、normalization）

通過減去最小值，然后再除以最大值與最小值的差值，來進(jìn)行歸一化。Scikit-Learn 提供了一個轉(zhuǎn)換器 MinMaxScaler 來實現(xiàn)這個功能。它有一個超參數(shù) feature_range ，可以讓你改變范圍，如果不希望范圍是 0 到 1。

標(biāo)準(zhǔn)化（standardization）

首先減去平均值（所以標(biāo)準(zhǔn)化值的平均值總是 0），然后除以方差，使得到的分布具有單位方差。標(biāo)準(zhǔn)化受到異常值的影響很小。Scikit-Learn 提供了一個轉(zhuǎn)換器 StandardScaler 來進(jìn)行標(biāo)準(zhǔn)化。

轉(zhuǎn)換流水線

數(shù)據(jù)處理過程存在許多數(shù)據(jù)轉(zhuǎn)換步驟，需要按一定的順序執(zhí)行。幸運的是，Scikit-Learn 提供了類 Pipeline ，來進(jìn)行這一系列的轉(zhuǎn)換。下面是一個數(shù)值屬性的小流水線：

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.compose import ColumnTransformer
'''
構(gòu)造器需要一個定義步驟順序的名字/估計器對的列表。除了最后一個估計器，其余都要是轉(zhuǎn)換器（即，它們都要有 fit_transform() 方法）。
當(dāng)你調(diào)用流水線的 fit() 方法，就會對所有轉(zhuǎn)換器順序調(diào)用 fit_transform() 方法，將每次調(diào)用的輸出作為參數(shù)傳遞給下一個調(diào)用，一直到最后一個估計器，它只執(zhí)行 fit() 方法。
'''
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
    ('imputer', Imputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
    ])

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
'''
array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])
'''
housing_prepared.shape
# (16512, 16)

5. 選擇模型并進(jìn)行訓(xùn)練

先訓(xùn)練一個線性回歸模型

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

計算RMSE

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
# 68628.19819848922

因此預(yù)測誤差 68628 不能讓人滿意。這是一個模型欠擬合訓(xùn)練數(shù)據(jù)的例子。當(dāng)這種情況發(fā)生時，意味著特征沒有提供足夠多的信息來做出一個好的預(yù)測，或者模型并不強(qiáng)大。
換一個決策樹模型

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
# 0.0 大概率過擬合了

交叉驗證

'''
隨機(jī)地將訓(xùn)練集分成十個不同的子集，然后訓(xùn)練評估決策樹模型 10 次，每次選一個不用的折來做評估，用其它 9 個來做訓(xùn)練。結(jié)果是一個包含10 個評分的數(shù)組。
'''
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
tree_rmse_scores
# array([68274.11882883, 66569.14495813, 72556.31339841, 68235.85607159,
#       70706.44616166, 73298.7766776 , 70404.07783425, 71858.98228216,
#       77435.9399421 , 71396.89318558])
"""
Scikit-Learn 交叉驗證功能期望的是效用函數(shù)（越大越好）
"""

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)
'''Scores: [70194.33680785 66855.16363941 72432.58244769 70758.73896782
 71115.88230639 75585.14172901 70262.86139133 70273.6325285
 75366.87952553 71231.65726027]
Mean: 71407.68766037929
Standard deviation: 2439.4345041191004'''

再換一個隨機(jī)森林

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
# 21933.31414779769
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
'''
Scores: [51646.44545909 48940.60114882 53050.86323649 54408.98730149
 50922.14870785 56482.50703987 51864.52025526 49760.85037653
 55434.21627933 53326.10093303]
Mean: 52583.72407377466
Standard deviation: 2298.353351147122
'''

保存模型

from sklearn.externals import joblib

joblib.dump(my_model, "my_model.pkl")
# 然后
my_model_loaded = joblib.load("my_model.pkl")

6. 模型微調(diào)

網(wǎng)絡(luò)搜索 Grid Search

下面的代碼搜索了 RandomForestRegressor 超參數(shù)值的最佳組合：

from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
grid_search.best_params_
# {'max_features': 8, 'n_estimators': 30}
# 因為 30 是 n_estimators 的最大值，你也應(yīng)該估計更高的值，因為評估的分?jǐn)?shù)可能會隨 n_estimators 的增大而持續(xù)提升。

隨機(jī)搜索

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

集成方法

另一種微調(diào)系統(tǒng)的方法是將表現(xiàn)最好的模型組合起來。組合（集成）之后的性能通常要比單獨的模型要好（就像隨機(jī)森林要比單獨的決策樹要好），特別是當(dāng)單獨模型的誤差類型不同時。待續(xù)

分析最佳模型和它們的誤差

用測試集評估系統(tǒng)

final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
# 47730.22690385927

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

一個過程完整的機(jī)器學(xué)習(xí)項目