本文對 A study on Regression applied to the Ames dataset 中的代碼進行詳解。
該段代碼的主體介紹如下:
Introduction
該 kernel 將使用各種技巧來全面體現 Linear Regression 的作用, 包括預處理和 regularization( a process of introducing additional information in order to prevent overfitting).
具體算法流程
1. 導入數據
(如需要數據集的同事,可在網頁鏈接下載)
-
import 工具包。
matplotlib 是最著名的 Python 圖表繪制擴展庫,它支持輸出多種格式的圖形圖像,并且可以使用多種 GUI(圖形用戶界面,即 Graphical User Interface) 界面庫交互式地顯示圖表。使用 %matplotlib 命令可以將 matplotlib 的圖表直接嵌入到 Notebook 之中,或者使用指定的界面庫顯示圖表,參數指定 matplotlib 圖表的顯示方式。inline 表示將圖表嵌入到Notebook中。
問題: IPython 內置了一套非常強大的指令系統,又被稱作魔法命令,使得在IPython環境中的操作更加得心應手。魔法命令都以%或者%%開頭,以%開頭的成為行命令,%%開頭的稱為單元命令。行命令只對命令所在的行有效,而單元命令則必須出現在單元的第一行,對整個單元的代碼進行處理。MORE TO SEE...
讀入數據。
建議可以將 .csv 文件用 excel 打開,后續可作數據上的人為調整,觀察不同的調整對代碼結果的影響。
train = pd.read_csv("../input/train.csv")
設置數據格式:
如下為為設置顯示的浮點數位數,保留三位小數。
pd.set_option('display.float_format', lambda x: '%.3f' % x)-
檢查是否有重復數據,并去除 ID 列。
檢查是否有重復數據:idsUnique = len(set(train.Id)) idsTotal = train.shape[0] idsDupli = idsTotal - idsUnique print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")
其中, train.shape 返回的是(1460,80),即行列的數目; train.shape[0] 返回的是行的數目。
在本數據集中沒有重復數據,然而如果在其他數據集中要做重復數據的處理,建議使用 DataFrame.drop_duplicates(subset=None, keep='first', inplace=False) Return DataFrame with duplicate rows removed, optionally only considering certain columns.
其中, 如上幾個參數解釋如下:
subset : column label or sequence of labels, optional.Only consider certain columns for identifying duplicates, by default use all of the columns。選擇要作用于的列。
keep : {‘first’, ‘last’, False}, default ‘first’.first : Drop duplicates except for the first occurrence. last : Drop duplicates except for the last occurrence. False : Drop all duplicates.
inplace : boolean, default False. Whether to drop duplicates in place or to return a copy. 如果選的是 True 則在原來 dataframe 上直接修改,否則就返回一個刪減后的 copy。去除 ID 列數據:
train.drop("Id", axis = 1, inplace = True)其中幾個參數的意思分別是:
“ID” 為列名。
axis = 1 表明是列;如果是 0 ,則表明是行。
inplace = True:凡是會對原數組作出修改并返回一個新數組的,往往都有一個 inplace可選參數。如果手動設定為True(默認為False),那么原數組直接就被替換。
2. Pre-processing
2.1 去除異常值 Potential Pitfalls/Outliers
train = train[train.GrLivArea < 4000] # 去除右側的異常點
2.2 Take log 以消減誤差
取對數是為了均衡誤差對不同價位房屋的價格預測值的影響。
train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice
這里取 log 時采用的是 log1p 即 log(1+x).
兩個問題:
- take log 的作用:
Small values that are close together are spread further out.
Large values that are spread out are brought closer together.
- take log(1+x) 的作用:
樸素貝葉斯中,防止變量之前從未出現的時候,出現的概率為 0 ,出現數學計算的錯誤。
2.3 Handle missing values
處理不可使用中位數或平均數或 most common value 進行填充的 feature 的缺失值。
替換數據的依據:
根據 label 判斷該 feature 下缺失值最有可能是什么,就填入什么。
具體來說,要深入去看原始數據集:
如果 values 中有等級劃分(優劣差等各等級;2、1、0 或 Y/N),一般選擇最低等級的一類作為填充值。
-
如果 values 中為類型劃分,則選擇該 features 下最經常出現的值作為填充值。
train.loc[:, "Alley"] = train.loc[:, "Alley"].fillna("None")
其中 train.loc[:, "Alley"] means select every row of column "alley"., .fillna(XX) means fill na cell with XX.
2.4 特殊數據的處理
2.4.1 將 numerical features 轉為 categories
Some numerical features 事實上是類型值, 要把它們轉化成類別。比如月份的數字本身無任何數值意義,所以轉換為英文縮寫。
train = train.replace({"MSSubClass" : {20 : "SC20", 30 : "SC30", 40 : "SC40", 45 : "SC45", 50 : "SC50", 60 : "SC60", 70 : "SC70", 75 : "SC75", 80 : "SC80", 85 : "SC85", 90 : "SC90", 120 : "SC120", 150 : "SC150", 160 : "SC160", 180 : "SC180", 190 : "SC190"},
"MoSold" : {1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun", 7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"}})
2.4.2 將 category features 轉為 ordered numbers
將一些 categorical features 轉換為 ordered numbers,
- 有明確分級的 feature,這些數值的順序本身是有信息的,比如,"BsmtQual" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5};
- 分類關系中分級關系可以比較明確區分出來,如 Alley ,大多數人都偏好鋪好的平整路面,而不是碎石路;LotShape,大多數人都偏好規整的 LotShape。反例則是比如 neighborhood 雖然可以反應出分級,畢竟大多數人喜歡的社區還是相似的,但是很難區分。
train = train.replace({"Alley" : {"Grvl" : 1, "Pave" : 2},
"BsmtCond" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
……)
3. Create new features
Then we will create new features, in 3 ways :
- Simplifications of existing features
- Combinations of existing features
- Polynomials on the top 10 existing features
之所以要對 features 作進一步的處理,我認為原因是:1. 簡化后續計算,專注核心的 features。
3.1 簡化 features 1* Simplifications of existing features
第一種簡化 features 的方法,即簡化已有的 features 數據層級,比如如下將原來 9 級的數據(1–9)可以劃分為 3 級(1–3)。
train["SimplOverallQual"] = train.OverallQual.replace(
{1 : 1, 2 : 1, 3 : 1, # bad
4 : 2, 5 : 2, 6 : 2, # average
7 : 3, 8 : 3, 9 : 3, 10 : 3 # good
})
3.2 簡化 features 2* Combinations of existing features
第二種簡化 features 的方法,即將若干種緊密相關的 features 合并在一起。
可能用到的講解視頻 Multivariate Linear Regression - Features and Polynomial Regressions - Housing Prices Predicting , given by Andrew NG, Stanford University。要注意的是,采用這種方法要格外注意 scaling。
具體語法(示例):
train["OverallGrade"] = train["OverallQual"] * train["OverallCond"]
3.3 簡化 features 3* Polynomials on the top 10 existing features
3.3.1 尋找重要 features
Find most important features relative to target. 按照其他 features 和 SalePrice 的相關度 correlation 降序排列。
corr = train.corr()
corr.sort_values(["SalePrice"], ascending = False, inplace = True)
3.3.2 Features and Polynomial Regressions
關于這一步,個人的理解是:先將重要的 features 挑選出來,然后為了更好地擬合某個模型,將這些重要的模型做了一個 Polynomial Regressions 的處理。
關于何時使用 polynomial,有如下三個 situations:
- 理論需求。即作者假設這里會由曲線構成。
- 人為觀察變量。在做回歸分析之前,要先做單變量或二變量觀察。可以通過畫簡單的散點圖來檢驗是否有曲線關系。
- 對殘差的觀察。如果你試圖將線性模型應用在曲線關系的數據上,那么散點圖中間會在中間區域有整塊的正值殘差,然后在橫坐標(X 軸即
predictor)一末端有整塊的負值殘差;或者反之。這就說明了線性模型是不適用的。但我個人覺得,第三種方式只是不適用線性模型的若干種情況之一,非充要條件。
代碼示例:
train["OverallQual-s2"] = train["OverallQual"] ** 2
train["OverallQual-s3"] = train["OverallQual"] ** 3
train["OverallQual-Sq"] = np.sqrt(train["OverallQual"])
同樣的,用到的講解視頻 Multivariate Linear Regression - Features and Polynomial Regressions - Housing Prices Predicting , given by Andrew NG, Stanford University。要注意的是,采用這種方法要格外注意 scaling。
有趣的是,執行了上述代碼之后,重新將影響 SalePrice 的 features 排序后,新生成的 features 進入了 influential features top 10,可見 polynomial 是有意義的。
4. Create Features 后的數據再處理
4.1 處理仍遺留的缺失數據
4.1.1 區分出 numerical 和 categorical features
除了目標 feature 即 SalePrice 之外,區分出 numerical 和 categorical features 。
categorical_features = train.select_dtypes(include = ["object"]).columns
numerical_features = train.select_dtypes(exclude = ["object"]).columns
numerical_features = numerical_features.drop("SalePrice")
其中 object 是指 categorical features 的數據類型。
4.1.2 缺失數據填充
對于 numerical features 中的缺失值,使用中位數作為填充值。
train_num = train_num.fillna(train_num.median())
4.2 Take Log
對 skewed numerical features 取 Log 變換可以弱化異常值的影響。
Inspired by Alexandru Papiu's script.
As a general rule of thumb, a skewness with an absolute value > 0.5 is considered at least moderately skewed.
skewness = train_num.apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
train_num[skewed_features] = np.log1p(train_num[skewed_features])
4.3 Create dummy features for categorical values
Create dummy features for categorical values via one-hot encoding.
在回歸分析中,dummy 變量(也被稱為 indicator variable, design variable, Boolean indicator, categorical variable, binary variable, or qualitative variable)取值為 0 或 1,意味著是否會有可能影響輸出的 categorical effect。Dummy 變量常用于分類互斥分類(比如吸煙者/非吸煙者)。
train_cat = [pd.get_dummies(train_cat)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)
get_dummies 作用于 categorical features,舉例來說明 get_dummies 的作用:
- 在執行 get_dummies 操作之前,每一列都是一個feature,這一列內取值都是原始數據或處理后的原始數據,對于 feature MSSubClass 的取值可分為 SC20/SC60/ SC70...SC120...等,其中第 23 行的數據記錄為 SC120。
- 在執行 get_dummies 操作之后,每一列的列名依次變為每一個原來 feature 的取值,比如原來的 MSSubClass 列會拓展為 SC20/SC60/ SC70...SC120...等。以 SC120 舉例,原來第 23 行記錄為 SC120, 那么對應修改后新增的 SC120 列中第 23 行值為 1;原來若干行記錄不是 SC120 的,對應變換后值為 0.
5. Modeling
5.1 數據整合與 validation
5.1.1 數據整合
首先進行數據的整合,將原來分別處理完畢的 numerical 和 categorical features 進行合并。
train = pd.[concat([train_num, train_cat], axis = 1)](http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.concat.html)
5.1.2 分割數據集
將數據集分割為 train set 和 validation set。事實上,在事后作者也說,對于 cross-validation 而言,沒必要提前分割數據集,因為 cross-validation 會自動分割。
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.3, random_state = 0)
關于 train_test_split(train, y, test_size = 0.3, random_state = 0),它的作用是隨機分割 arrays or matrices 為 train and test subsets。
5.1.3 standardize numerical features
Reason why we need standardization for numerical features: For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
翻譯如下:我們需要對 numerical features 進行 standardization 的原因是:舉例來說,很多機器學習算法(such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models)會假設所有的 features 均值為0,方差都是在同一 order。如果一個 feature 的方差的 order 在數量級上大于其他 features,那么它將主導影響這個算法方程,使得預測器不能按照預期也從其他 features 中學習。
stdSc = StandardScaler()
X_train.loc[:, numerical_features] = stdSc.fit_transform(X_train.loc[:, numerical_features])
X_test.loc[:, numerical_features] = stdSc.transform(X_test.loc[:, numerical_features])
其中,StandardScaler(): Standardize features by removing the mean and scaling to unit variance,返回的是執行完 standardization 之后的數據。
簡言之,fit_transform = fit + transform。 fit 函數會計算 X_train 的平均值和方差,然后根據 fit 的計算結果,transform 函數會執行 standardization 操作。此時的平均值和方差結果已經記錄下來,第二次對 X_test 操作時,不新計算 X_test 的平均值和方差,而沿用 X_train 的平均值和方差,直接調用 transform 函數進行 standardization 操作。
要注意的是:standardization (fit 函數)不能在數據分割(training 與 test/validation set)之前使用,因為我們不希望使得 StandardScaler 也計算 test set 的平均值和方差,我們希望 test set 使用和 train set 相同的平均值和方差。
5.1.4 Define error measure
Define error measure for official scoring : RMSE
scorer = make_scorer(mean_squared_error, greater_is_better = False)
其中,make_scorer makes a scorer from a performance metric or loss function. 第一個參數確定的是 function 的類型(這里指定的是一個 loss function), 第二個參數指明的是第一個參數是 score function(default=True)或 loss function(=False)。
這里作者采用了兩個方程式分別定義 train set rmse 和 test set rmse,事實上,我個人認為這是作者的謬誤。換言之,對于 cross-validation 而言,讀入的整個數據集在不同的 iteration 中分別都充當了 train 及 test set,沒必要分別計算 train set rmse 和 test set rmse。
def rmse_cv_train(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y_train, scoring = scorer, cv = 10)
return(rmse)
def rmse_cv_test(model):
rmse= np.sqrt(-cross_val_score(model, X_test, y_test, scoring = scorer, cv = 10))
return(rmse)
其中,此處的 cross_val_score 中的 scorer 是 loss function,返回值應該是負值,所以執行 sqrt 之前要加一個負號使得 sign-flip the outcome of the scorer;
如果此處的 scorer 是 score function,則返回值是 scores : array of float, shape=(len(list(cv)),),即 Array of scores of the estimator for each run of the cross validation.
BTW, Trust your CV score, and not LB score. The leaderboard score is scored only on a small percentage of the full test set. In some cases, it’s only a few hundred test cases. Your cross-validation score will be much more reliable in general.
5.2 *1. Linear Regression without regularization
lr = LinearRegression()
lr.fit(X_train, y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
其中, LinearRegression():Ordinary least squares Linear Regression.
通過畫圖可以查看預測結果:畫 residulas、畫原始預測值
plt.scatter(y_train_pred, y_train_pred - y_train, c = "blue", marker = "s", label = "Training data") # scatter 是散點圖
plt.scatter(y_test_pred, y_test_pred - y_test, c = "lightgreen", marker = "s", label = "Validation data")plt.scatter(y_train_pred, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_pred, y_test, c = "lightgreen", marker = "s", label = "Validation data")
5.3 *2. Linear Regression with Ridge regularization (L2 penalty)
Regularization is a very useful method to handle collinearity, filter out noise from data, and eventually prevent overfitting. The concept behind regularization is to introduce additional information (bias) to penalize extreme parameter weights.The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels.
L1 penalty:
absolute sum of weights
L2 penalty:
Ridge regression is an L2 penalized model where we simply add the squared sum of the weights to our cost function.
對于 L1 penalty and L2 penalty 此處的區別, 可進一步參見該文章中的 As Regularization/loss function 部分. 林軒田機器學習基石 14-4 General Regularizers(13-28)有具體講解。
5.3.1 尋找合適的 Ridge Regression Model
第一次尋找
ridge = RidgeCV(alphas = [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10, 30, 60])
ridge.fit(X_train, y_train)
alpha = ridge.alpha_ # Estimated regularization parameter.
其中,RidgeCV(): Ridge regression with built-in cross-validation.
** 第二次尋找**
要注意的一點是:在第一次得到 alpha 后要在用該 alpha 做一個中間值再取一次 alpha。
ridge = RidgeCV(
alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4], cv = 10)
ridge.fit(X_train, y_train)
alpha = ridge.alpha_
問題:為何第二次 RidgeCV 新增了參數 cv=10,而第一次沒有。
5.3.2 用合適的 Ridge Model 計算預測值
先檢測 RMSE。我個人認為,如果這里的 RMSE.mean() 值不合適,應該會回退到之前的步驟進行微調。但具體值范圍多少是不合格則需要查看原始數據的 range,比如 For a datum which ranges from 0 to 1000, an RMSE of 0.7 is small, but if the range goes from 0 to 1, it is not that small anymore,所以事實上并沒有一個絕對確定的值域。
如果 RMSE 不合適,我個人認為應首先回退調整 alpha 等其他參數,如果仍不合適,則要查驗數據,比如 polynomial 項、features 的合并簡化等。在調整的過程中,一次只應調整一個或有很強關系的系列參數。
print("Ridge RMSE on Training set :", rmse_cv_train(ridge).mean())
print("Ridge RMSE on Test set :", rmse_cv_test(ridge).mean())
如果 RMSE 合適,計算預測值。
y_train_rdg = ridge.predict(X_train)
y_test_rdg = ridge.predict(X_test)
5.3.3 畫圖檢驗 Ridge Model 結果
畫 residulas(預測值減去真實值):同時畫 y_train_rdg, y_test_rdg.
plt.scatter(y_train_rdg, y_train_rdg - y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_rdg, y_test_rdg - y_test, c = "lightgreen", marker = "s", label = "Validation data")
事實上,我們要讀懂 residulas 圖,一個好的 residuals 圖有如下特征:
(1) they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot,呈現對稱分布的特點,趨向于往圖片中間聚集。
(2) they’re clustered around the lower single digits of the y-axis (e.g., 0.5 or 1.5, not 30 or 150) 聚集在 y 軸的小數值區域。
(3) in general there aren’t clear patterns 一般不會呈現特定圖形模式。
Interpreting residual plots to improve your regression 這篇較為詳細地描述了 residual plot 背后的含義,并給出了 how to fix 的建議。
為了表明數據是否貼近于我們所選的模型,我們常使用的一個概念是 R-Squared. R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean. In general, the higher the R-squared(取值范圍是 [0, 1]), the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.
具體來說:
- 如果是 Y-axis Unbalanced,
解決方法是:
The solution to this is almost always to transform your data(一般指 take log), typically your response variable.
It’s also possible that your model lacks a variable.
- 如果是呈現了非線性,
要注意的是:
如果稍好一些的模型擬合如下,
對于這樣的模型,If you’re getting a quick understanding of the relationship, your straight line is a pretty decent approximation. If you’re going to use this model for prediction and not explanation, the most accurate possible model would probably account for that curve.
解決方法是:
Sometimes patterns like this indicate that a variable needs to be transformed.
If the pattern is actually as clear as these examples, you probably need to create a nonlinear model (it’s not as hard as that sounds).
Or, as always, it’s possible that the issue is a missing variable.
- outliers
解決方案:
It’s possible that this is a measurement or data entry error, where the outlier is just wrong, in which case you should delete it.
It’s possible that what appears to be just a couple outliers is in fact a power distribution. Consider transforming the variable if one of your variables has an asymmetric distribution (that is, it’s not remotely bell-shaped).
If it is indeed a legitimate outlier, you should assess the impact of the outlier.
- Large Y-axis Datapoints
解決方案:
Even though this approach wouldn’t work in the specific example above, it’s almost always worth looking around to see if there’s an opportunity to usefully transform a variable.
If that doesn’t work, though, you probably need to deal with your missing variable problem.
- X-axis Unbalanced
這種圖形不一定說明模型的預測能力不佳,可以 look at the Predicted vs Actual plot ,如果擬合良好,這也是有可能的(residuals are unbalanced but predictions are accurate);如果采用一些步驟微調之后,預測能力變得更差也是有可能的。
解決方案
The solution to this is almost always to transform your data, typically an explanatory variable. (Note that the example shown below will reference transforming your reponse variable, but the same process will be helpful here.)
It’s also possible that your model lacks a variable.
直接畫預測值:同時畫 y_train_rdg, y_test_rdg.
plt.scatter(y_train_rdg, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_rdg, y_test, c = "lightgreen", marker = "s", label = "Validation data")
畫重要的參數:As with other linear models, Ridge
will take in its fit method arrays X, y and will store the coefficients [w] of the linear model in its coef_ member:
coefs = pd.Series(ridge.coef_, index = X_train.columns) #ridge.coef_ 即為 w
imp_coefs = pd.concat([coefs.sort_values().head(10),
coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
5.4 * 3. Linear Regression with Lasso regularization (L1 penalty)
LASSO 是 Least Absolute Shrinkage and Selection Operator 的縮寫。這是另一種可選的 regularization 方式,我們可以將 Ridge 方法中取 weights 的平方和替換為取 weights 的絕對值和。不同于 L2 regularization, L1 regularization 輸出 sparse feature vectors,即大多數的 feature weights 都是 0。Sparsity 這種特性(大多數的 feature weights 都是 0)在實際中是十分有用的,尤其對于有很多維度且很多 features 之間無關聯的數據集。
5.4.1 尋找合適的 Lasso Regression Model
同 Ridge Model,在尋找合適的 Lasso Regression Model 也需要進行兩次尋找。
** 第一次尋找**
lasso = LassoCV(alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1,
0.3, 0.6, 1],
max_iter = 50000, cv = 10)
lasso.fit(X_train, y_train)
alpha = lasso.alpha_
第二次尋找
問題:同 Riege Model,在第二次尋找合適的模型時也新增了 cv=10 這個參數;不同的是,在使用 Lasso Model 時,額外新增了 max_iter=5000。
lasso = LassoCV(alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4], max_iter = 50000, cv = 10)
lasso.fit(X_train, y_train)
alpha = lasso.alpha_
5.4.2 用合適的 Lasso Model 計算預測值
與 Ridge 模型相同,先 print 出 RMSE。
print("Lasso RMSE on Training set :", rmse_cv_train(lasso).mean())
print("Lasso RMSE on Test set :", rmse_cv_test(lasso).mean())
計算預測值
y_train_las = lasso.predict(X_train)
y_test_las = lasso.predict(X_test)
5.4.3 畫圖檢驗 Lasso Model 結果
同 Ridge Model,此處也分為三步
- 畫 residuals(預測值減去真實值):同時畫 y_train_las, y_test_las.
- 直接畫出預測值:同時畫 y_train_las, y_test_las.
- 畫重要的參數
總結比較 Lasso 和 Ridge Model: Lasso 的 RMSE 結果子 training 和 test sets 上都表現更好。值得注意的是: Lasso 僅僅用了可用 features 的三分之一;另一點值得注意的是, Lasso 似乎給 neighborhood categories 這個 feature 更大的權重比,而直覺來說,neighborhood 的確對房屋售價騎著非常關鍵的作用。
5.5 * 4. Linear Regression with ElasticNet regularization (L1 and L2 penalty)
ElasticNet 是 Ridge 和 Lasso regression 的折中方案。它有 L1 penalty 來形成 sparsity,也有 L2 penalty 來客服 lasso 的一些限制,比如變量個數((Lasso can't select more features than it has observations, but it's not the case here anyway)
5.5.1 尋找合適的 ElasticNet Model
第一次尋找
elasticNet = ElasticNetCV(l1_ratio = [0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1],
alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006,
0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6],
max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
其中 l1_ratio 是 float between 0 and 1 passed to ElasticNet (scaling between l1 and l2 penalties).
For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
This parameter can be a list, in which case the different values are tested by cross-validation and the one giving the best prediction score is used. Note that a good choice of list of values for l1_ratio is often to put more values close to 1 (i.e. Lasso) and less close to 0 (i.e. Ridge), as in [.1, .5, .7, .9, .95, .99, 1]
第二次尋找(針對 ratio)
暫時利用之間的 alpha 值,在上一次尋找得到的 ratio 值的上下一定范圍內取 ratio 值,進行第二次尋找。
elasticNet = ElasticNetCV(l1_ratio = [ratio *(乘以) .85, ratio * .9, ratio * .95, ratio, ratio * 1.05, ratio * 1.1, ratio * 1.15],
alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6],
max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
其中有一些細節要注意:elasticNet.l1_ratio_ 的取值范圍應該是 [0, 1], 所以不可能超過這個范圍。如果超過,要折回最接近的取值范圍內,即置為 0 或 1。
第三次尋找(針對 alpha)
利用已經得到的 ratio 值,在第一次尋找到的 alpha 的上下一定范圍內取 alpha 值,進行第三次尋找。
elasticNet = ElasticNetCV(l1_ratio = ratio,
alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, alpha * .9,
alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3,
alpha * 1.35, alpha * 1.4],
max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
同上一步尋找合適的 Model, 有一些細節要注意:elasticNet.l1_ratio_ 的取值范圍應該是 [0, 1], 所以不可能超過這個范圍。如果超過,要折回最接近的取值范圍內,即置為 0 或 1。
5.5.2 用合適的 ElasticNetCV Model 計算預測值
與 Ridge、Lasso 模型相同,先 print 出 ElasticNetCV 的 RMSE。
print("ElasticNet RMSE on Training set :", rmse_cv_train(elasticNet).mean())
print("ElasticNet RMSE on Test set :", rmse_cv_test(elasticNet).mean())
計算預測值
y_train_ela = elasticNet.predict(X_train)
y_test_ela = elasticNet.predict(X_test)
5.5.3 畫圖檢驗 ElasticNetCV Model 結果
同 Ridge、Lasso Model,此處也分為三步
- 畫 residuals(預測值減去真實值):同時畫 y_train_las, y_test_las.
- 直接畫出預測值:同時畫 y_train_las, y_test_las.
- 畫重要的參數
總結:ElasticNetCV 選擇的較好的 L1 ratio 為 1, 即使用 Lasso regressor。事實上,該模型不需要任何的 L2 regularization 來克服 L1 的缺點。
結論
Linear Regression 在認真整理數據集并且優化 regularization 的情況下會得到不錯的預測,這比使用之前 kaggle 比賽中表現不錯的算法要好得多。
附錄
原始學習數據說明:
File descriptions
train.csv - the training set
test.csv - the test set
data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms
Data fields
Here's a brief version of what you'll find in the data description file.
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class 等級,有順序意義。
MSZoning: The general zoning classification 分類,無順序意義。
LotFrontage: Linear feet of street connected to property 數字型。
LotArea: Lot size in square feet
Street: Type of road access 分類。
Alley: Type of alley access 分類。
LotShape: General shape of property 分類。
LandContour 等高線: Flatness of the property 分類。
Utilities: Type of utilities available 分類
LotConfig: Lot configuration 分類
LandSlope: Slope of property 分類,順序有意義
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale