多變量線性回歸(一)

搭建編程環(huán)境

此處推薦安裝Octave,如若已安裝Matlab也可。這里不過多敘述如何安裝Octave或Matlab,請自行查閱相關(guān)資料。

多維特征(Multiple Features)

之前我們學(xué)習(xí)了單變量線性回歸,現(xiàn)在我們繼續(xù)利用房價的例子來學(xué)習(xí)多變量線性回歸。

如上圖所示,我們對房價模型增加一些特征,例如:房間的數(shù)量、樓層數(shù)和房屋使用年限。對此,我們分別令x1,x2,x3和x4表示房屋面積、房間的數(shù)量、樓層數(shù)和房屋使用年限。

這里增添了一些特征,我們也要引入一系列新的符號標記:

  • n:代表特征的數(shù)量
  • x(i):代表第i個訓(xùn)練示例,即表示特征矩陣中的第i行
  • xj(i):代表特征矩陣中第i行的第j個特征

因此,我們的多變量線性回歸的表達式為:
  hθ(x) = θ01x12x2+···+θnxn

這個公式中有n+1個參數(shù)和n個變量,為了簡化該公式,我們引入x0=1(x0(i)=1),則公式可以轉(zhuǎn)化為:
  hθ(x) = θ0x01x12x2+···+θnxn

此時公式中有n+1個參數(shù)和n+1個變量,此時我們可以將參數(shù)和變量看成n+1維的向量(即θ表示n+1維的(參數(shù))向量,X表示n+1維的(變量)向量),則我們可將公式簡化成:
  hθ(x) = θTX

補充筆記
Multiple Features

Linear regression with multiple variables is also known as "multivariate linear regression".

We now introduce notation for equations where we can have any number of input variables.

  • xj(i) = value of feature j in the ith training example
  • x(i) = the input (features) of the ith training example

Note:

  • m = the number of training example
  • n = the number of features

The multivariable form of the hypothesis function accommodating these multiple features is as follows:
  hθ(x) = θ01x12x2+···+θnxn

In order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

This is a vectorization of our hypothesis function for one training example.

多變量梯度下降(Gradient Descent For Multiple Variables)

與之前的單變量線性回歸類似,我們也構(gòu)建了一個代價函數(shù)J:

我們的目標與在單變量線性回歸中一樣,找出使得代價函數(shù)最小的一系列參數(shù)。在單變量線性回歸中,我們引入梯度下降算法來找尋該參數(shù)。因此,在多變量線性回歸中,我們依舊引入梯度下降算法。

即:

通過簡單的求導(dǎo)后可得:

補充筆記
Gradient Descent for Multiple Variables

The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:

In other words:

The following image compares image compares gradient descent with one value to gradient descent with multiple variables:

特征縮放(Feature Scaling)

在多維特征的情況下,若我們保證這些特征都具有相近的尺度,則梯度下降算法能夠更快地收斂。

我們還是以房價預(yù)測為例,假設(shè)此處我們只使用兩個特征,房屋的面積和房間的數(shù)量,房屋面積的取值范圍為0~2000平方英尺,房間數(shù)量的取值范圍為0~5。同時,我們以兩個參數(shù)為橫、縱坐標軸構(gòu)建代價函數(shù)的等高線圖。

從圖中可看出,橢圓較扁,且根據(jù)圖中紅色線條可知,梯度下降算法需要較多次數(shù)迭代才能收斂。

因此,為了讓梯度下降算法更快的收斂,我們采用特征縮放和均值歸一化的方法。特征縮放通過將特征變量除以特征變量的范圍(即最大值減去最小值)的方法,使得特征變量的新取值范圍僅為1,即-1 ≤ x(i) ≤1;均值歸一化通過特征變量的值減去特征變量的平均值的方法,使得特征變量的新平均值為0。我們通常使用如下公式實現(xiàn)特征縮放和均值歸一化:

其中μn表示某一特征的平均值,sn表示某一特征的標準差(或最大值與最小值間的差,即max-min)。

補充筆記
Feature Scaling

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:
  -1 ≤ x(i) ≤1
or
  -0.5 ≤ x(i) ≤0.5

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum values) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results.

學(xué)習(xí)率α

梯度下降算法收斂所需要的迭代次數(shù)根據(jù)模型的不同而不同。實際上,我們很難提前判斷梯度下降算法需要多少步迭代才能收斂。對此,我們通常畫出代價函數(shù)隨著迭代步數(shù)增加的變化曲線來試著預(yù)測梯度下降算法是否已經(jīng)收斂。

同時,這是種方法也可以進行一些自動收斂測試。(注:自動收斂測試就是用一種算法來判斷梯度下降算法是否收斂,通常要選擇一個合理的閾值ε來與代價函數(shù)J(θ)的下降的幅度比較,如若代價函數(shù)J(θ)的下降的幅度小于這個閾值ε,則可判斷梯度下降算法已經(jīng)收斂。但這個閾值ε的選擇是非常困難的,因此我們實際上還是通過觀察曲線圖來判斷梯度下降算法是否收斂。)

梯度下降算法的每次迭代都要受到學(xué)習(xí)率α的影響,當學(xué)習(xí)率α過小時,則梯度下降算法要進行很多次迭代才能收斂;當學(xué)習(xí)率α過大時,則梯度下降算法可能就會出錯,即每次迭代,代價函數(shù)可能不會下降,并可能越過局部最小值導(dǎo)致無法收斂。

補充筆記
Learning Rate

Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increase, then you probably need to decrease α.

Automatic convergence test. Declare convergence if J(θ) decrease by less than E in one iteration, where E is some small value such as 10-3. However in practice it's difficult to choose this threshold value.

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:

  • if α is too small: slow convergence.
  • if α is too large: may not decrease on every iteration and thus may not converge.
特征和多項式回歸(Features and Polynomial Regression)

之前我們介紹了多變量的線性回歸,現(xiàn)在我們來學(xué)習(xí)一下多項式回歸,其能幫助我們使用線性回歸的方法來擬合非常復(fù)雜的函數(shù),甚至是非線性函數(shù)。

比如有時我們想使用二次方模型(hθ(x) = θ0 + θ1x1 + θ2x22)來擬合我們的數(shù)據(jù),又有時我們想使用三次方模型(hθ(x) = θ0 + θ1x1 + θ2x22 + θ3x33)來擬合我們的數(shù)據(jù)······

通常我們需要先觀察數(shù)據(jù)然后來決定參數(shù)使用什么樣的模型。

另外,我們可以令:

  • x2 = x22
  • x3 = x33
    ······

這樣我們就將這些多項式回歸模型又轉(zhuǎn)換為線性回歸模型。(注:我們在使用多項式回歸模型時,由于會對變量xi進行平方、立方等操作,因此我們有必要在運行梯度下降算法之前進行特征縮放。)

補充筆記
Features and Polynomial Regression

We can improve our features and the form of our hypothesis function in a couple different ways.

We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1 * x2.

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

For example, if our hypothesis function is hθ(x) = θ0 + θ1x1 then we can create additional features based on x1, to get the quadratic function hθ(x) = θ0 + θ1x1 + θ2x12 or the cubic function hθ(x) = θ0 + θ1x1 + θ2x12 + θ3x13

In the cubic version, we have created new features x2 = x12 and x3 = x13.

To make it a square root function, we could do:

One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

eg. if x1 has range 1~1000 then range of x12 becomes 1~1000000 and that of x13 becomes 1~1000000000

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容

  • 過擬合問題(The Problem of Overfitting) 如上圖所示,第一個采用單變量線性回歸模型來擬合...
    SmallRookie閱讀 578評論 0 1
  • AI人工智能時代,機器學(xué)習(xí),深度學(xué)習(xí)作為其核心,本文主要介紹機器學(xué)習(xí)的基礎(chǔ)算法,以詳細線介紹 線性回歸算法 及其 ...
    erixhao閱讀 13,978評論 0 36
  • 文章作者:Tyan博客:noahsnail.com | CSDN | 簡書 聲明:作者翻譯論文僅為學(xué)習(xí),如有侵權(quán)請...
    SnailTyan閱讀 5,158評論 0 8
  • 曹娥江畔秋風(fēng)急,半濃半淡江霧起。 月滿已缺云層閉,草木簌簌向風(fēng)里。 亦卷亦舒隨潮逆,來去無意是菩提。
    剪卻西風(fēng)不問愁閱讀 315評論 0 0
  • 《人性的弱點》猶如連接人與人之間的紐帶,我想這部書的成功很大程度上取決于卡耐基成功的分析了人性的弱點并如何利用這種...
    NeoForest閱讀 816評論 0 8