搭建編程環(huán)境
此處推薦安裝Octave,如若已安裝Matlab也可。這里不過多敘述如何安裝Octave或Matlab,請(qǐng)自行查閱相關(guān)資料。
多維特征(Multiple Features)
之前我們學(xué)習(xí)了單變量線性回歸,現(xiàn)在我們繼續(xù)利用房?jī)r(jià)的例子來學(xué)習(xí)多變量線性回歸。
如上圖所示,我們對(duì)房?jī)r(jià)模型增加一些特征,例如:房間的數(shù)量、樓層數(shù)和房屋使用年限。對(duì)此,我們分別令x1,x2,x3和x4表示房屋面積、房間的數(shù)量、樓層數(shù)和房屋使用年限。
這里增添了一些特征,我們也要引入一系列新的符號(hào)標(biāo)記:
- n:代表特征的數(shù)量
- x(i):代表第i個(gè)訓(xùn)練示例,即表示特征矩陣中的第i行
- xj(i):代表特征矩陣中第i行的第j個(gè)特征
因此,我們的多變量線性回歸的表達(dá)式為:
hθ(x) = θ0+θ1x1+θ2x2+···+θnxn
這個(gè)公式中有n+1個(gè)參數(shù)和n個(gè)變量,為了簡(jiǎn)化該公式,我們引入x0=1(x0(i)=1),則公式可以轉(zhuǎn)化為:
hθ(x) = θ0x0+θ1x1+θ2x2+···+θnxn
此時(shí)公式中有n+1個(gè)參數(shù)和n+1個(gè)變量,此時(shí)我們可以將參數(shù)和變量看成n+1維的向量(即θ表示n+1維的(參數(shù))向量,X表示n+1維的(變量)向量),則我們可將公式簡(jiǎn)化成:
hθ(x) = θTX
補(bǔ)充筆記
Multiple Features
Linear regression with multiple variables is also known as "multivariate linear regression".
We now introduce notation for equations where we can have any number of input variables.
- xj(i) = value of feature j in the ith training example
- x(i) = the input (features) of the ith training example
Note:
- m = the number of training example
- n = the number of features
The multivariable form of the hypothesis function accommodating these multiple features is as follows:
hθ(x) = θ0+θ1x1+θ2x2+···+θnxn
In order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.
Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
This is a vectorization of our hypothesis function for one training example.
多變量梯度下降(Gradient Descent For Multiple Variables)
與之前的單變量線性回歸類似,我們也構(gòu)建了一個(gè)代價(jià)函數(shù)J:
我們的目標(biāo)與在單變量線性回歸中一樣,找出使得代價(jià)函數(shù)最小的一系列參數(shù)。在單變量線性回歸中,我們引入梯度下降算法來找尋該參數(shù)。因此,在多變量線性回歸中,我們依舊引入梯度下降算法。
即:
通過簡(jiǎn)單的求導(dǎo)后可得:
補(bǔ)充筆記
Gradient Descent for Multiple Variables
The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:
In other words:
The following image compares image compares gradient descent with one value to gradient descent with multiple variables:
特征縮放(Feature Scaling)
在多維特征的情況下,若我們保證這些特征都具有相近的尺度,則梯度下降算法能夠更快地收斂。
我們還是以房?jī)r(jià)預(yù)測(cè)為例,假設(shè)此處我們只使用兩個(gè)特征,房屋的面積和房間的數(shù)量,房屋面積的取值范圍為0~2000平方英尺,房間數(shù)量的取值范圍為0~5。同時(shí),我們以兩個(gè)參數(shù)為橫、縱坐標(biāo)軸構(gòu)建代價(jià)函數(shù)的等高線圖。
從圖中可看出,橢圓較扁,且根據(jù)圖中紅色線條可知,梯度下降算法需要較多次數(shù)迭代才能收斂。
因此,為了讓梯度下降算法更快的收斂,我們采用特征縮放和均值歸一化的方法。特征縮放通過將特征變量除以特征變量的范圍(即最大值減去最小值)的方法,使得特征變量的新取值范圍僅為1,即-1 ≤ x(i) ≤1;均值歸一化通過特征變量的值減去特征變量的平均值的方法,使得特征變量的新平均值為0。我們通常使用如下公式實(shí)現(xiàn)特征縮放和均值歸一化:
其中μn表示某一特征的平均值,sn表示某一特征的標(biāo)準(zhǔn)差(或最大值與最小值間的差,即max-min)。
補(bǔ)充筆記
Feature Scaling
We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:
-1 ≤ x(i) ≤1
or
-0.5 ≤ x(i) ≤0.5
These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.
Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum values) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:
Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.
Note that dividing by the range, or dividing by the standard deviation, give different results.
學(xué)習(xí)率α
梯度下降算法收斂所需要的迭代次數(shù)根據(jù)模型的不同而不同。實(shí)際上,我們很難提前判斷梯度下降算法需要多少步迭代才能收斂。對(duì)此,我們通常畫出代價(jià)函數(shù)隨著迭代步數(shù)增加的變化曲線來試著預(yù)測(cè)梯度下降算法是否已經(jīng)收斂。
同時(shí),這是種方法也可以進(jìn)行一些自動(dòng)收斂測(cè)試。(注:自動(dòng)收斂測(cè)試就是用一種算法來判斷梯度下降算法是否收斂,通常要選擇一個(gè)合理的閾值ε來與代價(jià)函數(shù)J(θ)的下降的幅度比較,如若代價(jià)函數(shù)J(θ)的下降的幅度小于這個(gè)閾值ε,則可判斷梯度下降算法已經(jīng)收斂。但這個(gè)閾值ε的選擇是非常困難的,因此我們實(shí)際上還是通過觀察曲線圖來判斷梯度下降算法是否收斂。)
梯度下降算法的每次迭代都要受到學(xué)習(xí)率α的影響,當(dāng)學(xué)習(xí)率α過小時(shí),則梯度下降算法要進(jìn)行很多次迭代才能收斂;當(dāng)學(xué)習(xí)率α過大時(shí),則梯度下降算法可能就會(huì)出錯(cuò),即每次迭代,代價(jià)函數(shù)可能不會(huì)下降,并可能越過局部最小值導(dǎo)致無法收斂。
補(bǔ)充筆記
Learning Rate
Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increase, then you probably need to decrease α.
Automatic convergence test. Declare convergence if J(θ) decrease by less than E in one iteration, where E is some small value such as 10-3. However in practice it's difficult to choose this threshold value.
It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
To summarize:
- if α is too small: slow convergence.
- if α is too large: may not decrease on every iteration and thus may not converge.
特征和多項(xiàng)式回歸(Features and Polynomial Regression)
之前我們介紹了多變量的線性回歸,現(xiàn)在我們來學(xué)習(xí)一下多項(xiàng)式回歸,其能幫助我們使用線性回歸的方法來擬合非常復(fù)雜的函數(shù),甚至是非線性函數(shù)。
比如有時(shí)我們想使用二次方模型(hθ(x) = θ0 + θ1x1 + θ2x22)來擬合我們的數(shù)據(jù),又有時(shí)我們想使用三次方模型(hθ(x) = θ0 + θ1x1 + θ2x22 + θ3x33)來擬合我們的數(shù)據(jù)······
通常我們需要先觀察數(shù)據(jù)然后來決定參數(shù)使用什么樣的模型。
另外,我們可以令:
- x2 = x22
- x3 = x33
······
這樣我們就將這些多項(xiàng)式回歸模型又轉(zhuǎn)換為線性回歸模型。(注:我們?cè)谑褂枚囗?xiàng)式回歸模型時(shí),由于會(huì)對(duì)變量xi進(jìn)行平方、立方等操作,因此我們有必要在運(yùn)行梯度下降算法之前進(jìn)行特征縮放。)
補(bǔ)充筆記
Features and Polynomial Regression
We can improve our features and the form of our hypothesis function in a couple different ways.
We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1 * x2.
Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit the data well.
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
For example, if our hypothesis function is hθ(x) = θ0 + θ1x1 then we can create additional features based on x1, to get the quadratic function hθ(x) = θ0 + θ1x1 + θ2x12 or the cubic function hθ(x) = θ0 + θ1x1 + θ2x12 + θ3x13
In the cubic version, we have created new features x2 = x12 and x3 = x13.
To make it a square root function, we could do:
One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
eg. if x1 has range 1~1000 then range of x12 becomes 1~1000000 and that of x13 becomes 1~1000000000