前言:
以斯坦福cs231n課程的python編程任務(wù)為主線,展開對該課程主要內(nèi)容的理解和部分?jǐn)?shù)學(xué)推導(dǎo)。該課程的學(xué)習(xí)資料和代碼如下:
視頻和PPT
筆記
assignment1初始代碼
Part 1: 線性分類器(Linear classifier)
- 分值函數(shù),將原始數(shù)據(jù)(即輸入數(shù)據(jù),經(jīng)過預(yù)處理后的)映射成每個類對應(yīng)的分值(分值越高,該類的可能性就越大)。
score function: map the raw data to class scores. - 損失(代價)函數(shù),表示預(yù)測結(jié)果和真實類標(biāo)簽之間的差。
loss/cost function: quantify the agreement between the predicted scores and the ground truth labels. - 優(yōu)化,最小化損失函數(shù)(通過優(yōu)化分值函數(shù)中的參數(shù)/權(quán)重)。
optimization: minimize the loss function with respect to the parameters of the score function. - 數(shù)據(jù)集(本課程中指圖像):
·訓(xùn)練集(training dataset),用于訓(xùn)練模型(即訓(xùn)練模型的表達(dá)能力)。
·驗證集(validation dataset),用于微調(diào)模型的超參數(shù),使模型性能達(dá)到最優(yōu)。
·測試集(test dataset),測試模型的泛化能力。
假設(shè)我們有一個圖像訓(xùn)練集 X,是一個大小為[N,D]的矩陣;其中,N表示樣本的數(shù)量,D表示樣本的維數(shù)。xi是X中的第i行,即第i個樣本。
y表示一個向量,大小為[1,N];yi表示第i個樣本的真實類別,yi=1,2,3, ...,C。
假設(shè)我們有一個線性映射為:
---------------------------------------------> f(xi, W, b) = xiW + b <----------------------------------------
其中,W是權(quán)重(weight)矩陣,大小為[D,C],W的第j列表示xi在第j(1≤ j ≤C)個類別上的線性映射;b是偏置向量,大小為[1,C];f的大小為[N,C]。(ps: 這里的公式為了和代碼里的保持一致,做了調(diào)整,下面的公式都為編程服務(wù))
函數(shù)f(xi, W, b) 的值就是C在每個類別上的得分,而我們的最終目標(biāo)就是學(xué)習(xí)到W和b,使得f的大小在全局范圍內(nèi)接近真實值,即真實的類別得到更高的分?jǐn)?shù)。
為了便于直觀理解,下面貼出一個栗子(和上面的公式有點區(qū)別,但不影響理解):
圖片中的結(jié)果認(rèn)為這很可能是一只狗,說明W和b的值沒有訓(xùn)練好。
關(guān)于線性分類器的幾何解釋和模板解釋,可以直接看cs231n的筆記,這里不再贅述。
為了便于計算,我們可以將b和W進(jìn)行合并,將b加到W的最后一行,W的大小將變?yōu)閇D+1,C]。此時,xi需要增加一維常數(shù)1,即xi的大小為[1,D+1](編程的時候別忘了);同時,上面的f需要修改為:f(xi, W) 。
- 數(shù)據(jù)預(yù)處理(Part 3 部分會講一下為什么需要預(yù)處理)
在機器學(xué)習(xí)里,規(guī)范化/歸一化(Normalization)輸入特征(這里指像素值[0,255])是非常常見且必要的,特別是對于各維度幅度變化非常大的數(shù)據(jù)集。但對于圖像而言,一般只要去均值(mean subtraction)即可(因為圖像每一維的像素值都在[0,255]之間),即計算出訓(xùn)練集圖像的均值圖像,然后每張圖像(包括訓(xùn)練集、驗證集和測試集)減去均值圖像(一般不需要歸一化和白化)。在numpy中此過程可以表示為: X -= np.mean(X, axis=0)。
1. 多分類支持向量機損失函數(shù)(Multiclass SVM loss)
SVM loss : 對于每一張圖像樣本,正確分類的得分應(yīng)該比錯誤分類的得分至少高Δ(Δ的取值在實際中一般為1,不將Δ作為參數(shù)是因為它的變化可以轉(zhuǎn)換為W的變化,所以只要訓(xùn)練W就行了)。這里貼上一張圖便于理解:
上面我們提到的線性映射將第i個樣本的像素值作為輸入,輸出該樣本在C個類別上的得分,形成一個分值向量,大小為[1,C]。所以,我們記sj = f(xi, W)j,表示第i個樣本在第j類上的得分。那么,multiclass SVM loss的表達(dá)式如下:
------------------------------------------> Li = ∑j≠yi max(0, sj?syi+Δ) <-----------------------------------
從表達(dá)式中可以看出,當(dāng)syi >= sj + Δ 時,Li = 0,這時候表示判斷準(zhǔn)確;反之,Li>0,這時候表示判斷有誤。
我們可以將Li重新表示如下:
-------------------------------------> Li = ∑j≠yi max(0, (xiW)j?(xyiW)j+Δ) <-----------------------------
上面的max(0, -)函數(shù)稱為Hinge loss,有時候也可以用max(0, -)2,稱為squared hinge loss SVM (or L2-SVM),它對錯誤的懲罰更加嚴(yán)厲。我們可以通過交叉驗證來選擇具體的形式(多數(shù)情況下我們會使用前者)。(ps: 這里有一篇介紹Hinge loss的博文)
2. 正則化(Regularization)
上面的損失函數(shù)存在缺陷:W不唯一。假設(shè)一組W使得損失函數(shù)的值為0,那么 λW (λ>1) 也能做到。為了得到唯一的W進(jìn)行分類工作,我們可以添加一個正則化懲罰項(regularization penalty)R(W)來實現(xiàn),通常是2范數(shù):
-------------------------------------------> R(W) = ∑k∑s (Wk,s)2 <-----------------------------------------
添加懲罰項后,完整的損失函數(shù)表達(dá)式為:
------------------------------------------> L = (1/N)∑iLi + λR(W) <----------------------------------------
其中,λ可以通過交叉驗證來選擇。
對參數(shù)進(jìn)行懲罰最主要的作用其實是防止過擬合(overfitting),提高模型的泛化能力。此外,偏置b不會對輸入特征的影響強度產(chǎn)生作用,所以我們不需要對b進(jìn)行懲罰(但是b被合并到了W里,所以實際上我們在assignment1里對b也進(jìn)行了懲罰,不過影響不大)。
后面求解參數(shù)W會用到L關(guān)于W的偏導(dǎo)數(shù),這里我們先給出(推導(dǎo)比較簡單,這里Δ我直接換成1了):
---------------------------> ?Wyi Li = - xiT(∑j≠yi1(xiWj - xiWyi +1>0)) + 2λWyi <----------------------
----------------------------> ?Wj Li = xiT 1(xiWj - xiWyi +1>0) + 2λWj , (j≠yi) <-----------------------
其中,1(·)是示性函數(shù),其取值規(guī)則為:1(表達(dá)式為真) =1;1(表達(dá)式為假) =0。
3. Softmax classifier
Softmax是二值Logistic回歸在多分類問題上的推廣。
這里函數(shù)f保持不變,將Hinge loss替換成交叉熵?fù)p失函數(shù)(cross-entropy loss),其損失函數(shù)表達(dá)式如下(log(e) =1):
--------------------------------------> Li = -log(exp(fyi)/∑j exp(fj)) <--------------------------------------
其中,函數(shù)fj(z) = exp(zj)/∑k exp(zk)稱為softmax函數(shù)。可以看出softmax函數(shù)的輸出實際上是輸入樣本xi在K個類別上的概率分布,而上式是概率分布的交叉熵(不是相對熵,雖然看上去好像是相對熵,下面我會稍微修改一下Li,還原它的本來面目;交叉熵可以看做熵與相對熵之和)。
先引入一下信息論里的交叉熵公式:H(p,q) = -∑x p(x)logq(x);其中p表示真實分布,q表示擬合分布。下面我們來修改下Li:
----------------------------------> Li = -∑k pi,klog(exp(fk)/∑j exp(fj)) <-----------------------------------
其中,pi = [0,0, ...,0,1,0, ...,0,0],pi,k=pi[k],pi 的大小為[1,C],pi 中只有pi[yi]=1,其余元素均為0。現(xiàn)在感覺如何?
在實際編程計算softmax函數(shù)時,可能會遇到數(shù)值穩(wěn)定性(Numeric stability)問題(因為在計算過程中,exp(fyi) 和 ∑j exp(fj) 的值可能會變得非常大,大值數(shù)相除容易導(dǎo)致數(shù)值不穩(wěn)定),為了避免出現(xiàn)這樣的問題,我們可以進(jìn)行如下處理:
其中,C的取值通常為:logC = -maxj fj,即-logC取f每一行中的最大值。
現(xiàn)在,結(jié)合懲罰項,給出總的損失函數(shù):
---------------------> L = -(1/N)∑i∑j1(k=yi)log(exp(fk)/∑j exp(fj)) + λR(W) <-----------------------
后面求解參數(shù)W會用到L關(guān)于W的偏導(dǎo)數(shù),這里我們先給出結(jié)果,然后推導(dǎo)一遍:
--------------> ?Wk L = -(1/N)∑i xiT(pi,m-Pm) + 2λWk, where Pk = exp(fk)/∑j exp(fj) <----------
推導(dǎo)過程如下:
下面貼出一張圖,大家可以直觀感受下SVM和Softmax關(guān)于損失函數(shù)的計算區(qū)別:
CS231n Convolutional Neural Networks for Visual Recognition.png
4. 優(yōu)化(Optimization)
優(yōu)化就是通過在訓(xùn)練集上訓(xùn)練參數(shù)(權(quán)重和偏置),最小化損失函數(shù)的過程。然后,通過驗證集來微調(diào)超參數(shù)(學(xué)習(xí)率、懲罰因子λ等等),最終得到最優(yōu)的模型;并用測試集來測試模型的泛化能力。
通常我們用梯度下降法(Gradient Descent)并結(jié)合反向傳播(Backpropagation)來訓(xùn)練參數(shù)。具體的參數(shù)更新策略,這里我們使用vanilla update方法(我們會在Part3神經(jīng)網(wǎng)絡(luò)部分,具體介紹不同的參數(shù)更新策略),即x += - learning_rate * dx,其中x表示需要更新的參數(shù)。
梯度下降的版本很多,通常我們使用Mini-batch梯度下降法(Mini-batch Gradient Descent),具體參見該課程的筆記。
ps: 在編程任務(wù)中你會發(fā)現(xiàn)上面提示用隨機梯度下降(Stochastic Gradient Descent, SGD),但實際上用了Mini-batches,所以當(dāng)你聽到有人用SGD來優(yōu)化參數(shù),不要驚訝,他們實際是用了Mini-batches的。
至于反向傳播,實際就是鏈?zhǔn)椒▌t(chain rule),這里不展開講,具體參見課程筆記。實際上我已經(jīng)給出了,就是上面的偏導(dǎo)。等到后面的神經(jīng)網(wǎng)絡(luò),再具體展開講一下。
Part 2: Python編程任務(wù)(線性分類器)
· 我用的IDE是Pycharm。
· Assignment1的線性分類器部分,我們需要完成 linear_svm.py,softmax.py,linear_classifier.py。在完成后,你可以用svm.ipynb和softmax.ipynb里的代碼來debug你的模型,獲得最優(yōu)模型,然后在測試集上測試分類水平。
· Assignment1用的圖像庫是CIFAR-10,你也可以從這里下載。
linear_svm.py代碼如下:
__coauthor__ = 'Deeplayer'
# 5.19.2016
import numpy as np
def svm_loss_naive(W, X, y, reg):
"""
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
dW = np.zeros(W.shape) # initialize the gradient as zero
# compute the loss and the gradient
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
for i in xrange(num_train):
scores = X[i].dot(W)
correct_class_score = scores[y[i]]
for j in xrange(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
dW[:, y[i]] += -X[i, :] # compute the correct_class gradients
dW[:, j] += X[i, :] # compute the wrong_class gradients
# Right now the loss is a sum over all training examples, but we want it
# to be an average instead so we divide by num_train.
loss /= num_train
dW /= num_train
# Add regularization to the loss.
loss += 0.5 * reg * np.sum(W * W)
dW += reg * W
return loss, dW
def svm_loss_vectorized(W, X, y, reg):
"""
Structured SVM loss function, vectorized implementation.Inputs and outputs
are the same as svm_loss_naive.
"""
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
scores = X.dot(W) # N by C
num_train = X.shape[0]
num_classes = W.shape[1]
scores_correct = scores[np.arange(num_train), y] # 1 by N
scores_correct = np.reshape(scores_correct, (num_train, 1)) # N by 1
margins = scores - scores_correct + 1.0 # N by C
margins[np.arange(num_train), y] = 0.0
margins[margins <= 0] = 0.0
loss += np.sum(margins) / num_train
loss += 0.5 * reg * np.sum(W * W)
# compute the gradient
margins[margins > 0] = 1.0
row_sum = np.sum(margins, axis=1) # 1 by N
margins[np.arange(num_train), y] = -row_sum
dW += np.dot(X.T, margins)/num_train + reg * W # D by C
return loss, dW
softmax.py代碼如下:
__coauthor__ = 'Deeplayer'
# 5.19.2016
import numpy as np
def softmax_loss_naive(W, X, y, reg):
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W) # D by C
dW_each = np.zeros_like(W)
num_train, dim = X.shape
num_class = W.shape[1]
f = X.dot(W) # N by C
# Considering the Numeric Stability
f_max = np.reshape(np.max(f, axis=1), (num_train, 1)) # N by 1
prob = np.exp(f - f_max) / np.sum(np.exp(f - f_max), axis=1, keepdims=True) # N by C
y_trueClass = np.zeros_like(prob)
y_trueClass[np.arange(num_train), y] = 1.0
for i in xrange(num_train):
for j in xrange(num_class):
loss += -(y_trueClass[i, j] * np.log(prob[i, j]))
dW_each[:, j] = -(y_trueClass[i, j] - prob[i, j]) * X[i, :]
dW += dW_each
loss /= num_train
loss += 0.5 * reg * np.sum(W * W)
dW /= num_train
dW += reg * W
return loss, dW
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W) # D by C
num_train, dim = X.shape
f = X.dot(W) # N by C
# Considering the Numeric Stability
f_max = np.reshape(np.max(f, axis=1), (num_train, 1)) # N by 1
prob = np.exp(f - f_max) / np.sum(np.exp(f - f_max), axis=1, keepdims=True)
y_trueClass = np.zeros_like(prob)
y_trueClass[range(num_train), y] = 1.0 # N by C
loss += -np.sum(y_trueClass * np.log(prob)) / num_train + 0.5 * reg * np.sum(W * W)
dW += -np.dot(X.T, y_trueClass - prob) / num_train + reg * W
return loss, dW
linear_classifier.py代碼如下:
__coauthor__ = 'Deeplayer'
# 5.19.2016
from linear_svm import *
from softmax import *
class LinearClassifier(object):
def __init__(self):
self.W = None
def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100,
batch_size=200, verbose=True):
"""
Train this linear classifier using stochastic gradient descent.
Inputs:
- X: A numpy array of shape (N, D) containing training data; there are N
training samples each of dimension D.
- y: A numpy array of shape (N,) containing training labels; y[i] = c
means that X[i] has label 0 <= c < C for C classes.
- learning_rate: (float) learning rate for optimization.
- reg: (float) regularization strength.
- num_iters: (integer) number of steps to take when optimizing
- batch_size: (integer) number of training examples to use at each step.
- verbose: (boolean) If true, print progress during optimization.
Outputs:
A list containing the value of the loss function at each training iteration.
"""
num_train, dim = X.shape
# assume y takes values 0...K-1 where K is number of classes
num_classes = np.max(y) + 1
if self.W is None:
# lazily initialize W
self.W = 0.001 * np.random.randn(dim, num_classes) # D by C
# Run stochastic gradient descent(Mini-Batch) to optimize W
loss_history = []
for it in xrange(num_iters):
X_batch = None
y_batch = None
# Sampling with replacement is faster than sampling without replacement.
sample_index = np.random.choice(num_train, batch_size, replace=False)
X_batch = X[sample_index, :] # batch_size by D
y_batch = y[sample_index] # 1 by batch_size
# evaluate loss and gradient
loss, grad = self.loss(X_batch, y_batch, reg)
loss_history.append(loss)
# perform parameter update
self.W += -learning_rate * grad
if verbose and it % 100 == 0:
print 'Iteration %d / %d: loss %f' % (it, num_iters, loss)
return loss_history
def predict(self, X):
"""
Use the trained weights of this linear classifier to predict labels for
data points.
Inputs:
- X: D x N array of training data. Each column is a D-dimensional point.
Returns:
- y_pred: Predicted labels for the data in X. y_pred is a 1-dimensional
array of length N, and each element is an integer giving the
predicted class.
"""
y_pred = np.zeros(X.shape[1]) # 1 by N
y_pred = np.argmax(np.dot(self.W.T, X), axis=0)
return y_pred
def loss(self, X_batch, y_batch, reg):
"""
Compute the loss function and its derivative.
Subclasses will override this.
Inputs:
- X_batch: A numpy array of shape (N, D) containing a minibatch of N
data points; each point has dimension D.
- y_batch: A numpy array of shape (N,) containing labels for the minibatch.
- reg: (float) regularization strength.
Returns: A tuple containing:
- loss as a single float
- gradient with respect to self.W; an array of the same shape as W
"""
pass
class LinearSVM(LinearClassifier):
"""
A subclass that uses the Multiclass SVM loss function
"""
def loss(self, X_batch, y_batch, reg):
return svm_loss_vectorized(self.W, X_batch, y_batch, reg)
class Softmax(LinearClassifier):
"""
A subclass that uses the Softmax + Cross-entropy loss function
"""
def loss(self, X_batch, y_batch, reg):
return softmax_loss_vectorized(self.W, X_batch, y_batch, reg)
下面我貼一下微調(diào)超參數(shù)獲得最優(yōu)模型的代碼,并給出一些運行結(jié)果和圖:
1、 LinearClassifier_svm_start.py
__coauthor__ = 'Deeplayer'
# 5.20.2016
import numpy as np
import matplotlib.pyplot as plt
import math
from linear_classifier import *
from data_utils import load_CIFAR10
# Load the raw CIFAR-10 data.
cifar10_dir = 'E:/PycharmProjects/ML/CS231n/cifar-10-batches-py' # u should change this
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# As a sanity check, we print out the size of the training and test data.
print 'Training data shape: ', X_train.shape # (50000,32,32,3)
print 'Training labels shape: ', y_train.shape # (50000L,)
print 'Test data shape: ', X_test.shape # (10000,32,32,3)
print 'Test labels shape: ', y_test.shape # (10000L,)
print
# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
idxs = np.flatnonzero(y_train == y)
idxs = np.random.choice(idxs, samples_per_class, replace=False)
for i, idx in enumerate(idxs):
plt_idx = i * num_classes + y + 1
plt.subplot(samples_per_class, num_classes, plt_idx)
plt.imshow(X_train[idx].astype('uint8'))
plt.axis('off')
if i == 0:
plt.title(cls)
plt.show()
# Split the data into train, val, and test sets.
num_training = 49000
num_validation = 1000
num_test = 1000
mask = range(num_training, num_training + num_validation)
X_val = X_train[mask] # (1000,32,32,3)
y_val = y_train[mask] # (1,1000)
mask = range(num_training)
X_train = X_train[mask] # (49000,32,32,3)
y_train = y_train[mask] # (1,49000)
mask = range(num_test)
X_test = X_test[mask] # (1000,32,32,3)
y_test = y_test[mask] # (1,1000)
# Preprocessing1: reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1)) # (49000,3072)
X_val = np.reshape(X_val, (X_val.shape[0], -1)) # (1000,3072)
X_test = np.reshape(X_test, (X_test.shape[0], -1)) # (1000,3072)
# Preprocessing2: subtract the mean image
mean_image = np.mean(X_train, axis=0) # (1,3072)
X_train -= mean_image
X_val -= mean_image
X_test -= mean_image
# Visualize the mean image
plt.figure(figsize=(4, 4))
plt.imshow(mean_image.reshape((32, 32, 3)).astype('uint8'))
plt.show()
# Bias trick, extending the data
X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))]) # (49000,3073)
X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))]) # (1000,3073)
X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))]) # (1000,3073)
# Use the validation set to tune hyperparameters (regularization strength
# and learning rate).
learning_rates = [1e-7, 5e-5]
regularization_strengths = [5e4, 1e5]
results = {}best_val = -1 # The highest validation accuracy that we have seen so far.
best_svm = None # The LinearSVM object that achieved the highest validation rate.
iters = 1500
for lr in learning_rates:
for rs in regularization_strengths:
svm = LinearSVM()
svm.train(X_train, y_train, learning_rate=lr, reg=rs, num_iters=iters)
Tr_pred = svm.predict(X_train.T)
acc_train = np.mean(y_train == Tr_pred)
Val_pred = svm.predict(X_val.T)
acc_val = np.mean(y_val == Val_pred)
results[(lr, rs)] = (acc_train, acc_val)
if best_val < acc_val:
best_val = acc_val
best_svm = svm
# print results
for lr, reg in sorted(results):
train_accuracy, val_accuracy = results[(lr, reg)]
print 'lr %e reg %e train accuracy: %f val accuracy: %f' %
(lr, reg, train_accuracy, val_accuracy)
print 'Best validation accuracy achieved during validation: %f' % best_val # around 38.2%
# Visualize the learned weights for each class
w = best_svm.W[:-1, :] # strip out the bias
w = w.reshape(32, 32, 3, 10)
w_min, w_max = np.min(w), np.max(w)
classes = ['plane', 'car', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
for i in xrange(10):
plt.subplot(2, 5, i + 1)
# Rescale the weights to be between 0 and 255
wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
plt.imshow(wimg.astype('uint8'))
plt.axis('off')
plt.title(classes[i])
plt.show()
# Evaluate the best svm on test set
Ts_pred = best_svm.predict(X_test.T)
test_accuracy = np.mean(y_test == Ts_pred) # around 37.1%
print 'LinearSVM on raw pixels of CIFAR-10 final test set accuracy: %f' % test_accuracy
下面可視化一下部分原始圖片、均值圖像和學(xué)習(xí)到的權(quán)重:
2、 LinearClassifier_softmax_start.py
__coauthor__ = 'Deeplayer'
# 5.20.2016
import numpy as np
from data_utils import load_CIFAR10
from linear_classifier import *
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):
"""
Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
it for the linear classifier. These are the same steps as we used for the SVM,
but condensed to a single function.
"""
# Load the raw CIFAR-10 data
cifar10_dir = 'E:/PycharmProjects/ML/CS231n/cifar-10-batches-py' # make a change
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# subsample the data
mask = range(num_training, num_training + num_validation)
X_val = X_train[mask]
y_val = y_train[mask]
mask = range(num_training)
X_train = X_train[mask]
y_train = y_train[mask]
mask = range(num_test)
X_test = X_test[mask]
y_test = y_test[mask]
# Preprocessing: reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_val = np.reshape(X_val, (X_val.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
# subtract the mean image
mean_image = np.mean(X_train, axis=0)
X_train -= mean_image
X_val -= mean_image
X_test -= mean_image
# add bias dimension and transform into columns
X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
return X_train, y_train, X_val, y_val, X_test, y_test
# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()
# Use the validation set to tune hyperparameters (regularization strength
# and learning rate).
results = {}
best_val = -1
best_softmax = None
learning_rates = [1e-7, 5e-7]
regularization_strengths = [5e4, 1e4]
iters = 1500
for lr in learning_rates:
for rs in regularization_strengths:
softmax = Softmax()
softmax.train(X_train, y_train, learning_rate=lr, reg=rs, num_iters=iters)
Tr_pred = softmax.predict(X_train.T)
acc_train = np.mean(y_train == Tr_pred)
Val_pred = softmax.predict(X_val.T)
acc_val = np.mean(y_val == Val_pred)
results[(lr, rs)] = (acc_train, acc_val)
if best_val < acc_val:
best_val = acc_val
best_softmax = softmax
# Print out results.
for lr, reg in sorted(results):
train_accuracy, val_accuracy = results[(lr, reg)]
print 'lr %e reg %e train accuracy: %f val accuracy: %f' %
(lr, reg, train_accuracy, val_accuracy)
# around 38.9%
print 'best validation accuracy achieved during cross-validation: %f' % best_val
# Evaluate the best softmax on test set.
Ts_pred = best_softmax.predict(X_test.T)
test_accuracy = np.mean(y_test == Ts_pred) # around 37.4%
print 'Softmax on raw pixels of CIFAR-10 final test set accuracy: %f' % test_accuracy
最后以SVM為例,比較一下向量化和非向量化編程在運算速度上的差異:
--> naive_vs_vectorized.py
__coauthor__ = 'Deeplayer'
# 5.20.2016
import time
from linear_svm import *
from data_utils import load_CIFAR10
def get_CIFAR10_data(num_training=49000, num_dev=500):
# Load the raw CIFAR-10 data
cifar10_dir = 'E:/PycharmProjects/ML/CS231n/cifar-10-batches-py' # make a change
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
mask = range(num_training)
X_train = X_train[mask]
mask = np.random.choice(num_training, num_dev, replace=False)
X_dev = X_train[mask]
y_dev = y_train[mask]
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))
mean_image = np.mean(X_train, axis=0)
X_dev -= mean_image
X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])
return X_dev, y_dev
X_dev, y_dev = get_CIFAR10_data()
# generate a random SVM weight matrix of small numbers
W = np.random.randn(3073, 10) * 0.0001
tic = time.time()
loss_naive, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'Naive loss and gradient: computed in %fs' % (toc - tic) # around 0.198s
tic = time.time()
loss_vectorized, grad_vectorized = svm_loss_vectorized(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'Vectorized loss and gradient: computed in %fs' % (toc - tic) # around 0.005s
Part 3: 神經(jīng)網(wǎng)絡(luò)(Neural Networks)
- 神經(jīng)網(wǎng)絡(luò)模型是由多個人工神經(jīng)元構(gòu)成的多層網(wǎng)絡(luò)結(jié)構(gòu),而人工神經(jīng)元的靈感來自人腦;相對于生物神經(jīng)元,人工神經(jīng)元只是一個十分粗糙的模型。下面給出一張生物神經(jīng)元和它的數(shù)學(xué)模型的對比圖:
- 從上圖的數(shù)學(xué)模型我們可以看出人工神經(jīng)元的處理過程如下:
輸入x與權(quán)重w做內(nèi)積 ----> 內(nèi)積結(jié)果輸入激活函數(shù) ---> 從激活函數(shù)輸出信號 - 感知器(perceptron)和S型神經(jīng)元(sigmoid neuron),是兩個重要的人工神經(jīng)元,承載了神經(jīng)網(wǎng)絡(luò)的關(guān)鍵思想(可以移步Michael Nielsen寫的Neural Networks and Deep Learning)。
先介紹下S型神經(jīng)元,上張圖:
S 型神經(jīng)元有多個輸入x1,x2,x3,... ;對每個輸入有權(quán)重w1,w2,...,和?個總的偏置b。輸出output = σ?(w?x+b),這里σ被稱為S型函數(shù),定義為:
---------------------------------------------> σ(z) = 1/(1+e-z) <---------------------------------------------
σ的函數(shù)曲線如下:
這個形狀是階躍函數(shù)平滑后的版本:
σ函數(shù)的平滑特性是它成為激活函數(shù)的關(guān)鍵因素,? σ的平滑特性意味著權(quán)重和偏置的微小變化,Δwj和Δb,會通過神經(jīng)元產(chǎn)生一個微小的輸出變化Δoutput。實際上,Δoutput可以很好地近似表示為:
從公式可以看出,Δoutput是一個反映權(quán)重和偏置變化的線性函數(shù)。這一線性性質(zhì),使得我們可以很容易地選擇小的權(quán)重和偏置的變化量,來獲得任何想要的小的輸出變化量。
下面介紹神經(jīng)網(wǎng)絡(luò)的結(jié)構(gòu)(ps: 這里指前饋(feedforward)神經(jīng)網(wǎng)絡(luò),網(wǎng)絡(luò)中是沒有回路的,信息總是向前傳播,不反向回饋),神經(jīng)網(wǎng)絡(luò)通常有如下結(jié)構(gòu):
上圖是一個含有兩個隱藏層的3-layer神經(jīng)網(wǎng)絡(luò),層與層之間是全連接(fully-connected)的。輸入層是圖像數(shù)據(jù)(經(jīng)過預(yù)處理后的),即該層的神經(jīng)元數(shù)量等于輸入圖片的維數(shù);神經(jīng)網(wǎng)絡(luò)的隱藏層可以是一層或多層,多層神經(jīng)網(wǎng)絡(luò)我們稱為人工神經(jīng)網(wǎng)絡(luò)(ANN),其實最后一層隱藏層,我們可以看成是輸入圖像的特征向量;輸出層神經(jīng)元的數(shù)量等于需要分類的圖像數(shù)據(jù)的類別數(shù),輸出值可以看成是在每個類別上的得分。
對于分類任務(wù)而言,根據(jù)損失函數(shù)(SVM loss function or softmax loss function)選擇的不同,神經(jīng)網(wǎng)絡(luò)的輸出層也可以看作是SVM層或Softmax層。神經(jīng)網(wǎng)絡(luò)的激活函數(shù)是非線性的,所以神經(jīng)網(wǎng)絡(luò)是一個非線性分類器。
---> (ps: 神經(jīng)網(wǎng)絡(luò)的輸出層神經(jīng)元不含激活函數(shù)f)
神經(jīng)網(wǎng)絡(luò)的多層結(jié)構(gòu)給它帶來了非常強大的表達(dá)能力(層越深,神經(jīng)元數(shù)量越多,表達(dá)能力越強),換句話說,神經(jīng)網(wǎng)絡(luò)可以擬合任意函數(shù)!具體的可視化證明可以移步這里。但是,隱藏層或神經(jīng)元數(shù)量越多,越容易出現(xiàn)過擬合(overfitting)現(xiàn)象,這時我們需要使用規(guī)則化(L2 regularization, dropout等等)來控制過擬合。
接下來我們具體討論神經(jīng)網(wǎng)絡(luò)的各個環(huán)節(jié):
1. 激活函數(shù)的選擇
之前我們已經(jīng)介紹了S型函數(shù),但是在實際應(yīng)用中,我們基本不會使用它,因為它的缺陷較多。先看下σ的導(dǎo)數(shù):
從圖中我們可以看到,S型函數(shù)導(dǎo)數(shù)值在0到0.25之間。在進(jìn)行反向傳播的時候,σ?′會和梯度相乘,前面層的梯度值等于后面層的乘積項,那么越往前梯度值越小,慢慢趨近于0,這就是梯度消失問題(vanishing gradient problem)。為了便于理解梯度為什么會消失,我們給出一個每層只有一個神經(jīng)元的4-layer簡化模型:
其中,C表示代價函數(shù),aj = ?σ(zj)(注意,a4 = z4),zj = wjaj-1 + bj,我們稱 zj是神經(jīng)元的帶權(quán)輸入。現(xiàn)在我們要來研究一下第一個隱藏神經(jīng)元的梯度?C/?b1,這里我們直接給出表達(dá)式(具體證明,請移步這里):
我們看出?C/?b1會是?C/?b3的1/16 或者更小,這其實就是梯度消失的本質(zhì)原因。這會導(dǎo)致深層神經(jīng)網(wǎng)絡(luò)前面的隱藏層神經(jīng)元學(xué)習(xí)速度慢于后面隱藏層神經(jīng)元的學(xué)習(xí)速度,而且越往前越慢,最終無法學(xué)習(xí)。
---> ps: 對于這個問題,不論使用什么樣的激活函數(shù),都會出現(xiàn),但是有些激活函數(shù)可以減輕這一問題。說到這里,不得不提一下Batch Normalization,這一方法在很大程度上緩解了梯度消散問題,bravo!
除此之外,sigmoid還有兩個缺陷:
其一,當(dāng)sigmoid的輸入值很小或者很大的時候,它的導(dǎo)數(shù)會趨于0,在反向傳播的時候梯度就會趨于0,那么神經(jīng)元就不能很好的更新而提前飽和;
其二,sigmoid神經(jīng)元輸出值(激活值)是恒大于0的,那么問題來了,就以上面的4-layer簡化模型為例,你會發(fā)現(xiàn)在反向傳播時,梯度會恒正或恒負(fù)(取決于?C/?a4的正負(fù))。換句話說,連接到同一個神經(jīng)元的所有權(quán)重w(包括偏置b)會一起增加或者一起減少。這就有問題了,因為某些權(quán)重可能需要有不同方向的變化(雖然沒有嚴(yán)格的證明,但這樣更加合理)。所以,我們通常希望激活函數(shù)的輸出值是關(guān)于0對稱的。
下面列出一些相對于sigmoid性能更好的激活函數(shù):
1、Tanh
tanh神經(jīng)元使用雙曲正切函數(shù)替換了S型函數(shù),tanh函數(shù)的定義如下:
-------------------------------------> tanh(z) = (ex?e-x)/(ex+e-x) <----------------------------------------
該公式也可以寫成:tanh(z) = 2σ(2z)?1,所以tanh可以看做是sigmoid的縮放版,相對于sigmoid的好處是他的輸出值關(guān)于0對稱,其函數(shù)曲線如下:
2、修正線性單元(Rectified Linear Unit, ReLU)
ReLU是近幾年在圖像識別上比較受歡迎的激活函數(shù),定義如下:
------------------------------------------> f(z) = max(0, z) <------------------------------------------------
其函數(shù)曲線如下:
ReLU的優(yōu)點在于它不會飽和,收斂快(即能快速找到代價函數(shù)的極值點),而且計算簡單(函數(shù)形式很簡單);但是收斂快也使得ReLU比較脆弱,如果梯度的更新太快,還沒有找到最佳值,就進(jìn)入小于0的函數(shù)段,這會使得梯度變?yōu)?,無法更新梯度直接掛機了。所以,對于ReLU神經(jīng)元,控制學(xué)習(xí)率(learning rate)十分重要。此外,它的輸出也不是均值為零0的。
---> ps: 在assignment1里的神經(jīng)網(wǎng)絡(luò)部分,我們選擇ReLU作為我們的激活函數(shù)。
3、Leaky ReLU(LReLU)
Leaky ReLU是ReLU的改進(jìn)版,修正了ReLU的缺點,定義如下:
-------------------------------------------> f(z)=max(αz, z) <------------------------------------------------
其中,α為較小的正值(如0.01),函數(shù)曲線如下:
4、Maxout
Maxout是ReLU和LReLU的一般化公式,公式如下:
----------------------------------------------> max(z1, z2) <--------------------------------------------------
可以看出,該方法會使得參數(shù)數(shù)量增加一倍。
5、指數(shù)線性單元(Exponential Linear Units, ELU)
ELU的公式為:
函數(shù)曲線如下:
ELU除了具有LReLu的優(yōu)點外,還有輸出結(jié)果接近于0均值的良好特性;但是,計算復(fù)雜度會提高。
---> ps: 通常我們在神經(jīng)網(wǎng)絡(luò)中只使用一種激活函數(shù)。
2. 數(shù)據(jù)預(yù)處理
和Part1部分一樣,假設(shè)我們有一個圖像訓(xùn)練集X,是一個大小為[N,D]的矩陣;其中,N表示樣本的數(shù)量,D表示樣本的維數(shù)。xi是X中的第i行,即第i個樣本。y表示一個向量,大小為[1,N];yi表示第i個樣本的真實類別,yi=1,2,3, ...,C。
數(shù)據(jù)預(yù)處理的手段一般有:
· 去均值(mean subtraction)
· 規(guī)范化/歸一化(normalization)
· 主成分分析(PCA)和白化(whitening)
對于圖像而言,我們一般只進(jìn)行去均值處理(好處1:自然圖像數(shù)據(jù)是平穩(wěn)的,即數(shù)據(jù)每一個維度的統(tǒng)計都服從相同分布。去均值處理可以移除圖像的平均亮度值,我們對圖像的照度并不感興趣,而更多地關(guān)注其內(nèi)容;好處2:使數(shù)據(jù)關(guān)于0對稱),X -= np.mean(X, axis=0)。或者我們可以進(jìn)一步進(jìn)行歸一化,即每一維減去該維的標(biāo)準(zhǔn)差,X /= np.std(X, axis = 0)。但是,我們通常不會進(jìn)行白化,因為計算代價太大(需要計算協(xié)方差矩陣的特征值)。有關(guān)數(shù)據(jù)預(yù)處理的詳細(xì)內(nèi)容可以參見UFLDL和課程筆記。
---> PS1: 其實我們還要進(jìn)行一項預(yù)處理,就是將圖像向量化,假設(shè)圖像大小為[d1,d2],向量化之后大小為[1,D],D=d1d2。但是我們通常不會將其納入預(yù)處理范疇。
---> PS2: 我們?yōu)槭裁匆M(jìn)行預(yù)處理?因為預(yù)處理可以增大數(shù)據(jù)分布范圍,加速收斂,即可以幫助我們更快地找到代價函數(shù)的極(小)值點。便于大家直觀理解,我繪制了下面這張圖(以二維數(shù)據(jù)為例):
此圖以ReLU神經(jīng)元為例,ReLU(wx+b) = max(wx+b,0),圖中綠色和紅色的線表示wx+b=0;我們發(fā)現(xiàn)只有紅色的線對數(shù)據(jù)進(jìn)行了分割,說明我們隨機初始化的參數(shù)只有少部分發(fā)揮了作用,那么在反向傳播時,收斂速度就會變得很慢;但是去均值后的數(shù)據(jù)被大多數(shù)線分割了,這樣收斂速度也就會快很多了。
3. 權(quán)重初始化方式的選擇
通常我們會將權(quán)重隨機初始化為:均值為0,標(biāo)準(zhǔn)差為一個很小的正數(shù)(如0.001)的高斯分布,在numpy中可以寫成:w = np.random.randn(n)。這樣的初始化方式對于小型的神經(jīng)網(wǎng)絡(luò)是可以的(在assignment1的編程部分,我們就是使用這樣的初始化方式)。
但是對于深度神經(jīng)網(wǎng)絡(luò),這樣的初始化方式并不好。我們以激活函數(shù)為tanh為例,如果標(biāo)準(zhǔn)差設(shè)置得較小,后面層的激活值將全部趨于0,反向傳播時梯度也會變的很小;如果我們將標(biāo)準(zhǔn)差設(shè)置得大些,神經(jīng)元就會趨于飽和,梯度將會趨于零。
為了解決這個問題,我們可以使用方差校準(zhǔn)技術(shù):
· 實踐經(jīng)驗告訴我們,如果每個神經(jīng)元的輸出都有著相似的分布會使收斂速度加快。而上面使用的隨機初始化方式,會使得各個神經(jīng)元的輸出值的分布產(chǎn)生較大的變化。
· 拋開激活函數(shù),我們假設(shè)神經(jīng)元的帶權(quán)輸入s=∑iwixi,則s和x的方差關(guān)系如下:
得到的結(jié)果顯示,如果希望s與輸入變量x同分布就需要使w的方差為1/n。即權(quán)重初始化方式改為:w = np.random.randn(n) / sqrt(n)。
但是當(dāng)使用ReLU作為激活函數(shù)時,各層神經(jīng)元的輸出值分布又不一樣了,對于這個問題這篇論文進(jìn)行了探討,并給出了修改:w = np.random.randn(n) * sqrt(2.0/n),解決了此問題。
至于偏置的初始化,我們可以簡單地將其初始化為0。
4. Batch Normalization
Batch Normalization就是在每一層的wx+b和f(wx+b)之間加一個歸一化(將wx+b歸一化成:均值為0,方差為1;但在原論文中,作者為了計算的穩(wěn)定性,加了兩個參數(shù)將數(shù)據(jù)又還原回去了,這兩個參數(shù)也是需要訓(xùn)練的。Assignment2部分我會詳細(xì)介紹)層,說白了,就是對每一層的數(shù)據(jù)都預(yù)處理一次。方便直觀感受,上張圖:
這個方法可以進(jìn)一步加速收斂,因此學(xué)習(xí)率可以適當(dāng)增大,加快訓(xùn)練速度;過擬合現(xiàn)象可以得倒一定程度的緩解,所以可以不用Dropout或用較低的Dropout,而且可以減小L2正則化系數(shù),訓(xùn)練速度又再一次得到了提升。即Batch Normalization可以降低我們對正則化的依賴程度。
現(xiàn)在的深度神經(jīng)網(wǎng)絡(luò)基本都會用到Batch Normalization。
5. 正則化的選擇
這里,我們會繼續(xù)使用L2正則化(關(guān)于L1正則化和最大范數(shù)約束,請看課程筆記)來懲罰權(quán)重W,控制過擬合現(xiàn)象的發(fā)生。在深度神經(jīng)網(wǎng)絡(luò)(如卷積神經(jīng)網(wǎng)絡(luò),后續(xù)的Assignment2篇會講到)中我們通常也是選擇L2正則化,而且還會增加Dropout來進(jìn)一步控制過擬合。關(guān)于Dropout,我們留到Assignment2部分再詳細(xì)介紹。
6. 損失函數(shù)的選擇
損失(代價)函數(shù)由data loss 和 regularization loss兩部分組成,即L = 1/N ∑iLi + λR(W)。我們常用的損失函數(shù)是SVM的hinge loss和softmax的交叉熵?fù)p失(這里我們只針對數(shù)據(jù)集中樣本只有一個正確類的情況,對于其它分類問題和回歸問題,請看課程筆記),這里我們選擇softmax的交叉熵?fù)p失作為我們的損失函數(shù)。
7. 反向傳播計算梯度
我們以激活函數(shù)f為ReLU,損失函數(shù)為softmax的交叉熵?fù)p失的3-layer神經(jīng)網(wǎng)絡(luò)為例,給出完整的計算各層梯度的過程(由于圖片分辨率較高,請在新的標(biāo)簽頁打開圖片并放大,或者下載后觀看。下圖中,W3 的 size 應(yīng)該是 [H,C]):
8. 參數(shù)更新策略
1)、Vanilla update
最簡單的參數(shù)更新方式,即我們常說的SGD方法的標(biāo)準(zhǔn)計算形式。
2)、Momentum update (SGD+Momentum)
該方法是對Vanilla update的改進(jìn)版,為了理解momentum 技術(shù),我們可以把現(xiàn)梯度下降,類比于球滾向山谷的底部。momentum 技術(shù)修改了梯度下降的兩處使之類似于這個物理場景。首先,引入一個稱為速度(velocity)的概念。梯度的作用就是改變速度,而不是直接的改變位置,就如同物理學(xué)中的力改變速度,只會間接地影響位置;第二,momentum 方法引入了一種摩擦力的項,用來逐步地減少速度。具體的更新規(guī)則如下:
----------------------------------------------) v --> v' = ?μ?v - λdx (-------------------------------------------
------------------------------------------------) x --> x' = x + v' (---------------------------------------------
其中,x表示需要更新的參數(shù)(W和b),v的初始值為0,μ是用來控制摩擦力的量的超參數(shù),取值在(0,1)之間,最常見的設(shè)定值為0.9(也可以用交叉驗證來選擇最合適的μ值,一般我們會從[0.5, 0.9, 0.95, 0.99]里面選出最合適的)。
從公式可以看出,我們通過重復(fù)地增加梯度項來構(gòu)造速度,那么隨著迭代次數(shù)的增加,速度會越來越快,這樣就能夠確保momentum技術(shù)比標(biāo)準(zhǔn)的梯度下降運行得更快;同時μ的引入,保證了在接近谷底時速度會慢慢下降,最終停在谷底,而不是在谷底來回震蕩。
---> ps: SGD+Momentum是最常見的參數(shù)更新方式,這里我們就使用此方法。
3)、Nesterov Momentum (SGD+Nesterov Momentum)
算是Momentum update的改良版,實際應(yīng)用中的收斂效果也略優(yōu)于momentum update。為了方便理解Nesterov Momentum,我們把Momentum update的更新規(guī)則合并如下:
--------------------------------------------) x --> x' = (x + μ?v) - λdx (--------------------------------------
從公式可以看出,(x + μ?v)其實就是x即將去到的下一個位置;但是這個公式在計算梯度的時候,仍然還在計算dx,而我們希望它能前瞻性地計算d(x + μ?v),這樣我們的梯度能更快的下降。貼張輔助理解的圖(圖中大紅點表示參數(shù)x的當(dāng)前位置):
現(xiàn)在我們可以給出Nesterov Momentum的參數(shù)更新規(guī)則了:
------------------------------------------> x_ahead = x + μv <---------------------------------------------
-----------------------------------------> v = μv - λdx_ahead <--------------------------------------------
-------------------------------------------------> x = x + v <--------------------------------------------------
在實際應(yīng)用時,我們會稍作修改,對應(yīng)代碼如下:
v_prev = v
v = mu * v - learning_rate * dx # 和 Momentum update 的更新方式一樣
x += -mu * v_prev + (1 + mu) * v # 新的更新方式
如果你想深入了解Nesterov Momentum的數(shù)學(xué)原理,請看論文:
· Advances in optimizing Recurrent Networks by Yoshua Bengio, Section 3.5.
· Ilya Sutskever’s thesis, contains a exposition of the topic in section 7.2
8.1. 衰減學(xué)習(xí)率
在實際訓(xùn)練過程中,隨著訓(xùn)練過程的推進(jìn),逐漸衰減學(xué)習(xí)率是很有必要的技術(shù)手段。這也很容易理解,我們還是以山頂?shù)缴焦葹槔瑒傞_始離山谷很遠(yuǎn),我們的步長可以大點,但是快接近山谷時,我們的步長得小點,以免越過山谷。
常見的學(xué)習(xí)率衰減方式:
1)、步長衰減:每一個epoch(1 epoch = N/batch_size iterations)過后,學(xué)習(xí)率下降一些,數(shù)學(xué)形式為λ'=kλ,k可以取0.9/0.95,我們也可以通過交叉驗證獲得。
2)、指數(shù)衰減:數(shù)學(xué)形式為α=α0e?kt,其中α0,k為超參數(shù),t是迭代次數(shù)。
3)、1/t衰減:數(shù)學(xué)形式為α=α0/(1+kt),其中α0,k為超參數(shù),t是迭代次數(shù)。
在實際應(yīng)用中,我們通常選擇步長衰減,因為它包含的超參數(shù)少,計算代價低。
以上的討論都是以全局使用同樣的學(xué)習(xí)率為前提的,而調(diào)整學(xué)習(xí)率是一件很費時同時也容易出錯的事情,因此我們一直希望有一種學(xué)習(xí)率自更新的方式,甚至可以細(xì)化到逐參數(shù)更新。下面簡單介紹一下幾個常見的自適應(yīng)方法:
1)、Adagrad
Adagrad是Duchi等在論文Adaptive Subgradient Methods for Online Learning and Stochastic Optimization中提出的自適應(yīng)學(xué)習(xí)率算法,實現(xiàn)代碼如下:
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
這種方法的好處是,對于高梯度的權(quán)重,它們的有效學(xué)習(xí)率被降低了;而小梯度的權(quán)重迭代過程中學(xué)習(xí)率提升了。要注意的是,這里開根號很重要。平滑參數(shù)eps是為了避免除以0的情況,eps一般取值1e-4 到1e-8。
2)、RMSprop
RMSprop是一種高效但是還未正式發(fā)布的自適應(yīng)調(diào)節(jié)學(xué)習(xí)率的方法,RMSProp方法對Adagrad算法做了一個簡單的優(yōu)化,以減緩它的迭代強度:
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
其中,decay_rate是一個超參數(shù),其值可以在 [0.9, 0.99, 0.999]中選擇。
3)、Adam
Adam有點像RMSProp+momentum,效果比RMSProp稍好,其簡化版的代碼如下:
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)
論文中推薦eps = 1e-8,beta1 = 0.9,beta2 = 0.999。 完整的Adam update還包括了一個偏差修正機制,以彌補m,v初始值為零的情況。
---> PS: 建議使用SGD+Nesterov Momentum 或 Adam來更新參數(shù)。
其它的一些方法:
·Adadelta by Matthew Zeiler
·Unit Tests for Stochastic Optimization
這里給出一些上述提到的多種參數(shù)更新方法下,損失函數(shù)最優(yōu)化的示意圖:
9. 超參數(shù)的優(yōu)化
神經(jīng)網(wǎng)絡(luò)的訓(xùn)練過程中,我們需要對很多超參數(shù)進(jìn)行優(yōu)化,這個過程通常在驗證集上進(jìn)行,這里我們需要優(yōu)化的超參數(shù)有:
·初始學(xué)習(xí)率
·學(xué)習(xí)率衰減因子
·正則化系數(shù)/懲罰因子(包括L2懲罰因子,dropout比例)
對于深度神經(jīng)網(wǎng)絡(luò)而言,我們訓(xùn)練一次需要很長的時間。所以,在此之前我們花一些時間去做超參數(shù)搜索,以確定最佳超參數(shù)。最直接的方式就是在框架實現(xiàn)的過程中,設(shè)計一個會持續(xù)變換超參數(shù)實施優(yōu)化,并記錄每個超參數(shù)在每一個epoch后,在驗證集上狀態(tài)和效果。實際應(yīng)用中,神經(jīng)網(wǎng)絡(luò)里確定這些超參數(shù),我們一般很少使用n折交叉驗證,一般使用一份固定的交叉驗證集就可以了。
對于初始學(xué)習(xí)率,通常的搜索序列是:learning_rate = 10 ** uniform(-6, 1),訓(xùn)練5 epoches左右,然后縮小范圍,訓(xùn)練更多次epoches,最后確定初始學(xué)習(xí)率的大小,大概在1e-3左右;對于正則化系數(shù)λ,通常的搜索序列為[0.5, 0.9, 0.95, 0.99]。
10. 訓(xùn)練過程的可視化觀察
1)、觀察損失函數(shù),來判斷你設(shè)置的學(xué)習(xí)率好壞:
但實際損失函數(shù)的變化沒有上圖光滑,會存在波動,下圖是實際訓(xùn)練CIFAR-10的時候,loss的變化情況:
大家可能會注意到上圖的曲線有一些上下波動,這和設(shè)定的batch size有關(guān)系。batch size非常小的情況下,會出現(xiàn)很大的波動,如果batch size設(shè)定大一些,會相對穩(wěn)定一點。
·
2)、觀察訓(xùn)練集/驗證集上的準(zhǔn)確度,來判斷是否發(fā)生了過擬合:
Part 4: Python編程任務(wù)(2-layer神經(jīng)網(wǎng)絡(luò))
· Assignment1的神經(jīng)網(wǎng)絡(luò)部分,我們需要完成neural_net.py,完成后可以用two_layer_net.ipynb里的代碼(部分代碼需要自己完成)來調(diào)試你的模型,優(yōu)化超參數(shù),獲得最優(yōu)模型,最后在測試集上測試分類水平。
· 這里用的圖像庫還是CIFAR-10。
neural_net.py 代碼如下:
__coauthor__ = 'Deeplayer'
# 6.14.2016
#import numpy as np
class TwoLayerNet(object):
"""
A two-layer fully-connected neural network. The net has an input dimension of
D, a hidden layer dimension of H, and performs classification over C classes.
The network has the following architecture:
input - fully connected layer - ReLU - fully connected layer - softmax
The outputs of the second fully-connected layer are the scores for each class.
"""
def __init__(self, input_size, hidden_size, output_size, std=1e-4):
self.params = {}
self.params['W1'] = std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros((1, hidden_size))
self.params['W2'] = std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros((1, output_size))
def loss(self, X, y=None, reg=0.0):
"""
Compute the loss and gradients for a two layer fully connected neural network.
"""
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
N, D = X.shape
# Compute the forward pass
scores = None
h1 = ReLU(np.dot(X, W1) + b1) # hidden layer 1 (N,H)
out = np.dot(h1, W2) + b2 # output layer (N,C)
scores = out # (N,C)
if y is None:
return scores
# Compute the lossloss = None
# Considering the Numeric Stability
scores_max = np.max(scores, axis=1, keepdims=True) # (N,1)
# Compute the class probabilities
exp_scores = np.exp(scores - scores_max) # (N,C)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # (N,C)
# cross-entropy loss and L2-regularization
correct_logprobs = -np.log(probs[range(N), y]) # (N,1)
data_loss = np.sum(correct_logprobs) / N
reg_loss = 0.5 * reg * np.sum(W1*W1) + 0.5 * reg * np.sum(W2*W2)
loss = data_loss + reg_loss
# Backward pass: compute gradients
grads = {}
# Compute the gradient of scores
dscores = probs # (N,C)
dscores[range(N), y] -= 1
dscores /= N
# Backprop into W2 and b2
dW2 = np.dot(h1.T, dscores) # (H,C)
db2 = np.sum(dscores, axis=0, keepdims=True) # (1,C)
# Backprop into hidden layer
dh1 = np.dot(dscores, W2.T) # (N,H)
# Backprop into ReLU non-linearity
dh1[h1 <= 0] = 0
# Backprop into W1 and b1
dW1 = np.dot(X.T, dh1) # (D,H)
db1 = np.sum(dh1, axis=0, keepdims=True) # (1,H)
# Add the regularization gradient contribution
dW2 += reg * W2
dW1 += reg * W1
grads['W1'] = dW1
grads['b1'] = db1
grads['W2'] = dW2
grads['b2'] = db2
return loss, grads
def train(self, X, y, X_val, y_val, learning_rate=1e-3,
learning_rate_decay=0.95, reg=1e-5, mu=0.9, num_epochs=10,
mu_increase=1.0, batch_size=200, verbose=False):
"""
Train this neural network using stochastic gradient descent.
Inputs:
- X: A numpy array of shape (N, D) giving training data.
- y: A numpy array f shape (N,) giving training labels; y[i] = c means that
X[i] has label c, where 0 <= c < C.
- X_val: A numpy array of shape (N_val, D) giving validation data.
- y_val: A numpy array of shape (N_val,) giving validation labels.
- learning_rate: Scalar giving learning rate for optimization.
- learning_rate_decay: Scalar giving factor used to decay the learning rate
after each epoch.
- reg: Scalar giving regularization strength.
- num_iters: Number of steps to take when optimizing.
- batch_size: Number of training examples to use per step.
- verbose: boolean; if true print progress during optimization.
"""
num_train = X.shape[0]
iterations_per_epoch = max(num_train / batch_size, 1)
# Use SGD to optimize the parameters
v_W2, v_b2 = 0.0, 0.0
v_W1, v_b1 = 0.0, 0.0
loss_history = []
train_acc_history = []
val_acc_history = []
for it in xrange(1, num_epochs * iterations_per_epoch + 1):
X_batch = None
y_batch = None
# Sampling with replacement is faster than sampling without replacement.
sample_index = np.random.choice(num_train, batch_size, replace=True)
X_batch = X[sample_index, :] # (batch_size,D)
y_batch = y[sample_index] # (1,batch_size)
# Compute loss and gradients using the current minibatch
loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
loss_history.append(loss)
# Perform parameter update (with momentum)
v_W2 = mu * v_W2 - learning_rate * grads['W2']
self.params['W2'] += v_W2
v_b2 = mu * v_b2 - learning_rate * grads['b2']
self.params['b2'] += v_b2
v_W1 = mu * v_W1 - learning_rate * grads['W1']
self.params['W1'] += v_W1
v_b1 = mu * v_b1 - learning_rate * grads['b1']
self.params['b1'] += v_b1
"""
if verbose and it % 100 == 0:
print 'iteration %d / %d: loss %f' % (it, num_iters, loss)
"""
# Every epoch, check train and val accuracy and decay learning rate.
if verbose and it % iterations_per_epoch == 0:
# Check accuracy
epoch = it / iterations_per_epoch
train_acc = (self.predict(X_batch) == y_batch).mean()
val_acc = (self.predict(X_val) == y_val).mean()
train_acc_history.append(train_acc)
val_acc_history.append(val_acc)
print 'epoch %d / %d: loss %f, train_acc: %f, val_acc: %f' %
(epoch, num_epochs, loss, train_acc, val_acc)
# Decay learning rate
learning_rate *= learning_rate_decay
# Increase mu
mu *= mu_increase
return {
'loss_history': loss_history,
'train_acc_history': train_acc_history,
'val_acc_history': val_acc_history,
}
def predict(self, X):
"""
Inputs:
- X: A numpy array of shape (N, D) giving N D-dimensional data points to
classify.
Returns:
- y_pred: A numpy array of shape (N,) giving predicted labels for each of
the elements of X. For all i, y_pred[i] = c means that X[i] is
predicted to have class c, where 0 <= c < C.
"""
y_pred = None
h1 = ReLU(np.dot(X, self.params['W1']) + self.params['b1'])
scores = np.dot(h1, self.params['W2']) + self.params['b2']
y_pred = np.argmax(scores, axis=1)
return y_pred
def ReLU(x):
"""ReLU non-linearity."""
return np.maximum(0, x)
完成neural_net.py后,你需要檢查代碼編寫是否正確(用two_layer_net.ipynb里的代碼來check);check完之后,我們就需要優(yōu)化超參數(shù)了。
PS: 由于文章字?jǐn)?shù)達(dá)到上限,請到 CS231n : Assignment1(續(xù))繼續(xù)閱讀。 :(|)