一面亲上边一面膜下边,武动乾坤第二季,久久久久久久久毛片无码

前言：Udacity上有一個免費(fèi)的《深度學(xué)習(xí)》單項課程。雖然是免費(fèi)的課程，但保持了Udacity一貫的水準(zhǔn)。其課程主要是神經(jīng)網(wǎng)絡(luò)的實(shí)際應(yīng)用。因此，該課程能較好的提升實(shí)際項目水平。但是，如果需要提升理論水平，則可以同步學(xué)習(xí)Coursera上hinton的《機(jī)器學(xué)習(xí)》課程，里面有很深入和系統(tǒng)的講解了深度學(xué)習(xí)的所有基礎(chǔ)理論。
該課程的入門作業(yè)是notMnist字符的分類。其中，notMnist數(shù)據(jù)是一組按字符分類的圖像文件夾，其共有A、B、C、D、E、F、G、H、I、J共10個文件夾。其中，每一張圖像均維28*28。完成該作業(yè)能夠夯實(shí)兩個重要基礎(chǔ)：1是實(shí)現(xiàn)了一個分類項目的所有流程；2是實(shí)現(xiàn)對數(shù)據(jù)進(jìn)行歸一化、隨機(jī)化和數(shù)據(jù)清洗。這是非常關(guān)鍵的預(yù)處理步驟。其具體的處理方法如下：
預(yù)處理方法 對于一組數(shù)據(jù)，第一步并不是將數(shù)據(jù)馬上用于訓(xùn)練模型，而是要對數(shù)據(jù)進(jìn)行預(yù)處理。預(yù)處理的步驟主要包括：

（1）歸一化圖像數(shù)據(jù)的取值范圍是[0,255]，原始的樣本值一般都是過
大的。因此，要使用如下公式對圖像數(shù)據(jù)進(jìn)行歸一化。 x= (x-128)/255
樣本數(shù)值較大還存在2個不利于學(xué)習(xí)的影響：1是參數(shù)訓(xùn)練時存在多個樣本的求和，因此，存在數(shù)據(jù)的截斷以及大小數(shù)據(jù)相加帶來的偏差。2是樣本值過大影響到模型的初始化。模型的初始化時需要神經(jīng)網(wǎng)絡(luò)的分類盡可能的隨機(jī)，即logit值越小越好，這樣softmax的值才能趨向與均值。但是，如果樣本的值較大，即使參數(shù)w是隨機(jī)的小值，但由于x太大，導(dǎo)致logit的值較大，經(jīng)由softmax計算后的結(jié)果趨向于1和0，影響初始化的隨機(jī)性。
（2）隨機(jī)化由于最開始的樣本數(shù)據(jù)是以文件夾區(qū)分好類別的，但是，在訓(xùn)練時，數(shù)據(jù)需要以隨機(jī)化的形式輸入，否則訓(xùn)練難以穩(wěn)定。
（3）數(shù)據(jù)清洗由于訓(xùn)練集、驗證集和測試集中可能有數(shù)據(jù)是重復(fù)的。如果重復(fù)得太多，那么最后得分類結(jié)果得真實(shí)性會受到影響。因此，需要對3個樣本集進(jìn)行數(shù)據(jù)得清洗，使其沒有相互沒有交集。
其作業(yè)是掛在jupyter上并由多個代碼塊組成，包括從網(wǎng)頁上下載數(shù)據(jù)、解壓數(shù)據(jù)、數(shù)據(jù)預(yù)處理、模型訓(xùn)練和測試等多個代碼塊。下面就按照代碼塊的順序進(jìn)行講解和完成作業(yè)。

代碼塊1-載入模塊

from __future__ import print_function
import matplotlib.pyplot as plt#繪圖模塊
import numpy as np#矩陣模塊
import os
import sys
import tarfile#文件解壓模塊
from IPython.display import display, Image
from scipy import ndimage
from sklearn.linear_model import LogisticRegression#回歸模塊
from six.moves.urllib.request import urlretrieve#下載模塊
from six.moves import cPickle as pickle#壓縮模塊
# Config the matplotlib backend as plotting inline in IPython
%matplotlib inline

代碼塊2-下載文件

注：由于該網(wǎng)絡(luò)地址下載數(shù)據(jù)太慢，因此，建議不使用該函數(shù)進(jìn)行數(shù)據(jù)下載。而是自己將數(shù)據(jù)下載到本地文件夾中。下載的網(wǎng)址如下：

http://yaroslavvb.com/upload/notMNIST/

代碼塊3-解壓文件并存儲解壓后的文件地址

由于該代碼塊的解壓速度太慢，因此，利用解壓工具解壓該文件。并將解壓后的文件夾地址存儲起來，為后續(xù)的調(diào)用做好準(zhǔn)備即可。下文代碼刪除了作業(yè)中解壓部分的代碼，而保留存儲文件夾路徑的代碼。

num_classes = 10
np.random.seed(133)

#創(chuàng)建每一個類別的文件夾名
def maybe_extract(filename, force=False):
  root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
  data_folders = [os.path.join(root, d) for d in sorted(os.listdir(root))
    if os.path.isdir(os.path.join(root, d))]
  if len(data_folders) != num_classes:
    raise Exception(
      'Expected %d folders, one per class. Found %d instead.' % (
        num_classes, len(data_folders)))
  return data_folders

#本地存儲notMnist數(shù)據(jù)的的文件夾
train_filename = 'C:\\ProgramInstall\\PythonCode\\notMNIST_large'  
test_filename = 'C:\\ProgramInstall\\PythonCode\\notMNIST_small'
train_folders = maybe_extract(train_filename)
test_folders = maybe_extract(test_filename)

問題1-顯示解壓后的圖像

#Problem1: Display a sample of the images that we just download
nums_image_show = 2#顯示的圖像張數(shù)
for index_class in range(num_classes):
    #i from 0 to 9
    imagename_list = os.listdir(train_folders[index_class])
    imagename_list_indice = imagename_list[0:nums_image_show]
    for index_image in range(nums_image_show):
        path = train_folders[index_class] +'\\' + imagename_list_indice[index_image]
        display(Image(filename = path))

其結(jié)果如下圖所示：

代碼塊4-加載和歸一化圖像數(shù)據(jù)

該代碼塊主要實(shí)現(xiàn)了3個功能：1是將本地硬盤中的每類圖像文件夾中的圖像數(shù)據(jù)讀到一個3維的dataset對象中，第1維是圖像個數(shù)索引，其余2維則是圖像數(shù)據(jù)。其中主要是利用了scipy模塊中的ndarray對象兌取硬盤中的圖像數(shù)據(jù)。2是將讀取到的圖像數(shù)據(jù)按照上文所述的公式進(jìn)行了歸一化。3是將ndarray對象打包為pickle格式并存儲在工作目錄下，每個類別有一個.pickle文件。并將打包后.pickle文件的地址存儲為train_datasets和test_datasets返回。

注：將數(shù)據(jù)打包為.pickle文件更便于數(shù)據(jù)的調(diào)用與處理。因為，圖像的原始數(shù)據(jù)是使用循環(huán)打入到對象中的，如果每次使用圖像數(shù)據(jù)均需要循環(huán)來加載，這樣加大了代碼量。而對.pickle文件只需要讀取一次即可，而無需使用循環(huán)。

問題2 顯示從pickle文件中讀取的圖像

#Problem2 Displaying a sample of the labels and images from the ndarray

# Config the matplotlib backend as plotting inline in IPython
%matplotlib inline
import matplotlib.pyplot as plt
def load_and_displayImage_from_pickle(data_filename_set,NumClass,NumImage):
    if(NumImage <= 0):
        print('NumImage <= 0')
        return
    plt.figure('subplot')
    for index,pickle_file in enumerate(data_filename_set):
        with open(pickle_file,'rb') as f:
            data = pickle.load(f)
            ImageList = data[0:NumImage,:,:]
            for i,Image in enumerate(ImageList):
                #NumClass代表類別，每個類別一行;NumImage代表每個類顯示的圖像張數(shù)
                plt.subplot(NumClass, NumImage, index*NumImage+i+1)
                plt.imshow(Image)
            index = index+1        
#顯示10類，每類顯示5張圖片        
load_and_displayImage_from_pickle(train_datasets,10,5)    
load_and_displayImage_from_pickle(test_datasets,10,5)

其結(jié)果如下圖所示：

問題3-檢測數(shù)據(jù)是否平衡

數(shù)據(jù)是否平衡的意思是各類樣本的大小是否相當(dāng)。

def show_sum_of_different_class(data_filename_set):
    plt.figure(1)
    #read .pickle file
    sumofdifferentclass = []
    for pickle_file in data_filename_set:
        with open(pickle_file,'rb') as f:
            data = pickle.load(f)
            print(len(data))
            sumofdifferentclass.append(len(data))

    #show the data
    x = range(10)
    plt.bar(x,sumofdifferentclass)    
    plt.show()

print('train_datasets:\n')    
show_sum_of_different_class(train_datasets)  
print('test_datasets:\n')    
show_sum_of_different_class(test_datasets)

其結(jié)果如下圖所示：

代碼塊5-將不同類別的數(shù)據(jù)混合并將得到驗證集

該模塊實(shí)現(xiàn)了2個功能：1是將不同類別的數(shù)據(jù)進(jìn)行混合。之前是每個類別一個數(shù)據(jù)對象。現(xiàn)在，為了便于后續(xù)的訓(xùn)練，需將不同類別的數(shù)據(jù)存儲為一個大的數(shù)據(jù)對象，即該對象同時包含A、B…J共個類別的樣本。2是從訓(xùn)練集中提取一部分作為驗證集。

代碼塊6-將混合后的數(shù)據(jù)進(jìn)行隨機(jī)化

上一步只是將數(shù)據(jù)進(jìn)行和混合并存儲為一個大的數(shù)據(jù)對象，此步則將混合后的數(shù)據(jù)對象中的數(shù)據(jù)進(jìn)行了隨機(jī)化處理。只有隨機(jī)化后的數(shù)據(jù)訓(xùn)練模型時才會有較為穩(wěn)定的效果。

問題4 從驗證混合后的數(shù)據(jù)

'''Problem4 Convince yourself that the data is still good after shuffling!
'''
#data_set是數(shù)據(jù)集，NumImage是顯示的圖像張數(shù)
def displayImage_from_dataset(data_set,NumImage):
    if(NumImage <= 0):
        print('NumImage <= 0')
        return
    plt.figure('subplot')
    ImageList = data_set[0:NumImage,:,:]
    for index,Image in enumerate(ImageList):
        #NumClass代表類別，每個類別一行;NumImage代表每個類顯示的圖像張數(shù)
        plt.subplot(NumImage//5+1, 5, index+1)
        plt.imshow(Image)
        index = index+1    
    plt.show()
displayImage_from_dataset(train_dat```
set,50)

其結(jié)果如下圖所示，下圖也表明圖像數(shù)據(jù)確實(shí)是不同類別隨機(jī)分布的。

代碼塊7-將不同的樣本及存為.pickle文件

問題5-數(shù)據(jù)清洗

一般來說，訓(xùn)練集、驗證集和測試集中會有數(shù)據(jù)的重合，但是，如果重合的數(shù)據(jù)太多則會影響到測試結(jié)果的準(zhǔn)確程度。因此，需要對數(shù)據(jù)進(jìn)行清洗，使彼此之間步存在交集。

注：ndarray數(shù)據(jù)無法使用set的方式來求取交集。但如果使用循環(huán)對比的方式在數(shù)據(jù)量大的情況下會非常慢，因此，下文的做法使先將數(shù)據(jù)哈希化，再通過哈希的鍵值來判斷數(shù)據(jù)是否相等。由于哈希的鍵值是字符串，因此比對起來效率會高很多。

#先使用hash
import hashlib

#使用sha的作用是將二維數(shù)據(jù)和哈希值之間進(jìn)行一一對應(yīng)，這樣，通過比較哈希值就能將二維數(shù)組是否相等比較出來
def extract_overlap_hash_where(dataset_1,dataset_2):

    dataset_hash_1 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_1])
    dataset_hash_2 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_2])
    overlap = {}
    for i, hash1 in enumerate(dataset_hash_1):
        duplicates = np.where(dataset_hash_2 == hash1)
        if len(duplicates[0]):
            overlap[i] = duplicates[0]
    return overlap

#display the overlap
def display_overlap(overlap,source_dataset,target_dataset):
    overlap = {k: v for k,v in overlap.items() if len(v) >= 3}
    item = np.random.choice(list(overlap.keys()))
    imgs = np.concatenate(([source_dataset[item]],target_dataset[overlap[item][0:7]]))
    plt.suptitle(item)
    for i,img in enumerate(imgs):
        plt.subplot(2,4,i+1)
        plt.axis('off')
        plt.imshow(img)
    plt.show()

#數(shù)據(jù)清洗
def sanitize(dataset_1,dataset_2,labels_1):
    dataset_hash_1 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_1])
    dataset_hash_2 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_2])
    overlap = []
    for i,hash1 in enumerate(dataset_hash_1):
        duplictes = np.where(dataset_hash_2 == hash1)
        if len(duplictes[0]):
            overlap.append(i)
    return np.delete(dataset_1,overlap,0),np.delete(labels_1, overlap, None)


overlap_test_train = extract_overlap_hash_where(test_dataset,train_dataset)
print('Number of overlaps:', len(overlap_test_train.keys()))
display_overlap(overlap_test_train, test_dataset, train_dataset)

test_dataset_sanit,test_labels_sanit = sanitize(test_dataset,train_dataset,test_labels)
print('Overlapping images removed from test_dataset: ', len(test_dataset) - len(test_dataset_sanit))

valid_dataset_sanit, valid_labels_sanit = sanitize(valid_dataset, train_dataset, valid_labels)
print('Overlapping images removed from valid_dataset: ', len(valid_dataset) - len(valid_dataset_sanit))

print('Training:', train_dataset.shape, train_labels.shape)
print('Validation:', valid_labels_sanit.shape, valid_labels_sanit.shape)
print('Testing:', test_dataset_sanit.shape, test_labels_sanit.shape)

pickle_file_sanit = 'notMNIST_sanit.pickle'
try:
    f = open(pickle_file_sanit,'wb')
    save = {
        'train_dataset':train_dataset,
        'train_labels': train_labels,
        'valid_dataset': valid_dataset,
        'valid_labels': valid_labels,
        'test_dataset': test_dataset,
        'test_labels': test_labels,
    }
    pickle.dump(save,f,pickle.HIGHEST_PROTOCOL)
    f.close()
except Exception as e:
  print('Unable to save data to', pickle_file, ':', e)
  raise

statinfo = os.stat(pickle_file_sanit)
print('Compressed pickle size:', statinfo.st_size)

問題6-模型訓(xùn)練

該模型是使用邏輯回歸模型進(jìn)行的訓(xùn)練。

def train_and_predict(sample_size):
    regr = LogisticRegression()
    X_train = train_dataset[:sample_size].reshape(sample_size,784)
    y_train = train_labels[:sample_size]
    regr.fit(X_train,y_train)
    X_test = test_dataset.reshape(test_dataset.shape[0],28*28)
    y_test = test_labels

    pred_labels = regr.predict(X_test)
    print('Accuracy:', regr.score(X_test, y_test), 'when sample_size=', sample_size)

for sample_size in [50,100,1000,5000,len(train_dataset)]:
    train_and_predict(sample_size)

后兩個問題的答案來源與以下博文 http://www.hankcs.com/ml/notmnist.html

本文轉(zhuǎn)載自http://blog.csdn.net/u013698770/article/details/54645326

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Udacity深度學(xué)習(xí)DeepLearning課程作業(yè)1-notMNIST

Udacity深度學(xué)習(xí)DeepLearning課程作業(yè)1-notMNIST

代碼塊1-載入模塊

代碼塊2-下載文件

代碼塊3-解壓文件并存儲解壓后的文件地址

問題1-顯示解壓后的圖像

代碼塊4-加載和歸一化圖像數(shù)據(jù)

問題2 顯示從pickle文件中讀取的圖像

問題3-檢測數(shù)據(jù)是否平衡

代碼塊5-將不同類別的數(shù)據(jù)混合并將得到驗證集

代碼塊6-將混合后的數(shù)據(jù)進(jìn)行隨機(jī)化

問題4 從驗證混合后的數(shù)據(jù)

代碼塊7-將不同的樣本及存為.pickle文件

問題5-數(shù)據(jù)清洗

問題6-模型訓(xùn)練

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Udacity深度學(xué)習(xí)DeepLearning課程作業(yè)1-notMNIST

代碼塊1-載入模塊

代碼塊2-下載文件

代碼塊3-解壓文件并存儲解壓后的文件地址

問題1-顯示解壓后的圖像

代碼塊4-加載和歸一化圖像數(shù)據(jù)

問題2 顯示從pickle文件中讀取的圖像

問題3-檢測數(shù)據(jù)是否平衡

代碼塊5-將不同類別的數(shù)據(jù)混合并將得到驗證集

代碼塊6-將混合后的數(shù)據(jù)進(jìn)行隨機(jī)化

問題4 從驗證混合后的數(shù)據(jù)

代碼塊7-將不同的樣本及存為.pickle文件

問題5-數(shù)據(jù)清洗

問題6-模型訓(xùn)練

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频