让我们站着再来一次的更新时间,性亂倫xxxx乱大交女,国产青草视频在线观看

同本文一起發布的另外一篇文章中，提到了 BlueDot 公司，這個公司致力于利用人工智能保護全球人民免受傳染病的侵害，在本次疫情還沒有引起強烈關注時，就提前一周發出預警，一周的時間，多么寶貴！

他們的 AI 預警系統，就用到了深度學習對文本的處理，這個系統抓取網絡上大量的新聞、公開聲明等獲取到的數十萬的信息，對自然語言進行處理，我們今天就聊聊深度學習如何對文本的簡單處理。

文本，String 或 Text，就是字符的序列或單詞的序列，最常見的是單詞的處理（我們暫時不考慮中文，中文的理解和處理與英文相比要復雜得多）。計算機就是固化的數學，對文本的處理，在本質上來說就是固化的統計學，這樣采用統計學處理后的模型就可以解決許多簡單的問題了。下面我們開始吧。

處理文本數據

與之前一致，如果原始要訓練的數據不是向量，我們要進行向量化，文本的向量化，有幾種方式：

按照單詞分割
按照字符分割
提取單詞的 n-gram

我喜歡吃火……，你猜我接下來會說的是什么？1-gram 接下來說什么都可以，這個詞與前文沒關系；2-gram 接下來可能說“把，柴，焰”等，組成詞語“火把、火柴、火焰”；3-gram 接下來可能說“鍋”，組成“吃火鍋”，這個概率更大一些。先簡單這么理解，n-gram 就是與前 n-1 個詞有關。

我們今天先來填之前挖下來的一個坑，當時說以后將介紹 one-hot，現在是時候了。

one-hot 編碼

def one_hot():
?    samples = ['The cat sat on the mat', 'The dog ate my homework']
    token_index = {}
    # 分割成單詞
    for sample in samples:
        for word in sample.split():
            if word not in token_index:
                token_index[word] = len(token_index) + 1
    # {'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10}
    print(token_index)
 
    max_length = 8
    results = np.zeros(shape=(len(samples), max_length, max(token_index.values()) + 1))
    for i, sample in enumerate(samples):
        for j, word in list(enumerate(sample.split()))[:max_length]:
            index = token_index.get(word)
            results[i, j, index] = 1.
?
    print(results)

結果

我們看到，這個數據是不好的，mat 和 homework 后面都分別跟了一個英文的句話 '.'，要炫技寫那種高級的正則表達式去匹配這個莫名其妙的符號嗎？當然不是了，沒錯，Keras 有內置的方法。

def keras_one_hot():
    samples = ['The cat sat on the mat.', 'The dog ate my homework.']
    tokenizer = Tokenizer(num_words=1000)
    tokenizer.fit_on_texts(samples)
    sequences = tokenizer.texts_to_sequences(samples)
    print(sequences)
    one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
    print(one_hot_results)
    word_index = tokenizer.word_index
    print(word_index)
    print('Found %s unique tokens.' % len(word_index))

結果

這里的 num_words 和上面的 max_length 都是用來表示多少個最常用的單詞，控制好這個，可以大大的減少運算量訓練時間，甚至有點時候能更好的提高準確率，希望引起一定注意。我們還可以看到得到的編碼的向量，很大一部分都是 0，不夠緊湊，這會導致大量的內存占用，不好不好，有什么什么其他辦法呢？答案是肯定的。

詞嵌入

也叫詞向量。詞嵌入通常是密集的，維度低的（256、512、1024）。那到底什么叫詞嵌入呢？

本文我們的主題是處理文本信息，文本信息就是有語義的，對于沒有語義的文本我們什么也干不了，但是我們之前的處理方法，其實就是概率上的統計，，是一種單純的計算，沒有理解的含義（或者說很少），但是考慮到真實情況，“非常好” 和 “非常棒” 的含義是相近的，它們與 “非常差” 的含義是相反的，因此我們希望轉換成向量的時候，前兩個向量距離小，與后一個向量距離大。因此看下面一張圖，是不是就很容易理解了呢：

image

可能直接讓你去實現這個功能有點難，幸好 Keras 簡化了這個問題，Embedding 是內置的網絡層，可以完成這個映射關系。現在理解這個概念后，我們再來看看 IMDB 問題（電影評論情感預測），代碼就簡單了，差不都可以達到 75%的準確率：

def imdb_run():
    max_features = 10000
    maxlen = 20
    (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
    x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
    model = Sequential()
    model.add(Embedding(10000, 8, input_length=maxlen))
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
    model.summary()
    history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

我們的數據量有點少，怎么辦呢？上一節我們在處理圖像的時候，用到的方法是使用預訓練的網絡，這里我們采用類似的方法，采用預訓練的詞嵌入。最流行的兩種詞嵌入是 GloVe 和 Word2Vec，我們后面還是會在合適的時候分別介紹這兩個詞嵌入。今天我們采用 GloVe 的方法，具體做法我寫在了代碼的注釋中。我們還是先看結果，代碼還是放在最后：

image

很快就過擬合了，你可能覺得這個驗證精度接近 60%，考慮到訓練樣本只有 200 個，這個結果真的還挺不錯的，當然，你可能不信，那么我再給出兩組對比圖，一組是沒有詞嵌入的：

image

驗證精度明顯偏低，再給出 2000 個訓練集的數據：

image

這個精度就高了很多，追求這個高低不是我們的目的，我們的目的是說明詞嵌入是有效的，我們達到了這個目的，好了，接下來我們看看代碼吧：

#!/usr/bin/env python3
?
import os
import time
?
import matplotlib.pyplot as plt
import numpy as np
from keras.layers import Embedding, Flatten, Dense
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
?
?
def deal():
    # http://mng.bz/0tIo
    imdb_dir = '/Users/renyuzhuo/Documents/PycharmProjects/Data/aclImdb'
    train_dir = os.path.join(imdb_dir, 'train')
    labels = []
    texts = []
    # 讀出所有數據
    for label_type in ['neg', 'pos']:
        dir_name = os.path.join(train_dir, label_type)
        for fname in os.listdir(dir_name):
            if fname[-4:] == '.txt':
                f = open(os.path.join(dir_name, fname))
                texts.append(f.read())
                f.close()
                if label_type == 'neg':
                    labels.append(0)
                else:
                    labels.append(1)
?
    # 對所有數據進行分詞
    # 每個評論最多 100 個單詞
    maxlen = 100
    # 訓練樣本數量
    training_samples = 200
    # 驗證樣本數量
    validation_samples = 10000
    # 只取最常見 10000 個單詞
    max_words = 10000
    # 分詞，前文已經介紹過了
    tokenizer = Tokenizer(num_words=max_words)
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)
    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))
    # 將整數列表轉換成張量
    data = pad_sequences(sequences, maxlen=maxlen)
    labels = np.asarray(labels)
    print('Shape of data tensor:', data.shape)
    print('Shape of label tensor:', labels.shape)
    # 打亂數據
    indices = np.arange(data.shape[0])
    np.random.shuffle(indices)
    data = data[indices]
    labels = labels[indices]
    # 切割成訓練集和驗證集
    x_train = data[:training_samples]
    y_train = labels[:training_samples]
    x_val = data[training_samples: training_samples + validation_samples]
    y_val = labels[training_samples: training_samples + validation_samples]
?
    # 下載詞嵌入數據，下載地址：https: // nlp.stanford.edu / projects / glove
    glove_dir = '/Users/renyuzhuo/Documents/PycharmProjects/Data/glove.6B'
    embeddings_index = {}
    f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
    # 構建單詞與其x向量表示的索引
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    print('Found %s word vectors.' % len(embeddings_index))
?
    # 構建嵌入矩陣
    embedding_dim = 100
    embedding_matrix = np.zeros((max_words, embedding_dim))
    for word, i in word_index.items():
        if i < max_words:
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[i] = embedding_vector
?
    # 構建模型
    model = Sequential()
    model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
    model.add(Flatten())
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.summary()
?
    # 將 GloVe 加載到 Embedding 層，且將其設置為不可訓練
    model.layers[0].set_weights([embedding_matrix])
    model.layers[0].trainable = False
?
    # 訓練模型
    model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy',
                  metrics=['acc'])
    history = model.fit(x_train, y_train,
                        epochs=10,
                        batch_size=32,
                        validation_data=(x_val, y_val))
    model.save_weights('pre_trained_glove_model.h5')
?
    # 畫圖
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(acc) + 1)
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.show()
?
    plt.figure()
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
    plt.show()
?
?
if __name__ == "__main__":
    time_start = time.time()
    deal()
    time_end = time.time()
    print('Time Used: ', time_end - time_start)

本文首發自公眾號：RAIS

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

AI：深度學習用于文本處理

AI：深度學習用于文本處理

處理文本數據

one-hot 編碼

詞嵌入

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

AI：深度學習用于文本處理

處理文本數據

one-hot 編碼

詞嵌入

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频