猴子都能懂的NLP(NER)

創建一個簡單的模型理解句子某些詞的語義（NER）

加載一些包

import glob
import pandas as pd
import tensorflow as tf
from keras import Sequential
from keras.utils import pad_sequences, to_categorical
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding, Bidirectional, LSTM, TimeDistributed, Dense

加載標簽和語句

在ner文件夾里面有一堆原始數據，每句話后面是每個詞的標簽。

example :
An Iraqi court has sentenced 11 men to death for the massive truck bombings in Baghdad last August that killed more than 100 people .,O B-gpe O O O O O O O O O O O O O B-geo O B-tim O O O O O O O,DT JJ NN VBZ VBN CD NNS TO NN IN DT JJ NN NNS IN NNP JJ NNP WDT VBD JJR IN CD NNS .

files = glob.glob('./ner/*.tags')
data_pd = pd.concat([pd.read_csv(f, header=None, names=['text', 'label', 'pos']) for f in files], ignore_index=True)

print(data_pd.info())

序列化文本

首先把文本和標簽Token化

# This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer
# being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary,
# based on word count, based on tf-idf...
text_tok = Tokenizer(filters='[\\]^\t\n', lower=False, split=' ', oov_token='<OOV>')
pos_tok = Tokenizer(filters='\t\n', lower=False, split=' ', oov_token='<OOV>')
ner_tok = Tokenizer(filters='\t\n', lower=False, split=' ', oov_token='<OOV>')

text_tok.fit_on_texts(data_pd['text'])
pos_tok.fit_on_texts(data_pd['pos'])
ner_tok.fit_on_texts(data_pd['label'])

ner_config = ner_tok.get_config()
text_config = text_tok.get_config()
print(ner_config)
# print(text_config)

這里打印了標簽的Token信息，出現率越高的詞索引會比較小。

{'num_words': None, 'filters': '\t\n', 'lower': False, 'split': ' ', 'char_level': False, 'oov_token': '<OOV>', 'document_count': 62010, 'word_counts': '{"O": 1146068, "B-gpe": 20436, "B-geo": 48876, "B-tim": 26296, "I-tim": 8493, "B-org": 26195, "I-org": 21899, "B-per": 21984, "I-per": 22270, "I-geo": 9512, "B-art": 503, "B-nat": 238, "B-eve": 391, "I-eve": 318, "I-art": 364, "I-gpe": 244, "I-nat": 62}', 'word_docs': '{"B-geo": 31660, "B-gpe": 16565, "B-tim": 22345, "O": 61999, "B-org": 20478, "I-org": 11011, "I-tim": 5526, "B-per": 17499, "I-per": 13805, "I-geo": 7738, "B-art": 425, "B-nat": 211, "B-eve": 361, "I-eve": 201, "I-art": 207, "I-gpe": 224, "I-nat": 50}', 'index_docs': '{"3": 31660, "9": 16565, "4": 22345, "2": 61999, "5": 20478, "8": 11011, "11": 5526, "7": 17499, "6": 13805, "10": 7738, "12": 425, "17": 211, "13": 361, "15": 201, "14": 207, "16": 224, "18": 50}', 'index_word': '{"1": "<OOV>", "2": "O", "3": "B-geo", "4": "B-tim", "5": "B-org", "6": "I-per", "7": "B-per", "8": "I-org", "9": "B-gpe", "10": "I-geo", "11": "I-tim", "12": "B-art", "13": "B-eve", "14": "I-art", "15": "I-eve", "16": "I-gpe", "17": "B-nat", "18": "I-nat"}', 'word_index': '{"<OOV>": 1, "O": 2, "B-geo": 3, "B-tim": 4, "B-org": 5, "I-per": 6, "B-per": 7, "I-org": 8, "B-gpe": 9, "I-geo": 10, "I-tim": 11, "B-art": 12, "B-eve": 13, "I-art": 14, "I-eve": 15, "I-gpe": 16, "B-nat": 17, "I-nat": 18}'}

標簽的意義
geo = Geographical entity
org = Organization
per = Person
gpe = Geopolitical entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural phenomenon
B- / I 該前綴代表開始與緊跟例如， August 19，B-tim I-tim。這兩個詞都是時間，所以要表示開始與結束標識，這里的開始就是Augst => B-tim

每一個字Token化之后，接著用token來表示整個句子，因為計算的過程都是通過數字完成的。

# eval convert string to dictionary
text_vocab = eval(text_config['index_word'])
print("Unique words in vocab:", len(text_vocab))

ner_vocab = eval(ner_config['index_word'])
print("Unique NER tags in vocab:", len(ner_vocab))

# Transforms each text in texts to a sequence of integers.
x_tok = text_tok.texts_to_sequences(data_pd['text'])
y_tok = ner_tok.texts_to_sequences(data_pd['label'])

這里打印兩個例子，可以看到語義標簽以及句子都轉成了矩陣（看起來像個數組）

# O B-gpe O O O O O O O O O O O O O B-geo O B-tim O O O O O O O
# [2, 9, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 4, 2, 2, 2, 2, 2, 2, 2]
print(data_pd['label'][0], y_tok[0])

# An Iraqi court has sentenced 11 men to death for the massive truck bombings in Baghdad last August that killed more than 100 people .
# [316, 89, 233, 13, 1112, 494, 240, 7, 248, 12, 2, 913, 1485, 528, 5, 146, 61, 570, 16, 38, 50, 55, 671, 39, 3]
print(data_pd['text'][0], x_tok[0])

把所有輸入與輸出的數據處理成同樣的長度.
因為在tensorflow計算過程中，所有輸入輸出的的數據需要相同。
過長的數據會被截掉后半段，過段的數據會在句子后加上填充符。

max_len = 50
# padding, String, "pre" or "post" (optional, defaults to "pre"): pad either before or after each sequence.
x_pad = pad_sequences(x_tok, padding='post', maxlen=max_len)
y_pad = pad_sequences(y_tok, padding='post', maxlen=max_len)
print(x_pad.shape, y_pad.shape)

最后對輸出值進行one-hot 處理。
因為輸出結果是個類別，如果用 1 、2、 3 數字描述類別1、類別2、類別3，會得到類別3 = 類別1 + 類別2 的錯誤現象。

# Since there are multiple labels, each label token needs to be one-hot encoded like so:
num_classes = len(ner_vocab) + 1
Y = to_categorical(y_pad, num_classes)
# (62010, 50, 19)
print(Y.shape)

這里打印了一個例子來看看，可以看到B-gpe已經被轉化成了長度為19的1維舉證。
所以每一句話都是維度[50, 19]的矩陣。

# B-gpe => 9 => [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
print('finally covert ', ner_vocab.get('9'), '=>', y_pad[0][1], '=>', Y[0][1])

構建模型

vocab_size = len(text_vocab) + 1
embedding_dim = 64
rnn_units = 100
BATCH_SIZE = 90
num_classes = len(ner_vocab) + 1

dropout = 0.2


# None means it is a dynamic shape. It can take any value depending on the batch size you choose.
# num_units in TensorFlow is the number of hidden states, Positive integer, dimensionality of the output space.
# TimeDistributed , This wrapper allows to apply a layer to every temporal slice of an input.
# kernel_initializer Initializer for the kernel weights matrix, used for the linear transformation of the inputs
def build_model_bilstm(vocab_size, embedding_dim, rnn_units, batch_size, classes):
    return Sequential([
        Embedding(vocab_size, embedding_dim, mask_zero=True, batch_input_shape=[batch_size, None]),
        Bidirectional(LSTM(units=rnn_units, return_sequences=True, dropout=dropout, kernel_initializer=tf.keras.initializers.he_normal())),
        TimeDistributed(Dense(rnn_units, activation='relu')),
        Dense(classes, activation='softmax')
    ])


model = build_model_bilstm(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=BATCH_SIZE, classes=num_classes)
print(model.summary())
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

訓練與驗證模型

X = x_pad
total_sentences = ner_config.get('document_count')
test_size = round(total_sentences / BATCH_SIZE * 0.2)
test_size = BATCH_SIZE * test_size

X_train = X[test_size:]
Y_train = Y[test_size:]

X_test = X[0:test_size]
Y_test = Y[0:test_size]

model.fit(X_train, Y_train, batch_size=BATCH_SIZE, epochs=15)

model.evaluate(X_test, Y_test, batch_size=BATCH_SIZE)

準確率為96.34%

138/138 [==============================] - 2s 4ms/step - loss: 0.0872 - accuracy: 0.9634

預測句子


y_pred = model.predict(X_test, batch_size=BATCH_SIZE)

# convert prediction one-hot encoding back to number
y_pred = tf.argmax(y_pred, -1)
y_pnp = y_pred.numpy()

# convert ground true one-hot encode back to number
y_ground_true = tf.argmax(Y_test, -1)
y_ground_true_pnp = y_ground_true.numpy()

for i in range(10):
    x = 'sentence=> ' + text_tok.sequences_to_texts([X_test[i]])[0]
    ground_true = 'ground_true=> ' + ner_tok.sequences_to_texts([y_ground_true_pnp[i]])[0]
    prediction = 'prediction=> ' + ner_tok.sequences_to_texts([y_pnp[i]])[0]
    template = '|'.join(['{' + str(index) + ': <15}' for index, x in enumerate(x.split(' '))])
    print(template.format(*x.split(' ')))
    print(template.format(*ground_true.split(' ')))
    print(template.format(*prediction.split(' ')))
    print('\n')

打印其中兩個例子，可以看到預測還是挺準確的

sentence=>     |An             |Iraqi          |court          |has            |sentenced      |11             |men            |to             |death          |for            |the            |massive        |truck          |bombings       |in             |Baghdad        |last           |August         |that           |killed         |more           |than           |100            |people         |.              |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          
ground_true=>  |O              |B-gpe          |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |B-geo          |O              |B-tim          |O              |O              |O              |O              |O              |O              |O              |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          
prediction=>   |O              |B-gpe          |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |B-geo          |O              |B-tim          |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              



sentence=>     |The            |court          |convicted      |the            |men            |of             |planning       |and            |implementing   |the            |August         |19             |attacks        |on             |the            |Iraqi          |Ministries     |of             |Finance        |and            |Foreign        |Affairs        |.              |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          
ground_true=>  |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |B-tim          |I-tim          |O              |O              |O              |B-gpe          |O              |O              |B-org          |I-org          |I-org          |I-org          |O              |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          |<OOV>          
prediction=>   |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |B-tim          |I-tim          |O              |O              |O              |B-gpe          |B-org          |I-org          |I-org          |I-org          |I-org          |I-org          |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O              |O

bilistm 項目傳送門

最后編輯于：2022.09.07 11:10:08

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明：文章內容（如有圖片或視頻亦包括在內）由作者上傳并發布，文章內容僅代表作者本人觀點，簡書系信息發布平臺，僅提供信息存儲服務。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 228,835評論 6贊 534
死咒
序言：濱河連續發生了三起死亡事件，死亡現場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發現死者居然都...
沈念sama閱讀 98,676評論 3贊 419
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 176,730評論 0贊 380
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 63,118評論 1贊 314
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 71,873評論 6贊 410
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發上，一...
開封第一講書人閱讀 55,266評論 1贊 324
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,330評論 3贊 443
雙鴛鴦連環套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 42,482評論 0贊 289
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當地人在樹林里發現了一具尸體，經...
沈念sama閱讀 49,036評論 1贊 335
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 40,846評論 3贊 356
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發現自己被綠了。大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 43,025評論 1贊 371
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 38,575評論 5贊 362
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發生泄漏。R本人自食惡果不足惜，卻給世界環境...
茶點故事閱讀 44,279評論 3贊 347
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 34,684評論 0贊 26
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 35,953評論 1贊 289
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 51,751評論 3贊 394
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 48,016評論 2贊 375

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

猴子都能懂的NLP(NER)

猴子都能懂的NLP(NER)

加載一些包

加載標簽和語句

序列化文本

構建模型

訓練與驗證模型

預測句子

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

猴子都能懂的NLP(NER)

加載一些包

加載標簽和語句

序列化文本

構建模型

訓練與驗證模型

預測句子

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频