使用深度學習進行中文自然語言處理之序列標注

深度學習簡介

深度學習的資料很多，這里就不展開了講，本文就介紹中文NLP的序列標注工作的一般方法。

機器學習與深度學習

簡單來說，機器學習就是根據樣本(即數據)學習得到一個模型，再根據這個模型預測的一種方法。
ML算法很多，Naive Bayes樸素貝葉斯、Decision Tree決策樹、Support Vector Machine支持向量機、Logistic Regression邏輯回歸、Conditional Random Field條件隨機場等。
而深度學習，簡單來說是一種有多層隱層的感知機。
DL也分很多模型，但一般了解Convolution Neural Network卷積神經網絡、Recurrent Neural Network循環神經網絡就夠了(當然都要學，這里是指前期學習階段可以側重這兩個)。
異同：ML是一種淺層學習，一般來說都由人工設計特征，而DL則用pre-training或者無監督學習來抽取特征表示，再使用監督學習來訓練預測模型（當然不全都是這樣）。
本文主要用于介紹DL在中文NLP的應用，所以采用了使用最為簡單、方便的DL框架keras來開發，它是構建于兩個非常受歡迎的DL框架theano和tensorflow之上的上層應用框架。

NLP簡介

Natural Language Process自然語言處理又分為NLU自然語言理解和NLG自然語言生成。而分詞、詞性標注、實體識別、依存分析則是NLP的基礎工作，它們都可以理解為一種序列標注工作。

序列標注工作簡介

詞向量簡介

Word Embedding詞向量方法，用實數向量來表示一個詞的方法，是對One-hot Representation的一種優化。優點是低維，而且可以方便的用數學距離衡量詞的詞義相似度，缺點是詞一多，模型就有點大，所以又有工作提出了Char Embedding方法，這種方法訓練出來的模型很小，但丟失了很多的語義信息，所以又有基于分詞信息的字向量的研究工作。

中文NLP序列標注之CWS

CWS簡介

Chinese Word Segmentation中文分詞是中文NLP的基礎，一般來說中文分詞有兩種方法，一種是基于詞典的方法，一種是基于ML或者DL的方法。CWS的發展可以參考漫話中文分詞，簡單來說基于詞典的方法實現簡單、速度快，但是對歧義和未登錄詞沒有什么好的辦法，而基于ML和DL的方法實現復雜、速度較慢，但是可以較好地應對歧義和OOV(Out-Of-Vocabulary)。
基于詞典的方法應用最廣的應該是正向最大匹配，而基于ML的CWS效果比較好的算法是CRF，本文主要介紹基于DL的方法，但在實際應用中應該合理的結合兩種方法。

標注集與評估方法

這里采用B(Begin字為詞的起始)、M(Middle字為詞的中間)、E(End字為詞的結束)、S(Single單字詞)標注集，訓練預料和評估工具采用SIGHAN中的方法，具體可以參考我的另一篇文章SIGHAN測評中文分詞的方法與指標介紹。

模型

原理是采用bi-directional LSTM模型訓練后對句子進行預測得到一個標注的概率，再使用Viterbi算法尋找最優的標注序列。在分詞的工作中不需要加入詞向量，提升效果不明顯。

實現

預處理

#!/usr/bin/env python
#-*- coding: utf-8 -*-

#2016年 03月 03日 星期四 11:01:05 CST by Demobin

import json
import h5py
import string
import codecs

corpus_tags = ['S', 'B', 'M', 'E']

def saveCwsInfo(path, cwsInfo):
    '''保存分詞訓練數據字典和概率'''
    print('save cws info to %s'%path)
    fd = open(path, 'w')
    (initProb, tranProb), (vocab, indexVocab) = cwsInfo
    j = json.dumps((initProb, tranProb))
    fd.write(j + '\n')
    for char in vocab:
        fd.write(char.encode('utf-8') + '\t' + str(vocab[char]) + '\n')
    fd.close()

def loadCwsInfo(path):
    '''載入分詞訓練數據字典和概率'''
    print('load cws info from %s'%path)
    fd = open(path, 'r')
    line = fd.readline()
    j = json.loads(line.strip())
    initProb, tranProb = j[0], j[1]
    lines = fd.readlines()
    fd.close()
    vocab = {}
    indexVocab = [0 for i in range(len(lines))]
    for line in lines:
        rst = line.strip().split('\t')
        if len(rst) < 2: continue
        char, index = rst[0].decode('utf-8'), int(rst[1])
        vocab[char] = index
        indexVocab[index] = char
    return (initProb, tranProb), (vocab, indexVocab)

def saveCwsData(path, cwsData):
    '''保存分詞訓練輸入樣本'''
    print('save cws data to %s'%path)
    #采用hdf5保存大矩陣效率最高
    fd = h5py.File(path,'w')
    (X, y) = cwsData
    fd.create_dataset('X', data = X)
    fd.create_dataset('y', data = y)
    fd.close()

def loadCwsData(path):
    '''載入分詞訓練輸入樣本'''
    print('load cws data from %s'%path)
    fd = h5py.File(path,'r')
    X = fd['X'][:]
    y = fd['y'][:]
    fd.close()
    return (X, y)

def sent2vec2(sent, vocab, ctxWindows = 5):
    
    charVec = []
    for char in sent:
        if char in vocab:
            charVec.append(vocab[char])
        else:
            charVec.append(vocab['retain-unknown'])
    #首尾padding
    num = len(charVec)
    pad = int((ctxWindows - 1)/2)
    for i in range(pad):
        charVec.insert(0, vocab['retain-padding'] )
        charVec.append(vocab['retain-padding'] )
    X = []
    for i in range(num):
        X.append(charVec[i:i + ctxWindows])
    return X

def sent2vec(sent, vocab, ctxWindows = 5):
    chars = []
    for char in sent:
        chars.append(char)
    return sent2vec2(chars, vocab, ctxWindows = ctxWindows)

def doc2vec(fname, vocab):
    '''文檔轉向量'''

    #一次性讀入文件，注意內存
    fd = codecs.open(fname, 'r', 'utf-8')
    lines = fd.readlines()
    fd.close()

    #樣本集
    X = []
    y = []

    #標注統計信息
    tagSize = len(corpus_tags)
    tagCnt = [0 for i in range(tagSize)]
    tagTranCnt = [[0 for i in range(tagSize)] for j in range(tagSize)]

    #遍歷行
    for line in lines:
        #按空格分割
        words = line.strip('\n').split()
        #每行的分詞信息
        chars = []
        tags = []
        #遍歷詞
        for word in words:
            #包含兩個字及以上的詞
            if len(word) > 1:
                #詞的首字
                chars.append(word[0])
                tags.append(corpus_tags.index('B'))
                #詞中間的字
                for char in word[1:(len(word) - 1)]:
                    chars.append(char)
                    tags.append(corpus_tags.index('M'))
                #詞的尾字
                chars.append(word[-1])
                tags.append(corpus_tags.index('E'))
            #單字詞
            else: 
                chars.append(word)
                tags.append(corpus_tags.index('S'))

        #字向量表示
        lineVecX = sent2vec2(chars, vocab, ctxWindows = 7)

        #統計標注信息
        lineVecY = []
        lastTag = -1
        for tag in tags:
            #向量
            lineVecY.append(tag)
            #lineVecY.append(corpus_tags[tag])
            #統計tag頻次
            tagCnt[tag] += 1
            #統計tag轉移頻次
            if lastTag != -1:
                tagTranCnt[lastTag][tag] += 1
            #暫存上一次的tag
            lastTag = tag

        X.extend(lineVecX)
        y.extend(lineVecY)

    #字總頻次
    charCnt = sum(tagCnt)
    #轉移總頻次
    tranCnt = sum([sum(tag) for tag in tagTranCnt])
    #tag初始概率
    initProb = []
    for i in range(tagSize):
        initProb.append(tagCnt[i]/float(charCnt))
    #tag轉移概率
    tranProb = []
    for i in range(tagSize):
        p = []
        for j in range(tagSize):
            p.append(tagTranCnt[i][j]/float(tranCnt))
        tranProb.append(p)

    return X, y, initProb, tranProb

def genVocab(fname, delimiters = [' ', '\n']):
    
    #一次性讀入文件，注意內存
    fd = codecs.open(fname, 'r', 'utf-8')
    data = fd.read()
    fd.close()

    vocab = {}
    indexVocab = []
    #遍歷
    index = 0
    for char in data:
        #如果為分隔符則無需加入字典
        if char not in delimiters and char not in vocab:
            vocab[char] = index
            indexVocab.append(char)
            index += 1

    #加入未登陸新詞和填充詞
    vocab['retain-unknown'] = len(vocab)
    vocab['retain-padding'] = len(vocab)
    indexVocab.append('retain-unknown')
    indexVocab.append('retain-padding')
    #返回字典與索引
    return vocab, indexVocab

def load(fname):
    print 'train from file', fname
    delims = [' ', '\n']
    vocab, indexVocab = genVocab(fname)
    X, y, initProb, tranProb = doc2vec(fname, vocab)
    print len(X), len(y), len(vocab), len(indexVocab)
    print initProb
    print tranProb
    return (X, y), (initProb, tranProb), (vocab, indexVocab)

if __name__ == '__main__':
    load('~/work/corpus/icwb2/training/msr_training.utf8')

模型

#!/usr/bin/env python
#-*- coding: utf-8 -*-

#2016年 03月 03日 星期四 11:01:05 CST by Demobin

import numpy as np
import json
import h5py
import codecs

from dataset import cws
from util import viterbi

from sklearn.model_selection import train_test_split

from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential,Graph, model_from_json
from keras.layers.core import Dense, Dropout, Activation, TimeDistributedDense
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU, SimpleRNN

from gensim.models import Word2Vec

def train(cwsInfo, cwsData, modelPath, weightPath):

    (initProb, tranProb), (vocab, indexVocab) = cwsInfo
    (X, y) = cwsData

    train_X, test_X, train_y, test_y = train_test_split(X, y , train_size=0.9, random_state=1)

    train_X = np.array(train_X)
    train_y = np.array(train_y)
    test_X = np.array(test_X)
    test_y = np.array(test_y)
    
    outputDims = len(cws.corpus_tags)
    Y_train = np_utils.to_categorical(train_y, outputDims)
    Y_test = np_utils.to_categorical(test_y, outputDims)
    batchSize = 128
    vocabSize = len(vocab) + 1
    wordDims = 100
    maxlen = 7
    hiddenDims = 100

    w2vModel = Word2Vec.load('model/sougou.char.model')
    embeddingDim = w2vModel.vector_size
    embeddingUnknown = [0 for i in range(embeddingDim)]
    embeddingWeights = np.zeros((vocabSize + 1, embeddingDim))
    for word, index in vocab.items():
        if word in w2vModel:
            e = w2vModel[word]
        else:
            e = embeddingUnknown
        embeddingWeights[index, :] = e
    
    #LSTM
    model = Sequential()
    model.add(Embedding(output_dim = embeddingDim, input_dim = vocabSize + 1, 
        input_length = maxlen, mask_zero = True, weights = [embeddingWeights]))
    model.add(LSTM(output_dim = hiddenDims, return_sequences = True))
    model.add(LSTM(output_dim = hiddenDims, return_sequences = False))
    model.add(Dropout(0.5))
    model.add(Dense(outputDims))
    model.add(Activation('softmax'))
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
    
    result = model.fit(train_X, Y_train, batch_size = batchSize, 
                    nb_epoch = 20, validation_data = (test_X,Y_test), show_accuracy=True)
    
    j = model.to_json()
    fd = open(modelPath, 'w')
    fd.write(j)
    fd.close()
    
    model.save_weights(weightPath)

    return model

def loadModel(modelPath, weightPath):

    fd = open(modelPath, 'r')
    j = fd.read()
    fd.close()
    
    model = model_from_json(j)
    
    model.load_weights(weightPath)

    return model


# 根據輸入得到標注推斷
def cwsSent(sent, model, cwsInfo):
    (initProb, tranProb), (vocab, indexVocab) = cwsInfo
    vec = cws.sent2vec(sent, vocab, ctxWindows = 7)
    vec = np.array(vec)
    probs = model.predict_proba(vec)
    #classes = model.predict_classes(vec)

    prob, path = viterbi.viterbi(vec, cws.corpus_tags, initProb, tranProb, probs.transpose())

    ss = ''
    for i, t in enumerate(path):
        ss += '%s/%s '%(sent[i], cws.corpus_tags[t])
    ss = ''
    word = ''
    for i, t in enumerate(path):
        if cws.corpus_tags[t] == 'S':
            ss += sent[i] + ' '
            word = ''
        elif cws.corpus_tags[t] == 'B':
            word += sent[i]
        elif cws.corpus_tags[t] == 'E':
            word += sent[i]
            ss += word + ' '
            word = ''
        elif cws.corpus_tags[t] == 'M': 
            word += sent[i]

    return ss

def cwsFile(fname, dstname, model, cwsInfo):
    fd = codecs.open(fname, 'r', 'utf-8')
    lines = fd.readlines()
    fd.close()

    fd = open(dstname, 'w')
    for line in lines:
        rst = cwsSent(line.strip(), model, cwsInfo)
        fd.write(rst.encode('utf-8') + '\n')
    fd.close()

def test():
    print 'Loading vocab...'
    cwsInfo = cws.loadCwsInfo('./model/cws.info')
    cwsData = cws.loadCwsData('./model/cws.data')
    print 'Done!'
    print 'Loading model...'
    #model = train(cwsInfo, cwsData, './model/cws.w2v.model', './model/cws.w2v.model.weights')
    #model = loadModel('./model/cws.w2v.model', './model/cws.w2v.model.weights')
    model = loadModel('./model/cws.model', './model/cws.model.weights')
    print 'Done!'
    print '-------------start predict----------------'
    #s = u'為寂寞的夜空畫上一個月亮'
    #print cwsSent(s, model, cwsInfo)
    cwsFile('~/work/corpus/icwb2/testing/msr_test.utf8', './msr_test.utf8.cws', model, cwsInfo)

if __name__ == '__main__':
    test()

viterbi算法

#!/usr/bin/python
# -*- coding: utf-8 -*-

#2016年 01月 28日 星期四 17:14:03 CST by Demobin

def _print(hiddenstates, V):
    s = "    " + " ".join(("%7d" % i) for i in range(len(V))) + "\n"
    for i, state in enumerate(hiddenstates):
        s += "%.5s: " % state
        s += " ".join("%.7s" % ("%f" % v[i]) for v in V)
        s += "\n"
    print(s)

#標準viterbi算法，參數為觀察狀態、隱藏狀態、概率三元組(初始概率、轉移概率、觀察概率)
def viterbi(obs, states, start_p, trans_p, emit_p):

    lenObs = len(obs)
    lenStates = len(states)

    V = [[0.0 for col in range(lenStates)] for row in range(lenObs)]
    path = [[0 for col in range(lenObs)] for row in range(lenStates)]

    #t = 0時刻
    for y in range(lenStates):
        #V[0][y] = start_p[y] * emit_p[y][obs[0]]
        V[0][y] = start_p[y] * emit_p[y][0]
        path[y][0] = y

    #t > 1時
    for t in range(1, lenObs):
        newpath = [[0.0 for col in range(lenObs)] for row in range(lenStates)]

        for y in range(lenStates):
            prob = -1
            state = 0
            for y0 in range(lenStates):
                #nprob = V[t - 1][y0] * trans_p[y0][y] * emit_p[y][obs[t]]
                nprob = V[t - 1][y0] * trans_p[y0][y] * emit_p[y][t]
                if nprob > prob:
                    prob = nprob
                    state = y0
                    #記錄最大概率
                    V[t][y] = prob
                    #記錄路徑
                    newpath[y][:t] = path[state][:t]
                    newpath[y][t] = y

        path = newpath

    prob = -1
    state = 0
    for y in range(lenStates):
        if V[lenObs - 1][y] > prob:
            prob = V[lenObs - 1][y]
            state = y

    #_print(states, V)
    return prob, path[state]

def example():
    #隱藏狀態
    hiddenstates = ('Healthy', 'Fever')
    #觀察狀態
    observations = ('normal', 'cold', 'dizzy')

    #初始概率
    '''
    Healthy': 0.6, 'Fever': 0.4
    '''
    start_p = [0.6, 0.4]
    #轉移概率
    '''
    Healthy' : {'Healthy': 0.7, 'Fever': 0.3},
    Fever' : {'Healthy': 0.4, 'Fever': 0.6}
    '''
    trans_p = [[0.7, 0.3], [0.4, 0.6]]
    #發射概率/輸出概率/觀察概率
    '''
    Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
    Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6}
    '''
    emit_p = [[0.5, 0.4, 0.1], [0.1, 0.3, 0.6]]

    return viterbi(observations,
                   hiddenstates,
                   start_p,
                   trans_p,
                   emit_p)

if __name__ == '__main__':
    print(example())

中文NLP序列標注之POS

預處理

#!/usr/bin/env python
# -*- coding: utf-8 -*-

#2016年 03月 03日 星期四 11:01:05 CST by Demobin

import h5py
import json
import codecs

mappings = {
    #人民日報標注集：863標注集
            'w':    'wp',
            't':    'nt',
            'nr':   'nh',
            'nx':   'nz',
            'nn':   'n',
            'nzz':  'n',
            'Ng':   'n',
            'f':    'nd',
            's':    'nl',
            'Vg':   'v',
            'vd':   'v',
            'vn':   'v',
            'vnn':  'v',
            'ad':   'a',
            'an':   'a',
            'Ag':   'a',
            'l':    'i',
            'z':    'a',
            'mq':   'm',
            'Mg':   'm',
            'Tg':   'nt',
            'y':    'u',
            'Yg':   'u',
            'Dg':   'd',
            'Rg':   'r',
            'Bg':   'b',
            'pn':   'p',
        }

tags_863 = {
        'a' :    [0, '形容詞'],
        'b' :    [1, '區別詞'],
        'c' :    [2, '連詞'],
        'd' :    [3, '副詞'],
        'e' :    [4, '嘆詞'],
        'g' :    [5, '語素字'],
        'h' :    [6, '前接成分'],
        'i' :    [7, '習用語'],
        'j' :    [8, '簡稱'],
        'k' :    [9, '后接成分'],
        'm' :    [10, '數詞'],
        'n' :    [11, '名詞'],
        'nd':    [12, '方位名詞'],
        'nh':    [13, '人名'],
        'ni':    [14, '團體、機構、組織的專名'],
        'nl':    [15, '處所名詞'],
        'ns':    [16, '地名'],
        'nt':    [17, '時間名詞'],
        'nz':    [18, '其它專名'],
        'o' :    [19, '擬聲詞'],
        'p' :    [20, '介詞'],
        'q' :    [21, '量詞'],
        'r' :    [22, '代詞'],
        'u' :    [23, '助詞'],
        'v' :    [24, '動詞'],
        'wp':    [25, '標點'],
        'ws':    [26, '字符串'],
        'x' :    [27, '非語素字'],
    }

def genCorpusTags():
    s = ''
    features = ['b', 'm', 'e', 's']
    for tag in tags:
        for f in features:
             s += '\'' + tag + '-' + f + '\'' + ','
    print s

corpus_tags = [
        'nh-b','nh-m','nh-e','nh-s','ni-b','ni-m','ni-e','ni-s','nl-b','nl-m','nl-e','nl-s','nd-b','nd-m','nd-e','nd-s','nz-b','nz-m','nz-e','nz-s','ns-b','ns-m','ns-e','ns-s','nt-b','nt-m','nt-e','nt-s','ws-b','ws-m','ws-e','ws-s','wp-b','wp-m','wp-e','wp-s','a-b','a-m','a-e','a-s','c-b','c-m','c-e','c-s','b-b','b-m','b-e','b-s','e-b','e-m','e-e','e-s','d-b','d-m','d-e','d-s','g-b','g-m','g-e','g-s','i-b','i-m','i-e','i-s','h-b','h-m','h-e','h-s','k-b','k-m','k-e','k-s','j-b','j-m','j-e','j-s','m-b','m-m','m-e','m-s','o-b','o-m','o-e','o-s','n-b','n-m','n-e','n-s','q-b','q-m','q-e','q-s','p-b','p-m','p-e','p-s','r-b','r-m','r-e','r-s','u-b','u-m','u-e','u-s','v-b','v-m','v-e','v-s','x-b','x-m','x-e','x-s'
    ]

def savePosInfo(path, posInfo):
    '''保存分詞訓練數據字典和概率'''
    print('save pos info to %s'%path)
    fd = open(path, 'w')
    (initProb, tranProb), (vocab, indexVocab) = posInfo
    j = json.dumps((initProb, tranProb))
    fd.write(j + '\n')
    for char in vocab:
        fd.write(char.encode('utf-8') + '\t' + str(vocab[char]) + '\n')
    fd.close()

def loadPosInfo(path):
    '''載入分詞訓練數據字典和概率'''
    print('load pos info from %s'%path)
    fd = open(path, 'r')
    line = fd.readline()
    j = json.loads(line.strip())
    initProb, tranProb = j[0], j[1]
    lines = fd.readlines()
    fd.close()
    vocab = {}
    indexVocab = [0 for i in range(len(lines))]
    for line in lines:
        rst = line.strip().split('\t')
        if len(rst) < 2: continue
        char, index = rst[0].decode('utf-8'), int(rst[1])
        vocab[char] = index
        indexVocab[index] = char
    return (initProb, tranProb), (vocab, indexVocab)

def savePosData(path, posData):
    '''保存分詞訓練輸入樣本'''
    print('save pos data to %s'%path)
    #采用hdf5保存大矩陣效率最高
    fd = h5py.File(path,'w')
    (X, y) = posData
    fd.create_dataset('X', data = X)
    fd.create_dataset('y', data = y)
    fd.close()

def loadPosData(path):
    '''載入分詞訓練輸入樣本'''
    print('load pos data from %s'%path)
    fd = h5py.File(path,'r')
    X = fd['X'][:]
    y = fd['y'][:]
    fd.close()
    return (X, y)

def sent2vec2(sent, vocab, ctxWindows = 5):
    
    charVec = []
    for char in sent:
        if char in vocab:
            charVec.append(vocab[char])
        else:
            charVec.append(vocab['retain-unknown'])
    #首尾padding
    num = len(charVec)
    pad = int((ctxWindows - 1)/2)
    for i in range(pad):
        charVec.insert(0, vocab['retain-padding'] )
        charVec.append(vocab['retain-padding'] )
    X = []
    for i in range(num):
        X.append(charVec[i:i + ctxWindows])
    return X

def sent2vec(sent, vocab, ctxWindows = 5):
    chars = []
    words = sent.split()
    for word in words:
        #包含兩個字及以上的詞
        if len(word) > 1:
            #詞的首字
            chars.append(word[0] + '_b')
            #詞中間的字
            for char in word[1:(len(word) - 1)]:
                chars.append(char + '_m')
            #詞的尾字
            chars.append(word[-1] + '_e')
        #單字詞
        else: 
            chars.append(word + '_s')
    
    return sent2vec2(chars, vocab, ctxWindows = ctxWindows)

def doc2vec(fname, vocab):
    '''文檔轉向量'''

    #一次性讀入文件，注意內存
    fd = codecs.open(fname, 'r', 'utf-8')
    lines = fd.readlines()
    fd.close()

    #樣本集
    X = []
    y = []

    #標注統計信息
    tagSize = len(corpus_tags)
    tagCnt = [0 for i in range(tagSize)]
    tagTranCnt = [[0 for i in range(tagSize)] for j in range(tagSize)]

    #遍歷行
    for line in lines:
        #按空格分割
        words = line.strip('\n').split()
        #每行的分詞信息
        chars = []
        tags = []
        #遍歷詞
        for word in words:
            rst = word.split('/')
            if len(rst) <= 0:
                print word
                continue
            word, tag = rst[0], rst[1].decode('utf-8')
            if tag not in tags_863:
                tag = mappings[tag]
            #包含兩個字及以上的詞
            if len(word) > 1:
                #詞的首字
                chars.append(word[0] + '_b')
                tags.append(corpus_tags.index(tag + '-' + 'b'))
                #詞中間的字
                for char in word[1:(len(word) - 1)]:
                    chars.append(char + '_m')
                    tags.append(corpus_tags.index(tag + '-' + 'm'))
                #詞的尾字
                chars.append(word[-1] + '_e')
                tags.append(corpus_tags.index(tag + '-' + 'e'))
            #單字詞
            else: 
                chars.append(word + '_s')
                tags.append(corpus_tags.index(tag + '-' + 's'))

        #字向量表示
        lineVecX = sent2vec2(chars, vocab, ctxWindows = 7)

        #統計標注信息
        lineVecY = []
        lastTag = -1
        for tag in tags:
            #向量
            lineVecY.append(tag)
            #lineVecY.append(corpus_tags[tag])
            #統計tag頻次
            tagCnt[tag] += 1
            #統計tag轉移頻次
            if lastTag != -1:
                tagTranCnt[lastTag][tag] += 1
            #暫存上一次的tag
            lastTag = tag

        X.extend(lineVecX)
        y.extend(lineVecY)

    #字總頻次
    charCnt = sum(tagCnt)
    #轉移總頻次
    tranCnt = sum([sum(tag) for tag in tagTranCnt])
    #tag初始概率
    initProb = []
    for i in range(tagSize):
        initProb.append(tagCnt[i]/float(charCnt))
    #tag轉移概率
    tranProb = []
    for i in range(tagSize):
        p = []
        for j in range(tagSize):
            p.append(tagTranCnt[i][j]/float(tranCnt))
        tranProb.append(p)

    return X, y, initProb, tranProb

def vocabAddChar(vocab, indexVocab, index, char):
    if char not in vocab:
        vocab[char] = index
        indexVocab.append(char)
        index += 1
    return index

def genVocab(fname, delimiters = [' ', '\n']):
    
    #一次性讀入文件，注意內存
    fd = codecs.open(fname, 'r', 'utf-8')
    lines = fd.readlines()
    fd.close()

    vocab = {}
    indexVocab = []
    #遍歷所有行
    index = 0
    for line in lines:
        words = line.strip().split()
        if words <= 0: continue
        #遍歷所有詞
        for word in words:
            word, tag = word.split('/')
            #包含兩個字及以上的詞
            if len(word) > 1:
                #詞的首字
                char = word[0] + '_b'
                index = vocabAddChar(vocab, indexVocab, index, char)
                #詞中間的字
                for char in word[1:(len(word) - 1)]:
                    char = char + '_m'
                    index = vocabAddChar(vocab, indexVocab, index, char)
                #詞的尾字
                char = word[-1] + '_e'
                index = vocabAddChar(vocab, indexVocab, index, char)
            #單字詞
            else: 
                char = word + '_s'
                index = vocabAddChar(vocab, indexVocab, index, char)

    #加入未登陸新詞和填充詞
    vocab['retain-unknown'] = len(vocab)
    vocab['retain-padding'] = len(vocab)
    indexVocab.append('retain-unknown')
    indexVocab.append('retain-padding')
    #返回字典與索引
    return vocab, indexVocab

def load(fname):
    print 'train from file', fname
    delims = [' ', '\n']
    vocab, indexVocab = genVocab(fname)
    X, y, initProb, tranProb = doc2vec(fname, vocab)
    print len(X), len(y), len(vocab), len(indexVocab)
    print initProb
    print tranProb
    return (X, y), (initProb, tranProb), (vocab, indexVocab)

def test():
    load('../data/pos.train')

if __name__ == '__main__':
    test()

模型

#!/usr/bin/env python
#-*- coding: utf-8 -*-

#2016年 03月 03日 星期四 11:01:05 CST by Demobin

import numpy as np
import json
import h5py
import codecs

from dataset import pos
from util import viterbi

from sklearn.model_selection import train_test_split

from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential,Graph, model_from_json
from keras.layers.core import Dense, Dropout, Activation, TimeDistributedDense
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU, SimpleRNN

from util import pChar

def train(posInfo, posData, modelPath, weightPath):

    (initProb, tranProb), (vocab, indexVocab) = posInfo
    (X, y) = posData

    train_X, test_X, train_y, test_y = train_test_split(X, y , train_size=0.9, random_state=1)

    train_X = np.array(train_X)
    train_y = np.array(train_y)
    test_X = np.array(test_X)
    test_y = np.array(test_y)
    
    outputDims = len(pos.corpus_tags)
    Y_train = np_utils.to_categorical(train_y, outputDims)
    Y_test = np_utils.to_categorical(test_y, outputDims)
    batchSize = 128
    vocabSize = len(vocab) + 1
    wordDims = 100
    maxlen = 7
    hiddenDims = 100

    w2vModel, vectorSize = pChar.load('model/pChar.model')
    embeddingDim = int(vectorSize)
    embeddingUnknown = [0 for i in range(embeddingDim)]
    embeddingWeights = np.zeros((vocabSize + 1, embeddingDim))
    for word, index in vocab.items():
        if word in w2vModel:
            e = w2vModel[word]
        else:
            print word
            e = embeddingUnknown
        embeddingWeights[index, :] = e
    
    #LSTM
    model = Sequential()
    model.add(Embedding(output_dim = embeddingDim, input_dim = vocabSize + 1, 
        input_length = maxlen, mask_zero = True, weights = [embeddingWeights]))
    model.add(LSTM(output_dim = hiddenDims, return_sequences = True))
    model.add(LSTM(output_dim = hiddenDims, return_sequences = False))
    model.add(Dropout(0.5))
    model.add(Dense(outputDims))
    model.add(Activation('softmax'))
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
    
    result = model.fit(train_X, Y_train, batch_size = batchSize, 
                    nb_epoch = 20, validation_data = (test_X,Y_test), show_accuracy=True)
    
    j = model.to_json()
    fd = open(modelPath, 'w')
    fd.write(j)
    fd.close()
    
    model.save_weights(weightPath)

    return model
    #Bi-directional LSTM

def loadModel(modelPath, weightPath):

    fd = open(modelPath, 'r')
    j = fd.read()
    fd.close()
    
    model = model_from_json(j)
    
    model.load_weights(weightPath)

    return model


# 根據輸入得到標注推斷
def posSent(sent, model, posInfo):
    (initProb, tranProb), (vocab, indexVocab) = posInfo
    vec = pos.sent2vec(sent, vocab, ctxWindows = 7)
    vec = np.array(vec)
    probs = model.predict_proba(vec)
    #classes = model.predict_classes(vec)

    prob, path = viterbi.viterbi(vec, pos.corpus_tags, initProb, tranProb, probs.transpose())

    ss = ''
    words = sent.split()
    index = -1
    for word in words:
        for char in word:
            index += 1
        ss += word + '/' + pos.tags_863[pos.corpus_tags[path[index]][:-2]][1].decode('utf-8') + ' '
        #ss += word + '/' + pos.corpus_tags[path[index]][:-2] + ' '

    return ss[:-1]

def posFile(fname, dstname, model, posInfo):
    fd = codecs.open(fname, 'r', 'utf-8')
    lines = fd.readlines()
    fd.close()

    fd = open(dstname, 'w')
    for line in lines:
        rst = posSent(line.strip(), model, posInfo)
        fd.write(rst.encode('utf-8') + '\n')
    fd.close()

def test():
    print 'Loading vocab...'
    #(X, y), (initProb, tranProb), (vocab, indexVocab) = pos.load('data/pos.train')
    #posInfo = ((initProb, tranProb), (vocab, indexVocab))
    #posData = (X, y)
    #pos.savePosInfo('./model/pos.info', posInfo)
    #pos.savePosData('./model/pos.data', posData)
    posInfo = pos.loadPosInfo('./model/pos.info')
    posData = pos.loadPosData('./model/pos.data')
    print 'Done!'
    print 'Loading model...'
    #model = train(posInfo, posData, './model/pos.w2v.model', './model/pos.w2v.model.weights')
    model = loadModel('./model/pos.w2v.model', './model/pos.w2v.model.weights')
    #model = loadModel('./model/pos.model', './model/pos.model.weights')
    print 'Done!'
    print '-------------start predict----------------'
    s = u'為 寂寞 的 夜空 畫 上 一個 月亮'
    print posSent(s, model, posInfo)
    #posFile('~/work/corpus/icwb2/testing/msr_test.utf8', './msr_test.utf8.pos', model, posInfo)

if __name__ == '__main__':
    test()

中文NLP序列標注之NER

預處理

模型

中文NLP序列標注之DP

To be continue...
PS：全貼代碼有點長，等我找時間再整理一下。

最后編輯于：2017.12.03 05:29:34

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明：文章內容（如有圖片或視頻亦包括在內）由作者上傳并發布，文章內容僅代表作者本人觀點，簡書系信息發布平臺，僅提供信息存儲服務。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 230,247評論 6贊 543
死咒
序言：濱河連續發生了三起死亡事件，死亡現場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發現死者居然都...
沈念sama閱讀 99,520評論 3贊 429
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 178,362評論 0贊 383
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 63,805評論 1贊 317
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 72,541評論 6贊 412
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發上，一...
開封第一講書人閱讀 55,896評論 1贊 328
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,887評論 3贊 447
雙鴛鴦連環套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 43,062評論 0贊 290
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當地人在樹林里發現了一具尸體，經...
沈念sama閱讀 49,608評論 1贊 336
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 41,356評論 3贊 358
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發現自己被綠了。大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 43,555評論 1贊 374
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 39,077評論 5贊 364
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發生泄漏。R本人自食惡果不足惜，卻給世界環境...
茶點故事閱讀 44,769評論 3贊 349
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 35,175評論 0贊 28
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 36,489評論 1贊 295
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 52,289評論 3贊 400
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 48,516評論 2贊 379

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

使用深度學習進行中文自然語言處理之序列標注

使用深度學習進行中文自然語言處理之序列標注

深度學習簡介

機器學習與深度學習

NLP簡介

序列標注工作簡介

詞向量簡介

中文NLP序列標注之CWS

CWS簡介

標注集與評估方法

模型

實現

中文NLP序列標注之POS

中文NLP序列標注之NER

中文NLP序列標注之DP

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

使用深度學習進行中文自然語言處理之序列標注

深度學習簡介

機器學習與深度學習

NLP簡介

序列標注工作簡介

詞向量簡介

中文NLP序列標注之CWS

CWS簡介

標注集與評估方法

模型

實現

中文NLP序列標注之POS

中文NLP序列標注之NER

中文NLP序列標注之DP

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频