訓練一個聊天機器人的很重要的一步是詞向量訓練,無論是生成式聊天機器人還是檢索式聊天機器人,都需要將文字轉化為詞向量,時下最火的詞向量訓練模型是word2vec,所以,今天小編文文帶你使用維基百科訓練詞向量。
4、繁簡轉換
上一篇中講到了將文檔從xml中抽取出來,下一步是將繁體字轉換為簡體字,那么我們使用opencc工具進行繁簡轉換,首先去下載opencc:https://bintray.com/package/files/byvoid/opencc/OpenCC
下載完成之后解壓即可,隨后使用命令:
opencc -i wiki.zh.text -o wiki.zh.jian.text -c t2s.json進行轉換
效果如下:
轉換前-繁體
轉換后-簡體
5、文章分詞:
使用jieba分詞器對文章及進行分詞,代碼如下:
import jieba
import jieba.analyse
import jieba.posseg as pseg
import codecs,sys
def cut_words(sentence):
#print sentence
return " ".join(jieba.cut(sentence)).encode('utf-8')
f=codecs.open('wiki.zh.jian.text','r',encoding="utf8")
target = codecs.open("wiki.zh.jian.seg.txt", 'w',encoding="utf8")
print ('open files')
line_num=1
line = f.readline()
while line:
print('---- processing', line_num, 'article----------------')
line_seg = " ".join(jieba.cut(line))
target.writelines(line_seg)
line_num = line_num + 1
line = f.readline()
f.close()
target.close()
exit()
while line:
curr = []
for oneline in line:
#print(oneline)
curr.append(oneline)
after_cut = map(cut_words, curr)
target.writelines(after_cut)
print ('saved',line_num,'articles')
exit()
line = f.readline1()
f.close()
target.close()
6、訓練詞向量
接下來就可以訓練詞向量啦,代碼如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os
import sys
import multiprocessing
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 4:
print(globals()['__doc__'] % locals())
sys.exit(1)
inp, outp1, outp2 = sys.argv[1:4]
model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
workers=multiprocessing.cpu_count(),iter=100)
# trim unneeded model memory = use(much) less RAM
# model.init_sims(replace=True)
model.save(outp1)
model.wv.save_word2vec_format(outp2, binary=False)
使用命令開始訓練
python train_word2vec_model.py wiki.zh.jian.seg.txt wiki.zh.text.model wiki.zh.text.vector
發現訓練開始:
模型訓練ing
今天先記錄到這里啦,下一篇,小編帶你一起體驗一下word2vec的訓練結果。