寫在前面
- 態度決定高度!讓優秀成為一種習慣!
- 世界上沒有什么事兒是加一次班解決不了的,如果有,就加兩次!(- - -茂強)
word2vec
大名鼎鼎的word2vec在這里就不再解釋什么了,多說無益,不太明白的就去百度google吧,下面就說一下各種實現吧
準備預料
python-gensim
一個簡單到爆的方式,甚至可以一行代碼解決問題。
from gensim.models import word2vec
sentences = word2vec.Text8Corpus("C:/traindataw2v.txt") # 加載語料
model = word2vec.Word2Vec(sentences, size=200) # 訓練skip-gram模型; 默認window=5
#獲取“學習”的詞向量
print("學習:" + model["學習"])
# 計算兩個詞的相似度/相關程度
y1 = model.similarity("不錯", "好")
# 計算某個詞的相關詞列表
y2 = model.most_similar("書", topn=20) # 20個最相關的
# 尋找對應關系
print("書-不錯,質量-")
y3 = model.most_similar(['質量', '不錯'], ['書'], topn=3)
# 尋找不合群的詞
y4 = model.doesnt_match("書 書籍 教材 很".split())
# 保存模型,以便重用
model.save("db.model")
# 對應的加載方式
model = word2vec.Word2Vec.load("db.model")
好了,gensim的方式說完了
下邊就讓我們看一下參數吧
默認參數如下:
sentences=None
size=100
alpha=0.025
window=5
min_count=5
max_vocab_size=None
sample=1e-3
seed=1
workers=3
min_alpha=0.0001
sg=0
hs=0
negative=5
cbow_mean=1
hashfxn=hash
iter=5
null_word=0
trim_rule=None
sorted_vocab=1
batch_words=MAX_WORDS_IN_BATCH
是不是感覺很意外,為啥有這么多參數,平時都不怎么用,但是,一個訓練好的模型的好與壞與其參數密不可分,之所以代碼把這些參數開放出來,是有一定的意義的,下面就讓我們來一一的看一下各個參數的意義在哪里吧。
sentences:就是每一行每一行的句子,但是句子長度不要過大,簡單的說就是上圖的樣子
sg:這個是訓練時用的算法,當為0時采用的是CBOW算法,當為1時會采用skip-gram
size:這個是定義訓練的向量的長度
window:是在一個句子中,當前詞和預測詞的最大距離
alpha:是學習率,是控制梯度下降算法的下降速度的
seed:用于隨機數發生器。與初始化詞向量有關
min_count: 字典截斷.,詞頻少于min_count次數的單詞會被丟棄掉
max_vocab_size:詞向量構建期間的RAM限制。如果所有不重復單詞個數超過這個值,則就消除掉其中最不頻繁的一個,None表示沒有限制
sample:高頻詞匯的隨機負采樣的配置閾值,默認為1e-3,范圍是(0,1e-5)
workers:設置多線程訓練模型,機器的核數越多,訓練越快
hs:如果為1則會采用hierarchica·softmax策略,Hierarchical Softmax是一種對輸出層進行優化的策略,輸出層從原始模型的利用softmax計算概率值改為了利用Huffman樹計算概率值。如果設置為0(默認值),則負采樣策略會被使用
negative:如果大于0,那就會采用負采樣,此時該值的大小就表示有多少個“noise words”會被使用,通常設置在(5-20),默認是5,如果該值設置成0,那就表示不采用負采樣
cbow_mean:在采用cbow模型時,此值如果是0,就會使用上下文詞向量的和,如果是1(默認值),就會采用均值
hashfxn:hash函數來初始化權重。默認使用python的hash函數
iter: 迭代次數,默認為5
trim_rule: 用于設置詞匯表的整理規則,指定那些單詞要留下,哪些要被刪除。可以設置為None(min_count會被使用)或者一個接受(word, count, min_count)并返回utils.RULE_DISCARD,utils.RULE_KEEP或者utils.RULE_DEFAULT,這個設置只會用在構建詞典的時候,不會成為模型的一部分
sorted_vocab: 如果為1(defau·t),則在分配word index 的時候會先對單詞基于頻率降序排序。
batch_words:每一批傳遞給每個線程單詞的數量,默認為10000,如果超過該值,則會被截斷
python-tensorflow
官方網站實現的是n-gram方式
Skip-Gram是給定input word來預測上下文。而CBOW是給定上下文,來預測input word
首先數據還是上邊的數據
-
讀取數據
words = [] with open("c:/traindatav.txt", "r", encoding="utf-8") as f: for line in f.readlines(): text = line.split(" => ") if len(text) == 2: lable = text[0].strip() listsentence = [w for w in text[1].split(" ") if re.match("[\u4e00-\u9fa5]+", w) and len(w) >= 2] words.extend(listsentence)
words存放單詞,這里單詞都是按照順序進入words里邊的
-
構建詞典
vocabulary_size = 10000 def build_dataset(words): count = [['UNK', -1]] count.extend(collections.Counter(words).most_common(vocabulary_size - 1)) dictionary = dict() for word, _ in count: dictionary[word] = len(dictionary) data = list() unk_count = 0 for word in words: if word in dictionary: index = dictionary[word] else: index = 0 # dictionary['UNK'] unk_count += 1 data.append(index) count[0][1] = unk_count reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) return data, count, dictionary, reverse_dictionary data, count, dictionary, reverse_dictionary = build_dataset(words)
vocabulary_size聲明了詞典里邊用多少單詞填充,其余的都用UNK填充,
這里篩選單詞的條件是詞頻,當然這里如果有好的想法也可以自行改進,比如去頭除尾,詞頻太高的也不要,詞頻太低的也不要,我這里選擇了10000歌詞去訓練
其中dictionary中存放的數據如下圖
這里邊的數據表示為每個詞標注一個索引
其中data里邊存放的數據如下圖
這里邊的數數字標識了words里邊詞的對應的索引,數據都是從上邊的dictionary中取出來的
其中count表示的是詞頻統計,如下圖
reverse_dictionary表示的是dictionary的反轉
-
參數聲明
batch_size = 128 embedding_size = 128 # Dimension of the embedding vector. skip_window = 1 # How many words to consider left and right. num_skips = 2 # How many times to reuse an input to generate a label. # We pick a random validation set to sample nearest neighbors. Here we limit the # validation samples to the words that have a low numeric ID, which by # construction are also the most frequent. valid_size = 16 # Random set of words to evaluate similarity on. valid_window = 100 # Only pick dev samples in the head of the distribution. valid_examples = np.random.choice(valid_window, valid_size, replace=False) num_sampled = 64 # Number of negative examples to sample.
-
構建skip-gram模型的迭代函數
data_index = 0 def generate_batch(batch_size, num_skips, skip_window): global data_index assert batch_size % num_skips == 0 assert num_skips <= 2 * skip_window batch = np.ndarray(shape=(batch_size), dtype=np.int32) labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) span = 2 * skip_window + 1 # [ skip_window target skip_window ] buffer = collections.deque(maxlen=span) for _ in range(span): buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) for i in range(batch_size // num_skips): target = skip_window # target label at the center of the buffer targets_to_avoid = [skip_window] for j in range(num_skips): while target in targets_to_avoid: target = random.randint(0, span - 1) targets_to_avoid.append(target) batch[i * num_skips + j] = buffer[skip_window] labels[i * num_skips + j, 0] = buffer[target] buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) return batch, labels
其中batch = np.ndarray(shape=(batch_size), dtype=np.int32)是產生一個128維的向量, labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)時產生128*1的一個矩陣,buffer里邊存放的是選出來的一個窗口上下文詞的索引,數據來源于data,data_index全局標識words的索引,也就是data的每一個值,其作用是為了在每一次迭代的過程中平滑的去產生上下文窗口。
一個叫做skip_window的參數,它代表著我們從當前input word的一側(左邊或右邊)選取詞的數量。num_skips,它代表著我們從整個窗口中選取多少個不同的詞作為我們的output word
-
構建計算圖
graph = tf.Graph() with graph.as_default(): # Input data. train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) valid_dataset = tf.constant(valid_examples, dtype=tf.int32) # Ops and variables pinned to the CPU because of missing GPU implementation with tf.device('/cpu:0'): # Look up embeddings for inputs. embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) embed = tf.nn.embedding_lookup(embeddings, train_inputs) # Construct the variables for the NCE loss nce_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size))) nce_biases = tf.Variable(tf.zeros([vocabulary_size])) # Compute the average NCE loss for the batch. # tf.nce_loss automatically draws a new sample of the negative labels each # time we evaluate the loss. loss = tf.reduce_mean( tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, inputs=embed, labels=train_labels, num_sampled = num_sampled, num_classes=vocabulary_size)) # Construct the SGD optimizer using a learning rate of 1.0. optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss) # Compute the cosine similarity between minibatch examples and all embeddings. norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup( normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True) # Add variable initializer. init = tf.global_variables_initializer()
首先聲明數據placeholder,train_inputs【128】,train_labels【128x1】,然后聲明valid_dataset,這個是存放詞頻相對比較高一些有效詞,主要是為了訓練結束后計算這些詞的相似詞
embeddings【10000x128】的詞向量矩陣,embed要訓練批次對應的詞向量矩陣,nce_weights表示nce損失下的權重矩陣,tf.truncated_normal()產生的是一個截尾的正態分布,nce_biases表示偏置項,loss就是損失函數,也就是目標函數,optimizer表示的是迭代優化隨機梯度下降法,用以優化loss函數,步長為1.0,similarity是為了根據embeddings計算valid_dataset中存放的詞的相似度
大概的神經網絡圖如圖,知道原理即可,圖也是借來的
-
開始迭代計算
num_steps = 100001 with tf.Session(graph=graph) as session: # We must initialize all variables before we use them. init.run() print("Initialized") average_loss = 0 for step in range(num_steps): batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window) feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels} # We perform one update step by evaluating the optimizer op (including it # in the list of returned values for session.run() _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict) average_loss += loss_val if step % 2000 == 0: if step > 0: average_loss /= 2000 # The average loss is an estimate of the loss over the last 2000 batches. print("Average loss at step ", step, ": ", average_loss) average_loss = 0 # Note that this is expensive (~20% slowdown if computed every 500 steps) if step % 10000 == 0: sim = similarity.eval() for i in range(valid_size): valid_word = reverse_dictionary[valid_examples[i]] top_k = 8 # number of nearest neighbors nearest = (-sim[i, :]).argsort()[1:top_k + 1] log_str = "Nearest to %s:" % valid_word for k in range(top_k): close_word = reverse_dictionary[nearest[k]] log_str = "%s %s," % (log_str, close_word) print(log_str) final_embeddings = normalized_embeddings.eval()
其實上邊的訓練很簡單,每次迭代都會根據generate_batch產生batch_inputs, batch_labels,這就是要喂給graph的數據,然后就是執行迭代了,迭代過程中,每個2000次都會輸出平均的誤差,每個10000次都會計算一下valid_dataset中的詞的前topK=8的相似詞, 最后final_embeddings存儲的就是標準化的詞向量。
-最后就是可視化
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
plt.figure(figsize=(18, 18)) # in inches
for i, label in enumerate(labels):
x, y = low_dim_embs[i, :]
plt.scatter(x, y)
plt.annotate(label,
xy=(x, y),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.savefig(filename)
try:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 500
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
labels = [reverse_dictionary[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)
except ImportError:
print("Please install sklearn, matplotlib, and scipy to visualize embeddings.")
可視化采用的是TSNE,這里就不多說了,如果項具體了解,請參考:數據降維,其他的就不多說了。
word2vec的spark實現
至于spark的實現就直接上代碼了,這個很簡單,而且官網上也有很詳細的教程,個人感覺spark做的api簡直就是再也不能人性化了,未來spark的方向也是深度學習和實時流,這個我個人感覺也算是走上spark的主流道路了。坐等人性化深度學習api的來臨。
廢話不多說,直接上代碼。
object WordToVec {
def main(args :Array[String]): Unit ={
val conf = new SparkConf().setAppName("WordToVec")
.setMaster("local")
val sc = new SparkContext(conf)
val stopwords = Array("的","是","你","我","他","她","它","和","了","而","有","人","被","做","對","與") //無效詞
val input = sc.textFile("c:/traindataw2v.txt")
.map(line => line.split(" "))
.map(_.filter(_.matches("[\u4E00-\u9FA5]+")).toSeq) //過濾中文
.map(_.filter(!stopwords.contains(_)))
.map(_.filter(_.length >= 2)) //長度必須大于2
val word2vec = new Word2Vec()
.setMinCount(2) //詞頻大于2的詞才能入選詞典
.setWindowSize(5) //上下文窗口長度為5
.setVectorSize(50) //詞的向量維度為50
.setNumIterations(25) //迭代次數為25
.setNumPartitions(3) // 數據分區3
.setSeed(12345) //隨機數產生seed
val model = word2vec.fit(input)
// model.save(sc, "D:/word2vecTmal")
// val model = Word2VecModel.load(sc,"D:/word2vecTmal")
val word = model.getVectors.keySet
val writer = new PrintWriter(new File("c:/resultw2v.txt" ))
model.getVectors.foreach(kv => {
writer.write(kv._1 + " => " + kv._2.mkString(" ") + "\n")
})
writer.close()
val synonyms = model.findSynonyms("很好", 5) //計算很好一次的5個最相似的詞并輸出
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
sc.stop()
}
}
總結
個人建議,訓練word2vec的時,如果想在單機情況下去訓練的話最好用第一種方案,如果想在集群,或者數據量比較大的情況下可以采用分布式的spark訓練,這兩個的結果可靠性都要比tensorflow官方實現的要好。這跟tensorflow的實現方法是有直接關系的。
好了不多說了,大家可以自己去實踐一下,畢竟我說的不算,實踐是最好的老師。后續會持續書寫相關的算法,敬請期待,都是干貨,不摻水。