使用 TensorFlow 的 skip-gram 模型 · 精通 TensorFlow 1.x

# 使用 TensorFlow 的 skip-gram 模型現在我們已經準備好了訓練和驗證數據，讓我們在 TensorFlow 中創建一個 skip-gram 模型。我們首先定義超參數： ```py batch_size = 128 embedding_size = 128 skip_window = 2 n_negative_samples = 64 ptb.skip_window=2 learning_rate = 1.0 ``` * `batch_size`是要在單個批次中輸入算法的目標和上下文單詞對的數量 * `embedding_size`是每個單詞的單詞向量或嵌入的維度 * `ptb.skip_window`是在兩個方向上的目標詞的上下文中要考慮的詞的數量 * `n_negative_samples`是由 NCE 損失函數生成的負樣本數，本章將進一步說明在一些教程中，包括 TensorFlow 文檔中的一個教程，還使用了一個參數`num_skips`。在這樣的教程中，作者選擇了`num_skips`（目標，上下文）對。例如，如果`skip_window`是 2，那么對的總數將是 4，如果`num_skips`被設置為 2，則只有兩對將被隨機選擇用于訓練。但是，我們考慮了所有的對以便保持訓練練習簡單。定義訓練數據的輸入和輸出占位符以及驗證數據的張量： ```py inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size]) outputs = tf.placeholder(dtype=tf.int32, shape=[batch_size,1]) inputs_valid = tf.constant(x_valid, dtype=tf.int32) ``` 定義一個嵌入矩陣，其行數等于詞匯長度，列等于嵌入維度。該矩陣中的每一行將表示詞匯表中一個單詞的單詞向量。使用在-1.0 到 1.0 之間均勻采樣的值填充此嵌入矩陣。 ```py # define embeddings matrix with vocab_len rows and embedding_size columns # each row represents vectore representation or embedding of a word # in the vocbulary embed_dist = tf.random_uniform(shape=[ptb.vocab_len, embedding_size], minval=-1.0,maxval=1.0) embed_matrix = tf.Variable(embed_dist,name='embed_matrix') ``` 使用此矩陣，定義使用`tf.nn.embedding_lookup()`實現的嵌入查找表。 `tf.nn.embedding_lookup()`有兩個參數：嵌入矩陣和輸入占位符。 lookup 函數返回`inputs`占位符中單詞的單詞向量。 ```py # define the embedding lookup table # provides the embeddings of the word ids in the input tensor embed_ltable = tf.nn.embedding_lookup(embed_matrix, inputs) ``` `embed_ltable`也可以解釋為輸入層頂部的嵌入層。接下來，將嵌入層的輸出饋送到 softmax 或噪聲對比估計（NCE）層。 NCE 基于一個非常簡單的想法，即訓練基于邏輯回歸的二分類器，以便從真實和嘈雜數據的混合中學習參數。 TensorFlow documentation describes the NCE in further detail:?[https://www.tensorflow.org/tutorials/word2vec.](https://www.tensorflow.org/tutorials/word2vec#scaling_up_with_noise-contrastive_training) 總之，基于 softmax 損失的模型在計算上是昂貴的，因為在整個詞匯表中計算概率分布并對其進行歸一化。基于 NCE 損耗的模型將其減少為二分類問題，即從噪聲樣本中識別真實樣本。 NCE 的基本數學細節可以在以下 NIPS 論文中找到：_學習詞嵌入有效地與噪聲對比估計_，作者 Andriy Mnih 和 Koray Kavukcuoglu。該論文可從以下鏈接獲得：[http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf.](http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf) `tf.nn.nce_loss()`函數在評估計算損耗時自動生成負樣本：參數`num_sampled`設置為等于負樣本數（`n_negative_samples`）。此參數指定要繪制的負樣本數。 ```py # define noise-contrastive estimation (NCE) loss layer nce_dist = tf.truncated_normal(shape=[ptb.vocab_len, embedding_size], stddev=1.0 / tf.sqrt(embedding_size * 1.0) ) nce_w = tf.Variable(nce_dist) nce_b = tf.Variable(tf.zeros(shape=[ptb.vocab_len])) loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_w, biases=nce_b, inputs=embed_ltable, labels=outputs, num_sampled=n_negative_samples, num_classes=ptb.vocab_len ) ) ``` 接下來，計算驗證集中的樣本與嵌入矩陣之間的余弦相似度： 1. 為了計算相似性得分，首先，計算嵌入矩陣中每個單詞向量的 L2 范數。 ```py # Compute the cosine similarity between validation set samples # and all embeddings. norm = tf.sqrt(tf.reduce_sum(tf.square(embed_matrix), 1, keep_dims=True)) normalized_embeddings = embed_matrix / norm ``` 1. 在驗證集中查找樣本的嵌入或單詞向量： ```py embed_valid = tf.nn.embedding_lookup(normalized_embeddings, inputs_valid) ``` 1. 通過將驗證集的嵌入與嵌入矩陣相乘來計算相似性得分。 ```py similarity = tf.matmul( embed_valid, normalized_embeddings, transpose_b=True) ``` 這給出了具有（`valid_size`，`vocab_len`）形狀的張量。張量中的每一行指的是驗證詞和詞匯單詞之間的相似性得分。接下來，定義 SGD 優化器，學習率為 0.9，歷時 50 個周期。 ```py n_epochs = 10 learning_rate = 0.9 n_batches = ptb.n_batches(batch_size) optimizer = tf.train.GradientDescentOptimizer(learning_rate) .minimize(loss) ``` 對于每個周期： 1. 逐批運行整個數據集上的優化器。 ```py ptb.reset_index_in_epoch() for step in range(n_batches): x_batch, y_batch = ptb.next_batch() y_batch = dsu.to2d(y_batch,unit_axis=1) feed_dict = {inputs: x_batch, outputs: y_batch} _, batch_loss = tfs.run([optimizer, loss], feed_dict=feed_dict) epoch_loss += batch_loss ``` 1. 計算并打印周期的平均損失。 ```py epoch_loss = epoch_loss / n_batches print('\n','Average loss after epoch ', epoch, ': ', epoch_loss) ``` 1. 在周期結束時，計算相似性得分。 ```py similarity_scores = tfs.run(similarity) ``` 1. 對于驗證集中的每個單詞，打印具有最高相似性得分的五個單詞。 ```py top_k = 5 for i in range(valid_size): similar_words = (-similarity_scores[i,:]) .argsort()[1:top_k + 1] similar_str = 'Similar to {0:}:' .format(ptb.id2word[x_valid[i]]) for k in range(top_k): similar_str = '{0:} {1:},'.format(similar_str, ptb.id2word[similar_words[k]]) print(similar_str) ``` 最后，在完成所有周期之后，計算可在學習過程中進一步利用的嵌入向量： ```py final_embeddings = tfs.run(normalized_embeddings) ``` 完整的訓練代碼如下： ```py n_epochs = 10 learning_rate = 0.9 n_batches = ptb.n_batches_wv() optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) with tf.Session() as tfs: tf.global_variables_initializer().run() for epoch in range(n_epochs): epoch_loss = 0 ptb.reset_index() for step in range(n_batches): x_batch, y_batch = ptb.next_batch_sg() y_batch = nputil.to2d(y_batch, unit_axis=1) feed_dict = {inputs: x_batch, outputs: y_batch} _, batch_loss = tfs.run([optimizer, loss], feed_dict=feed_dict) epoch_loss += batch_loss epoch_loss = epoch_loss / n_batches print('\nAverage loss after epoch ', epoch, ': ', epoch_loss) # print closest words to validation set at end of every epoch similarity_scores = tfs.run(similarity) top_k = 5 for i in range(valid_size): similar_words = (-similarity_scores[i, :] ).argsort()[1:top_k + 1] similar_str = 'Similar to {0:}:'.format( ptb.id2word[x_valid[i]]) for k in range(top_k): similar_str = '{0:} {1:},'.format( similar_str, ptb.id2word[similar_words[k]]) print(similar_str) final_embeddings = tfs.run(normalized_embeddings) ``` 這是我們分別在第 1 和第 10 周期之后得到的輸出： ```py Average loss after epoch 0 : 115.644006802 Similar to we: types, downturn, internal, by, introduce, Similar to been: said, funds, mcgraw-hill, street, have, Similar to also: will, she, next, computer, 's, Similar to of: was, and, milk, dollars, $, Similar to last: be, october, acknowledging, requested, computer, Similar to u.s.: plant, increase, many, down, recent, Similar to an: commerce, you, some, american, a, Similar to trading: increased, describes, state, companies, in, Average loss after epoch 9 : 5.56538496033 Similar to we: types, downturn, introduce, internal, claims, Similar to been: exxon, said, problem, mcgraw-hill, street, Similar to also: will, she, ssangyong, audit, screens, Similar to of: seasonal, dollars, motor, none, deaths, Similar to last: acknowledging, allow, incorporated, joint, requested, Similar to u.s.: undersecretary, typically, maxwell, recent, increase, Similar to an: banking, officials, imbalances, americans, manager, Similar to trading: describes, increased, owners, committee, else, ``` 最后，我們運行 5000 個周期的模型并獲得以下結果： ```py Average loss after epoch 4999 : 2.74216903135 Similar to we: matter, noted, here, classified, orders, Similar to been: good, precedent, medium-sized, gradual, useful, Similar to also: introduce, england, index, able, then, Similar to of: indicator, cleveland, theory, the, load, Similar to last: dec., office, chrysler, march, receiving, Similar to u.s.: label, fannie, pressures, squeezed, reflection, Similar to an: knowing, outlawed, milestones, doubled, base, Similar to trading: associates, downturn, money, portfolios, go, ``` 嘗試進一步運行，最多 50,000 個周期，以獲得更好的結果。同樣，我們在 50 個周期之后使用 text8 模型得到以下結果： ```py Average loss after epoch 49 : 5.74381046423 Similar to four: five, three, six, seven, eight, Similar to all: many, both, some, various, these, Similar to between: with, through, thus, among, within, Similar to a: another, the, any, each, tpvgames, Similar to that: which, however, although, but, when, Similar to zero: five, three, six, eight, four, Similar to is: was, are, has, being, busan, Similar to no: any, only, the, another, trinomial, ```