使用 word2vec 進行預測 · TensorFlow 機器學習秘籍中文第二版

# 使用 word2vec 進行預測在本文中，我們將使用先前學習的嵌入策略來執行分類。 ## 做好準備現在我們已經創建并保存了 CBOW 字嵌入，我們需要使用它們來對電影數據集進行情感預測。在本文中，我們將學習如何加載和使用預先訓練的嵌入，并使用這些嵌入來通過訓練邏輯線性模型來預測好的或壞的評論來執行情緒分析。情感分析是一項非常艱巨的任務，因為人類語言使得很難掌握所謂意義的真實含義的微妙之處和細微差別。諷刺，笑話和含糊不清的引用都使這項任務成倍增加。我們將在電影評論數據集上創建一個簡單的邏輯回歸，以查看我們是否可以從我們在上一個秘籍中創建并保存的 CBOW 嵌入中獲取任何信息。由于本文的重點是加載和使用已保存的嵌入，我們不會追求更復雜的模型。 ## 操作步驟我們將按如下方式處理秘籍： 1. 我們將首先加載必要的庫并開始圖會話： ```py import tensorflow as tf import matplotlib.pyplot as plt import numpy as np import random import os import pickle import string import requests import collections import io import tarfile import urllib.request import text_helpers from nltk.corpus import stopwords sess = tf.Session() ``` 1. 現在我們將聲明模型參數。嵌入大小應與我們用于創建前面的 CBOW 嵌入的嵌入大小相同。使用以下代碼執行此操作： ```py embedding_size = 200 vocabulary_size = 2000 batch_size = 100 max_words = 100 stops = stopwords.words('english') ``` 1. 我們將從我們創建的`text_helpers.py`文件加載和轉換文本數據。使用以下代碼執行此操作： ```py texts, target = text_helpers.load_movie_data() # Normalize text print('Normalizing Text Data') texts = text_helpers.normalize_text(texts, stops) # Texts must contain at least 3 words target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > 2] texts = [x for x in texts if len(x.split()) > 2] train_indices = np.random.choice(len(target), round(0.8*len(target)), replace=False) test_indices = np.array(list(set(range(len(target))) - set(train_indices))) texts_train = [x for ix, x in enumerate(texts) if ix in train_indices] texts_test = [x for ix, x in enumerate(texts) if ix in test_indices] target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices]) target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices]) ``` 1. 我們現在加載我們在擬合 CBOW 嵌入時創建的單詞字典。重要的是我們加載它以便我們具有從單詞到嵌入索引的完全相同的映射，如下所示： ```py dict_file = os.path.join(data_folder_name, 'movie_vocab.pkl') word_dictionary = pickle.load(open(dict_file, 'rb')) ``` 1. 我們現在可以使用我們的單詞字典將我們加載的句子數據轉換為數字`numpy`數組： ```py text_data_train = np.array(text_helpers.text_to_numbers(texts_train, word_dictionary)) text_data_test = np.array(text_helpers.text_to_numbers(texts_test, word_dictionary)) ``` 1. 由于電影評論的長度不同，我們將它們標準化，因此它們的長度都相同。在我們的例子中，我們將其設置為 100 個單詞。如果評論少于 100 個單詞，我們將用零填充它。使用以下代碼執行此操作： ```py text_data_train = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_train]]) text_data_test = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_test]]) ``` 1. 現在我們將聲明我們的模型變量和占位符以進行邏輯回歸。使用以下代碼執行此操作： ```py A = tf.Variable(tf.random_normal(shape=[embedding_size,1])) b = tf.Variable(tf.random_normal(shape=[1,1])) # Initialize placeholders x_data = tf.placeholder(shape=[None, max_words], dtype=tf.int32) y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32) ``` 1. 為了讓 TensorFlow 恢復我們預先訓練的嵌入，我們必須首先給`Saver`方法一個變量來恢復，所以我們將創建一個嵌入變量，其形狀與我們將加載的嵌入相同： ```py embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) ``` 1. 現在我們將`embedding_lookup`函數放在圖上，并將句子中所有單詞的平均嵌入。使用以下代碼執行此操作： ```py embed = tf.nn.embedding_lookup(embeddings, x_data) # Take average of all word embeddings in documents embed_avg = tf.reduce_mean(embed, 1) ``` 1. 接下來，我們將聲明我們的模型操作和損失函數，記住我們的損失函數已經內置了 sigmoid 操作，如下所示： ```py model_output = tf.add(tf.matmul(embed_avg, A), b) # Declare loss function (Cross Entropy loss) loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=y_target)) ``` 1. 現在我們將向圖添加預測和精度函數，以便我們可以在使用以下代碼訓練模型時評估精度： ```py prediction = tf.round(tf.sigmoid(model_output)) predictions_correct = tf.cast(tf.equal(prediction, y_target), tf.float32) accuracy = tf.reduce_mean(predictions_correct) ``` 1. 我們將聲明我們的優化函數并初始化以下模型變量： ```py my_opt = tf.train.AdagradOptimizer(0.005) train_step = my_opt.minimize(loss) init = tf.global_variables_initializer() sess.run(init) ``` 1. 現在我們有一個隨機初始化嵌入，我們可以告訴`Saver`方法將我們之前的 CBOW 嵌入加載到嵌入變量中。使用以下代碼執行此操作： ```py model_checkpoint_path = os.path.join(data_folder_name,'cbow_movie_embeddings.ckpt') saver = tf.train.Saver({"embeddings": embeddings}) saver.restore(sess, model_checkpoint_path) ``` 1. 現在我們可以開始訓練幾代。請注意，我們每 100 代就可以節省訓練和測試損失和準確率。我們只會每 500 代打印一次模型狀態，如下所示： ```py train_loss = [] test_loss = [] train_acc = [] test_acc = [] i_data = [] for i in range(10000): rand_index = np.random.choice(text_data_train.shape[0], size=batch_size) rand_x = text_data_train[rand_index] rand_y = np.transpose([target_train[rand_index]]) sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y}) # Only record loss and accuracy every 100 generations if (i+1)%100==0: i_data.append(i+1) train_loss_temp = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y}) train_loss.append(train_loss_temp) test_loss_temp = sess.run(loss, feed_dict={x_data: text_data_test, y_target: np.transpose([target_test])}) test_loss.append(test_loss_temp) train_acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y}) train_acc.append(train_acc_temp) test_acc_temp = sess.run(accuracy, feed_dict={x_data: text_data_test, y_target: np.transpose([target_test])}) test_acc.append(test_acc_temp) if (i+1)%500==0: acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp] acc_and_loss = [np.round(x,2) for x in acc_and_loss] print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss)) ``` 1. 結果如下： ```py Generation # 500\. Train Loss (Test Loss): 0.70 (0.71). Train Acc (Test Acc): 0.52 (0.48) Generation # 1000\. Train Loss (Test Loss): 0.69 (0.72). Train Acc (Test Acc): 0.56 (0.47) ... Generation # 9500\. Train Loss (Test Loss): 0.69 (0.70). Train Acc (Test Acc): 0.57 (0.55) Generation # 10000\. Train Loss (Test Loss): 0.70 (0.70). Train Acc (Test Acc): 0.59 (0.55) ``` 1. 以下是繪制訓練和測試損失和準確率的代碼，我們每 100 代保存一次： ```py # Plot loss over time plt.plot(i_data, train_loss, 'k-', label='Train Loss') plt.plot(i_data, test_loss, 'r--', label='Test Loss', linewidth=4) plt.title('Cross Entropy Loss per Generation') plt.xlabel('Generation') plt.ylabel('Cross Entropy Loss') plt.legend(loc='upper right') plt.show() # Plot train and test accuracy plt.plot(i_data, train_acc, 'k-', label='Train Set Accuracy') plt.plot(i_data, test_acc, 'r--', label='Test Set Accuracy', linewidth=4) plt.title('Train and Test Accuracy') plt.xlabel('Generation') plt.ylabel('Accuracy') plt.legend(loc='lower right') plt.show() ``` 每代交叉熵損失的圖如下： ![](https://img.kancloud.cn/3d/38/3d38b35cb1e606240766976aa26cf4e7_406x281.png)Figure 6: Here we observe the train and test loss over 10,000 generations 上述代碼的訓練圖和測試精度如下： ![](https://img.kancloud.cn/0b/6c/0b6ce21cb11a1df5bf85d14e4a98000d_406x281.png) 圖 7：我們可以觀察到訓練和測試裝置的準確率正在緩慢提高 10,000 代。值得注意的是，該模型表現非常差，并且僅比隨機預測器略好。 ## 工作原理我們加載了我們之前的 CBOW 嵌入并對平均嵌入評論進行了邏輯回歸。這里要注意的重要方法是我們如何將模型變量從磁盤加載到當前模型中已經初始化的變量。我們還必須記住在訓練嵌入之前存儲和加載我們創建的詞匯表。使用相同的嵌入時，從單詞到嵌入索引具有相同的映射非常重要。 ## 更多我們可以看到，我們在預測情緒方面幾乎達到了 60％的準確率。例如，要知道單詞`great;`背后的含義是一項艱巨的任務，它可以在評論中用于消極或積極的背景。為了解決這個問題，我們希望以某種方式為文檔本身創建嵌入并解決情緒問題。通常，整個評論是積極的，或者整個評論是否定的。我們可以利用這個優勢，我們將在下面的使用 doc2vec 以獲取情緒分析方法中查看如何執行此操作。