為垃圾郵件預測實現 RNN · TensorFlow 機器學習秘籍中文第二版

# 為垃圾郵件預測實現 RNN 首先，我們將應用標準 RNN 單元來預測奇異數值輸出，即垃圾郵件概率。 ## 做好準備在此秘籍中，我們將在 TensorFlow 中實現標準 RNN，以預測短信是垃圾郵件還是火腿。我們將使用 UCI 的 ML 倉庫中的 SMS 垃圾郵件收集數據集。我們將用于預測的架構將是來自嵌入文本的輸入 RNN 序列，我們將最后的 RNN 輸出作為垃圾郵件或火腿（1 或 0）的預測。 ## 操作步驟 1. 我們首先加載此腳本所需的庫： ```py import os import re import io import requests import numpy as np import matplotlib.pyplot as plt import tensorflow as tf from zipfile import ZipFile ``` 1. 接下來，我們啟動圖會話并設置 RNN 模型參數。我們將通過`20`周期以`250`的批量大小運行數據。我們將考慮的每個文本的最大長度是`25`字;我們將更長的文本剪切為`25`或零填充短文本。 RNN 將是`10`單元。我們只考慮在詞匯表中出現至少 10 次的單詞，并且每個單詞都將嵌入到可訓練的大小`50`中。droupout 率將是我們可以在訓練期間`0.5`或評估期間`1.0`設置的占位符： ```py sess = tf.Session() epochs = 20 batch_size = 250 max_sequence_length = 25 rnn_size = 10 embedding_size = 50 min_word_frequency = 10 learning_rate = 0.0005 dropout_keep_prob = tf.placeholder(tf.float32) ``` 1. 現在我們獲取 SMS 文本數據。首先，我們檢查它是否已經下載，如果是，請在文件中讀取。否則，我們下載數據并保存： ```py data_dir = 'temp' data_file = 'text_data.txt' if not os.path.exists(data_dir): os.makedirs(data_dir) if not os.path.isfile(os.path.join(data_dir, data_file)): zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip' r = requests.get(zip_url) z = ZipFile(io.BytesIO(r.content)) file = z.read('SMSSpamCollection') # Format Data text_data = file.decode() text_data = text_data.encode('ascii',errors='ignore') text_data = text_data.decode().split('\n') # Save data to text file with open(os.path.join(data_dir, data_file), 'w') as file_conn: for text in text_data: file_conn.write("{}\n".format(text)) else: # Open data from text file text_data = [] with open(os.path.join(data_dir, data_file), 'r') as file_conn: for row in file_conn: text_data.append(row) text_data = text_data[:-1] text_data = [x.split('\t') for x in text_data if len(x)>=1] [text_data_target, text_data_train] = [list(x) for x in zip(*text_data)] ``` 1. 為了減少我們的詞匯量，我們將通過刪除特殊字符和額外的空格來清理輸入文本，并將所有內容放在小寫中： ```py def clean_text(text_string): text_string = re.sub(r'([^sw]|_|[0-9])+', '', text_string) text_string = " ".join(text_string.split()) text_string = text_string.lower() return text_string # Clean texts text_data_train = [clean_text(x) for x in text_data_train] ``` > 請注意，我們的清潔步驟會刪除特殊字符作為替代方案，我們也可以用空格替換它們。理想情況下，這取決于數據集的格式。 1. 現在我們使用 TensorFlow 的內置詞匯處理器函數處理文本。這會將文本轉換為適當的索引列表： ```py vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(max_sequence_length, min_frequency=min_word_frequency) text_processed = np.array(list(vocab_processor.fit_transform(text_data_train))) ``` > 請注意，`contrib.learn.preprocessing`中的函數目前已棄用（使用當前的 TensorFlow 版本，1.10）。目前的替換建議 TensorFlow 預處理包僅在 Python 2 中運行。將 TensorFlow 預處理移至 Python 3 的工作目前正在進行中，并將取代前兩行。請記住，所有當前和最新的代碼都可以在這個 GitHub 頁面找到： [https://www.github.com/nfmcclure/tensorflow_cookbook](https://www.github.com/nfmcclure/tensorflow_cookbook) 和 Packt 倉庫： [https：/ /github.com/PacktPublishing/TensorFlow-Machine-Learning-Cookbook-Second-Edition](https://github.com/PacktPublishing/TensorFlow-Machine-Learning-Cookbook-Second-Edition) 。 1. 接下來，我們將數據隨機化以使其隨機化： ```py text_processed = np.array(text_processed) text_data_target = np.array([1 if x=='ham' else 0 for x in text_data_target]) shuffled_ix = np.random.permutation(np.arange(len(text_data_target))) x_shuffled = text_processed[shuffled_ix] y_shuffled = text_data_target[shuffled_ix] ``` 1. 我們還將數據拆分為 80-20 訓練測試數據集： ```py ix_cutoff = int(len(y_shuffled)*0.80) x_train, x_test = x_shuffled[:ix_cutoff], x_shuffled[ix_cutoff:] y_train, y_test = y_shuffled[:ix_cutoff], y_shuffled[ix_cutoff:] vocab_size = len(vocab_processor.vocabulary_) print("Vocabulary Size: {:d}".format(vocab_size)) print("80-20 Train Test split: {:d} -- {:d}".format(len(y_train), len(y_test))) ``` > 對于這個秘籍，我們不會進行任何超參數調整。如果讀者朝這個方向前進，請記住在繼續之前將數據集拆分為訓練測試驗證集。一個很好的選擇是 Scikit-learn 函數`model_selection.train_test_split()`。 1. 接下來，我們聲明圖占位符。 `x`輸入將是一個大小為`[None, max_sequence_length]`的占位符，它將是根據文本消息允許的最大字長的批量大小。對于火腿或垃圾郵件，`y` -output 占位符只是一個 0 或 1 的整數： ```py x_data = tf.placeholder(tf.int32, [None, max_sequence_length]) y_output = tf.placeholder(tf.int32, [None]) ``` 1. 我們現在為`x`輸入數據創建嵌入矩陣和嵌入查找操作： ```py embedding_mat = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0)) embedding_output = tf.nn.embedding_lookup(embedding_mat, x_data) ``` 1. 我們將模型聲明如下。首先，我們初始化一種要使用的 RNN 小區（RNN 大小為 10）。然后我們通過使其成為動態 RNN 來創建 RNN 序列。然后我們將退出添加到 RNN： ```py cell = tf.nn.rnn_cell.BasicRNNCell(num_units = rnn_size) output, state = tf.nn.dynamic_rnn(cell, embedding_output, dtype=tf.float32) output = tf.nn.dropout(output, dropout_keep_prob) ``` > 注意，動態 RNN 允許可變長度序列。即使我們在這個例子中使用固定的序列長度，通常最好在 TensorFlow 中使用`dynamic_rnn`有兩個主要原因。一個原因是，在實踐中，動態 RNN 實際上運行速度更快;第二個是，如果我們選擇，我們可以通過 RNN 運行不同長度的序列。 1. 現在要得到我們的預測，我們必須重新安排 RNN 并切掉最后一個輸出： ```py output = tf.transpose(output, [1, 0, 2]) last = tf.gather(output, int(output.get_shape()[0]) - 1) ``` 1. 為了完成 RNN 預測，我們通過完全連接的網絡層將`rnn_size`輸出轉換為兩個類別輸出： ```py weight = tf.Variable(tf.truncated_normal([rnn_size, 2], stddev=0.1)) bias = tf.Variable(tf.constant(0.1, shape=[2])) logits_out = tf.nn.softmax(tf.matmul(last, weight) + bias) ``` 1. 我們接下來宣布我們的損失函數。請記住，當使用 TensorFlow 中的`sparse_softmax`函數時，目標必須是整數索引（類型為`int`），并且 logits 必須是浮點數： ```py losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits_out, labels=y_output) loss = tf.reduce_mean(losses) ``` 1. 我們還需要一個精確度函數，以便我們可以比較測試和訓練集上的算法： ```py accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits_out, 1), tf.cast(y_output, tf.int64)), tf.float32)) ``` 1. 接下來，我們創建優化函數并初始化模型變量： ```py optimizer = tf.train.RMSPropOptimizer(learning_rate) train_step = optimizer.minimize(loss) init = tf.global_variables_initializer() sess.run(init) ``` 1. 現在我們可以開始循環遍歷數據并訓練模型。在多次循環數據時，最好在每個周期對數據進行洗牌以防止過度訓練： ```py train_loss = [] test_loss = [] train_accuracy = [] test_accuracy = [] # Start training for epoch in range(epochs): # Shuffle training data shuffled_ix = np.random.permutation(np.arange(len(x_train))) x_train = x_train[shuffled_ix] y_train = y_train[shuffled_ix] num_batches = int(len(x_train)/batch_size) + 1 for i in range(num_batches): # Select train data min_ix = i * batch_size max_ix = np.min([len(x_train), ((i+1) * batch_size)]) x_train_batch = x_train[min_ix:max_ix] y_train_batch = y_train[min_ix:max_ix] # Run train step train_dict = {x_data: x_train_batch, y_output: y_train_batch, dropout_keep_prob:0.5} sess.run(train_step, feed_dict=train_dict) # Run loss and accuracy for training temp_train_loss, temp_train_acc = sess.run([loss, accuracy], feed_dict=train_dict) train_loss.append(temp_train_loss) train_accuracy.append(temp_train_acc) # Run Eval Step test_dict = {x_data: x_test, y_output: y_test, dropout_keep_prob:1.0} temp_test_loss, temp_test_acc = sess.run([loss, accuracy], feed_dict=test_dict) test_loss.append(temp_test_loss) test_accuracy.append(temp_test_acc) print('Epoch: {}, Test Loss: {:.2}, Test Acc: {:.2}'.format(epoch+1, temp_test_loss, temp_test_acc)) ``` 1. 這導致以下輸出： ```py Vocabulary Size: 933 80-20 Train Test split: 4459 -- 1115 Epoch: 1, Test Loss: 0.59, Test Acc: 0.83 Epoch: 2, Test Loss: 0.58, Test Acc: 0.83 ... ``` ```py Epoch: 19, Test Loss: 0.46, Test Acc: 0.86 Epoch: 20, Test Loss: 0.46, Test Acc: 0.86 ``` 1. 以下是繪制訓練/測試損失和準確率的代碼： ```py epoch_seq = np.arange(1, epochs+1) plt.plot(epoch_seq, train_loss, 'k--', label='Train Set') plt.plot(epoch_seq, test_loss, 'r-', label='Test Set') plt.title('Softmax Loss') plt.xlabel('Epochs') plt.ylabel('Softmax Loss') plt.legend(loc='upper left') plt.show() # Plot accuracy over time plt.plot(epoch_seq, train_accuracy, 'k--', label='Train Set') plt.plot(epoch_seq, test_accuracy, 'r-', label='Test Set') plt.title('Test Accuracy') plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend(loc='upper left') plt.show() ``` ## 工作原理在這個秘籍中，我們創建了一個 RNN 到類別的模型來預測 SMS 文本是垃圾郵件還是火腿。我們在測試裝置上實現了大約 86％的準確率。以下是測試和訓練集的準確率和損失圖： ![](https://img.kancloud.cn/86/cd/86cd092109029d059f87eea1c0492df8_800x302.png) 圖 3：訓練和測試集的準確率（左）和損失（右） ## 更多強烈建議您多次瀏覽訓練數據集以獲取順序數據（這也建議用于非順序數據）。每次傳遞數據都稱為周期。此外，在每個周期之前對數據進行混洗是非常常見的（并且強烈推薦），以最小化數據順序對訓練的影響。