10.7 文本情感分類：使用循環神經網絡 · 動手學深度學習--pytorch

# 10.7 文本情感分類：使用循環神經網絡文本分類是自然語言處理的一個常見任務，它把一段不定長的文本序列變換為文本的類別。本節關注它的一個子問題：使用文本情感分類來分析文本作者的情緒。這個問題也叫情感分析，并有著廣泛的應用。例如，我們可以分析用戶對產品的評論并統計用戶的滿意度，或者分析用戶對市場行情的情緒并用以預測接下來的行情。同搜索近義詞和類比詞一樣，文本分類也屬于詞嵌入的下游應用。在本節中，我們將應用預訓練的詞向量和含多個隱藏層的雙向循環神經網絡，來判斷一段不定長的文本序列中包含的是正面還是負面的情緒。在實驗開始前，導入所需的包或模塊。 ``` python import collections import os import random import tarfile import torch from torch import nn import torchtext.vocab as Vocab import torch.utils.data as Data import sys sys.path.append("..") import d2lzh_pytorch as d2l os.environ["CUDA_VISIBLE_DEVICES"] = "0" device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') DATA_ROOT = "/S1/CSCL/tangss/Datasets" ``` ## 10.7.1 文本情感分類數據我們使用斯坦福的IMDb數據集（Stanford's Large Movie Review Dataset）作為文本情感分類的數據集 [1]。這個數據集分為訓練和測試用的兩個數據集，分別包含25,000條從IMDb下載的關于電影的評論。在每個數據集中，標簽為“正面”和“負面”的評論數量相等。 ### 10.7.1.1 讀取數據首先[下載](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)這個數據集到`DATA_ROOT`路徑下，然后解壓。 ``` python fname = os.path.join(DATA_ROOT, "aclImdb_v1.tar.gz") if not os.path.exists(os.path.join(DATA_ROOT, "aclImdb")): print("從壓縮包解壓...") with tarfile.open(fname, 'r') as f: f.extractall(DATA_ROOT) ``` 接下來，讀取訓練數據集和測試數據集。每個樣本是一條評論及其對應的標簽：1表示“正面”，0表示“負面”。 ``` python from tqdm import tqdm # 本函數已保存在d2lzh_pytorch包中方便以后使用 def read_imdb(folder='train', data_root="/S1/CSCL/tangss/Datasets/aclImdb"): data = [] for label in ['pos', 'neg']: folder_name = os.path.join(data_root, folder, label) for file in tqdm(os.listdir(folder_name)): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('\n', '').lower() data.append([review, 1 if label == 'pos' else 0]) random.shuffle(data) return data train_data, test_data = read_imdb('train'), read_imdb('test') ``` ### 10.7.1.2 預處理數據我們需要對每條評論做分詞，從而得到分好詞的評論。這里定義的`get_tokenized_imdb`函數使用最簡單的方法：基于空格進行分詞。 ``` python # 本函數已保存在d2lzh_pytorch包中方便以后使用 def get_tokenized_imdb(data): """ data: list of [string, label] """ def tokenizer(text): return [tok.lower() for tok in text.split(' ')] return [tokenizer(review) for review, _ in data] ``` 現在，我們可以根據分好詞的訓練數據集來創建詞典了。我們在這里過濾掉了出現次數少于5的詞。 ``` python # 本函數已保存在d2lzh_pytorch包中方便以后使用 def get_vocab_imdb(data): tokenized_data = get_tokenized_imdb(data) counter = collections.Counter([tk for st in tokenized_data for tk in st]) return Vocab.Vocab(counter, min_freq=5) vocab = get_vocab_imdb(train_data) '# words in vocab:', len(vocab) ``` 輸出： ``` ('# words in vocab:', 46151) ``` 因為每條評論長度不一致所以不能直接組合成小批量，我們定義`preprocess_imdb`函數對每條評論進行分詞，并通過詞典轉換成詞索引，然后通過截斷或者補0來將每條評論長度固定成500。 ``` python # 本函數已保存在d2lzh_torch包中方便以后使用 def preprocess_imdb(data, vocab): max_l = 500 # 將每條評論通過截斷或者補0，使得長度變成500 def pad(x): return x[:max_l] if len(x) > max_l else x + [0] * (max_l - len(x)) tokenized_data = get_tokenized_imdb(data) features = torch.tensor([pad([vocab.stoi[word] for word in words]) for words in tokenized_data]) labels = torch.tensor([score for _, score in data]) return features, labels ``` ### 10.7.1.3 創建數據迭代器現在，我們創建數據迭代器。每次迭代將返回一個小批量的數據。 ``` python batch_size = 64 train_set = Data.TensorDataset(*preprocess_imdb(train_data, vocab)) test_set = Data.TensorDataset(*preprocess_imdb(test_data, vocab)) train_iter = Data.DataLoader(train_set, batch_size, shuffle=True) test_iter = Data.DataLoader(test_set, batch_size) ``` 打印第一個小批量數據的形狀以及訓練集中小批量的個數。 ``` python for X, y in train_iter: print('X', X.shape, 'y', y.shape) break '#batches:', len(train_iter) ``` 輸出： ``` X torch.Size([64, 500]) y torch.Size([64]) ('#batches:', 391) ``` ## 10.7.2 使用循環神經網絡的模型在這個模型中，每個詞先通過嵌入層得到特征向量。然后，我們使用雙向循環神經網絡對特征序列進一步編碼得到序列信息。最后，我們將編碼的序列信息通過全連接層變換為輸出。具體來說，我們可以將雙向長短期記憶在最初時間步和最終時間步的隱藏狀態連結，作為特征序列的表征傳遞給輸出層分類。在下面實現的`BiRNN`類中，`Embedding`實例即嵌入層，`LSTM`實例即為序列編碼的隱藏層，`Linear`實例即生成分類結果的輸出層。 ``` python class BiRNN(nn.Module): def __init__(self, vocab, embed_size, num_hiddens, num_layers): super(BiRNN, self).__init__() self.embedding = nn.Embedding(len(vocab), embed_size) # bidirectional設為True即得到雙向循環神經網絡 self.encoder = nn.LSTM(input_size=embed_size, hidden_size=num_hiddens, num_layers=num_layers, bidirectional=True) # 初始時間步和最終時間步的隱藏狀態作為全連接層輸入 self.decoder = nn.Linear(4*num_hiddens, 2) def forward(self, inputs): # inputs的形狀是(批量大小, 詞數)，因為LSTM需要將序列長度(seq_len)作為第一維，所以將輸入轉置后 # 再提取詞特征，輸出形狀為(詞數, 批量大小, 詞向量維度) embeddings = self.embedding(inputs.permute(1, 0)) # rnn.LSTM只傳入輸入embeddings，因此只返回最后一層的隱藏層在各時間步的隱藏狀態。 # outputs形狀是(詞數, 批量大小, 2 * 隱藏單元個數) outputs, _ = self.encoder(embeddings) # output, (h, c) # 連結初始時間步和最終時間步的隱藏狀態作為全連接層輸入。它的形狀為 # (批量大小, 4 * 隱藏單元個數)。 encoding = torch.cat((outputs[0], outputs[-1]), -1) outs = self.decoder(encoding) return outs ``` 創建一個含兩個隱藏層的雙向循環神經網絡。 ``` python embed_size, num_hiddens, num_layers = 100, 100, 2 net = BiRNN(vocab, embed_size, num_hiddens, num_layers) ``` ### 10.7.2.1 加載預訓練的詞向量由于情感分類的訓練數據集并不是很大，為應對過擬合，我們將直接使用在更大規模語料上預訓練的詞向量作為每個詞的特征向量。這里，我們為詞典`vocab`中的每個詞加載100維的GloVe詞向量。 ``` python glove_vocab = Vocab.GloVe(name='6B', dim=100, cache=os.path.join(DATA_ROOT, "glove")) ``` 然后，我們將用這些詞向量作為評論中每個詞的特征向量。注意，預訓練詞向量的維度需要與創建的模型中的嵌入層輸出大小`embed_size`一致。此外，在訓練中我們不再更新這些詞向量。 ``` python # 本函數已保存在d2lzh_torch包中方便以后使用 def load_pretrained_embedding(words, pretrained_vocab): """從預訓練好的vocab中提取出words對應的詞向量""" embed = torch.zeros(len(words), pretrained_vocab.vectors[0].shape[0]) # 初始化為0 oov_count = 0 # out of vocabulary for i, word in enumerate(words): try: idx = pretrained_vocab.stoi[word] embed[i, :] = pretrained_vocab.vectors[idx] except KeyError: oov_count += 0 if oov_count > 0: print("There are %d oov words.") return embed net.embedding.weight.data.copy_( load_pretrained_embedding(vocab.itos, glove_vocab)) net.embedding.weight.requires_grad = False # 直接加載預訓練好的, 所以不需要更新它 ``` ### 10.7.2.2 訓練并評價模型這時候就可以開始訓練模型了。 ``` python lr, num_epochs = 0.01, 5 # 要過濾掉不計算梯度的embedding參數 optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=lr) loss = nn.CrossEntropyLoss() d2l.train(train_iter, test_iter, net, loss, optimizer, device, num_epochs) ``` 輸出： ``` training on cuda epoch 1, loss 0.5759, train acc 0.666, test acc 0.832, time 250.8 sec epoch 2, loss 0.1785, train acc 0.842, test acc 0.852, time 253.3 sec epoch 3, loss 0.1042, train acc 0.866, test acc 0.856, time 253.7 sec epoch 4, loss 0.0682, train acc 0.888, test acc 0.868, time 254.2 sec epoch 5, loss 0.0483, train acc 0.901, test acc 0.862, time 251.4 sec ``` 最后，定義預測函數。 ``` python # 本函數已保存在d2lzh_pytorch包中方便以后使用 def predict_sentiment(net, vocab, sentence): """sentence是詞語的列表""" device = list(net.parameters())[0].device sentence = torch.tensor([vocab.stoi[word] for word in sentence], device=device) label = torch.argmax(net(sentence.view((1, -1))), dim=1) return 'positive' if label.item() == 1 else 'negative' ``` 下面使用訓練好的模型對兩個簡單句子的情感進行分類。 ``` python predict_sentiment(net, vocab, ['this', 'movie', 'is', 'so', 'great']) # positive ``` ``` python predict_sentiment(net, vocab, ['this', 'movie', 'is', 'so', 'bad']) # negative ``` ## 小結 * 文本分類把一段不定長的文本序列變換為文本的類別。它屬于詞嵌入的下游應用。 * 可以應用預訓練的詞向量和循環神經網絡對文本的情感進行分類。 ## 參考文獻 [1] Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics. ----------- > 注：本節除代碼外與原書基本相同，[原書傳送門](https://zh.d2l.ai/chapter_natural-language-processing/sentiment-analysis-rnn.html)