四、文本序列到 TFRecords · ApacheCN 深度學習譯文集

# 四、文本序列到 TFRecords 大家好！在本教程中，我將向你展示如何將原始文本數據解析為 TFRecords。我知道很多人都卡在輸入處理流水線，尤其是當你開始著手自己的個人項目時。所以我真的希望它對你們任何人都有用！教程的流程圖 ![](https://img.kancloud.cn/53/34/5334fa341d36ab8fb52404865ea0f9d6_1056x288.png) ### 虛擬的IMDB文本數據在實踐中，我從斯坦福大學提供的大型電影評論數據集中選擇了一些數據樣本。 ### 在這里導入有用的庫 ```py from nltk.tokenize import word_tokenize import tensorflow as tf import pandas as pd import pickle import random import glob import nltk import re try: nltk.data.find('tokenizers/punkt') except LookupError: nltk.download('punkt') ``` ### 將數據解析為 TFRecords ```py def imdb2tfrecords(path_data='datasets/dummy_text/', min_word_frequency=5, max_words_review=700): ''' 這個腳本處理數據并將其保存為默認的 TensorFlow 文件格式：tfrecords。 Args: path_data: the path where the imdb data is stored. min_word_frequency: the minimum frequency of a word, to keep it in the vocabulary. max_words_review: the maximum number of words allowed in a review. ''' # 獲取正面/負面評論的文件名 pos_files = glob.glob(path_data + 'pos/*') neg_files = glob.glob(path_data + 'neg/*') # 連接正負評論的文件名 filenames = pos_files + neg_files # 列出數據集中的所有評論 reviews = [open(filenames[i],'r').read() for i in range(len(filenames))] # 移除 HTML 標簽 reviews = [re.sub(r'<[^>]+>', ' ', review) for review in reviews] # 將每個評論分詞 reviews = [word_tokenize(review) for review in reviews] # 計算每個評論的的長度 len_reviews = [len(review) for review in reviews] # 展開嵌套列表 reviews = [word for review in reviews for word in review] # 計算每個單詞的頻率 word_frequency = pd.value_counts(reviews) # 僅僅保留頻率高于最小值的單詞 vocabulary = word_frequency[word_frequency>=min_word_frequency].index.tolist() # 添加未知，起始和終止記號 extra_tokens = ['Unknown_token', 'End_token'] vocabulary += extra_tokens # 創建 word2idx 詞典 word2idx = {vocabulary[i]: i for i in range(len(vocabulary))} # 將單詞的詞匯表寫到磁盤 pickle.dump(word2idx, open(path_data + 'word2idx.pkl', 'wb')) def text2tfrecords(filenames, writer, vocabulary, word2idx, max_words_review): ''' 用于將每個評論解析為部分，并作為 tfrecord 寫入磁盤的函數。 Args: filenames: the paths of the review files. writer: the writer object for tfrecords. vocabulary: list with all the words included in the vocabulary. word2idx: dictionary of words and their corresponding indexes. ''' # 打亂 filenames random.shuffle(filenames) for filename in filenames: review = open(filename, 'r').read() review = re.sub(r'<[^>]+>', ' ', review) review = word_tokenize(review) # 將 review 歸約為最大單詞 review = review[-max_words_review:] # 將單詞替換為來自 word2idx 的等效索引 review = [word2idx[word] if word in vocabulary else word2idx['Unknown_token'] for word in review] indexed_review = review + [word2idx['End_token']] sequence_length = len(indexed_review) target = 1 if filename.split('/')[-2]=='pos' else 0 # Create a Sequence Example to store our data in ex = tf.train.SequenceExample() # 向我們的示例添加非順序特性 ex.context.feature['sequence_length'].int64_list.value.append(sequence_length) ex.context.feature['target'].int64_list.value.append(target) # 添加順序特征 token_indexes = ex.feature_lists.feature_list['token_indexes'] for token_index in indexed_review: token_indexes.feature.add().int64_list.value.append(token_index) writer.write(ex.SerializeToString()) ########################################################################## # Write data to tfrecords.This might take a while. ########################################################################## writer = tf.python_io.TFRecordWriter(path_data + 'dummy.tfrecords') text2tfrecords(filenames, writer, vocabulary, word2idx, max_words_review) imdb2tfrecords(path_data='datasets/dummy_text/') ``` ### 將 TFRecords 解析為 TF 張量 ```py def parse_imdb_sequence(record): ''' 解析 imdb tfrecords 的腳本 Returns: token_indexes: sequence of token indexes present in the review. target: the target of the movie review. sequence_length: the length of the sequence. ''' context_features = { 'sequence_length': tf.FixedLenFeature([], dtype=tf.int64), 'target': tf.FixedLenFeature([], dtype=tf.int64), } sequence_features = { 'token_indexes': tf.FixedLenSequenceFeature([], dtype=tf.int64), } context_parsed, sequence_parsed = tf.parse_single_sequence_example(record, context_features=context_features, sequence_features=sequence_features) return (sequence_parsed['token_indexes'], context_parsed['target'], context_parsed['sequence_length']) ``` 如果你希望我在本教程中添加任何內容，請告訴我，我將很樂意進一步改善它。