六、如何使用 TensorFlow Eager 從 TFRecords 批量讀取數據 · ApacheCN 深度學習譯文集

# 六、如何使用 TensorFlow Eager 從 TFRecords 批量讀取數據大家好，本教程再次關注輸入流水線。這很簡單，但我記得當我第一次開始批量讀取數據時，我陷入了相當多的細節，所以我想我可能會在這里分享我的方法。我真的希望它對你們中的一些人有用。教程的流程圖： ![](https://img.kancloud.cn/ea/ba/eaba0c630d246190ec166647c0bdcdf3_1056x288.png) 我們將研究兩種情況： + 可變序列長度的輸入數據 - 在這種情況下，我們將填充批次到最大序列長度。 + 圖像數據兩種情況的數據都存儲為 TFRecords。你可以查看教程的第四和第五章，了解如何將原始數轉換為 TFRecords。那么，讓我們直接開始編程！ ### 導入有用的庫 ```py # 導入數據可視化庫 import matplotlib.pyplot as plt # 使繪圖內嵌在筆記本中 %matplotlib inline # 導入 TensorFlow 和 TensorFlow Eager import tensorflow as tf import tensorflow.contrib.eager as tfe # 開啟 Eager 模式。一旦開啟不能撤銷！只執行一次。 tfe.enable_eager_execution() ``` ## 第一部分：讀取可變序列長度的數據本教程的第一部分向你介紹如何讀取不同長度的輸入數據。在我們的例子中，我們使用了大型電影數據庫中的虛擬 IMDB 評論。你可以想象，每個評論都有不同的單詞數。因此，當我們讀取一批數據時，我們將序列填充到批次中的最大序列長度。為了了解我如何獲得單詞索引序列，以及標簽和序列長度，請參閱第四章。 ### 創建函數來解析每個 TFRecord ```py def parse_imdb_sequence(record): ''' 用于解析 imdb tfrecords 的腳本 Returns: token_indexes: sequence of token indexes present in the review. target: the target of the movie review. sequence_length: the length of the sequence. ''' context_features = { 'sequence_length': tf.FixedLenFeature([], dtype=tf.int64), 'target': tf.FixedLenFeature([], dtype=tf.int64), } sequence_features = { 'token_indexes': tf.FixedLenSequenceFeature([], dtype=tf.int64), } context_parsed, sequence_parsed = tf.parse_single_sequence_example(record, context_features=context_features, sequence_features=sequence_features) return (sequence_parsed['token_indexes'], context_parsed['target'], context_parsed['sequence_length']) ``` ### 創建數據集迭代器正如你在上面的函數中所看到的，在解析每個記錄之后，我們返回一系列單詞索引，評論標簽和序列長度。在`padded_batch`方法中，我們只填充記錄的第一個元素：單詞索引的序列。在每個示例中，標簽和序列長度不需要填充，因為它們只是單個數字。因此，`padded_shapes`將是： + `[None]` -> 將序列填充到最大維度，還不知道，因此是`None`。 + `[]` -> 標簽沒有填充。 + `[]` -> 序列長度沒有填充。 ```py # 選取批量大小 batch_size = 2 # 從 TFRecords 創建數據集 dataset = tf.data.TFRecordDataset('datasets/dummy_text/dummy.tfrecords') dataset = dataset.map(parse_imdb_sequence).shuffle(buffer_size=10000) dataset = dataset.padded_batch(batch_size, padded_shapes=([None],[],[])) ``` ### 遍歷數據一次 ```py for review, target, sequence_length in tfe.Iterator(dataset): print(target) ''' tf.Tensor([0 1], shape=(2,), dtype=int64) tf.Tensor([1 0], shape=(2,), dtype=int64) tf.Tensor([0 1], shape=(2,), dtype=int64) ''' for review, target, sequence_length in tfe.Iterator(dataset): print(review.shape) ''' (2, 145) (2, 139) (2, 171) ''' for review, target, sequence_length in tfe.Iterator(dataset): print(sequence_length) ''' tf.Tensor([137 151], shape=(2,), dtype=int64) tf.Tensor([139 171], shape=(2,), dtype=int64) tf.Tensor([145 124], shape=(2,), dtype=int64) ''' ``` ## 第二部分：批量讀取圖像（以及它們的標簽）在本教程的第二部分中，我們將通過批量讀取圖像,將存儲為 TFRecords 的圖像可視化。這些圖像是 FER2013 數據集中的一個小型子樣本。 ### 創建函數來解析每個記錄并解碼圖片 ```py def parser(record): ''' 解析 TFRecords 樣本的函數 Returns: img: decoded image. label: the corresponding label of the image. ''' # 定義你想要解析的特征 features = {'image': tf.FixedLenFeature((), tf.string), 'label': tf.FixedLenFeature((), tf.int64)} # 解析樣本 parsed = tf.parse_single_example(record, features) # 解碼圖像 img = tf.image.decode_image(parsed['image']) return img, parsed['label'] ``` ### 創建數據集迭代器 ```py # 選取批量大小 batch_size = 5 # 從 TFRecords 創建數據集 dataset = tf.data.TFRecordDataset('datasets/dummy_images/dummy.tfrecords') dataset = dataset.map(parser).shuffle(buffer_size=10000) dataset = dataset.batch(batch_size) ``` ### 遍歷數據集一次。展示圖像。 ```py # Dictionary that stores the correspondence between integer labels and the emotions emotion_cat = {0:'Angry', 1:'Disgust', 2:'Fear', 3:'Happy', 4:'Sad', 5:'Surprise', 6:'Neutral'} # 遍歷數據集一次 for image, label in tfe.Iterator(dataset): # 為每個圖像批量創建子圖 f, axarr = plt.subplots(1, int(image.shape[0]), figsize=(14, 6)) # 繪制圖像 for i in range(image.shape[0]): axarr[i].imshow(image[i,:,:,0], cmap='gray') axarr[i].set_title('Emotion: %s' %emotion_cat[label[i].numpy()]) ``` ![](https://img.kancloud.cn/31/ab/31abcbf839e73f032ebf9a05bf75172d_818x182.png) 如果你希望我在本教程中添加任何內容，請與我們聯系。我會盡力添加它！