如何準備照片標題數據集以訓練深度學習模型 · Machine Learning Mastery 博客文章翻譯

# 如何準備照片標題數據集以訓練深度學習模型 > 原文： [https://machinelearningmastery.com/prepare-photo-caption-dataset-training-deep-learning-model/](https://machinelearningmastery.com/prepare-photo-caption-dataset-training-deep-learning-model/) 自動照片字幕是一個問題，其中模型必須在給定照片的情況下生成人類可讀的文本描述。這是人工智能中的一個具有挑戰性的問題，需要來自計算機視覺領域的圖像理解以及來自自然語言處理領域的語言生成。現在可以使用深度學習和免費提供的照片數據集及其描述來開發自己的圖像標題模型。在本教程中，您將了解如何準備照片和文本描述，以便開發深度學習自動照片標題生成模型。完成本教程后，您將了解： * 關于 Flickr8K 數據集，包含 8,000 多張照片和每張照片最多 5 個字幕。 * 如何為深度學習建模一般加載和準備照片和文本數據。 * 如何在 Keras 中為兩種不同類型的深度學習模型專門編碼數據。讓我們開始吧。 * **2017 年 11 月更新**：修復了“_ 整個描述序列模型 _”部分代碼中的小拼寫錯誤。謝謝 Moustapha Cheikh 和 Matthew。 * **2002 年 2 月更新**：提供了 Flickr8k_Dataset 數據集的直接鏈接，因為官方網站已被刪除。 ![How to Prepare a Photo Caption Dataset for Training a Deep Learning Model](img/ed876dc6c1e515e527db6e72f03e47ab.jpg) 如何準備照片標題數據集以訓練深度學習模型照片由 [beverlyislike](https://www.flickr.com/photos/beverlyislike/3307325815/) ，保留一些權利。 ## 教程概述本教程分為 9 個部分;他們是： 1. 下載 Flickr8K 數據集 2. 如何加載照片 3. 預先計算照片功能 4. 如何加載描述 5. 準備說明文字 6. 整個描述序列模型 7. 逐字模型 8. 漸進式加載 9. 預先計算照片功能 ### Python 環境本教程假定您已安裝 Python 3 SciPy 環境。您可以使用 Python 2，但您可能需要更改一些示例。您必須安裝帶有 TensorFlow 或 Theano 后端的 Keras（2.0 或更高版本）。本教程還假設您安裝了 scikit-learn，Pandas，NumPy 和 Matplotlib。如果您需要有關環境的幫助，請參閱此帖子： * [如何使用 Anaconda 設置用于機器學習和深度學習的 Python 環境](https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/) ## 下載 Flickr8K 數據集 Flickr8K 數據集是開始使用圖像字幕時使用的一個很好的數據集。原因是它是現實的并且相對較小，因此您可以使用 CPU 在工作站上下載它并構建模型。數據集的確切描述在論文“[框架圖像描述作為排名任務：數據，模型和評估指標](https://www.jair.org/media/3994/live-3994-7274-jair.pdf)”從 2013 年開始。作者將數據集描述如下： > 我們為基于句子的圖像描述和搜索引入了一個新的基準集合，包括 8,000 個圖像，每個圖像與五個不同的標題配對，提供對顯著實體和事件的清晰描述。 > > ... > > 圖像是從六個不同的 Flickr 組中選擇的，并且往往不包含任何知名人物或位置，而是手動選擇以描繪各種場景和情況。 - [框架圖像描述作為排名任務：數據，模型和評估指標](https://www.jair.org/media/3994/live-3994-7274-jair.pdf)，2013。數據集可免費獲得。您必須填寫申請表，并通過電子郵件將鏈接發送給您。我很樂意為您鏈接，但電子郵件地址明確要求：“_ 請不要重新分發數據集 _”。您可以使用以下鏈接來請求數據集： * [數據集申請表](https://illinois.edu/fb/sec/1713398) 在短時間內，您將收到一封電子郵件，其中包含指向兩個文件的鏈接： * **Flickr8k_Dataset.zip** （1 千兆字節）所有照片的存檔。 * **Flickr8k_text.zip** （2.2 兆字節）照片所有文字說明的檔案。 **UPDATE（2019 年 2 月）**：官方網站似乎已被刪除（雖然表格仍然有效）。以下是我的[數據集 GitHub 存儲庫](https://github.com/jbrownlee/Datasets)的一些直接下載鏈接： * [Flickr8k_Dataset.zip](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip) * [Flickr8k_text.zip](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip) 下載數據集并將其解壓縮到當前工作目錄中。您將有兩個目錄： * **Flicker8k_Dataset** ：包含 8092 張 jpeg 格式的照片。 * **Flickr8k_text** ：包含許多包含不同照片描述來源的文件。接下來，我們來看看如何加載圖像。 ## 如何加載照片在本節中，我們將開發一些代碼來加載照片，以便與 Python 中的 Keras 深度學習庫一起使用。圖像文件名是唯一的圖像標識符。例如，以下是圖像文件名的示例： ```py 990890291_afc72be141.jpg 99171998_7cc800ceef.jpg 99679241_adc853a5c0.jpg 997338199_7343367d7f.jpg 997722733_0cb5439472.jpg ``` Keras 提供 _load_img（）_ 函數，可用于將圖像文件直接作為像素數組加載。 ```py from keras.preprocessing.image import load_img image = load_img('990890291_afc72be141.jpg') ``` 像素數據需要轉換為 NumPy 陣列以便在 Keras 中使用。我們可以使用 _img_to_array（）_ keras 函數來轉換加載的數據。 ```py from keras.preprocessing.image import img_to_array image = img_to_array(image) ``` 我們可能想要使用預定義的特征提取模型，例如在 Image net 上訓練的最先進的深度圖像分類網絡。牛津視覺幾何組（VGG）模型很受歡迎，可用于 Keras。牛津視覺幾何組（VGG）模型很受歡迎，可用于 Keras。如果我們決定在模型中使用這個預先訓練的模型作為特征提取器，我們可以使用 Keras 中的 _preprocess_input（）_ 函數預處理模型的像素數據，例如： ```py from keras.applications.vgg16 import preprocess_input # reshape data into a single sample of an image image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) ``` 我們可能還想強制加載照片以使其具有與 VGG 模型相同的像素尺寸，即 224 x 224 像素。我們可以在調用 _load_img（）_ 時這樣做，例如： ```py image = load_img('990890291_afc72be141.jpg', target_size=(224, 224)) ``` 我們可能想要從圖像文件名中提取唯一的圖像標識符。我們可以通過將'。'（句點）字符拆分文件名字符串并檢索結果數組的第一個元素來實現： ```py image_id = filename.split('.')[0] ``` 我們可以將所有這些結合在一起并開發一個函數，給定包含照片的目錄的名稱，將加載和預處理 VGG 模型的所有照片，并將它們返回到鍵入其唯一圖像標識符的字典中。 ```py from os import listdir from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from keras.applications.vgg16 import preprocess_input def load_photos(directory): images = dict() for name in listdir(directory): # load an image from file filename = directory + '/' + name image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get image id image_id = name.split('.')[0] images[image_id] = image return images # load images directory = 'Flicker8k_Dataset' images = load_photos(directory) print('Loaded Images: %d' % len(images)) ``` 運行此示例將打印已加載圖像的數量。運行需要幾分鐘。 ```py Loaded Images: 8091 ``` 如果你沒有 RAM 來保存所有圖像（估計大約 5GB），那么你可以添加一個 if 語句來在加載 100 個圖像后提前打破循環，例如： ```py if (len(images) >= 100): break ``` ## 預先計算照片功能可以使用預先訓練的模型從數據集中的照片中提取特征并將特征存儲到文件中。這是一種效率，這意味著可以將從照片中提取的特征轉換為文本描述的模型的語言部分可以從特征提取模型中單獨訓練。好處是，非常大的預訓練模型不需要加載，保存在存儲器中，并且用于在訓練語言模型時處理每張照片。之后，可以將特征提取模型和語言模型放在一起，以便對新照片進行預測。在本節中，我們將擴展上一節中開發的照片加載行為，以加載所有照片，使用預先訓練的 VGG 模型提取其特征，并將提取的特征存儲到可以加載并用于訓練的新文件中。語言模型。第一步是加載 VGG 模型。此型號直接在 Keras 中提供，可按如下方式加載。請注意，這會將 500 兆的模型權重下載到您的計算機，這可能需要幾分鐘。 ```py from keras.applications.vgg16 import VGG16 # load the model in_layer = Input(shape=(224, 224, 3)) model = VGG16(include_top=False, input_tensor=in_layer, pooling='avg') print(model.summary()) ``` 這將加載 VGG 16 層模型。通過設置 _include_top = False_ ，從模型中刪除兩個密集輸出層以及分類輸出層。最終匯集層的輸出被視為從圖像中提取的特征。接下來，我們可以像上一節一樣遍歷圖像目錄中的所有圖像，并在模型上為每個準備好的圖像調用 _predict（）_ 函數以獲取提取的特征。然后可以將這些特征存儲在鍵入圖像 id 的字典中。下面列出了完整的示例。 ```py from os import listdir from pickle import dump from keras.applications.vgg16 import VGG16 from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from keras.applications.vgg16 import preprocess_input from keras.layers import Input # extract features from each photo in the directory def extract_features(directory): # load the model in_layer = Input(shape=(224, 224, 3)) model = VGG16(include_top=False, input_tensor=in_layer) print(model.summary()) # extract features from each photo features = dict() for name in listdir(directory): # load an image from file filename = directory + '/' + name image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) # get image id image_id = name.split('.')[0] # store feature features[image_id] = feature print('>%s' % name) return features # extract features from all images directory = 'Flicker8k_Dataset' features = extract_features(directory) print('Extracted Features: %d' % len(features)) # save to file dump(features, open('features.pkl', 'wb')) ``` 該示例可能需要一些時間才能完成，可能需要一個小時。提取所有功能后，字典將存儲在當前工作目錄中的“ _features.pkl_ ”文件中。然后可以稍后加載這些特征并將其用作訓練語言模型的輸入。您可以在 Keras 中嘗試其他類型的預訓練模型。 ## 如何加載描述花點時間談談描述是很重要的;有一些可用。文件 _Flickr8k.token.txt_ 包含圖像標識符列表（用于圖像文件名）和分詞描述。每個圖像都有多個描述。以下是文件中的描述示例，顯示了單個圖像的 5 種不同描述。 ```py 1305564994_00513f9a5b.jpg#0 A man in street racer armor be examine the tire of another racer 's motorbike . 1305564994_00513f9a5b.jpg#1 Two racer drive a white bike down a road . 1305564994_00513f9a5b.jpg#2 Two motorist be ride along on their vehicle that be oddly design and color . 1305564994_00513f9a5b.jpg#3 Two person be in a small race car drive by a green hill . 1305564994_00513f9a5b.jpg#4 Two person in race uniform in a street car . ``` 文件 _ExpertAnnotations.txt_ 表示每個圖像的哪些描述是由“_ 專家 _”編寫的，這些描述是由眾包工作者寫的，要求描述圖像。最后，文件 _CrowdFlowerAnnotations.txt_ 提供群眾工作者的頻率，指示字幕是否適合每個圖像。可以概率地解釋這些頻率。該論文的作者描述了注釋如下： > ......要求注釋者寫出描述描繪的場景，情境，事件和實體（人，動物，其他物體）的句子。我們為每個圖像收集了多個字幕，因為可以描述許多圖像的方式存在相當大的差異。 - [框架圖像描述作為排名任務：數據，模型和評估指標](https://www.jair.org/media/3994/live-3994-7274-jair.pdf)，2013。還有訓練/測試拆分中使用的照片標識符列表，以便您可以比較報告中報告的結果。第一步是決定使用哪些字幕。最簡單的方法是對每張照片使用第一個描述。首先，我們需要一個函數將整個注釋文件（' _Flickr8k.token.txt_ '）加載到內存中。下面是一個執行此操作的函數，稱為 _load_doc（）_，給定文件名，將以字符串形式返回文檔。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text ``` 我們可以從上面的文件示例中看到，我們只需要用空格分割每一行，并將第一個元素作為圖像標識符，其余元素作為圖像描述。例如： ```py # split line by white space tokens = line.split() # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] ``` 然后我們可以通過刪除文件擴展名和描述號來清理圖像標識符。 ```py # remove filename from image id image_id = image_id.split('.')[0] ``` 我們還可以將描述標記重新組合成一個字符串，以便以后處理。 ```py # convert description tokens back to string image_desc = ' '.join(image_desc) ``` 我們可以把所有這些放在一個函數中。下面定義 _load_descriptions（）_ 函數，它將獲取加載的文件，逐行處理，并將圖像標識符字典返回到它們的第一個描述。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # extract descriptions for images def load_descriptions(doc): mapping = dict() # process lines for line in doc.split('\n'): # split line by white space tokens = line.split() if len(line) < 2: continue # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] # remove filename from image id image_id = image_id.split('.')[0] # convert description tokens back to string image_desc = ' '.join(image_desc) # store the first description for each image if image_id not in mapping: mapping[image_id] = image_desc return mapping filename = 'Flickr8k_text/Flickr8k.token.txt' doc = load_doc(filename) descriptions = load_descriptions(doc) print('Loaded: %d ' % len(descriptions)) ``` 運行該示例將打印已加載的圖像描述的數量。 ```py Loaded: 8092 ``` 還有其他方法可以加載可能對數據更準確的描述。使用上面的示例作為起點，讓我知道你提出了什么。在下面的評論中發布您的方法。 ## 準備說明文字描述是分詞的;這意味著每個標記由用空格分隔的單詞組成。它還意味著標點符號被分隔為它們自己的標記，例如句點（'。'）和單詞復數（'s）的撇號。在模型中使用之前清理描述文本是個好主意。我們可以形成一些數據清理的想法包括： * 將所有標記的大小寫歸一化為小寫。 * 從標記中刪除所有標點符號。 * 刪除包含一個或多個字符的所有標記（刪除標點符號后），例如'a'和掛's'字符。我們可以在一個函數中實現這些簡單的清理操作，該函數清除上一節中加載的字典中的每個描述。下面定義了 _clean_descriptions（）_ 函數，它將清理每個加載的描述。 ```py # clean description text def clean_descriptions(descriptions): # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for key, desc in descriptions.items(): # tokenize desc = desc.split() # convert to lower case desc = [word.lower() for word in desc] # remove punctuation from each token desc = [w.translate(table) for w in desc] # remove hanging 's' and 'a' desc = [word for word in desc if len(word)>1] # store as string descriptions[key] = ' '.join(desc) ``` 然后我們可以將干凈的文本保存到文件中以供我們的模型稍后使用。每行將包含圖像標識符，后跟干凈描述。下面定義了 _save_doc（）_ 函數，用于將已清理的描述保存到文件中。 ```py # save descriptions to file, one per line def save_doc(descriptions, filename): lines = list() for key, desc in descriptions.items(): lines.append(key + ' ' + desc) data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() ``` 將這一切與上一節中的描述加載在一起，下面列出了完整的示例。 ```py import string # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # extract descriptions for images def load_descriptions(doc): mapping = dict() # process lines for line in doc.split('\n'): # split line by white space tokens = line.split() if len(line) < 2: continue # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] # remove filename from image id image_id = image_id.split('.')[0] # convert description tokens back to string image_desc = ' '.join(image_desc) # store the first description for each image if image_id not in mapping: mapping[image_id] = image_desc return mapping # clean description text def clean_descriptions(descriptions): # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for key, desc in descriptions.items(): # tokenize desc = desc.split() # convert to lower case desc = [word.lower() for word in desc] # remove punctuation from each token desc = [w.translate(table) for w in desc] # remove hanging 's' and 'a' desc = [word for word in desc if len(word)>1] # store as string descriptions[key] = ' '.join(desc) # save descriptions to file, one per line def save_doc(descriptions, filename): lines = list() for key, desc in descriptions.items(): lines.append(key + ' ' + desc) data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() filename = 'Flickr8k_text/Flickr8k.token.txt' # load descriptions doc = load_doc(filename) # parse descriptions descriptions = load_descriptions(doc) print('Loaded: %d ' % len(descriptions)) # clean descriptions clean_descriptions(descriptions) # summarize vocabulary all_tokens = ' '.join(descriptions.values()).split() vocabulary = set(all_tokens) print('Vocabulary Size: %d' % len(vocabulary)) # save descriptions save_doc(descriptions, 'descriptions.txt') ``` 運行該示例首先加載 8,092 個描述，清除它們，匯總 4,484 個唯一單詞的詞匯表，然后將它們保存到名為“ _descriptionss.txt_ ”的新文件中。 ```py Loaded: 8092 Vocabulary Size: 4484 ``` 在文本編輯器中打開新文件' _descriptionss.txt_ '并查看內容。您應該看到準備好進行建模的照片的可讀描述。 ```py ... 3139118874_599b30b116 two girls pose for picture at christmastime 2065875490_a46b58c12b person is walking on sidewalk and skeleton is on the left inside of fence 2682382530_f9f8fd1e89 man in black shorts is stretching out his leg 3484019369_354e0b88c0 hockey team in red and white on the side of the ice rink 505955292_026f1489f2 boy rides horse ``` 詞匯量仍然相對較大。為了使建模更容易，特別是第一次，我建議通過刪除僅在所有描述中出現一次或兩次的單詞來進一步減少詞匯量。 ## 整個描述序列模型有很多方法可以模擬字幕生成問題。一種樸素的方式是創建一個模型，以一次性方式輸出整個文本描述。這是一個樸素的模型，因為它給模型帶來了沉重的負擔，既可以解釋照片的含義，也可以生成單詞，然后將這些單詞排列成正確的順序。這與編碼器 - 解碼器循環神經網絡中使用的語言翻譯問題不同，其中在給定輸入序列的編碼的情況下，整個翻譯的句子一次輸出一個字。在這里，我們將使用圖像的編碼來生成輸出句子。可以使用用于圖像分類的預訓練模型對圖像進行編碼，例如在上述 ImageNet 模型上訓練的 VGG。模型的輸出將是詞匯表中每個單詞的概率分布。序列與最長的照片描述一樣長。因此，描述需要首先進行整數編碼，其中詞匯表中的每個單詞被賦予唯一的整數，并且單詞序列將被整數序列替換。然后，整數序列需要是一個熱編碼，以表示序列中每個單詞的詞匯表的理想化概率分布。我們可以使用 Keras 中的工具來準備此類模型的描述。第一步是將圖像標識符的映射加載到存儲在' _descriptionss.txt_ '中的干凈描述中。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load clean descriptions into memory def load_clean_descriptions(filename): doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # store descriptions[image_id] = ' '.join(image_desc) return descriptions descriptions = load_clean_descriptions('descriptions.txt') print('Loaded %d' % (len(descriptions))) ``` 運行此片段將 8,092 張照片描述加載到以圖像標識符為中心的字典中。然后，可以使用這些標識符將每個照片文件加載到模型的相應輸入。 ```py Loaded 8092 ``` 接下來，我們需要提取所有描述文本，以便我們對其進行編碼。 ```py # extract all text desc_text = list(descriptions.values()) ``` 我們可以使用 Keras _Tokenizer_ 類將詞匯表中的每個單詞一致地映射為整數。首先，創建對象，然后將其放在描述文本上。稍后可以將擬合標記器保存到文件中，以便將預測一致地解碼回詞匯單詞。 ```py from keras.preprocessing.text import Tokenizer # prepare tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts(desc_text) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) ``` 接下來，我們可以使用 fit tokenizer 將照片描述編碼為整數序列。 ```py # integer encode descriptions sequences = tokenizer.texts_to_sequences(desc_text) ``` 該模型將要求所有輸出序列具有相同的訓練長度。我們可以通過填充所有編碼序列以使其具有與最長編碼序列相同的長度來實現這一點。我們可以在單詞列表之后用 0 值填充序列。 Keras 提供 _pad_sequences（）_ 函數來填充序列。 ```py from keras.preprocessing.sequence import pad_sequences # pad all sequences to a fixed length max_length = max(len(s) for s in sequences) print('Description Length: %d' % max_length) padded = pad_sequences(sequences, maxlen=max_length, padding='post') ``` 最后，我們可以對填充序列進行熱編碼，以便為序列中的每個字提供一個稀疏向量。 Keras 提供 _to_categorical（）_ 函數來執行此操作。 ```py from keras.utils import to_categorical # one hot encode y = to_categorical(padded, num_classes=vocab_size) ``` 編碼后，我們可以確保序列輸出數據具有正確的模型形狀。 ```py y = y.reshape((len(descriptions), max_length, vocab_size)) print(y.shape) ``` 將所有這些放在一起，下面列出了完整的示例。 ```py from numpy import array from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load clean descriptions into memory def load_clean_descriptions(filename): doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # store descriptions[image_id] = ' '.join(image_desc) return descriptions descriptions = load_clean_descriptions('descriptions.txt') print('Loaded %d' % (len(descriptions))) # extract all text desc_text = list(descriptions.values()) # prepare tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts(desc_text) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # integer encode descriptions sequences = tokenizer.texts_to_sequences(desc_text) # pad all sequences to a fixed length max_length = max(len(s) for s in sequences) print('Description Length: %d' % max_length) padded = pad_sequences(sequences, maxlen=max_length, padding='post') # one hot encode y = to_categorical(padded, num_classes=vocab_size) y = y.reshape((len(descriptions), max_length, vocab_size)) print(y.shape) ``` 運行該示例首先打印加載的圖像描述的數量（8,092 張照片），數據集詞匯量大小（4,485 個單詞），最長描述的長度（28 個單詞），然后最終打印用于擬合預測模型的數據的形狀。形式 _[樣品，序列長度，特征]_ 。 ```py Loaded 8092 Vocabulary Size: 4485 Description Length: 28 (8092, 28, 4485) ``` 如上所述，輸出整個序列對于模型可能是具有挑戰性的。我們將在下一節中討論一個更簡單的模型。 ## 逐字模型用于生成照片標題的更簡單的模型是在給定圖像作為輸入和生成的最后一個單詞的情況下生成一個單詞。然后必須遞歸地調用該模型以生成描述中的每個單詞，其中先前的預測作為輸入。使用單詞作為輸入，為模型提供強制上下文，以預測序列中的下一個單詞。這是以前研究中使用的模型，例如： * [Show and Tell：神經圖像標題生成器](https://arxiv.org/abs/1411.4555)，2015。字嵌入層可用于表示輸入字。與照片的特征提取模型一樣，這也可以在大型語料庫或所有描述的數據集上進行預訓練。該模型將完整的單詞序列作為輸入;序列的長度將是數據集中描述的最大長度。該模型必須以某種方式開始。一種方法是用特殊標簽圍繞每個照片描述以指示描述的開始和結束，例如“STARTDESC”和“ENDDESC”。例如，描述： ```py boy rides horse ``` 會成為： ```py STARTDESC boy rides horse ENDDESC ``` 并且將被輸入到具有相同圖像輸入的模型，以產生以下輸入 - 輸出字序列對： ```py Input (X), Output (y) STARTDESC, boy STARTDESC, boy, rides STARTDESC, boy, rides, horse STARTDESC, boy, rides, horse ENDDESC ``` 數據準備工作將與上一節中描述的大致相同。每個描述必須是整數編碼。在編碼之后，序列被分成多個輸入和輸出對，并且只有輸出字（y）是一個熱編碼的。這是因為該模型僅需要一次預測一個單詞的概率分布。代碼是相同的，直到我們計算序列的最大長度。 ```py ... descriptions = load_clean_descriptions('descriptions.txt') print('Loaded %d' % (len(descriptions))) # extract all text desc_text = list(descriptions.values()) # prepare tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts(desc_text) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # integer encode descriptions sequences = tokenizer.texts_to_sequences(desc_text) # determine the maximum sequence length max_length = max(len(s) for s in sequences) print('Description Length: %d' % max_length) ``` 接下來，我們將每個整數編碼序列分成輸入和輸出對。讓我們在序列中的第 i 個單詞處逐步執行稱為 seq 的單個序列，其中 i> = 1。首先，我們將第一個 i-1 個字作為輸入序列，將第 i 個字作為輸出字。 ```py # split into input and output pair in_seq, out_seq = seq[:i], seq[i] ``` 接下來，將輸入序列填充到輸入序列的最大長度。使用預填充（默認值），以便在序列的末尾顯示新單詞，而不是輸入開頭。使用預填充（默認值），以便在序列的末尾顯示新單詞，而不是輸入的開頭。 ```py # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] ``` 輸出字是一個熱編碼，與上一節非常相似。 ```py # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] ``` 我們可以將所有這些放在一個完整的例子中，為逐字模型準備描述數據。 ```py from numpy import array from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load clean descriptions into memory def load_clean_descriptions(filename): doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # store descriptions[image_id] = ' '.join(image_desc) return descriptions descriptions = load_clean_descriptions('descriptions.txt') print('Loaded %d' % (len(descriptions))) # extract all text desc_text = list(descriptions.values()) # prepare tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts(desc_text) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # integer encode descriptions sequences = tokenizer.texts_to_sequences(desc_text) # determine the maximum sequence length max_length = max(len(s) for s in sequences) print('Description Length: %d' % max_length) X, y = list(), list() for img_no, seq in enumerate(sequences): # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # split into input and output pair in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store X.append(in_seq) y.append(out_seq) # convert to numpy arrays X, y = array(X), array(y) print(X.shape) print(y.shape) ``` 運行該示例將打印相同的統計信息，但會打印生成的編碼輸入和輸出序列的大小。請注意，圖像的輸入必須遵循完全相同的順序，其中針對從單個描述中繪制的每個示例顯示相同的照片。實現此目的的一種方法是加載照片并將其存儲為從單個描述準備的每個示例。 ```py Loaded 8092 Vocabulary Size: 4485 Description Length: 28 (66456, 28) (66456, 4485) ``` ## 漸進式加載如果你有大量的 RAM（例如 8 千兆字節或更多），并且大多數現代系統都有，那么照片和描述的 Flicr8K 數據集可以放入 RAM 中。如果您想使用 CPU 適合深度學習模型，這很好。或者，如果您想使用 GPU 調整模型，那么您將無法將數據放入普通 GPU 視頻卡的內存中。一種解決方案是根據模型逐步加載照片和描述。 Keras 通過在模型上使用 _fit_generator（）_ 函數來支持逐步加載的數據集。生成器是用于描述用于返回模型進行訓練的批量樣本的函數的術語。這可以像獨立函數一樣簡單，其名稱在擬合模型時傳遞給 _fit_generator（）_ 函數。作為提醒，模型適用于多個時期，其中一個時期是一個遍歷整個訓練數據集的時期，例如所有照片。一個時期由多批示例組成，其中模型權重在每批結束時更新。生成器必須創建并生成一批示例。例如，數據集中的平均句子長度為 11 個字;這意味著每張照片將產生 11 個用于擬合模型的示例，而兩張照片將產生平均約 22 個示例。現代硬件的良好默認批量大小可能是 32 個示例，因此這是大約 2-3 張照片的示例。我們可以編寫一個自定義生成器來加載一些照片并將樣本作為一個批次返回。讓我們假設我們正在使用上一節中描述的逐字模型，該模型期望一系列單詞和準備好的圖像作為輸入并預測單個單詞。讓我們設計一個數據生成器，給出一個加載的圖像標識符字典來清理描述，一個訓練好的標記器，最大序列長度將為每個批次加載一個圖像的例子。生成器必須永遠循環并產生每批樣品。如果生成器和產量是新概念，請考慮閱讀本文： * [Python 生成器](https://wiki.python.org/moin/Generators) 我們可以使用 while 循環永遠循環，并在其中循環遍歷圖像目錄中的每個圖像。對于每個圖像文件名，我們可以加載圖像并從圖像的描述中創建所有輸入 - 輸出序列對。以下是數據生成器功能。 ```py def data_generator(mapping, tokenizer, max_length): # loop for ever over images directory = 'Flicker8k_Dataset' while 1: for name in listdir(directory): # load an image from file filename = directory + '/' + name image, image_id = load_image(filename) # create word sequences desc = mapping[image_id] in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc, image) yield [[in_img, in_seq], out_word] ``` 您可以擴展它以將數據集目錄的名稱作為參數。生成器返回一個包含模型輸入（X）和輸出（y）的數組。輸入包括一個數組，其中包含兩個輸入圖像和編碼單詞序列的項目。輸出是一個熱編碼的單詞。你可以看到它調用一個名為 _load_photo（）_ 的函數來加載一張照片并返回像素和圖像標識符。這是本教程開頭開發的照片加載功能的簡化版本。 ```py # load a single photo intended as input for the VGG feature extractor model def load_photo(filename): image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image)[0] # get image id image_id = filename.split('/')[-1].split('.')[0] return image, image_id ``` 調用名為 _create_sequences（）_ 的另一個函數來創建圖像序列，輸入單詞序列和輸出單詞，然后我們將其輸出給調用者。這是一個功能，包括上一節中討論的所有內容，還可以創建圖像像素的副本，每個輸入 - 輸出對都是根據照片的描述創建的。 ```py # create sequences of images, input sequences and output words for an image def create_sequences(tokenizer, max_length, descriptions, images): Ximages, XSeq, y = list(), list(),list() vocab_size = len(tokenizer.word_index) + 1 for j in range(len(descriptions)): seq = descriptions[j] image = images[j] # integer encode seq = tokenizer.texts_to_sequences([seq])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # select in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store Ximages.append(image) XSeq.append(in_seq) y.append(out_seq) Ximages, XSeq, y = array(Ximages), array(XSeq), array(y) return Ximages, XSeq, y ``` 在準備使用數據生成器的模型之前，我們必須加載干凈的描述，準備標記生成器，并計算最大序列長度。必須將所有 3 個作為參數傳遞給 _data_generator（）_。我們使用先前開發的相同 _load_clean_descriptions（）_ 函數和新的 _create_tokenizer（）_ 函數來簡化標記生成器的創建。將所有這些結合在一起，下面列出了完整的數據生成器，隨時可用于訓練模型。 ```py from os import listdir from numpy import array from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from keras.applications.vgg16 import preprocess_input # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load clean descriptions into memory def load_clean_descriptions(filename): doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # store descriptions[image_id] = ' '.join(image_desc) return descriptions # fit a tokenizer given caption descriptions def create_tokenizer(descriptions): lines = list(descriptions.values()) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # load a single photo intended as input for the VGG feature extractor model def load_photo(filename): image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image)[0] # get image id image_id = filename.split('/')[-1].split('.')[0] return image, image_id # create sequences of images, input sequences and output words for an image def create_sequences(tokenizer, max_length, desc, image): Ximages, XSeq, y = list(), list(),list() vocab_size = len(tokenizer.word_index) + 1 # integer encode the description seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # select in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store Ximages.append(image) XSeq.append(in_seq) y.append(out_seq) Ximages, XSeq, y = array(Ximages), array(XSeq), array(y) return [Ximages, XSeq, y] # data generator, intended to be used in a call to model.fit_generator() def data_generator(descriptions, tokenizer, max_length): # loop for ever over images directory = 'Flicker8k_Dataset' while 1: for name in listdir(directory): # load an image from file filename = directory + '/' + name image, image_id = load_photo(filename) # create word sequences desc = descriptions[image_id] in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc, image) yield [[in_img, in_seq], out_word] # load mapping of ids to descriptions descriptions = load_clean_descriptions('descriptions.txt') # integer encode sequences of words tokenizer = create_tokenizer(descriptions) # pad to fixed length max_length = max(len(s.split()) for s in list(descriptions.values())) print('Description Length: %d' % max_length) # test the data generator generator = data_generator(descriptions, tokenizer, max_length) inputs, outputs = next(generator) print(inputs[0].shape) print(inputs[1].shape) print(outputs.shape) ``` 可以通過調用 [next（）](https://docs.python.org/3/library/functions.html#next)函數來測試數據生成器。我們可以按如下方式測試發電機。 ```py # test the data generator generator = data_generator(descriptions, tokenizer, max_length) inputs, outputs = next(generator) print(inputs[0].shape) print(inputs[1].shape) print(outputs.shape) ``` 運行該示例打印單個批量的輸入和輸出示例的形狀（例如，13 個輸入 - 輸出對）： ```py (13, 224, 224, 3) (13, 28) (13, 4485) ``` 通過調用模型上的 fit_generator（）函數（而不是 _fit（）_）并傳入生成器，可以使用生成器來擬合模型。我們還必須指定每個時期的步數或批次數。我們可以將此估計為（10 x 訓練數據集大小），如果使用 7,000 個圖像進行訓練，則可能估計為 70,000。 ```py # define model # ... # fit model model.fit_generator(data_generator(descriptions, tokenizer, max_length), steps_per_epoch=70000, ...) ``` ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 ### Flickr8K 數據集 * [將圖像描述框架化為排名任務：數據，模型和評估指標](http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/KCCA.html)（主頁） * [框架圖像描述作為排名任務：數據，模型和評估指標](https://www.jair.org/media/3994/live-3994-7274-jair.pdf)，（PDF）2013。 * [數據集申請表](https://illinois.edu/fb/sec/1713398) * [Old Flicrk8K 主頁](http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html) ### API * [Python 生成器](https://wiki.python.org/moin/Generators) * [Keras Model API](https://keras.io/models/model/) * [Keras pad_sequences（）API](https://keras.io/preprocessing/sequence/#pad_sequences) * [Keras Tokenizer API](https://keras.io/preprocessing/text/#tokenizer) * [Keras VGG16 API](https://keras.io/applications/#vgg16) ## 摘要在本教程中，您了解了如何準備照片和文本描述，以便開發自動照片標題生成模型。具體來說，你學到了： * 關于 Flickr8K 數據集，包含 8,000 多張照片和每張照片最多 5 個字幕。 * 如何為深度學習建模一般加載和準備照片和文本數據。 * 如何在 Keras 中為兩種不同類型的深度學習模型專門編碼數據。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。