如何利用小實驗在 Keras 中開發字幕生成模型 · Machine Learning Mastery 博客文章翻譯

# 如何利用小實驗在 Keras 中開發字幕生成模型 > 原文： [https://machinelearningmastery.com/develop-a-caption-generation-model-in-keras/](https://machinelearningmastery.com/develop-a-caption-generation-model-in-keras/) 字幕生成是一個具有挑戰性的人工智能問題，其中必須為照片生成文本描述。它既需要計算機視覺的方法來理解圖像的內容，也需要來自自然語言處理領域的語言模型，以便將圖像的理解轉化為正確的單詞。最近，深度學習方法已經在該問題的示例上獲得了現有技術的結果。在您自己的數據上開發字幕生成模型可能很困難，主要是因為數據集和模型太大而需要數天才能進行訓練。另一種方法是使用較小數據集的小樣本來探索模型配置。在本教程中，您將了解如何使用標準照片字幕數據集的小樣本來探索不同的深度模型設計。完成本教程后，您將了解： * 如何為照片字幕建模準備數據。 * 如何設計基線和測試工具來評估模型的技能和控制其隨機性。 * 如何評估模型技能，特征提取模型和單詞嵌入等屬性，以提升模型技能。讓我們開始吧。 * **2019 年 2 月 2 日**：提供了 Flickr8k_Dataset 數據集的直接鏈接，因為官方網站被刪除了。 ![How to Use Small Experiments to Develop a Caption Generation Model in Keras](img/7d34b218f89d903c2711e5c2dc7e3027.jpg) 如何使用小實驗開發 Keras 中的字幕生成模型照片由 [Per](https://www.flickr.com/photos/perry-pics/5968641588/) ，保留一些權利。 ## 教程概述本教程分為 6 個部分;他們是： 1. 數據準備 2. 基線標題生成模型 3. 網絡大小參數 4. 配置特征提取模型 5. 詞嵌入模型 6. 結果分析 ### Python 環境本教程假設您安裝了 Python SciPy 環境，理想情況下使用 Python 3。您必須安裝帶有 TensorFlow 或 Theano 后端的 Keras（2.0 或更高版本）。本教程還假設您安裝了 scikit-learn，Pandas，NumPy 和 Matplotlib。如果您需要有關環境的幫助，請參閱本教程： * [如何使用 Anaconda 設置用于機器學習和深度學習的 Python 環境](https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/) 我建議在帶 GPU 的系統上運行代碼。您可以在 Amazon Web Services 上以低成本方式訪問 GPU。在本教程中學習如何： * [如何使用亞馬遜網絡服務上的 Keras 開發和評估大型深度學習模型](https://machinelearningmastery.com/develop-evaluate-large-deep-learning-models-keras-amazon-web-services/) 讓我們潛入。 ## 數據準備首先，我們需要準備數據集來訓練模型。我們將使用 Flickr8K 數據集，該數據集包含超過 8,000 張照片及其描述。您可以從此處下載數據集： * [將圖像描述框架化為排名任務：數據，模型和評估指標](http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/KCCA.html)。 **UPDATE（April / 2019）**：官方網站似乎已被刪除（雖然表格仍然有效）。以下是我的[數據集 GitHub 存儲庫](https://github.com/jbrownlee/Datasets)的一些直接下載鏈接： * [Flickr8k_Dataset.zip](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip) * [Flickr8k_text.zip](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip) 將照片和說明分別解壓縮到 _Flicker8k_Dataset_ 和 _Flickr8k_text_ 目錄中的當前工作目錄中。數據準備分為兩部分，它們是： 1. 準備文本 2. 準備照片 ### 準備文本數據集包含每張照片的多個描述，描述文本需要一些最小的清潔。首先，我們將加載包含所有描述的文件。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text filename = 'Flickr8k_text/Flickr8k.token.txt' # load descriptions doc = load_doc(filename) ``` 每張照片都有唯一的標識符。這用于照片文件名和描述的文本文件中。接下來，我們將逐步瀏覽照片說明列表并保存每張照片的第一個描述。下面定義了一個名為 _load_descriptions（）_ 的函數，給定加載的文檔文本，它將返回照片標識符的字典到描述。 ```py # extract descriptions for images def load_descriptions(doc): mapping = dict() # process lines for line in doc.split('\n'): # split line by white space tokens = line.split() if len(line) < 2: continue # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] # remove filename from image id image_id = image_id.split('.')[0] # convert description tokens back to string image_desc = ' '.join(image_desc) # store the first description for each image if image_id not in mapping: mapping[image_id] = image_desc return mapping # parse descriptions descriptions = load_descriptions(doc) print('Loaded: %d ' % len(descriptions)) ``` 接下來，我們需要清理描述文本。描述已經被分詞并且易于使用。我們將通過以下方式清理文本，以減少我們需要使用的單詞詞匯量： * 將所有單詞轉換為小寫。 * 刪除所有標點符號。 * 刪除所有長度不超過一個字符的單詞（例如“a”）。下面定義 _clean_descriptions（）_ 函數，給定描述圖像標識符的字典，逐步執行每個描述并清理文本。 ```py import string def clean_descriptions(descriptions): # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for key, desc in descriptions.items(): # tokenize desc = desc.split() # convert to lower case desc = [word.lower() for word in desc] # remove punctuation from each token desc = [w.translate(table) for w in desc] # remove hanging 's' and 'a' desc = [word for word in desc if len(word)>1] # store as string descriptions[key] = ' '.join(desc) # clean descriptions clean_descriptions(descriptions) # summarize vocabulary all_tokens = ' '.join(descriptions.values()).split() vocabulary = set(all_tokens) print('Vocabulary Size: %d' % len(vocabulary)) ``` 最后，我們將圖像標識符和描述字典保存到名為 _descriptionss.txt_ 的新文件中，每行有一個圖像標識符和描述。下面定義了 _save_doc（）_ 函數，該函數給出了包含標識符到描述和文件名的映射的字典，將映射保存到文件。 ```py # save descriptions to file, one per line def save_doc(descriptions, filename): lines = list() for key, desc in descriptions.items(): lines.append(key + ' ' + desc) data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() # save descriptions save_doc(descriptions, 'descriptions.txt') ``` 綜合這些，下面提供了完整的列表。 ```py import string # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # extract descriptions for images def load_descriptions(doc): mapping = dict() # process lines for line in doc.split('\n'): # split line by white space tokens = line.split() if len(line) < 2: continue # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] # remove filename from image id image_id = image_id.split('.')[0] # convert description tokens back to string image_desc = ' '.join(image_desc) # store the first description for each image if image_id not in mapping: mapping[image_id] = image_desc return mapping def clean_descriptions(descriptions): # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for key, desc in descriptions.items(): # tokenize desc = desc.split() # convert to lower case desc = [word.lower() for word in desc] # remove punctuation from each token desc = [w.translate(table) for w in desc] # remove hanging 's' and 'a' desc = [word for word in desc if len(word)>1] # store as string descriptions[key] = ' '.join(desc) # save descriptions to file, one per line def save_doc(descriptions, filename): lines = list() for key, desc in descriptions.items(): lines.append(key + ' ' + desc) data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() filename = 'Flickr8k_text/Flickr8k.token.txt' # load descriptions doc = load_doc(filename) # parse descriptions descriptions = load_descriptions(doc) print('Loaded: %d ' % len(descriptions)) # clean descriptions clean_descriptions(descriptions) # summarize vocabulary all_tokens = ' '.join(descriptions.values()).split() vocabulary = set(all_tokens) print('Vocabulary Size: %d' % len(vocabulary)) # save descriptions save_doc(descriptions, 'descriptions.txt') ``` 首先運行示例打印已加載的照片描述數（8,092）和干凈詞匯表的大小（4,484 個單詞）。 ```py Loaded: 8092 Vocabulary Size: 4484 ``` 然后將干凈的描述寫入' _descriptionss.txt_ '。看一下文件，我們可以看到描述已準備好進行建模。看一下文件，我們可以看到描述已準備好進行建模。 ```py 3621647714_fc67ab2617 man is standing on snow with trees and mountains all around him 365128300_6966058139 group of people are rafting on river rapids 2751694538_fffa3d307d man and boy sit in the driver seat 537628742_146f2c24f8 little girl running in field 2320125735_27fe729948 black and brown dog with blue collar goes on alert by soccer ball in the grass ... ``` ### 準備照片我們將使用預先訓練的模型來解釋照片的內容。有很多型號可供選擇。在這種情況下，我們將使用 2014 年贏得 ImageNet 競賽的牛津視覺幾何組或 VGG 模型。在此處了解有關模型的更多信息： * [用于大規模視覺識別的超深卷積網絡](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) Keras 直接提供這種預先訓練的模型。請注意，第一次使用此模型時，Keras 將從 Internet 下載模型權重，大約為 500 兆字節。這可能需要幾分鐘，具體取決于您的互聯網連接。我們可以將此模型用作更廣泛的圖像標題模型的一部分。問題是，它是一個大型模型，每次我們想要測試一個新的語言模型配置（下游）是多余的時，通過網絡運行每張照片。相反，我們可以使用預先訓練的模型預先計算“照片功能”并將其保存到文件中。然后，我們可以稍后加載這些功能，并將它們作為數據集中給定照片的解釋提供給我們的模型。通過完整的 VGG 模型運行照片也沒有什么不同，只是我們提前完成了一次。這是一種優化，可以更快地訓練我們的模型并消耗更少的內存。我們可以使用 VGG 類在 Keras 中加載 VGG 模型。我們將加載沒有頂部的模型;這意味著沒有網絡末端的層用于解釋從輸入中提取的特征并將它們轉換為類預測。我們對照片的圖像網絡分類不感興趣，我們將訓練自己對圖像特征的解釋。 Keras 還提供了用于將加載的照片整形為模型的優選尺寸的工具（例如，3 通道 224×224 像素圖像）。下面是一個名為 _extract_features（）_ 的函數，給定目錄名稱將加載每張照片，為 VGG 準備并從 VGG 模型中收集預測的特征。圖像特征是具有形狀（7,7,512）的三維陣列。該函數返回圖像標識符的字典到圖像特征。 ```py # extract features from each photo in the directory def extract_features(directory): # load the model in_layer = Input(shape=(224, 224, 3)) model = VGG16(include_top=False, input_tensor=in_layer) print(model.summary()) # extract features from each photo features = dict() for name in listdir(directory): # load an image from file filename = directory + '/' + name image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) # get image id image_id = name.split('.')[0] # store feature features[image_id] = feature print('>%s' % name) return features ``` 我們可以調用此函數來準備用于測試模型的照片數據，然后將生成的字典保存到名為“ _features.pkl_ ”的文件中。下面列出了完整的示例。 ```py from os import listdir from pickle import dump from keras.applications.vgg16 import VGG16 from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from keras.applications.vgg16 import preprocess_input from keras.layers import Input # extract features from each photo in the directory def extract_features(directory): # load the model in_layer = Input(shape=(224, 224, 3)) model = VGG16(include_top=False, input_tensor=in_layer) print(model.summary()) # extract features from each photo features = dict() for name in listdir(directory): # load an image from file filename = directory + '/' + name image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) # get image id image_id = name.split('.')[0] # store feature features[image_id] = feature print('>%s' % name) return features # extract features from all images directory = 'Flicker8k_Dataset' features = extract_features(directory) print('Extracted Features: %d' % len(features)) # save to file dump(features, open('features.pkl', 'wb')) ``` 運行此數據準備步驟可能需要一段時間，具體取決于您的硬件，可能需要一個小時的 CPU 與現代工作站。在運行結束時，您將提取的特征存儲在' _features.pkl_ '中供以后使用。 ## 基線標題生成模型在本節中，我們將定義一個基線模型，用于生成照片的標題以及如何對其進行評估，以便將其與此基線的變體進行比較。本節分為 5 部分： 1. 加載數據。 2. 適合模型。 3. 評估模型。 4. 完整的例子 5. “A”與“A”測試 6. 生成照片標題 ### 1.加載數據我們不會在所有字幕數據上，甚至在大量數據樣本上使用該模型。在本教程中，我們感興趣的是快速測試一組標題模型的不同配置，以查看對此數據有何用處。這意味著我們需要快速評估一個模型配置。為此，我們將在 100 張照片和標題上訓練模型，然后在訓練數據集和 100 張照片和標題的新測試集上進行評估。首先，我們需要加載預定義的照片子集。提供的數據集具有用于訓練，測試和開發的單獨集合，這些集合實際上只是不同的照片標識符組。我們將加載開發集并使用前 100 個列表標識符和第二個 100 標識符（例如從 100 到 200）作為測試集。下面的函數 _load_set（）_ 將加載一組預定義的標識符，我們將使用' _Flickr_8k.devImages.txt_ '文件名作為參數調用它。 ```py # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) ``` 接下來，我們需要將集合拆分為訓練集和測試集。我們將首先通過對標識符進行排序來對它們進行排序，以確保我們始終在機器和運行中對它們進行一致的分割，然后將前 100 個用于訓練，接下來的 100 個用于測試。下面的 _train_test_split（）_ 函數將在加載的標識符集作為輸入的情況下創建此拆分。 ```py # split a dataset into train/test elements def train_test_split(dataset): # order keys so the split is consistent ordered = sorted(dataset) # return split dataset as two new sets return set(ordered[:100]), set(ordered[100:200]) ``` 現在，我們可以使用預定義的一組訓練或測試標識符加載照片描述。下面是函數 _load_clean_descriptions（）_，它為來自' _descriptionss.txt_ '的已清除文本描述加載給定的一組標識符，并將標識符字典返回給文本。我們將開發的模型將生成給定照片的標題，并且標題將一次生成一個單詞。將提供先前生成的單詞的序列作為輸入。因此，我們需要一個“_ 第一個字 _”來啟動生成過程和'_ 最后一個字 _'來表示標題的結束。為此，我們將使用字符串' _startseq_ '和' _endseq_ '。 ```py # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # store descriptions[image_id] = 'startseq ' + ' '.join(image_desc) + ' endseq' return descriptions ``` 接下來，我們可以加載給定數據集的照片功能。下面定義了一個名為 _load_photo_features（）_ 的函數，它加載了整套照片描述，然后返回給定照片標識符集的感興趣子集。這不是非常有效，因為所有照片功能的加載字典大約是 700 兆字節。然而，這將使我們快速起步。請注意，如果您有更好的方法，請在下面的評論中分享。 ```py # load photo features def load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features ``` 我們可以暫停一下，測試迄今為止開發的所有內容完整的代碼示例如下所示。 ```py from pickle import load # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) # split a dataset into train/test elements def train_test_split(dataset): # order keys so the split is consistent ordered = sorted(dataset) # return split dataset as two new sets return set(ordered[:100]), set(ordered[100:200]) # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # store descriptions[image_id] = 'startseq ' + ' '.join(image_desc) + ' endseq' return descriptions # load photo features def load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features # load dev set filename = 'Flickr8k_text/Flickr_8k.devImages.txt' dataset = load_set(filename) print('Dataset: %d' % len(dataset)) # train-test split train, test = train_test_split(dataset) print('Train=%d, Test=%d' % (len(train), len(test))) # descriptions train_descriptions = load_clean_descriptions('descriptions.txt', train) test_descriptions = load_clean_descriptions('descriptions.txt', test) print('Descriptions: train=%d, test=%d' % (len(train_descriptions), len(test_descriptions))) # photo features train_features = load_photo_features('features.pkl', train) test_features = load_photo_features('features.pkl', test) print('Photos: train=%d, test=%d' % (len(train_features), len(test_features))) ``` 運行此示例首先在開發數據集中加載 1,000 個照片標識符。選擇訓練和測試集并用于過濾一組干凈的照片描述和準備好的圖像特征。我們快到了。 ```py Dataset: 1,000 Train=100, Test=100 Descriptions: train=100, test=100 Photos: train=100, test=100 ``` 描述文本需要先編碼為數字，然后才能像輸入中那樣呈現給模型，或者與模型的預測進行比較。編碼數據的第一步是創建從單詞到唯一整數值??的一致映射。 Keras 提供了 Tokenizer 類，可以從加載的描述數據中學習這種映射。下面定義 _create_tokenizer（）_，它將在給定加載的照片描述文本的情況下適合 Tokenizer。 ```py # fit a tokenizer given caption descriptions def create_tokenizer(descriptions): lines = list(descriptions.values()) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # prepare tokenizer tokenizer = create_tokenizer(descriptions) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) ``` 我們現在可以對文本進行編碼。每個描述將分為單詞。該模型將提供一個單詞和照片，并生成下一個單詞。然后，將描述的前兩個單詞作為輸入提供給模型，以生成下一個單詞。這就是模型的訓練方式。例如，輸入序列“_ 在場 _ 中運行的小女孩”將被分成 6 個輸入 - 輸出對來訓練模型： ```py X1, X2 (text sequence), y (word) photo startseq, little photo startseq, little, girl photo startseq, little, girl, running photo startseq, little, girl, running, in photo startseq, little, girl, running, in, field photo startseq, little, girl, running, in, field, endseq ``` 稍后，當模型用于生成描述時，生成的單詞將被連接并遞歸地提供作為輸入以生成圖像的標題。下面給出標記器 _create_sequences（）_ 的函數，單個干凈的描述，照片的特征和最大描述長度將為訓練模型準備一組輸入 - 輸出對。調用此函數將返回 _X1_ 和 _X2_ ，用于圖像數據和輸入序列數據的數組以及輸出字的 _y_ 值。輸入序列是整數編碼的，輸出字是一個熱編碼的，以表示在整個可能單詞的詞匯表中預期單詞的概率分布。 ```py # create sequences of images, input sequences and output words for an image def create_sequences(tokenizer, desc, image, max_length): Ximages, XSeq, y = list(), list(),list() vocab_size = len(tokenizer.word_index) + 1 # integer encode the description seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # select in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store Ximages.append(image) XSeq.append(in_seq) y.append(out_seq) # Ximages, XSeq, y = array(Ximages), array(XSeq), array(y) return [Ximages, XSeq, y] ``` ### 2.適合模型我們幾乎準備好適應這個模型。已經討論了模型的部分內容，但讓我們重新進行迭代。該模型基于文章“ [Show and Tell：A Neural Image Caption Generator](https://arxiv.org/abs/1411.4555) ”，2015 年。該模型包括三個部分： * **照片功能提取器**。這是在 ImageNet 數據集上預訓練的 16 層 VGG 模型。我們使用 VGG 模型（沒有頂部）預處理照片，并將使用此模型預測的提取特征作為輸入。 * **序列處理器**。這是用于處理文本輸入的單詞嵌入層，后跟 LSTM 層。 LSTM 輸出由 Dense 層一次解釋一個輸出。 * **口譯員（缺少更好的名字）**。特征提取器和序列處理器都輸出固定長度的向量，該向量是最大序列的長度。它們連接在一起并由 LSTM 和 Dense 層處理，然后進行最終預測。在基礎模型中使用保守數量的神經元。具體來說，在特征提取器之后的 128 Dense 層，在序列處理器之后是 50 維單詞嵌入，接著是 256 單元 LSTM 和 128 神經元密集，最后是 500 單元 LSTM，接著是網絡末端的 500 神經元密集。該模型預測了詞匯表中的概率分布，因此使用 softmax 激活函數，并且在擬合網絡時最小化分類交叉熵損失函數。函數 _define_model（）_ 定義基線模型，給定詞匯量的大小和照片描述的最大長度。 Keras 功能 API 用于定義模型，因為它提供了定義采用兩個輸入流并組合它們的模型所需的靈活性。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary()) plot_model(model, show_shapes=True, to_file='plot.png') return model ``` 要了解模型的結構，特別是層的形狀，請參閱下面列出的摘要。 ```py ____________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ==================================================================================================== input_1 (InputLayer) (None, 7, 7, 512) 0 ____________________________________________________________________________________________________ input_2 (InputLayer) (None, 25) 0 ____________________________________________________________________________________________________ global_max_pooling2d_1 (GlobalMa (None, 512) 0 input_1[0][0] ____________________________________________________________________________________________________ embedding_1 (Embedding) (None, 25, 50) 18300 input_2[0][0] ____________________________________________________________________________________________________ dense_1 (Dense) (None, 128) 65664 global_max_pooling2d_1[0][0] ____________________________________________________________________________________________________ lstm_1 (LSTM) (None, 25, 256) 314368 embedding_1[0][0] ____________________________________________________________________________________________________ repeat_vector_1 (RepeatVector) (None, 25, 128) 0 dense_1[0][0] ____________________________________________________________________________________________________ time_distributed_1 (TimeDistribu (None, 25, 128) 32896 lstm_1[0][0] ____________________________________________________________________________________________________ concatenate_1 (Concatenate) (None, 25, 256) 0 repeat_vector_1[0][0] time_distributed_1[0][0] ____________________________________________________________________________________________________ lstm_2 (LSTM) (None, 500) 1514000 concatenate_1[0][0] ____________________________________________________________________________________________________ dense_3 (Dense) (None, 500) 250500 lstm_2[0][0] ____________________________________________________________________________________________________ dense_4 (Dense) (None, 366) 183366 dense_3[0][0] ==================================================================================================== Total params: 2,379,094 Trainable params: 2,379,094 Non-trainable params: 0 ____________________________________________________________________________________________________ ``` 我們還創建了一個圖表來可視化網絡結構，更好地幫助理解兩個輸入流。 ![Plot of the Baseline Captioning Deep Learning Model](img/eb076553435ca4d2a366c4b5e7d90a61.jpg) 基線標題深度學習模型的情節我們將使用數據生成器訓練模型。鑒于字幕和提取的照片特征可能作為單個數據集適合存儲器，因此嚴格來說不需要這樣做。然而，當您在整個數據集上訓練最終模型時，這是一種很好的做法。調用時，生成器將產生結果。在 Keras 中，它將產生一批輸入 - 輸出樣本，用于估計誤差梯度并更新模型權重。函數 _data_generator（）_ 定義數據生成器，給定加載的照片描述字典，照片特征，整數編碼序列的分詞器以及數據集中的最大序列長度。生成器永遠循環，并在被問及時保持產生批量的輸入 - 輸出對。我們還有一個 _n_step_ 參數，它允許我們調整每批次要生成的輸入輸出對的圖像數量。平均序列有 10 個字，即 10 個輸入 - 輸出對，良好的批量大小可能是 30 個樣本，大約 2 到 3 個圖像值。 ```py # data generator, intended to be used in a call to model.fit_generator() def data_generator(descriptions, features, tokenizer, max_length, n_step): # loop until we finish training while 1: # loop over photo identifiers in the dataset keys = list(descriptions.keys()) for i in range(0, len(keys), n_step): Ximages, XSeq, y = list(), list(),list() for j in range(i, min(len(keys), i+n_step)): image_id = keys[j] # retrieve photo feature input image = features[image_id][0] # retrieve text input desc = descriptions[image_id] # generate input-output pairs in_img, in_seq, out_word = create_sequences(tokenizer, desc, image, max_length) for k in range(len(in_img)): Ximages.append(in_img[k]) XSeq.append(in_seq[k]) y.append(out_word[k]) # yield this batch of samples to the model yield [[array(Ximages), array(XSeq)], array(y)] ``` 通過調用 _fit_generator（）_ 并將其傳遞給數據生成器以及所需的所有參數，可以擬合模型。在擬合模型時，我們還可以指定每個時期運行的批次數和時期數。 ```py model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose) ``` 對于這些實驗，我們將每批使用 2 個圖像，每個時期使用 50 個批次（或 100 個圖像），以及 50 個訓練時期。您可以在自己的實驗中嘗試不同的配置。 ### 3.評估模型現在我們知道如何準備數據并定義模型，我們必須定義一個測試工具來評估給定的模型。我們將通過在數據集上訓練模型來評估模型，生成訓練數據集中所有照片的描述，使用成本函數評估這些預測，然后多次重復此評估過程。結果將是模型的技能分數分布，我們可以通過計算平均值和標準差來總結。這是評估深度學習模型的首選方式。看這篇文章： * [如何評估深度學習模型的技巧](https://machinelearningmastery.com/evaluate-skill-deep-learning-models/) 首先，我們需要能夠使用訓練有素的模型生成照片的描述。這包括傳入開始描述標記' _startseq_ '，生成一個單詞，然后以生成的單詞作為輸入遞歸調用模型，直到到達序列標記結尾' _endseq_ '或達到最大描述長度。以下名為 _generate_desc（）_ 的函數實現此行為，并在給定訓練模型和給定準備照片作為輸入的情況下生成文本描述。它調用函數 _word_for_id（）_ 以將整數預測映射回一個字。 ```py # map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None # generate a description for an image def generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'startseq' # iterate over the whole length of the sequence for i in range(max_length): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # stop if we predict the end of the sequence if word == 'endseq': break return in_text ``` 我們將為訓練數據集和測試數據集中的所有照片生成預測。以下名為 _evaluate_model（）_ 的函數將針對給定的照片描述和照片特征數據集評估訓練模型。使用語料庫 BLEU 分數收集和評估實際和預測的描述，該分數總結了生成的文本與預期文本的接近程度。 ```py # evaluate the skill of the model def evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for key, desc in descriptions.items(): # generate description yhat = generate_desc(model, tokenizer, photos[key], max_length) # store actual and predicted actual.append([desc.split()]) predicted.append(yhat.split()) # calculate BLEU score bleu = corpus_bleu(actual, predicted) return bleu ``` BLEU 分數用于文本翻譯，用于針對一個或多個參考翻譯評估翻譯文本。事實上，我們可以訪問我們可以比較的每個圖像的多個參考描述，但為了簡單起見，我們將使用數據集中每張照片的第一個描述（例如清理版本）。您可以在此處了解有關 BLEU 分數的更多信息： * 維基百科 [BLEU（雙語評估替補）](https://en.wikipedia.org/wiki/BLEU) NLTK Python 庫在 [_corpus_bleu（）_ 函數](http://www.nltk.org/api/nltk.translate.html)中實現 BLEU 分數計算。接近 1.0 的較高分數更好，接近零的分數更差。最后，我們需要做的就是在循環中多次定義，擬合和評估模型，然后報告最終的平均分數。理想情況下，我們會重復實驗 30 次或更多次，但這對我們的小型測試工具來說需要很長時間。相反，將評估模型 3 次。它會更快，但平均分數會有更高的差異。下面定義了模型評估循環。在運行結束時，訓練和測試集的 BLEU 分數的分布被保存到文件中。 ```py # run experiment train_results, test_results = list(), list() for i in range(n_repeats): # define the model model = define_model(vocab_size, max_length) # fit model model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose) # evaluate model on training data train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length) test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length) # store train_results.append(train_score) test_results.append(test_score) print('>%d: train=%f test=%f' % ((i+1), train_score, test_score)) # save results to file df = DataFrame() df['train'] = train_results df['test'] = test_results print(df.describe()) df.to_csv(model_name+'.csv', index=False) ``` 我們按如下方式對運行進行參數化，允許我們命名每次運行并將結果保存到單獨的文件中。 ```py # define experiment model_name = 'baseline1' verbose = 2 n_epochs = 50 n_photos_per_update = 2 n_batches_per_epoch = int(len(train) / n_photos_per_update) n_repeats = 3 ``` ### 4.完成示例下面列出了完整的示例。 ```py from os import listdir from numpy import array from numpy import argmax from pandas import DataFrame from nltk.translate.bleu_score import corpus_bleu from pickle import load from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from keras.applications.vgg16 import preprocess_input from keras.applications.vgg16 import VGG16 from keras.utils import plot_model from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import Flatten from keras.layers import LSTM from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.layers import Embedding from keras.layers.merge import concatenate from keras.layers.pooling import GlobalMaxPooling2D # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) # split a dataset into train/test elements def train_test_split(dataset): # order keys so the split is consistent ordered = sorted(dataset) # return split dataset as two new sets return set(ordered[:100]), set(ordered[100:200]) # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # store descriptions[image_id] = 'startseq ' + ' '.join(image_desc) + ' endseq' return descriptions # load photo features def load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features # fit a tokenizer given caption descriptions def create_tokenizer(descriptions): lines = list(descriptions.values()) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # create sequences of images, input sequences and output words for an image def create_sequences(tokenizer, desc, image, max_length): Ximages, XSeq, y = list(), list(),list() vocab_size = len(tokenizer.word_index) + 1 # integer encode the description seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # select in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store Ximages.append(image) XSeq.append(in_seq) y.append(out_seq) # Ximages, XSeq, y = array(Ximages), array(XSeq), array(y) return [Ximages, XSeq, y] # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary()) plot_model(model, show_shapes=True, to_file='plot.png') return model # data generator, intended to be used in a call to model.fit_generator() def data_generator(descriptions, features, tokenizer, max_length, n_step): # loop until we finish training while 1: # loop over photo identifiers in the dataset keys = list(descriptions.keys()) for i in range(0, len(keys), n_step): Ximages, XSeq, y = list(), list(),list() for j in range(i, min(len(keys), i+n_step)): image_id = keys[j] # retrieve photo feature input image = features[image_id][0] # retrieve text input desc = descriptions[image_id] # generate input-output pairs in_img, in_seq, out_word = create_sequences(tokenizer, desc, image, max_length) for k in range(len(in_img)): Ximages.append(in_img[k]) XSeq.append(in_seq[k]) y.append(out_word[k]) # yield this batch of samples to the model yield [[array(Ximages), array(XSeq)], array(y)] # map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None # generate a description for an image def generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'startseq' # iterate over the whole length of the sequence for i in range(max_length): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # stop if we predict the end of the sequence if word == 'endseq': break return in_text # evaluate the skill of the model def evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for key, desc in descriptions.items(): # generate description yhat = generate_desc(model, tokenizer, photos[key], max_length) # store actual and predicted actual.append([desc.split()]) predicted.append(yhat.split()) # calculate BLEU score bleu = corpus_bleu(actual, predicted) return bleu # load dev set filename = 'Flickr8k_text/Flickr_8k.devImages.txt' dataset = load_set(filename) print('Dataset: %d' % len(dataset)) # train-test split train, test = train_test_split(dataset) # descriptions train_descriptions = load_clean_descriptions('descriptions.txt', train) test_descriptions = load_clean_descriptions('descriptions.txt', test) print('Descriptions: train=%d, test=%d' % (len(train_descriptions), len(test_descriptions))) # photo features train_features = load_photo_features('features.pkl', train) test_features = load_photo_features('features.pkl', test) print('Photos: train=%d, test=%d' % (len(train_features), len(test_features))) # prepare tokenizer tokenizer = create_tokenizer(train_descriptions) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # determine the maximum sequence length max_length = max(len(s.split()) for s in list(train_descriptions.values())) print('Description Length: %d' % max_length) # define experiment model_name = 'baseline1' verbose = 2 n_epochs = 50 n_photos_per_update = 2 n_batches_per_epoch = int(len(train) / n_photos_per_update) n_repeats = 3 # run experiment train_results, test_results = list(), list() for i in range(n_repeats): # define the model model = define_model(vocab_size, max_length) # fit model model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose) # evaluate model on training data train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length) test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length) # store train_results.append(train_score) test_results.append(test_score) print('>%d: train=%f test=%f' % ((i+1), train_score, test_score)) # save results to file df = DataFrame() df['train'] = train_results df['test'] = test_results print(df.describe()) df.to_csv(model_name+'.csv', index=False) ``` 首先運行該示例打印已加載的訓練數據的摘要統計信息。 ```py Dataset: 1,000 Descriptions: train=100, test=100 Photos: train=100, test=100 Vocabulary Size: 366 Description Length: 25 ``` 該示例在 GPU 硬件上需要大約 20 分鐘，在 CPU 硬件上需要更長時間。在運行結束時，訓練集上報告的平均 BLEU 為 0.06，測試集上報告為 0.04。結果存儲在 _baseline1.csv_ 中。 ```py train test count 3.000000 3.000000 mean 0.060617 0.040978 std 0.023498 0.025105 min 0.042882 0.012101 25% 0.047291 0.032658 50% 0.051701 0.053215 75% 0.069484 0.055416 max 0.087268 0.057617 ``` 這提供了用于與備用配置進行比較的基線模型。 ### “A”與“A”測試在我們開始測試模型的變化之前，了解測試裝置是否穩定非常重要。也就是說，5 次運行的模型的總結技巧是否足以控制模型的隨機性。我們可以通過在 A / B 測試區域中所謂的 A 對 A 測試再次運行實驗來了解這一點。如果我們再次進行相同的實驗，我們期望獲得相同的結果;如果我們不這樣做，可能需要額外的重復來控制方法的隨機性和數據集。以下是該算法的第二次運行的結果。 ```py train test count 3.000000 3.000000 mean 0.036902 0.043003 std 0.020281 0.017295 min 0.018522 0.026055 25% 0.026023 0.034192 50% 0.033525 0.042329 75% 0.046093 0.051477 max 0.058660 0.060624 ``` 我們可以看到該運行獲得了非常相似的均值和標準差 BLEU 分數。具體而言，在訓練上的平均 BLEU 為 0.03 對 0.06，對于測試為 0.04 至 0.04。線束有點吵，但足夠穩定，可以進行比較。模特有什么好處嗎？ ### 生成照片標題我們希望該模型訓練不足，甚至可能在配置下，但是它可以生成任何類型的可讀文本嗎？重要的是，基線模型具有一定的能力，以便我們可以將基線的 BLEU 分數與產生什么樣的描述質量的想法聯系起來。讓我們訓練一個模型并從訓練和測試集生成一些描述作為健全性檢查。將重復次數更改為 1，將運行名稱更改為“ _baseline_generate_ ”。 ```py model_name = 'baseline_generate' n_repeats = 1 ``` 然后更新 _evaluate_model（）_ 函數以僅評估數據集中的前 5 張照片并打印描述，如下所示。 ```py # evaluate the skill of the model def evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for key, desc in descriptions.items(): # generate description yhat = generate_desc(model, tokenizer, photos[key], max_length) # store actual and predicted actual.append([desc.split()]) predicted.append(yhat.split()) print('Actual: %s' % desc) print('Predicted: %s' % yhat) if len(actual) >= 5: break # calculate BLEU score bleu = corpus_bleu(actual, predicted) return bleu ``` 重新運行示例。您應該看到訓練的結果如下所示（具體結果將根據算法的隨機性質而變化）： ```py Actual: startseq boy bites hard into treat while he sits outside endseq Predicted: startseq boy boy while while he while outside endseq Actual: startseq man in field backed by american flags endseq Predicted: startseq man in in standing city endseq Actual: startseq two girls are walking down dirt road in park endseq Predicted: startseq man walking down down road in endseq Actual: startseq girl laying on the tree with boy kneeling before her endseq Predicted: startseq boy while in up up up water endseq Actual: startseq boy in striped shirt is jumping in front of water fountain endseq Predicted: startseq man is is shirt is on on on on bike endseq ``` 您應該在測試數據集上看到如下結果： ```py Actual: startseq three people are looking into photographic equipment endseq Predicted: startseq boy racer on on on on bike endseq Actual: startseq boy is leaning on chair whilst another boy pulls him around with rope endseq Predicted: startseq girl in playing on on on sword endseq Actual: startseq black and brown dog jumping in midair near field endseq Predicted: startseq dog dog running running running and dog in grass endseq Actual: startseq dog places his head on man face endseq Predicted: startseq brown dog dog to to to to to to to ball endseq Actual: startseq man in green hat is someplace up high endseq Predicted: startseq man in up up his waves endseq ``` 我們可以看到描述并不完美，有些是粗略的，但通常模型會生成一些可讀的文本。一個很好的改善起點。接下來，讓我們看一些實驗來改變不同子模型的大小或容量。 ## 網絡大小參數在本節中，我們將了解網絡結構的總體變化如何影響模型技能。我們將看看模型大小的以下幾個方面： 1. '編碼器'的固定向量輸出的大小。 2. 序列編碼器模型的大小。 3. 語言模型的大小。讓我們潛入。 ### 固定長度向量的大小在基線模型中，照片特征提取器和文本序列編碼器都輸出 128 個元素向量。然后將這些向量連接起來以由語言模型處理。來自每個子模型的 128 個元素向量包含有關輸入序列和照片的所有已知信息。我們可以改變這個向量的大小，看它是否會影響模型技能首先，我們可以將大小從 128 個元素減少到 64 個元素。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(64, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(64, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model ``` 我們將此模型命名為“ _size_sm_fixed_vec_ ”。 ```py model_name = 'size_sm_fixed_vec' ``` 運行此實驗會產生以下 BLEU 分數，可能是測試集上基線的小增益。 ```py train test count 3.000000 3.000000 mean 0.204421 0.063148 std 0.026992 0.003264 min 0.174769 0.059391 25% 0.192849 0.062074 50% 0.210929 0.064757 75% 0.219246 0.065026 max 0.227564 0.065295 ``` 我們還可以將固定長度向量的大小從 128 增加到 256 個單位。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(256, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(256, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model ``` 我們將此配置命名為“ _size_lg_fixed_vec_ ”。 ```py model_name = 'size_lg_fixed_vec' ``` 運行此實驗顯示 BLEU 分數表明該模型并沒有更好。有可能通過更多數據和/或更長時間的訓練，我們可能會看到不同的故事。 ```py train test count 3.000000 3.000000 mean 0.023517 0.027813 std 0.009951 0.010525 min 0.012037 0.021737 25% 0.020435 0.021737 50% 0.028833 0.021737 75% 0.029257 0.030852 max 0.029682 0.039966 ``` ### 序列編碼器大小我們可以調用子模型來解釋生成到目前為止的序列編碼器的單詞的輸入序列。首先，我們可以嘗試降低序列編碼器的代表表現力是否會影響模型技能。我們可以將 LSTM 層中的內存單元數從 256 減少到 128。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(128, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model model_name = 'size_sm_seq_model' ``` 運行這個例子，我們可以看到兩列訓練上的小凹凸和基線測試。這可能是小訓練集大小的神器。 ```py train test count 3.000000 3.000000 mean 0.074944 0.053917 std 0.014263 0.013264 min 0.066292 0.039142 25% 0.066713 0.048476 50% 0.067134 0.057810 75% 0.079270 0.061304 max 0.091406 0.064799 ``` 換句話說，我們可以將 LSTM 層的數量從一個增加到兩個，看看是否會產生顯著的差異。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = LSTM(256, return_sequences=True)(emb3) emb5 = TimeDistributed(Dense(128, activation='relu'))(emb4) # merge inputs merged = concatenate([fe3, emb5]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model model_name = 'size_lg_seq_model' ``` 運行此實驗表明 BLEU 在訓練和測試裝置上都有不錯的碰撞。 ```py train test count 3.000000 3.000000 mean 0.094937 0.096970 std 0.022394 0.079270 min 0.069151 0.046722 25% 0.087656 0.051279 50% 0.106161 0.055836 75% 0.107830 0.122094 max 0.109499 0.188351 ``` 我們還可以嘗試通過將其從 50 維加倍到 100 維來增加單詞嵌入的表示能力。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 100, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model model_name = 'size_em_seq_model' ``` 我們在訓練數據集上看到一個大的運動，但測試數據集上的運動可能很少。 ```py count 3.000000 3.000000 mean 0.112743 0.050935 std 0.017136 0.006860 min 0.096121 0.043741 25% 0.103940 0.047701 50% 0.111759 0.051661 75% 0.121055 0.054533 max 0.130350 0.057404 ``` ### 語言模型的大小我們可以參考從連接序列和照片特征輸入中學習的模型作為語言模型。它負責生成單詞。首先，我們可以通過將 LSTM 和密集層切割為 500 到 256 個神經元來研究對模型技能的影響。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(256)(merged) lm3 = Dense(256, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model model_name = 'size_sm_lang_model' ``` 我們可以看到，這對 BLEU 對訓練和測試數據集的影響都很小，同樣可能與數據集的小尺寸有關。 ```py train test count 3.000000 3.000000 mean 0.063632 0.056059 std 0.018521 0.009064 min 0.045127 0.048916 25% 0.054363 0.050961 50% 0.063599 0.053005 75% 0.072884 0.059630 max 0.082169 0.066256 ``` 我們還可以通過添加相同大小的第二個 LSTM 層來查看加倍語言模型容量的影響。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500, return_sequences=True)(merged) lm3 = LSTM(500)(lm2) lm4 = Dense(500, activation='relu')(lm3) outputs = Dense(vocab_size, activation='softmax')(lm4) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model model_name = 'size_lg_lang_model' ``` 同樣，我們看到 BLEU 中的微小運動，可能是噪聲和數據集大小的人為因素。測試的改進測試數據集的改進可能是個好兆頭。這可能是一個值得探索的變化。 ```py train test count 3.000000 3.000000 mean 0.043838 0.067658 std 0.037580 0.045813 min 0.017990 0.015757 25% 0.022284 0.050252 50% 0.026578 0.084748 75% 0.056763 0.093608 max 0.086948 0.102469 ``` 在更小的數據集上調整模型大小具有挑戰性。 ## 配置特征提取模型使用預先訓練的 VGG16 模型提供了一些額外的配置點。基線模型從 VGG 模型中移除了頂部，包括全局最大池化層，然后將特征的編碼提供給 128 元素向量。在本節中，我們將查看對基線模型的以下修改： 1. 在 VGG 模型之后使用全局平均池層。 2. 不使用任何全局池。 ### 全球平均匯集我們可以用 GlobalAveragePooling2D 替換 GlobalMaxPooling2D 層以實現平均池化。開發全局平均合并以減少圖像分類問題的過度擬合，但可以在解釋從圖像中提取的特征方面提供一些益處。有關全球平均合并的更多信息，請參閱論文： * [網絡網絡](https://arxiv.org/abs/1312.4400)，2013 年。下面列出了更新的 _define_model（）_ 函數和實驗名稱。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalAveragePooling2D()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model model_name = 'fe_avg_pool' ``` 結果表明訓練數據集得到了顯著改善，這可能是過度擬合的標志。我們也看到了測試技巧的小幅提升。這可能是一個值得探索的變化。我們也看到了測試技巧的小幅提升。這可能是一個值得探索的變化。 ```py train test count 3.000000 3.000000 mean 0.834627 0.060847 std 0.083259 0.040463 min 0.745074 0.017705 25% 0.797096 0.042294 50% 0.849118 0.066884 75% 0.879404 0.082418 max 0.909690 0.097952 ``` ### 沒有合并我們可以刪除 GlobalMaxPooling2D 并展平 3D 照片功能并將其直接送入 Dense 層。我不認為這是一個很好的模型設計，但值得測試這個假設。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = Flatten()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model model_name = 'fe_flat' ``` 令人驚訝的是，我們看到訓練數據的小幅提升和測試數據的大幅提升。這對我來說是令人驚訝的，可能值得進一步調查。 ```py train test count 3.000000 3.000000 mean 0.055988 0.135231 std 0.017566 0.079714 min 0.038605 0.044177 25% 0.047116 0.106633 50% 0.055627 0.169089 75% 0.064679 0.180758 max 0.073731 0.192428 ``` 我們可以嘗試重復此實驗，并提供更多容量來解釋提取的照片功能。在 Flatten 層之后添加具有 500 個神經元的新 Dense 層。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = Flatten()(inputs1) fe2 = Dense(500, activation='relu')(fe1) fe3 = Dense(128, activation='relu')(fe2) fe4 = RepeatVector(max_length)(fe3) # embedding inputs2 = Input(shape=(max_length,)) emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe4, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model model_name = 'fe_flat2' ``` 這導致更改不太令人印象深刻，并且測試數據集上的 BLEU 結果可能更差。 ```py train test count 3.000000 3.000000 mean 0.060126 0.029487 std 0.030300 0.013205 min 0.031235 0.020850 25% 0.044359 0.021887 50% 0.057483 0.022923 75% 0.074572 0.033805 max 0.091661 0.044688 ``` ## 詞嵌入模型模型的關鍵部分是序列學習模型，它必須解釋到目前為止為照片生成的單詞序列。在該子模型的輸入處是單詞嵌入和改進單詞嵌入而不是從頭開始學習它作為模型的一部分（如在基線模型中）的好方法是使用預訓練的單詞嵌入。在本節中，我們將探討在模型上使用預先訓練的單詞嵌入的影響。特別： 1. 訓練 Word2Vec 模型 2. 訓練 Word2Vec 模型+微調 ### 訓練有素的 word2vec 嵌入用于從文本語料庫預訓練單詞嵌入的有效學習算法是 word2vec 算法。您可以在此處了解有關 word2vec 算法的更多信息： * [Word2Vec Google 代碼項目](https://code.google.com/archive/p/word2vec/) 我們可以使用此算法使用數據集中的已清理照片描述來訓練新的獨立單詞向量集。 [Gensim 庫](https://radimrehurek.com/gensim/models/word2vec.html)提供對算法實現的訪問，我們可以使用它來預先訓練嵌入。首先，我們必須像以前一樣加載訓練數據集的干凈照片描述。接下來，我們可以在所有干凈的描述中使用 word2vec 模型。我們應該注意，這包括比訓練數據集中使用的 50 更多的描述。這些實驗的更公平的模型應該只訓練訓練數據集中的那些描述。一旦適合，我們可以將單詞和單詞向量保存為 ASCII 文件，可能用于以后的檢查或可視化。 ```py # train word2vec model lines = [s.split() for s in train_descriptions.values()] model = Word2Vec(lines, size=100, window=5, workers=8, min_count=1) # summarize vocabulary size in model words = list(model.wv.vocab) print('Vocabulary size: %d' % len(words)) # save model in ASCII (word2vec) format filename = 'custom_embedding.txt' model.wv.save_word2vec_format(filename, binary=False) ``` 單詞嵌入保存到文件' _custom_embedding.txt_ '。現在，我們可以將嵌入加載到內存中，只檢索詞匯表中單詞的單詞向量，然后將它們保存到新文件中。 ```py # load the whole embedding into memory embedding = dict() file = open('custom_embedding.txt') for line in file: values = line.split() word = values[0] coefs = asarray(values[1:], dtype='float32') embedding[word] = coefs file.close() print('Embedding Size: %d' % len(embedding)) # summarize vocabulary all_tokens = ' '.join(train_descriptions.values()).split() vocabulary = set(all_tokens) print('Vocabulary Size: %d' % len(vocabulary)) # get the vectors for words in our vocab cust_embedding = dict() for word in vocabulary: # check if word in embedding if word not in embedding: continue cust_embedding[word] = embedding[word] print('Custom Embedding %d' % len(cust_embedding)) # save dump(cust_embedding, open('word2vec_embedding.pkl', 'wb')) print('Saved Embedding') ``` 下面列出了完整的示例。 ```py # prepare word vectors for captioning model from numpy import asarray from pickle import dump from gensim.models import Word2Vec # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) # split a dataset into train/test elements def train_test_split(dataset): # order keys so the split is consistent ordered = sorted(dataset) # return split dataset as two new sets return set(ordered[:100]), set(ordered[100:200]) # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # store descriptions[image_id] = 'startseq ' + ' '.join(image_desc) + ' endseq' return descriptions # load dev set filename = 'Flickr8k_text/Flickr_8k.devImages.txt' dataset = load_set(filename) print('Dataset: %d' % len(dataset)) # train-test split train, test = train_test_split(dataset) print('Train=%d, Test=%d' % (len(train), len(test))) # descriptions train_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions)) # train word2vec model lines = [s.split() for s in train_descriptions.values()] model = Word2Vec(lines, size=100, window=5, workers=8, min_count=1) # summarize vocabulary size in model words = list(model.wv.vocab) print('Vocabulary size: %d' % len(words)) # save model in ASCII (word2vec) format filename = 'custom_embedding.txt' model.wv.save_word2vec_format(filename, binary=False) # load the whole embedding into memory embedding = dict() file = open('custom_embedding.txt') for line in file: values = line.split() word = values[0] coefs = asarray(values[1:], dtype='float32') embedding[word] = coefs file.close() print('Embedding Size: %d' % len(embedding)) # summarize vocabulary all_tokens = ' '.join(train_descriptions.values()).split() vocabulary = set(all_tokens) print('Vocabulary Size: %d' % len(vocabulary)) # get the vectors for words in our vocab cust_embedding = dict() for word in vocabulary: # check if word in embedding if word not in embedding: continue cust_embedding[word] = embedding[word] print('Custom Embedding %d' % len(cust_embedding)) # save dump(cust_embedding, open('word2vec_embedding.pkl', 'wb')) print('Saved Embedding') ``` 運行此示例將創建存儲在文件' _word2vec_embedding.pkl_ '中的單詞到單詞向量的新字典映射。 ```py Dataset: 1000 Train=100, Test=100 Descriptions: train=100 Vocabulary size: 365 Embedding Size: 366 Vocabulary Size: 365 Custom Embedding 365 Saved Embedding ``` 接下來，我們可以加載此嵌入并使用單詞向量作為嵌入層中的固定權重。下面提供 _load_embedding（）_ 函數，它加載自定義 word2vec 嵌入并返回新的嵌入層以供在模型中使用。 ```py # load a word embedding def load_embedding(tokenizer, vocab_size, max_length): # load the tokenizer embedding = load(open('word2vec_embedding.pkl', 'rb')) dimensions = 100 trainable = False # create a weight matrix for words in training docs weights = zeros((vocab_size, dimensions)) # walk words in order of tokenizer vocab to ensure vectors are in the right index for word, i in tokenizer.word_index.items(): if word not in embedding: continue weights[i] = embedding[word] layer = Embedding(vocab_size, dimensions, weights=[weights], input_length=max_length, trainable=trainable, mask_zero=True) return layer ``` 我們可以通過直接從 _define_model（）_ 函數調用函數在我們的模型中使用它。 ```py # define the captioning model def define_model(tokenizer, vocab_size, max_length): # feature extractor (encoder) inputs1 = Input(shape=(7, 7, 512)) fe1 = GlobalMaxPooling2D()(inputs1) fe2 = Dense(128, activation='relu')(fe1) fe3 = RepeatVector(max_length)(fe2) # embedding inputs2 = Input(shape=(max_length,)) emb2 = load_embedding(tokenizer, vocab_size, max_length)(inputs2) emb3 = LSTM(256, return_sequences=True)(emb2) emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3) # merge inputs merged = concatenate([fe3, emb4]) # language model (decoder) lm2 = LSTM(500)(merged) lm3 = Dense(500, activation='relu')(lm2) outputs = Dense(vocab_size, activation='softmax')(lm3) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model model_name = 'seq_w2v_fixed' ``` 我們可以在訓練數據集上看到一些提升，也許在測試數據集上沒有真正顯著的變化。 ```py train test count 3.000000 3.000000 mean 0.096780 0.047540 std 0.055073 0.008445 min 0.033511 0.038340 25% 0.078186 0.043840 50% 0.122861 0.049341 75% 0.128414 0.052140 max 0.133967 0.054939 ``` ### 訓練有素的 word2vec 嵌入微調我們可以重復之前的實驗，并允許模型在擬合模型時調整單詞向量。下面列出了允許微調嵌入層的更新的 _load_embedding（）_ 功能。 ```py # load a word embedding def load_embedding(tokenizer, vocab_size, max_length): # load the tokenizer embedding = load(open('word2vec_embedding.pkl', 'rb')) dimensions = 100 trainable = True # create a weight matrix for words in training docs weights = zeros((vocab_size, dimensions)) # walk words in order of tokenizer vocab to ensure vectors are in the right index for word, i in tokenizer.word_index.items(): if word not in embedding: continue weights[i] = embedding[word] layer = Embedding(vocab_size, dimensions, weights=[weights], input_length=max_length, trainable=trainable, mask_zero=True) return layer model_name = 'seq_w2v_tuned' ``` 同樣，我們認為在基線模型中使用這些預先訓練的字嵌入向量并沒有太大差異。 ```py train test count 3.000000 3.000000 mean 0.065297 0.042712 std 0.080194 0.007697 min 0.017675 0.034593 25% 0.019003 0.039117 50% 0.020332 0.043641 75% 0.089108 0.046772 max 0.157885 0.049904 ``` ## 結果分析我們對來自 8,000 張照片的 Flickr8k 訓練數據集的非常小的樣本（1.6％）進行了一些實驗。樣本可能太小，模型沒有經過足夠長時間的訓練，并且每個模型的 3 次重復會導致過多的變化。這些方面也可以通過設計實驗來評估，例如： 1. 模型技能是否隨著數據集的大小而縮放？ 2. 更多的時代會帶來更好的技能嗎？ 3. 更多重復會產生一個方差較小的技能嗎？盡管如此，我們對如何為更全面的數據集配置模型有一些想法。以下是本教程中進行的實驗的平均結果摘要。查看結果圖表很有幫助。如果我們有更多的重復，每個分數分布的盒子和胡須圖可能是一個很好的可視化。這里我們使用一個簡單的條形圖。請記住，較大的 BLEU 分數更好。訓練數據集的結果： ![Bar Chart of Experiment vs Model Skill on the Training Dataset](img/06296eb24404348507d2d48f948c0313.jpg) 實驗條形圖與訓練數據集的模型技巧測試數據集上的結果： ![Bar Chart of Experiment vs Model Skill on the Test Dataset](img/e69be5d66733a6d148d48cf818a04539.jpg) 實驗條形圖與測試數據集的模型技巧從僅查看測試數據集的平均結果，我們可以建議： * 在照片特征提取器（fe_flat 在 0.135231）之后可能不需要合并。 * 在照片特征提取器（fe_avg_pool 為 0.060847）之后，平均合并可能比最大合并更有優勢。 * 也許在子模型之后的較小尺寸的固定長度向量是一個好主意（size_sm_fixed_vec 在 0.063148）。 * 也許在語言模型中添加更多層可以帶來一些好處（size_lg_lang_model 為 0.067658）。 * 也許在序列模型中添加更多層可以帶來一些好處（size_lg_seq_model 為 0.09697）。我還建議探索這些建議的組合。我們還可以查看結果的分布。下面是一些代碼，用于加載每個實驗的保存結果，并在訓練和測試集上創建結果的盒子和須狀圖以供審查。 ```py from os import listdir from pandas import read_csv from pandas import DataFrame from matplotlib import pyplot # load all .csv results into a dataframe train, test = DataFrame(), DataFrame() directory = 'results' for name in listdir(directory): if not name.endswith('csv'): continue filename = directory + '/' + name data = read_csv(filename, header=0) experiment = name.split('.')[0] train[experiment] = data['train'] test[experiment] = data['test'] # plot results on train train.boxplot(vert=False) pyplot.show() # plot results on test test.boxplot(vert=False) pyplot.show() ``` 在訓練數據集上分配結果。 ![Box and Whisker Plot of Experiment vs Model Skill on the Training Dataset](img/85c1ebb9ffbbcd5c80fc50b1c3b07ef9.jpg) 訓練數據集中實驗與模型技巧的盒子和晶須圖在測試數據集上分配結果。 ![Box and Whisker Plot of Experiment vs Model Skill on the Test Dataset](img/7c30315b0ae45ddd0d650b0ed6c36d67.jpg) 測試數據集的實驗與模型技巧的盒子和晶須圖對這些分布的審查表明： * 平面上的利差很大;也許平均合并可能更安全。 * 較大的語言模型的傳播很大，并且在錯誤/危險的方向上傾斜。 * 較大序列模型上的擴散很大，并且向右傾斜。 * 較小的固定長度向量大小可能有一些好處。我預計增加重復到 5,10 或 30 會稍微收緊這些分布。 ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 ### 文件 * [Show and Tell：神經圖像標題生成器](https://arxiv.org/abs/1411.4555)，2015。 * [顯示，參與和講述：視覺注意的神經圖像標題生成](https://arxiv.org/abs/1502.03044)，2016。 * [網絡網絡](https://arxiv.org/abs/1312.4400)，2013 年。 ### 相關字幕項目 * [caption_generator：圖片字幕項目](https://github.com/anuragmishracse/caption_generator) * [Keras 圖片標題](https://github.com/LemonATsu/Keras-Image-Caption) * [神經圖像字幕（NIC）](https://github.com/oarriaga/neural_image_captioning) * [Keras 深度學習圖像標題檢索](https://deeplearningmania.quora.com/Keras-deep-learning-for-image-caption-retrieval) * [DataLab Cup 2：圖像字幕](http://www.cs.nthu.edu.tw/~shwu/courses/ml/competitions/02_Image-Caption/02_Image-Caption.html) ### 其他 * [將圖像描述框架化為排名任務：數據，模型和評估指標](http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/KCCA.html)。 ### API * [Keras VGG16 API](https://keras.io/applications/#vgg16) * [Gensim word2vec API](https://radimrehurek.com/gensim/models/word2vec.html) ## 摘要在本教程中，您了解了如何使用照片字幕數據集的一小部分樣本來探索不同的模型設計。具體來說，你學到了： * 如何為照片字幕建模準備數據。 * 如何設計基線和測試工具來評估模型的技能和控制其隨機性。 * 如何評估模型技能，特征提取模型和單詞嵌入等屬性，以提升模型技能。你能想到什么實驗？你還嘗試了什么？您可以在訓練和測試數據集上獲得哪些最佳結果？請在下面的評論中告訴我。