如何從頭開發深度學習圖片標題生成器 · Machine Learning Mastery 博客文章翻譯

# 如何從頭開發深度學習圖片標題生成器 > 原文： [https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/](https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/) #### 開發一個深度學習模型自動用 Keras 逐步描述 Python 中的照片。字幕生成是一個具有挑戰性的人工智能問題，必須為給定的照片生成文本描述。它既需要計算機視覺的方法來理解圖像的內容，也需要來自自然語言處理領域的語言模型，以便將圖像的理解轉化為正確的單詞。最近，深度學習方法已經在這個問題的例子上取得了最新的成果。深度學習方法已經證明了關于字幕生成問題的最新結果。這些方法最令人印象深刻的是，可以定義單個端到端模型來預測標題，給定照片，而不是需要復雜的數據準備或專門設計模型的管道。在本教程中，您將了解如何從頭開發照片字幕深度學習模型。完成本教程后，您將了解： * 如何準備用于訓練深度學習模型的照片和文本數據。 * 如何設計和訓練深度學習字幕生成模型。 * 如何評估訓練標題生成模型并使用它來標注全新的照片。 **注**：摘錄自：“[深度學習自然語言處理](https://machinelearningmastery.com/deep-learning-for-nlp/)”。看一下，如果你想要更多的分步教程，在使用文本數據時充分利用深度學習方法。讓我們開始吧。 * **2017 年 11 月更新**：添加了關于 Keras 2.1.0 和 2.1.1 中引入的影響本教程中代碼的錯誤的說明。 * **2017 年 12 月更新**：在解釋如何將描述保存到文件時更新了函數名稱中的拼寫錯誤，感謝 Minel。 * **Update Apr / 2018** ：增加了一個新的部分，展示了如何使用漸進式加載為具有最小 RAM 的工作站訓練模型。 * **2002 年 2 月更新**：提供了 Flickr8k_Dataset 數據集的直接鏈接，因為官方網站已被刪除。 ![How to Develop a Deep Learning Caption Generation Model in Python from Scratch](img/7c3093e713bfc0f44e9aa591c5ae3415.jpg) 如何從頭開始在 Python 中開發深度學習字幕生成模型照片由[生活在蒙羅維亞](https://www.flickr.com/photos/livinginmonrovia/8069637650/)，保留一些權利。 ## 教程概述本教程分為 6 個部分;他們是： 1. 照片和標題數據集 2. 準備照片數據 3. 準備文本數據 4. 開發深度學習模型 5. 逐步加載訓練（ **NEW** ） 6. 評估模型 7. 生成新標題 ### Python 環境本教程假設您安裝了 Python SciPy 環境，理想情況下使用 Python 3。您必須安裝帶有 TensorFlow 或 Theano 后端的 Keras（2.2 或更高版本）。本教程還假設您安裝了 scikit-learn，Pandas，NumPy 和 Matplotlib。如果您需要有關環境的幫助，請參閱本教程： * [如何使用 Anaconda 設置用于機器學習和深度學習的 Python 環境](https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/) 我建議在帶 GPU 的系統上運行代碼。您可以在 Amazon Web Services 上以低成本方式訪問 GPU。在本教程中學習如何： * [如何設置 Amazon AWS EC2 GPU 以訓練 Keras 深度學習模型（循序漸進）](https://machinelearningmastery.com/develop-evaluate-large-deep-learning-models-keras-amazon-web-services/) 讓我們潛入。 ## 照片和標題數據集 Flickr8K 數據集是開始使用圖像字幕時使用的一個很好的數據集。原因是因為它是現實的并且相對較小，因此您可以使用 CPU 在工作站上下載它并構建模型。數據集的確切描述在論文“[框架圖像描述作為排名任務：數據，模型和評估指標](https://www.jair.org/media/3994/live-3994-7274-jair.pdf)”從 2013 年開始。作者將數據集描述如下： > 我們為基于句子的圖像描述和搜索引入了一個新的基準集合，包括 8,000 個圖像，每個圖像與五個不同的標題配對，提供對顯著實體和事件的清晰描述。 > > ... > > 圖像是從六個不同的 Flickr 組中選擇的，并且往往不包含任何知名人物或位置，而是手動選擇以描繪各種場景和情況。 - [框架圖像描述作為排名任務：數據，模型和評估指標](https://www.jair.org/media/3994/live-3994-7274-jair.pdf)，2013。數據集可免費獲得。您必須填寫申請表，并通過電子郵件將鏈接發送給您。我很樂意為您鏈接，但電子郵件地址明確要求：“_ 請不要重新分發數據集 _”。您可以使用以下鏈接來請求數據集： * [數據集申請表](https://illinois.edu/fb/sec/1713398) 在短時間內，您將收到一封電子郵件，其中包含指向兩個文件的鏈接： * **Flickr8k_Dataset.zip** （1 千兆字節）所有照片的存檔。 * **Flickr8k_text.zip** （2.2 兆字節）照片所有文字說明的檔案。 **UPDATE（2019 年 2 月）**：官方網站似乎已被刪除（雖然表格仍然有效）。以下是我的[數據集 GitHub 存儲庫](https://github.com/jbrownlee/Datasets)的一些直接下載鏈接： * [Flickr8k_Dataset.zip](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip) * [Flickr8k_text.zip](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip) 下載數據集并將其解壓縮到當前工作目錄中。您將有兩個目錄： * **Flicker8k_Dataset** ：包含 8092 張 JPEG 格式的照片。 * **Flickr8k_text** ：包含許多包含不同照片描述來源的文件。數據集具有預定義的訓練數據集（6,000 個圖像），開發數據集（1,000 個圖像）和測試數據集（1,000 個圖像）。可用于評估模型技能的一個衡量標準是 BLEU 分數。作為參考，下面是在測試數據集上評估時對于熟練模型的一些球場 BLEU 分數（取自 2017 年論文“[將圖像放入圖像標題生成器](https://arxiv.org/abs/1703.09137)”中）： * BLEU-1：0.401 至 0.578。 * BLEU-2：0.176 至 0.390。 * BLEU-3：0.099 至 0.260。 * BLEU-4：0.059 至 0.170。我們在評估模型時會更晚地描述 BLEU 指標。接下來，我們來看看如何加載圖像。 ## 準備照片數據我們將使用預先訓練的模型來解釋照片的內容。有很多型號可供選擇。在這種情況下，我們將使用 2014 年贏得 ImageNet 競賽的 Oxford Visual Geometry Group 或 VGG 模型。在此處了解有關該模型的更多信息： * [用于大規模視覺識別的超深卷積網絡](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) Keras 直接提供這種預先訓練的模型。請注意，第一次使用此模型時，Keras 將從 Internet 下載模型權重，大約為 500 兆字節。這可能需要幾分鐘，具體取決于您的互聯網連接。我們可以將此模型用作更廣泛的圖像標題模型的一部分。問題是，它是一個大型模型，每次我們想要測試一個新的語言模型配置（下游）是多余的時，通過網絡運行每張照片。相反，我們可以使用預先訓練的模型預先計算“照片功能”并將其保存到文件中。然后，我們可以稍后加載這些功能，并將它們作為數據集中給定照片的解釋提供給我們的模型。通過完整的 VGG 模型運行照片也沒有什么不同;我們只是提前做過一次。這是一種優化，可以更快地訓練我們的模型并消耗更少的內存。我們可以使用 VGG 類在 Keras 中加載 VGG 模型。我們將從加載的模型中刪除最后一層，因為這是用于預測照片分類的模型。我們對圖像分類不感興趣，但我們對分類前的照片內部表示感興趣。這些是模型從照片中提取的“特征”。 Keras 還提供了用于將加載的照片整形為模型的優選尺寸的工具（例如，3 通道 224×224 像素圖像）。下面是一個名為 _extract_features（）_ 的函數，給定目錄名稱，將加載每張照片，為 VGG 準備，并從 VGG 模型中收集預測的特征。圖像特征是 1 維 4,096 元素向量。該函數返回圖像標識符的字典到圖像特征。 ```py # extract features from each photo in the directory def extract_features(directory): # load the model model = VGG16() # re-structure the model model.layers.pop() model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # summarize print(model.summary()) # extract features from each photo features = dict() for name in listdir(directory): # load an image from file filename = directory + '/' + name image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) # get image id image_id = name.split('.')[0] # store feature features[image_id] = feature print('>%s' % name) return features ``` 我們可以調用此函數來準備用于測試模型的照片數據，然后將生成的字典保存到名為“ _features.pkl_ ”的文件中。下面列出了完整的示例。 ```py from os import listdir from pickle import dump from keras.applications.vgg16 import VGG16 from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from keras.applications.vgg16 import preprocess_input from keras.models import Model # extract features from each photo in the directory def extract_features(directory): # load the model model = VGG16() # re-structure the model model.layers.pop() model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # summarize print(model.summary()) # extract features from each photo features = dict() for name in listdir(directory): # load an image from file filename = directory + '/' + name image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) # get image id image_id = name.split('.')[0] # store feature features[image_id] = feature print('>%s' % name) return features # extract features from all images directory = 'Flicker8k_Dataset' features = extract_features(directory) print('Extracted Features: %d' % len(features)) # save to file dump(features, open('features.pkl', 'wb')) ``` 運行此數據準備步驟可能需要一段時間，具體取決于您的硬件，可能需要一個小時的 CPU 與現代工作站。在運行結束時，您將提取的特征存儲在' _features.pkl_ '中供以后使用。該文件大小約為 127 兆字節。 ## 準備文本數據數據集包含每張照片的多個描述，描述文本需要一些最小的清潔。如果您不熟悉清理文本數據，請參閱此帖子： * [如何使用 Python 清理機器學習文本](https://machinelearningmastery.com/clean-text-machine-learning-python/) 首先，我們將加載包含所有描述的文件。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text filename = 'Flickr8k_text/Flickr8k.token.txt' # load descriptions doc = load_doc(filename) ``` 每張照片都有唯一的標識符。此標識符用于照片文件名和描述的文本文件中。接下來，我們將逐步瀏覽照片說明列表。下面定義了一個函數 _load_descriptions（）_，給定加載的文檔文本，它將返回描述的照片標識符字典。每個照片標識符映射到一個或多個文本描述的列表。 ```py # extract descriptions for images def load_descriptions(doc): mapping = dict() # process lines for line in doc.split('\n'): # split line by white space tokens = line.split() if len(line) < 2: continue # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] # remove filename from image id image_id = image_id.split('.')[0] # convert description tokens back to string image_desc = ' '.join(image_desc) # create the list if needed if image_id not in mapping: mapping[image_id] = list() # store description mapping[image_id].append(image_desc) return mapping # parse descriptions descriptions = load_descriptions(doc) print('Loaded: %d ' % len(descriptions)) ``` 接下來，我們需要清理描述文本。描述已經被分詞并且易于使用。我們將通過以下方式清理文本，以減少我們需要使用的單詞詞匯量： * 將所有單詞轉換為小寫。 * 刪除所有標點符號。 * 刪除所有長度不超過一個字符的單詞（例如“a”）。 * 刪除包含數字的所有單詞。下面定義 _clean_descriptions（）_ 函數，給定描述圖像標識符的字典，逐步執行每個描述并清理文本。 ```py import string def clean_descriptions(descriptions): # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for key, desc_list in descriptions.items(): for i in range(len(desc_list)): desc = desc_list[i] # tokenize desc = desc.split() # convert to lower case desc = [word.lower() for word in desc] # remove punctuation from each token desc = [w.translate(table) for w in desc] # remove hanging 's' and 'a' desc = [word for word in desc if len(word)>1] # remove tokens with numbers in them desc = [word for word in desc if word.isalpha()] # store as string desc_list[i] = ' '.join(desc) # clean descriptions clean_descriptions(descriptions) ``` 清理完畢后，我們可以總結一下詞匯量的大小。理想情況下，我們想要一個既富有表現力又盡可能小的詞匯。較小的詞匯量將導致較小的模型將更快地訓練。作為參考，我們可以將干凈的描述轉換為一個集合并打印其大小，以了解我們的數據集詞匯表的大小。 ```py # convert the loaded descriptions into a vocabulary of words def to_vocabulary(descriptions): # build a list of all description strings all_desc = set() for key in descriptions.keys(): [all_desc.update(d.split()) for d in descriptions[key]] return all_desc # summarize vocabulary vocabulary = to_vocabulary(descriptions) print('Vocabulary Size: %d' % len(vocabulary)) ``` 最后，我們可以將圖像標識符和描述字典保存到名為 _descriptionss.txt_ 的新文件中，每行一個圖像標識符和描述。下面定義 _save_descriptions（）_ 函數，給定包含標識符到描述和文件名的映射的字典，將映射保存到文件。 ```py # save descriptions to file, one per line def save_descriptions(descriptions, filename): lines = list() for key, desc_list in descriptions.items(): for desc in desc_list: lines.append(key + ' ' + desc) data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() # save descriptions save_descriptions(descriptions, 'descriptions.txt') ``` 綜合這些，下面提供了完整的列表。 ```py import string # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # extract descriptions for images def load_descriptions(doc): mapping = dict() # process lines for line in doc.split('\n'): # split line by white space tokens = line.split() if len(line) < 2: continue # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] # remove filename from image id image_id = image_id.split('.')[0] # convert description tokens back to string image_desc = ' '.join(image_desc) # create the list if needed if image_id not in mapping: mapping[image_id] = list() # store description mapping[image_id].append(image_desc) return mapping def clean_descriptions(descriptions): # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for key, desc_list in descriptions.items(): for i in range(len(desc_list)): desc = desc_list[i] # tokenize desc = desc.split() # convert to lower case desc = [word.lower() for word in desc] # remove punctuation from each token desc = [w.translate(table) for w in desc] # remove hanging 's' and 'a' desc = [word for word in desc if len(word)>1] # remove tokens with numbers in them desc = [word for word in desc if word.isalpha()] # store as string desc_list[i] = ' '.join(desc) # convert the loaded descriptions into a vocabulary of words def to_vocabulary(descriptions): # build a list of all description strings all_desc = set() for key in descriptions.keys(): [all_desc.update(d.split()) for d in descriptions[key]] return all_desc # save descriptions to file, one per line def save_descriptions(descriptions, filename): lines = list() for key, desc_list in descriptions.items(): for desc in desc_list: lines.append(key + ' ' + desc) data = '\n'.join(lines) file = open(filename, 'w') file.write(data) file.close() filename = 'Flickr8k_text/Flickr8k.token.txt' # load descriptions doc = load_doc(filename) # parse descriptions descriptions = load_descriptions(doc) print('Loaded: %d ' % len(descriptions)) # clean descriptions clean_descriptions(descriptions) # summarize vocabulary vocabulary = to_vocabulary(descriptions) print('Vocabulary Size: %d' % len(vocabulary)) # save to file save_descriptions(descriptions, 'descriptions.txt') ``` 首先運行該示例打印加載的照片描述的數量（8,092）和清晰詞匯的大小（8,763 個單詞）。 ```py Loaded: 8,092 Vocabulary Size: 8,763 ``` 最后，干凈的描述寫入' _descriptionss.txt_ '。看一下這個文件，我們可以看到這些描述已經準備好進行建模了。文件中的描述順序可能有所不同。 ```py 2252123185_487f21e336 bunch on people are seated in stadium 2252123185_487f21e336 crowded stadium is full of people watching an event 2252123185_487f21e336 crowd of people fill up packed stadium 2252123185_487f21e336 crowd sitting in an indoor stadium 2252123185_487f21e336 stadium full of people watch game ... ``` ## 開發深度學習模型在本節中，我們將定義深度學習模型并將其擬合到訓練數據集上。本節分為以下幾部分： 1. 加載數據中。 2. 定義模型。 3. 適合模型。 4. 完整的例子。 ### 加載數據中首先，我們必須加載準備好的照片和文本數據，以便我們可以使用它來適應模型。我們將訓練訓練數據集中所有照片和標題的數據。在訓練期間，我們將監控模型在開發數據集上的表現，并使用該表現來決定何時將模型保存到文件。訓練和開發數據集已分別在 _Flickr_8k.trainImages.txt_ 和 _Flickr_8k.devImages.txt_ 文件中預定義，兩者都包含照片文件名列表。從這些文件名中，我們可以提取照片標識符并使??用這些標識符來過濾每組的照片和說明。下面的函數 _load_set（）_ 將在給定訓練或開發集文件名的情況下加載一組預定義的標識符。 ```py # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) ``` 現在，我們可以使用預定義的一系列訓練或開發標識符來加載照片和描述。下面是函數 _load_clean_descriptions（）_，它為來自' _descriptionss.txt_ '的已清除文本描述加載給定的一組標識符，并將標識符字典返回給文本描述列表。我們將開發的模型將生成給定照片的標題，并且標題將一次生成一個單詞。將提供先前生成的單詞的序列作為輸入。因此，我們需要一個'_ 第一個字 _'來啟動生成過程，'_ 最后一個字 _'來表示標題的結尾。為此，我們將使用字符串' _startseq_ '和' _endseq_ '。這些令牌在加載時會添加到已加載的描述中。在我們對文本進行編碼之前，現在執行此操作非常重要，這樣才能正確編碼令牌。 ```py # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions ``` 接下來，我們可以加載給定數據集的照片功能。下面定義了一個名為 _load_photo_features（）_ 的函數，它加載了整套照片描述，然后返回給定照片標識符集的感興趣子集。這不是很有效;盡管如此，這將使我們快速起步。 ```py # load photo features def load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features ``` 我們可以暫停一下，測試迄今為止開發的所有內容完整的代碼示例如下所示。 ```py from pickle import load # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions # load photo features def load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features # load training dataset (6K) filename = 'Flickr8k_text/Flickr_8k.trainImages.txt' train = load_set(filename) print('Dataset: %d' % len(train)) # descriptions train_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions)) # photo features train_features = load_photo_features('features.pkl', train) print('Photos: train=%d' % len(train_features)) ``` 運行此示例首先在測試數據集中加載 6,000 個照片標識符。然后，這些功能用于過濾和加載已清理的描述文本和預先計算的照片功能。我們快到了。 ```py Dataset: 6,000 Descriptions: train=6,000 Photos: train=6,000 ``` 描述文本需要先編碼為數字，然后才能像輸入中那樣呈現給模型，或者與模型的預測進行比較。編碼數據的第一步是創建從單詞到唯一整數值??的一致映射。 Keras 提供 _Tokenizer_ 類，可以從加載的描述數據中學習這種映射。下面定義 _to_lines（）_ 將描述字典轉換為字符串列表和 _create_tokenizer（）_ 函數，在給定加載的照片描述文本的情況下，它將適合 Tokenizer。 ```py # convert a dictionary of clean descriptions to a list of descriptions def to_lines(descriptions): all_desc = list() for key in descriptions.keys(): [all_desc.append(d) for d in descriptions[key]] return all_desc # fit a tokenizer given caption descriptions def create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # prepare tokenizer tokenizer = create_tokenizer(train_descriptions) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) ``` 我們現在可以對文本進行編碼。每個描述將分為單詞。該模型將提供一個單詞和照片，并生成下一個單詞。然后，將描述的前兩個單詞作為輸入提供給模型，以生成下一個單詞。這就是模型的訓練方式。例如，輸入序列“_ 在場 _ 中運行的小女孩”將被分成 6 個輸入 - 輸出對來訓練模型： ```py X1, X2 (text sequence), y (word) photo startseq, little photo startseq, little, girl photo startseq, little, girl, running photo startseq, little, girl, running, in photo startseq, little, girl, running, in, field photo startseq, little, girl, running, in, field, endseq ``` 稍后，當模型用于生成描述時，生成的單詞將被連接并遞歸地提供作為輸入以生成圖像的標題。以下函數命名為 _create_sequences（）_，給定分詞器，最大序列長度以及所有描述和照片的字典，將數據轉換為輸入 - 輸出數據對以訓練模型。模型有兩個輸入數組：一個用于照片功能，另一個用于編碼文本。模型有一個輸出，它是文本序列中編碼的下一個單詞。輸入文本被編碼為整數，其將被饋送到字嵌入層。照片功能將直接送到模型的另一部分。該模型將輸出預測，該預測將是詞匯表中所有單詞的概率分布。因此，輸出數據將是每個單詞的單熱編碼版本，表示在除了實際單詞位置之外的所有單詞位置具有 0 值的理想化概率分布，其具有值 1。 ```py # create sequences of images, input sequences and output words for an image def create_sequences(tokenizer, max_length, descriptions, photos): X1, X2, y = list(), list(), list() # walk through each image identifier for key, desc_list in descriptions.items(): # walk through each description for the image for desc in desc_list: # encode the sequence seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # split into input and output pair in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store X1.append(photos[key][0]) X2.append(in_seq) y.append(out_seq) return array(X1), array(X2), array(y) ``` 我們需要計算最長描述中的最大字數。名為 _max_length（）_ 的短輔助函數定義如下。 ```py # calculate the length of the description with the most words def max_length(descriptions): lines = to_lines(descriptions) return max(len(d.split()) for d in lines) ``` 我們現在已經足夠加載訓練和開發數據集的數據，并將加載的數據轉換為輸入 - 輸出對，以適應深度學習模型。 ### 定義模型我們將基于 Marc Tanti 等人描述的“_ 合并模型 _”來定義深度學習。在 2017 年的論文中： * [將圖像放在圖像標題生成器](https://arxiv.org/abs/1703.09137)中的位置，2017。 * [循環神經網絡（RNN）在圖像標題生成器中的作用是什么？](https://arxiv.org/abs/1708.02043) ，2017。有關此架構的溫和介紹，請參閱帖子： * [使用編碼器 - 解碼器模型的注入和合并架構生成字幕](https://machinelearningmastery.com/caption-generation-inject-merge-architectures-encoder-decoder-model/) 作者提供了一個很好的模型示意圖，如下所示。 ![Schematic of the Merge Model For Image Captioning](img/a5a04b56f81f1075fd690ba33b5bc864.jpg) 圖像標題合并模型的示意圖我們將分三個部分描述該模型： * **照片功能提取器**。這是在 ImageNet 數據集上預訓練的 16 層 VGG 模型。我們已經使用 VGG 模型預處理了照片（沒有輸出層），并將使用此模型預測的提取特征作為輸入。 * **序列處理器**。這是用于處理文本輸入的單詞嵌入層，后面是長短期記憶（LSTM）循環神經網絡層。 * **解碼器**（缺少一個更好的名字）。特征提取器和序列處理器都輸出固定長度的向量。這些被合并在一起并由 Dense 層處理以進行最終預測。 Photo Feature Extractor 模型要求輸入照片要素是 4,096 個元素的向量。這些由 Dense 層處理以產生照片的 256 個元素表示。序列處理器模型期望具有預定義長度（34 個字）的輸入序列被饋送到嵌入層，該嵌入層使用掩碼來忽略填充值。接下來是具有 256 個存儲器單元的 LSTM 層。兩個輸入模型都產生 256 個元素向量。此外，兩個輸入模型都以 50％的丟失形式使用正則化。這是為了減少過度擬合訓練數據集，因為這種模型配置學得非常快。解碼器模型使用加法運算合并來自兩個輸入模型的向量。然后將其饋送到密集 256 神經元層，然后饋送到最終輸出密集層，該密集層對序列中的下一個字的整個輸出詞匯表進行 softmax 預測。下面名為 _ 的函數 define_model（）_ 定義并返回準備好的模型。 ```py # define the captioning model def define_model(vocab_size, max_length): # feature extractor model inputs1 = Input(shape=(4096,)) fe1 = Dropout(0.5)(inputs1) fe2 = Dense(256, activation='relu')(fe1) # sequence model inputs2 = Input(shape=(max_length,)) se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2) se2 = Dropout(0.5)(se1) se3 = LSTM(256)(se2) # decoder model decoder1 = add([fe2, se3]) decoder2 = Dense(256, activation='relu')(decoder1) outputs = Dense(vocab_size, activation='softmax')(decoder2) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam') # summarize model print(model.summary()) plot_model(model, to_file='model.png', show_shapes=True) return model ``` 要了解模型的結構，特別是層的形狀，請參閱下面列出的摘要。 ```py ____________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ==================================================================================================== input_2 (InputLayer) (None, 34) 0 ____________________________________________________________________________________________________ input_1 (InputLayer) (None, 4096) 0 ____________________________________________________________________________________________________ embedding_1 (Embedding) (None, 34, 256) 1940224 input_2[0][0] ____________________________________________________________________________________________________ dropout_1 (Dropout) (None, 4096) 0 input_1[0][0] ____________________________________________________________________________________________________ dropout_2 (Dropout) (None, 34, 256) 0 embedding_1[0][0] ____________________________________________________________________________________________________ dense_1 (Dense) (None, 256) 1048832 dropout_1[0][0] ____________________________________________________________________________________________________ lstm_1 (LSTM) (None, 256) 525312 dropout_2[0][0] ____________________________________________________________________________________________________ add_1 (Add) (None, 256) 0 dense_1[0][0] lstm_1[0][0] ____________________________________________________________________________________________________ dense_2 (Dense) (None, 256) 65792 add_1[0][0] ____________________________________________________________________________________________________ dense_3 (Dense) (None, 7579) 1947803 dense_2[0][0] ==================================================================================================== Total params: 5,527,963 Trainable params: 5,527,963 Non-trainable params: 0 ____________________________________________________________________________________________________ ``` 我們還創建了一個圖表來可視化網絡結構，更好地幫助理解兩個輸入流。 ![Plot of the Caption Generation Deep Learning Model](img/3a9ec93ec57895a672f3fd9adac0be96.jpg) 標題生成深度學習模型的情節 ### 適合模型現在我們知道如何定義模型，我們可以將它放在訓練數據集上。該模型學習快速且快速地適應訓練數據集。因此，我們將監控訓練模型在保持開發數據集上的技能。當開發數據集上的模型技能在時代結束時得到改善時，我們將整個模型保存到文件中。在運行結束時，我們可以使用訓練數據集中具有最佳技能的已保存模型作為我們的最終模型。我們可以通過在 Keras 中定義 _ModelCheckpoint_ 并指定它來監控驗證數據集上的最小損失并將模型保存到文件名中具有訓練和驗證損失的文件來實現。 ```py # define checkpoint callback filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5' checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min') ``` 然后我們可以通過 _ 回調 _ 參數在 _fit（）_ 的調用中指定檢查點。我們還必須通過 _validation_data_ 參數在 _fit（）_ 中指定開發數據集。我們只適用于 20 個時代的模型，但考慮到訓練數據的數量，每個時代在現代硬件上可能需要 30 分鐘。 ```py # fit model model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest)) ``` ### 完整的例子下面列出了在訓練數據上擬合模型的完整示例。 ```py from numpy import array from pickle import load from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.utils import plot_model from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import LSTM from keras.layers import Embedding from keras.layers import Dropout from keras.layers.merge import add from keras.callbacks import ModelCheckpoint # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions # load photo features def load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features # covert a dictionary of clean descriptions to a list of descriptions def to_lines(descriptions): all_desc = list() for key in descriptions.keys(): [all_desc.append(d) for d in descriptions[key]] return all_desc # fit a tokenizer given caption descriptions def create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # calculate the length of the description with the most words def max_length(descriptions): lines = to_lines(descriptions) return max(len(d.split()) for d in lines) # create sequences of images, input sequences and output words for an image def create_sequences(tokenizer, max_length, descriptions, photos): X1, X2, y = list(), list(), list() # walk through each image identifier for key, desc_list in descriptions.items(): # walk through each description for the image for desc in desc_list: # encode the sequence seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # split into input and output pair in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store X1.append(photos[key][0]) X2.append(in_seq) y.append(out_seq) return array(X1), array(X2), array(y) # define the captioning model def define_model(vocab_size, max_length): # feature extractor model inputs1 = Input(shape=(4096,)) fe1 = Dropout(0.5)(inputs1) fe2 = Dense(256, activation='relu')(fe1) # sequence model inputs2 = Input(shape=(max_length,)) se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2) se2 = Dropout(0.5)(se1) se3 = LSTM(256)(se2) # decoder model decoder1 = add([fe2, se3]) decoder2 = Dense(256, activation='relu')(decoder1) outputs = Dense(vocab_size, activation='softmax')(decoder2) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer='adam') # summarize model print(model.summary()) plot_model(model, to_file='model.png', show_shapes=True) return model # train dataset # load training dataset (6K) filename = 'Flickr8k_text/Flickr_8k.trainImages.txt' train = load_set(filename) print('Dataset: %d' % len(train)) # descriptions train_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions)) # photo features train_features = load_photo_features('features.pkl', train) print('Photos: train=%d' % len(train_features)) # prepare tokenizer tokenizer = create_tokenizer(train_descriptions) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # determine the maximum sequence length max_length = max_length(train_descriptions) print('Description Length: %d' % max_length) # prepare sequences X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features) # dev dataset # load test set filename = 'Flickr8k_text/Flickr_8k.devImages.txt' test = load_set(filename) print('Dataset: %d' % len(test)) # descriptions test_descriptions = load_clean_descriptions('descriptions.txt', test) print('Descriptions: test=%d' % len(test_descriptions)) # photo features test_features = load_photo_features('features.pkl', test) print('Photos: test=%d' % len(test_features)) # prepare sequences X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features) # fit model # define the model model = define_model(vocab_size, max_length) # define checkpoint callback filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5' checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min') # fit model model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest)) ``` 首先運行該示例將打印已加載的訓練和開發數據集的摘要。 ```py Dataset: 6,000 Descriptions: train=6,000 Photos: train=6,000 Vocabulary Size: 7,579 Description Length: 34 Dataset: 1,000 Descriptions: test=1,000 Photos: test=1,000 ``` 在對模型進行總結之后，我們可以了解訓練和驗證（開發）輸入 - 輸出對的總數。 ```py Train on 306,404 samples, validate on 50,903 samples ``` 然后運行該模型，將最佳模型保存到.h5 文件中。在我的運行中，最佳驗證結果已保存到文件中： * _model-ep002-loss3.245-val_loss3.612.h5_ 該模型在第 2 迭代結束時保存，訓練數據集損失 3.245，開發數據集損失 3.612 您的具體結果會有所不同。讓我知道你在下面的評論中得到了什么。如果您在 AWS 上運行該示例，請將模型文件復制回當前工作目錄。如果您需要 AWS 上的命令幫助，請參閱帖子： * [10 個亞馬遜網絡服務深度學習命令行方案](https://machinelearningmastery.com/command-line-recipes-deep-learning-amazon-web-services/) 你得到的錯誤如下： ```py Memory Error ``` 如果是這樣，請參閱下一節。 ## 訓練與漸進式裝載 **注意**：如果您在上一節中沒有任何問題，請跳過本節。本節適用于那些沒有足夠內存來訓練模型的人，如上一節所述（例如，出于任何原因不能使用 AWS EC2）。標題模型的訓練確實假設你有很多 RAM。上一節中的代碼不具有內存效率，并假設您在具有 32GB 或 64GB RAM 的大型 EC2 實例上運行。如果您在 8GB RAM 的工作站上運行代碼，則無法訓練模型。解決方法是使用漸進式加載。這篇文章在帖子中標題為“ _Progressive Loading_ ”的倒數第二節中詳細討論過： * [如何準備用于訓練深度學習模型的照片標題數據集](https://machinelearningmastery.com/prepare-photo-caption-dataset-training-deep-learning-model/) 我建議您繼續閱讀該部分。如果您想使用漸進式加載來訓練此模型，本節將向您展示如何。第一步是我們必須定義一個可以用作數據生成器的函數。我們將保持簡單，并使數據生成器每批產生一張照片的數據。這將是為照片及其描述生成的所有序列。 _data_generator（）_ 下面的函數將是數據生成器，將采用加載的文本描述，照片功能，標記器和最大長度。在這里，我假設您可以將這些訓練數據放入內存中，我相信 8GB 的 RAM 應該更有能力。這是如何運作的？閱讀上面剛才提到的引入數據生成器的帖子。 ```py # data generator, intended to be used in a call to model.fit_generator() def data_generator(descriptions, photos, tokenizer, max_length): # loop for ever over images while 1: for key, desc_list in descriptions.items(): # retrieve the photo feature photo = photos[key][0] in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc_list, photo) yield [[in_img, in_seq], out_word] ``` 您可以看到我們正在調用 _create_sequence（）_ 函數來為單個照片而不是整個數據集創建一批數據。這意味著我們必須更新 _create_sequences（）_ 函數以刪除 for 循環的“迭代所有描述”。更新的功能如下： ```py # create sequences of images, input sequences and output words for an image def create_sequences(tokenizer, max_length, desc_list, photo): X1, X2, y = list(), list(), list() # walk through each description for the image for desc in desc_list: # encode the sequence seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # split into input and output pair in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store X1.append(photo) X2.append(in_seq) y.append(out_seq) return array(X1), array(X2), array(y) ``` 我們現在擁有我們所需要的一切。注意，這是一個非常基本的數據生成器。它提供的大內存節省是在擬合模型之前不在存儲器中具有訓練和測試數據的展開序列，這些樣本（例如來自 _create_sequences（）_ 的結果）是根據每張照片的需要創建的。一些關于進一步改進這種數據生成器的袖口想法包括： * 隨機化每個時代的照片順序。 * 使用照片 ID 列表并根據需要加載文本和照片數據，以進一步縮短內存。 * 每批產生不止一張照片的樣品。我過去經歷過這些變化。如果您這樣做以及如何參與評論，請告訴我們。您可以通過直接調用數據生成器來檢查數據生成器，如下所示： ```py # test the data generator generator = data_generator(train_descriptions, train_features, tokenizer, max_length) inputs, outputs = next(generator) print(inputs[0].shape) print(inputs[1].shape) print(outputs.shape) ``` 運行此完整性檢查將顯示一批批量序列的樣子，在這種情況下，47 個樣本將為第一張照片進行訓練。 ```py (47, 4096) (47, 34) (47, 7579) ``` 最后，我們可以在模型上使用 _fit_generator（）_ 函數來使用此數據生成器訓練模型。在這個簡單的例子中，我們將丟棄開發數據集和模型檢查點的加載，并在每個訓練時期之后簡單地保存模型。然后，您可以在訓練后返回并加載/評估每個已保存的模型，以找到我們可以在下一部分中使用的最低損失的模型。使用數據生成器訓練模型的代碼如下： ```py # train the model, run epochs manually and save after each epoch epochs = 20 steps = len(train_descriptions) for i in range(epochs): # create the data generator generator = data_generator(train_descriptions, train_features, tokenizer, max_length) # fit for one epoch model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1) # save model model.save('model_' + str(i) + '.h5') ``` 而已。您現在可以使用漸進式加載來訓練模型并節省大量 RAM。這也可能慢得多。下面列出了用于訓練字幕生成模型的漸進式加載（使用數據生成器）的完整更新示例。 ```py from numpy import array from pickle import load from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.utils import plot_model from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import LSTM from keras.layers import Embedding from keras.layers import Dropout from keras.layers.merge import add from keras.callbacks import ModelCheckpoint # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions # load photo features def load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features # covert a dictionary of clean descriptions to a list of descriptions def to_lines(descriptions): all_desc = list() for key in descriptions.keys(): [all_desc.append(d) for d in descriptions[key]] return all_desc # fit a tokenizer given caption descriptions def create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # calculate the length of the description with the most words def max_length(descriptions): lines = to_lines(descriptions) return max(len(d.split()) for d in lines) # create sequences of images, input sequences and output words for an image def create_sequences(tokenizer, max_length, desc_list, photo): X1, X2, y = list(), list(), list() # walk through each description for the image for desc in desc_list: # encode the sequence seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # split into input and output pair in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store X1.append(photo) X2.append(in_seq) y.append(out_seq) return array(X1), array(X2), array(y) # define the captioning model def define_model(vocab_size, max_length): # feature extractor model inputs1 = Input(shape=(4096,)) fe1 = Dropout(0.5)(inputs1) fe2 = Dense(256, activation='relu')(fe1) # sequence model inputs2 = Input(shape=(max_length,)) se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2) se2 = Dropout(0.5)(se1) se3 = LSTM(256)(se2) # decoder model decoder1 = add([fe2, se3]) decoder2 = Dense(256, activation='relu')(decoder1) outputs = Dense(vocab_size, activation='softmax')(decoder2) # tie it together [image, seq] [word] model = Model(inputs=[inputs1, inputs2], outputs=outputs) # compile model model.compile(loss='categorical_crossentropy', optimizer='adam') # summarize model model.summary() plot_model(model, to_file='model.png', show_shapes=True) return model # data generator, intended to be used in a call to model.fit_generator() def data_generator(descriptions, photos, tokenizer, max_length): # loop for ever over images while 1: for key, desc_list in descriptions.items(): # retrieve the photo feature photo = photos[key][0] in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc_list, photo) yield [[in_img, in_seq], out_word] # load training dataset (6K) filename = 'Flickr8k_text/Flickr_8k.trainImages.txt' train = load_set(filename) print('Dataset: %d' % len(train)) # descriptions train_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions)) # photo features train_features = load_photo_features('features.pkl', train) print('Photos: train=%d' % len(train_features)) # prepare tokenizer tokenizer = create_tokenizer(train_descriptions) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # determine the maximum sequence length max_length = max_length(train_descriptions) print('Description Length: %d' % max_length) # define the model model = define_model(vocab_size, max_length) # train the model, run epochs manually and save after each epoch epochs = 20 steps = len(train_descriptions) for i in range(epochs): # create the data generator generator = data_generator(train_descriptions, train_features, tokenizer, max_length) # fit for one epoch model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1) # save model model.save('model_' + str(i) + '.h5') ``` 也許評估每個保存的模型，并選擇保持數據集中損失最小的最終模型。下一節可能有助于此。您是否在教程中使用了這個新增功能？你是怎么去的？ ## 評估模型一旦模型適合，我們就可以評估其預測測試數據集的預測技巧。我們將通過生成測試數據集中所有照片的描述并使用標準成本函數評估這些預測來評估模型。首先，我們需要能夠使用訓練有素的模型生成照片的描述。這包括傳入開始描述標記' _startseq_ '，生成一個單詞，然后以生成的單詞作為輸入遞歸調用模型，直到到達序列標記結尾' _endseq_ '或達到最大描述長度。以下名為 _generate_desc（）_ 的函數實現此行為，并在給定訓練模型和給定準備照片作為輸入的情況下生成文本描述。它調用函數 _word_for_id（）_ 以將整數預測映射回一個字。 ```py # map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None # generate a description for an image def generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'startseq' # iterate over the whole length of the sequence for i in range(max_length): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # stop if we predict the end of the sequence if word == 'endseq': break return in_text ``` 我們將為測試數據集和訓練數據集中的所有照片生成預測。以下名為 _evaluate_model（）_ 的函數將針對給定的照片描述和照片特征數據集評估訓練模型。使用語料庫 BLEU 分數收集和評估實際和預測的描述，該分數總結了生成的文本與預期文本的接近程度。 ```py # evaluate the skill of the model def evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for key, desc_list in descriptions.items(): # generate description yhat = generate_desc(model, tokenizer, photos[key], max_length) # store actual and predicted references = [d.split() for d in desc_list] actual.append(references) predicted.append(yhat.split()) # calculate BLEU score print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25))) ``` BLEU 分數用于文本翻譯，用于針對一個或多個參考翻譯評估翻譯文本。在這里，我們將每個生成的描述與照片的所有參考描述進行比較。然后，我們計算 1,2,3 和 4 累積 n-gram 的 BLEU 分數。您可以在此處了解有關 BLEU 分數的更多信息： * [計算 Python 中文本的 BLEU 分數的溫和介紹](https://machinelearningmastery.com/calculate-bleu-score-for-text-python/) [NLTK Python 庫在 _corpus_bleu（）_ 函數中實現 BLEU 得分](http://www.nltk.org/api/nltk.translate.html)計算。接近 1.0 的較高分數更好，接近零的分數更差。我們可以將所有這些與上一節中的函數一起用于加載數據。我們首先需要加載訓練數據集以準備 Tokenizer，以便我們可以將生成的單詞編碼為模型的輸入序列。使用與訓練模型時使用的完全相同的編碼方案對生成的單詞進行編碼至關重要。然后，我們使用這些函數來加載測試數據集。下面列出了完整的示例。 ```py from numpy import argmax from pickle import load from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import load_model from nltk.translate.bleu_score import corpus_bleu # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions # load photo features def load_photo_features(filename, dataset): # load all features all_features = load(open(filename, 'rb')) # filter features features = {k: all_features[k] for k in dataset} return features # covert a dictionary of clean descriptions to a list of descriptions def to_lines(descriptions): all_desc = list() for key in descriptions.keys(): [all_desc.append(d) for d in descriptions[key]] return all_desc # fit a tokenizer given caption descriptions def create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # calculate the length of the description with the most words def max_length(descriptions): lines = to_lines(descriptions) return max(len(d.split()) for d in lines) # map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None # generate a description for an image def generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'startseq' # iterate over the whole length of the sequence for i in range(max_length): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # stop if we predict the end of the sequence if word == 'endseq': break return in_text # evaluate the skill of the model def evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for key, desc_list in descriptions.items(): # generate description yhat = generate_desc(model, tokenizer, photos[key], max_length) # store actual and predicted references = [d.split() for d in desc_list] actual.append(references) predicted.append(yhat.split()) # calculate BLEU score print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25))) # prepare tokenizer on train set # load training dataset (6K) filename = 'Flickr8k_text/Flickr_8k.trainImages.txt' train = load_set(filename) print('Dataset: %d' % len(train)) # descriptions train_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions)) # prepare tokenizer tokenizer = create_tokenizer(train_descriptions) vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # determine the maximum sequence length max_length = max_length(train_descriptions) print('Description Length: %d' % max_length) # prepare test set # load test set filename = 'Flickr8k_text/Flickr_8k.testImages.txt' test = load_set(filename) print('Dataset: %d' % len(test)) # descriptions test_descriptions = load_clean_descriptions('descriptions.txt', test) print('Descriptions: test=%d' % len(test_descriptions)) # photo features test_features = load_photo_features('features.pkl', test) print('Photos: test=%d' % len(test_features)) # load the model filename = 'model-ep002-loss3.245-val_loss3.612.h5' model = load_model(filename) # evaluate model evaluate_model(model, test_descriptions, test_features, tokenizer, max_length) ``` 運行該示例將打印 BLEU 分數。我們可以看到分數在問題的熟練模型的預期范圍的頂部和接近頂部。所選的模型配置決不會優化。 ```py BLEU-1: 0.579114 BLEU-2: 0.344856 BLEU-3: 0.252154 BLEU-4: 0.131446 ``` ## 生成新標題現在我們知道如何開發和評估字幕生成模型，我們如何使用它？幾乎我們為全新照片生成字幕所需的一切都在模型文件中。我們還需要 Tokenizer 在生成序列時為模型編碼生成的單詞，以及在我們定義模型時使用的輸入序列的最大長度（例如 34）。我們可以硬編碼最大序列長度。通過文本編碼，我們可以創建標記生成器并將其保存到文件中，以便我們可以在需要時快速加載它而無需整個 Flickr8K 數據集。另一種方法是使用我們自己的詞匯表文件并在訓練期間映射到整數函數。我們可以像以前一樣創建 Tokenizer 并將其保存為 pickle 文件 _tokenizer.pkl_ 。下面列出了完整的示例。 ```py from keras.preprocessing.text import Tokenizer from pickle import dump # load doc into memory def load_doc(filename): # open the file as read only file = open(filename, 'r') # read all text text = file.read() # close the file file.close() return text # load a pre-defined list of photo identifiers def load_set(filename): doc = load_doc(filename) dataset = list() # process line by line for line in doc.split('\n'): # skip empty lines if len(line) < 1: continue # get the image identifier identifier = line.split('.')[0] dataset.append(identifier) return set(dataset) # load clean descriptions into memory def load_clean_descriptions(filename, dataset): # load document doc = load_doc(filename) descriptions = dict() for line in doc.split('\n'): # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc) return descriptions # covert a dictionary of clean descriptions to a list of descriptions def to_lines(descriptions): all_desc = list() for key in descriptions.keys(): [all_desc.append(d) for d in descriptions[key]] return all_desc # fit a tokenizer given caption descriptions def create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # load training dataset (6K) filename = 'Flickr8k_text/Flickr_8k.trainImages.txt' train = load_set(filename) print('Dataset: %d' % len(train)) # descriptions train_descriptions = load_clean_descriptions('descriptions.txt', train) print('Descriptions: train=%d' % len(train_descriptions)) # prepare tokenizer tokenizer = create_tokenizer(train_descriptions) # save the tokenizer dump(tokenizer, open('tokenizer.pkl', 'wb')) ``` 我們現在可以在需要時加載 tokenizer 而無需加載整個注釋的訓練數據集。現在，讓我們為新照片生成描述。下面是我在 Flickr 上隨機選擇的新照片（可在許可許可下獲得）。 ![Photo of a dog at the beach.](img/1036583bcaf100d850a94df4e70324d4.jpg) 一條狗的照片在海灘的。照片由 [bambe1964](https://www.flickr.com/photos/bambe1964/7837618434/) 拍攝，部分版權所有。我們將使用我們的模型為它生成描述。下載照片并將其保存到本地目錄，文件名為“ _example.jpg_ ”。首先，我們必須從 _tokenizer.pkl_ 加載 Tokenizer，并定義填充輸入所需的生成序列的最大長度。 ```py # load the tokenizer tokenizer = load(open('tokenizer.pkl', 'rb')) # pre-define the max sequence length (from training) max_length = 34 ``` 然后我們必須像以前一樣加載模型。 ```py # load the model model = load_model('model-ep002-loss3.245-val_loss3.612.h5') ``` 接下來，我們必須加載要描述的照片并提取特征。我們可以通過重新定義模型并向其添加 VGG-16 模型來實現這一目標，或者我們可以使用 VGG 模型預測特征并將其用作現有模型的輸入。我們將使用后者并使用在數據準備期間使用的 _extract_features（）_ 函數的修改版本，但適用于處理單張照片。 ```py # extract features from each photo in the directory def extract_features(filename): # load the model model = VGG16() # re-structure the model model.layers.pop() model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # load the photo image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) return feature # load and prepare the photograph photo = extract_features('example.jpg') ``` 然后，我們可以使用在評估模型時定義的 _generate_desc（）_ 函數生成描述。下面列出了為全新獨立照片生成描述的完整示例。 ```py from pickle import load from numpy import argmax from keras.preprocessing.sequence import pad_sequences from keras.applications.vgg16 import VGG16 from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from keras.applications.vgg16 import preprocess_input from keras.models import Model from keras.models import load_model # extract features from each photo in the directory def extract_features(filename): # load the model model = VGG16() # re-structure the model model.layers.pop() model = Model(inputs=model.inputs, outputs=model.layers[-1].output) # load the photo image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) return feature # map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None # generate a description for an image def generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'startseq' # iterate over the whole length of the sequence for i in range(max_length): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # stop if we predict the end of the sequence if word == 'endseq': break return in_text # load the tokenizer tokenizer = load(open('tokenizer.pkl', 'rb')) # pre-define the max sequence length (from training) max_length = 34 # load the model model = load_model('model-ep002-loss3.245-val_loss3.612.h5') # load and prepare the photograph photo = extract_features('example.jpg') # generate description description = generate_desc(model, tokenizer, photo, max_length) print(description) ``` 在這種情況下，生成的描述如下： ```py startseq dog is running across the beach endseq ``` 您可以刪除開始和結束標記，您將擁有一個漂亮的自動照片字幕模型的基礎。這就像生活在未來的家伙！它仍然完全讓我感到震驚，我們可以做到這一點。哇。 ## 擴展本節列出了一些擴展您可能希望探索的教程的想法。 * **替代預訓練照片模型**。使用小的 16 層 VGG 模型進行特征提取。考慮探索在 ImageNet 數據集上提供更好表現的更大模型，例如 Inception。 * **較小的詞匯**。在模型的開發中使用了大約八千字的更大詞匯。支持的許多單詞可能是拼寫錯誤或僅在整個數據集中使用過一次。優化詞匯量并縮小尺寸，可能減半。 * **預先訓練過的單詞向量**。該模型學習了單詞向量作為擬合模型的一部分。通過使用在訓練數據集上預訓練或在更大的文本語料庫（例如新聞文章或維基百科）上訓練的單詞向量，可以實現更好的表現。 * **調諧模型**。該模型的配置沒有針對該問題進行調整。探索備用配置，看看是否可以獲得更好的表現。你嘗試過這些擴展嗎？在下面的評論中分享您的結果。 ## 進一步閱讀如果您要深入了解，本節將提供有關該主題的更多資源。 ### 標題生成論文 * [Show and Tell：神經圖像標題生成器](https://arxiv.org/abs/1411.4555)，2015。 * [顯示，參與和講述：視覺注意的神經圖像標題生成](https://arxiv.org/abs/1502.03044)，2015。 * [將圖像放在圖像標題生成器](https://arxiv.org/abs/1703.09137)中的位置，2017。 * [循環神經網絡（RNN）在圖像標題生成器中的作用是什么？](https://arxiv.org/abs/1708.02043) ，2017。 * [圖像自動生成描述：模型，數據集和評估措施的調查](https://arxiv.org/abs/1601.03896)，2016。 ### Flickr8K 數據集 * [將圖像描述框架化為排名任務：數據，模型和評估指標](http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/KCCA.html)（主頁） * [框架圖像描述作為排名任務：數據，模型和評估指標](https://www.jair.org/media/3994/live-3994-7274-jair.pdf)，（PDF）2013。 * [數據集申請表](https://illinois.edu/fb/sec/1713398) * [Old Flicrk8K 主頁](http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html) ### API * [Keras Model API](https://keras.io/models/model/) * [Keras pad_sequences（）API](https://keras.io/preprocessing/sequence/#pad_sequences) * [Keras Tokenizer API](https://keras.io/preprocessing/text/#tokenizer) * [Keras VGG16 API](https://keras.io/applications/#vgg16) * [Gensim word2vec API](https://radimrehurek.com/gensim/models/word2vec.html) * [nltk.translate 包 API 文檔](http://www.nltk.org/api/nltk.translate.html) ## 摘要在本教程中，您了解了如何從頭開發照片字幕深度學習模型。具體來說，你學到了： * 如何準備照片和文本數據，為深度學習模型的訓練做好準備。 * 如何設計和訓練深度學習字幕生成模型。 * 如何評估訓練標題生成模型并使用它來標注全新的照片。你有任何問題嗎？在下面的評論中提出您的問題，我會盡力回答。 **注**：這是以下摘錄的章節：“[深度學習自然語言處理](https://machinelearningmastery.com/deep-learning-for-nlp/)”。看一下，如果你想要更多的分步教程，在使用文本數據時充分利用深度學習方法。