Image Captioning

Project overview

The aim of this project is to build a model able to produce image captions stating from images. The model architecture is designed as encoder-decoder. The encoder part is resnet50 pretrained. The resnet50 is used to create the feature vector that later will be used as input of the decoder. The decoder part of the model is a RNN and specifically, a LSTM, chosen to ensure the best performances in the presence of long caption.

The mentioned model is then trained on Microsoft COCO, where images are annotated by humans. The output of the model is the creation of a caption for a specific image, like in the picture below.

A red and white bus parked next to a building.

This project is required as part of the program: Udacity Computer Vision Nanodegree. It is here presented with some improvements.

Project notebook setup

The project is divided into four main Python notebooks that will perform:

0.Dataset.ipynb : This first notebook setup the COCO API

1.Preliminaries.ipynb : Create the required vocabulary by preprocessing the COCO dataset

2.Training : Notebook to perform train of the CNN-RNN model

3.Inference : Notebook to use the trained model to generate captions for images in the test dataset

Additionally, to the notebooks above, there three files, three Python script file (.py):

model.py : Define the convolutional neural network and LSTM architecture alongside the function forward.

data_load.py : Helper function to perform data loading

vocabulary.py : Script to implement the Vocabulary class

Data

The Microsoft Common Objects in COntext (MS COCO) dataset is a large-scale dataset for scene understanding. The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms. More information about the COCO dataset on the official website

NOTE: The following steps are needed to initialize the COCO API. These steps are needed to start using the API called inside the notebooks

Clone this repo: https://github.com/cocodataset/cocoapi

git clone https://github.com/cocodataset/cocoapi.git

Setup the coco API (also described in the readme here)

cd cocoapi/PythonAPI  
make  
cd ..

Download some specific data from here: http://cocodataset.org/#download (described below)

Under Annotations, download:
- 2014 Train/Val annotations [241MB] (extract captions_train2014.json and captions_val2014.json, and place at locations cocoapi/annotations/captions_train2014.json and cocoapi/annotations/captions_val2014.json, respectively)
- 2014 Testing Image info [1MB] (extract image_info_test2014.json and place at location cocoapi/annotations/image_info_test2014.json)
Under Images, download:
- 2014 Train images [83K/13GB] (extract the train2014 folder and place at location cocoapi/images/train2014/)
- 2014 Val images [41K/6GB] (extract the val2014 folder and place at location cocoapi/images/val2014/)
- 2014 Test images [41K/6GB] (extract the test2014 folder and place at location cocoapi/images/test2014/)

Possible improvements

Use the validation set to guide the search of hyperparameters
Use a bigger CNN encoder
Train the model longer
Implement more complex models like Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks

LICENSE: This project is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
.gitignore		.gitignore
0.Dataset.ipynb		0.Dataset.ipynb
1.Preliminaries.ipynb		1.Preliminaries.ipynb
2.Training.ipynb		2.Training.ipynb
3.Inference.ipynb		3.Inference.ipynb
LICENSE		LICENSE
README.md		README.md
data_loader.py		data_loader.py
model.py		model.py
vocab.pkl		vocab.pkl
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Captioning

Project overview

Project notebook setup

Data

Possible improvements

About

Uh oh!

Languages

License

FrancescoMrn/Image_Captioning

Folders and files

Latest commit

History

Repository files navigation

Image Captioning

Project overview

Project notebook setup

Data

Possible improvements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages