This project implements a deep learning pipeline for image captioning, combining CLIP as a vision encoder and GPT-2 as a language model. The model projects CLIP visual features into GPT-2's input space to generate natural language descriptions for images. The model is fine-tuned on the Microsoft COCO dataset, which contains over 82,000 training images and 40,000 validation images.
The best-performing model achieves high BERTScore metrics:
Metric | Value |
---|---|
Precision | 0.912 |
Recall | 0.902 |
F1-Score | 0.906 |
![]() Caption: A car is parked in front of a car dealer. |
![]() Caption: A view of a city street at night. |
![]() Caption: A man holding a stuffed animal in his hand. |
![]() Caption: A cat that is laying down on a bed. |
The model architecture is a hybrid design that combines CLIP (as a visual encoder) and GPT-2 (as a language decoder). The main idea is to project the visual embeddings extracted from an image into the input embedding space of GPT-2, allowing the language model to generate captions conditioned on visual content.
-
CLIP Encoder
The input image is processed using CLIP to extract high-level semantic features via
CLIPProcessor
. The processed image is then encoded usingCLIPModel
into a 512-dimensional vector. -
Projection Module
A simple MLP (a linear layer followed by a
Tanh
activation) is used to project the CLIP embedding into a sequence of GPT-2-compatible embeddings. These are treated as visual tokens. In the example above, 5 visual tokens are generated, each of dimensionality 768. -
GPT-2 Decoder
The visual tokens are concatenated with text tokens (caption inputs) at the embedding level. The combined sequence is passed through GPT-2 to predict the caption. During training, the loss is computed only on the text tokens, ignoring the visual prefix.
The training process is fully configurable through a JSON configuration file (training_config.json). This allows for maximum flexibility — models, paths, and hyperparameters can be changed without modifying the core codebase.
Key | Description | Recommended Value |
---|---|---|
clip_config.model |
Name of the pretrained CLIP model | "openai/clip-vit-base-patch32" |
clip_config.image_size |
Resize dimensions for input images [H, W] (optional; default is [224, 224]) |
[224, 224] |
gpt2_config.model |
Name of the pretrained GPT-2 model | "gpt2" |
gpt2_config.visual_tokens_length |
Number of visual prefix tokens passed to GPT-2 | 5 |
gpt2_config.text_tokens_max_length |
Maximum number of tokens for the caption text | 10 |
gpt2_config.n_layers_to_freeze |
Number of GPT-2 attention layers to freeze (optional; default is None — means all layers are trainable) | 10 |
data_paths.train_captions_path |
Path to training captions JSON file | filename generated with the script coco_captions.py when split = train |
data_paths.val_captions_path |
Path to validation captions JSON file | filename generated with the script coco_captions.py when split = val |
training_config.subset_ratio |
Ratio of COCO dataset to use for each epoch (optional; default is 1.0 — means all dataset is used) | 0.4 |
training_config.batch_size |
Batch size for training and validation | 64 |
training_config.num_epochs |
Number of training epochs | 10 |
training_config.learning_rate |
Learning rate for the optimizer | 1e-4 |
training_config.enable_GPU |
Whether to use GPU if available (optional; default is False) | true |
training_config.trained_model_path |
File path to save the best performing model | — |
training_config.monitoring_plots_path |
File path to save training curves image | — |
-
Load Configuration Load the training configuration from a
.json
file usingpydantic
validation. -
Initialize Pretrained Models Load CLIP and GPT-2 models from Hugging Face using their specified names.
-
Prepare Dataset and Dataloaders
- Tokenize captions with GPT-2 tokenizer.
- Preprocess images using
CLIPProcessor
. - Generate image embeddings using
CLIPModel
. - Generate batches using PyTorch
DataLoader
.
-
Build the Caption Model
- Instantiate the
ClipCaptionModel
which includes a projection layer to map CLIP embeddings to GPT-2 token space. - Instantiate the optimizer and all the losses and metrics tracking lists.
- Instantiate the
-
Train the Model For each epoch:
- Compute loss between generated and true tokens.
- Backpropagate gradients.
- Perform optimization steps.
-
Validate on Validation Data
- Evaluate model performance on the validation set.
- Compute validation loss and BERTScore (Precision, Recall, F1).
-
Save Best Model Save the model when it achieves the lowest validation loss (with a timestamped filename prefix). The model is saved as a
.pkl
file with the following content:caption_model_state_dict
: The state dictionary of the caption model.clip_config
: The configuration for CLIP model.gpt2_config
: The configuration for GPT-2 model.training_config
: The configuration for the training process.train_loss
: The mean training loss for that epoch.validation_loss
: The mean validation loss for that epoch (i.e., the lowest validation loss achieved so far).validation_bert_precision
: The precision of the model using BERTScore.validation_bert_recall
: The recall of the model using BERTScore.validation_bert_f1_score
: The F1-score of the model using BERTScore.
-
Log and Visualize Metrics
- Save plots of training/validation loss and BERT metrics per epoch. Examples can be found in the
figs
folder. - Logs are written to both console and file.
- Save plots of training/validation loss and BERT metrics per epoch. Examples can be found in the
-
Clone the repository:
git clone https://github.com/Lahdhirim/CV-image-captioning-clip-gpt2.git cd CV-image-captioning-clip-gpt2
-
Install dependencies:
pip install -r requirements.txt
There are two main execution modes:
python main.py train
python main.py inference