Overview
This repository focuses on multi-modal learning, integrating image processing and natural language understanding. The project aims to generate descriptive captions for images using a combination of deep learning techniques for vision-language models.
Features
End-to-end text-to-image and image-to-text generation.
Use of ResNet18 for image feature extraction.
A custom Transformer-based model for text generation.
Extensive data augmentation to improve model generalization.
Evaluation metrics: BLEU Score, Cross-Entropy Loss, and Accuracy.
-
Project Architecture
-
Dataset
-
Setup Instructions
-
Training and Validation
-
Results
-
Future Improvements
-
Contributors
The project architecture consists of the following components:
-
Image Encoder:
Pretrained ResNet18 extracts visual features from images.
Features are projected into an embedding space of dimension
embedding_dim
. -
Text Encoder:
Embeds the captions into tokenized feature vectors.
Custom Embedding Layer maps vocabulary tokens into the same embedding space as the image features.
-
Fusion Layer:
Combines image and text embeddings for feature fusion.
Fully connected layers integrate both modalities.
-
Output Decoder:
Generates a sequence of tokens as text captions.
Evaluated using Cross-Entropy Loss and BLEU Score.
This project uses the Flickr8k dataset:
Image Directory: Contains 8,000 images.
Caption File: Each image is annotated with five captions.
Augmentation: Images are augmented with random flips, rotations, and color jittering to increase dataset variability.
- Clone this Repository
git clone https://github.com/STiFLeR7/Multi-Modal-Learning-for-Image-and-Text-Analysis
cd Multi-Modal-Learning-for-Image-and-Text-Analysis
-
Install Dependencies
pip install -r requirements.txt
-
Running the Python Files Run Data Augmentation -
python augmented.py
Train the Model - python train.py
Validate the Model - python validate.py
Training
The training process involves:
1. Cross-Entropy Loss for token predictions.
2. Gradient Clipping to prevent exploding gradients.
3. Checkpointing to save the best model based on validation loss.
Validation
Evaluation metrics include:
1. Validation Loss: Monitors overfitting.
2. BLEU Score: Evaluates sequence-to-sequence quality.
3. Accuracy: Measures token-level predictions.
Image | Predicted Caption | Ground Truth Caption |
---|---|---|
![]() |
A man sitting on a bench in a park. | A person relaxing on a park bench. |
![]() |
A group of people enjoying snow. | People hiking a snowy mountain. |
Metrics
Training Loss: 4.5278
Validation BLEU Score: 0.6543
Validation Accuracy: 83.45%
1. Implement Transformer-based decoders for more accurate caption generation.
2. Experiment with larger datasets like COCO for better generalization.
3. Add Beam Search Decoding for generating captions.
STiFLeR7 - Lead Developer, Researcher & Developer @ NIMS | AI/ML/DL | Tech Lead at CudaBit