Skip to content

Develops approaches for jointly analyzing images and text using deep learning. Covers applications like image-text matching, visual question answering, image captioning, and sentiment analysis with visual context.

Notifications You must be signed in to change notification settings

STiFLeR7/Multi-Modal-Learning-for-Image-and-Text-Analysis

Repository files navigation

Multi-Modal Learning for Image and Text Analysis

banner

Overview

This repository focuses on multi-modal learning, integrating image processing and natural language understanding. The project aims to generate descriptive captions for images using a combination of deep learning techniques for vision-language models.

Features

End-to-end text-to-image and image-to-text generation.

Use of ResNet18 for image feature extraction.

A custom Transformer-based model for text generation.

Extensive data augmentation to improve model generalization.

Evaluation metrics: BLEU Score, Cross-Entropy Loss, and Accuracy.

Table of Contents

  1. Project Architecture

  2. Dataset

  3. Setup Instructions

  4. Training and Validation

  5. Results

  6. Future Improvements

  7. Contributors

Project Architecture

The project architecture consists of the following components:

  1. Image Encoder:

    Pretrained ResNet18 extracts visual features from images.

    Features are projected into an embedding space of dimension embedding_dim.

  2. Text Encoder:

    Embeds the captions into tokenized feature vectors.

    Custom Embedding Layer maps vocabulary tokens into the same embedding space as the image features.

  3. Fusion Layer:

    Combines image and text embeddings for feature fusion.

    Fully connected layers integrate both modalities.

  4. Output Decoder:

    Generates a sequence of tokens as text captions.

    Evaluated using Cross-Entropy Loss and BLEU Score.

Dataset

Screenshot 2024-12-29 114207 Screenshot 2024-12-29 114213 Screenshot 2024-12-29 114229

This project uses the Flickr8k dataset:

Image Directory: Contains 8,000 images.

Caption File: Each image is annotated with five captions.

Augmentation: Images are augmented with random flips, rotations, and color jittering to increase dataset variability.

Setup Instructions

  1. Clone this Repository git clone https://github.com/STiFLeR7/Multi-Modal-Learning-for-Image-and-Text-Analysis

cd Multi-Modal-Learning-for-Image-and-Text-Analysis

  1. Install Dependencies pip install -r requirements.txt

  2. Running the Python Files Run Data Augmentation - python augmented.py

Train the Model - python train.py

Validate the Model - python validate.py

Training and Validation

Training

Screenshot 2024-12-29 114310

The training process involves:

1. Cross-Entropy Loss for token predictions.
2. Gradient Clipping to prevent exploding gradients.
3. Checkpointing to save the best model based on validation loss.

Validation

Evaluation metrics include:

1. Validation Loss: Monitors overfitting.
2. BLEU Score: Evaluates sequence-to-sequence quality.
3. Accuracy: Measures token-level predictions.

Results

Image Predicted Caption Ground Truth Caption
35506150_cbdb630f4f A man sitting on a bench in a park. A person relaxing on a park bench.
57417274_d55d34e93e A group of people enjoying snow. People hiking a snowy mountain.

Metrics

Training Loss: 4.5278

Validation BLEU Score: 0.6543

Validation Accuracy: 83.45%

Future Improvements

1. Implement Transformer-based decoders for more accurate caption generation.

2. Experiment with larger datasets like COCO for better generalization.

3. Add Beam Search Decoding for generating captions.

Contributors

STiFLeR7 - Lead Developer, Researcher & Developer @ NIMS | AI/ML/DL | Tech Lead at CudaBit

About

Develops approaches for jointly analyzing images and text using deep learning. Covers applications like image-text matching, visual question answering, image captioning, and sentiment analysis with visual context.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages