Skip to content

This project implements an adaptive attention mechanism for image captioning, inspired by 'Show, Attend and Tell' paper. It combines ResNet50 and LSTM with a sentinel gate to dynamically balance focus between visual features and language context.

Notifications You must be signed in to change notification settings

rahul-vinay/ShowAttendTell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ShowAttendAndTell: Image Caption Generation with Adaptive Attention

This project implements an adaptive attention mechanism for image captioning, inspired by the "Show, Attend and Tell" paper. It dynamically balances focus between visual features and language context, achieving a baseline BLEU score of ~18.5 on the Flickr8k dataset.


πŸ“‚ Project Overview

  • Objective: Generate captions by focusing on relevant image regions while dynamically incorporating language context.

  • Model: Combines a ResNet50-based encoder, an LSTM decoder, and adaptive attention with a sentinel gate.

  • Dataset: Flickr8k, with preprocessing for tokenization, padding, and vocabulary creation.

  • Evaluation: BLEU scores to measure caption quality.

    Image

πŸ“Š Results

  • BLEU Score: ~18.5 (baseline).
  • Demonstrated ability to generate grammatically correct captions, with room for improvement on complex scenes.

πŸ” Future Directions

  • Scale to larger datasets (e.g., Flickr30k, MS COCO).
  • Explore metrics like METEOR or CIDEr for contextual evaluation.
  • Integrate advanced spatial encodings for improved scene understanding.

πŸ›  Technologies

  • Python, PyTorch, Google Colab
  • ResNet50, LSTM, Adaptive Attention
  • BLEU Scoring, NLTK

πŸ“„ For More Details

Refer to the detailed presentation SAT - Adaptive Attention and project report: SAT Report.


πŸ“¬ Contact

Developed by Rahul Vinay
Reach out: rvinay102000@gmail.com

About

This project implements an adaptive attention mechanism for image captioning, inspired by 'Show, Attend and Tell' paper. It combines ResNet50 and LSTM with a sentinel gate to dynamically balance focus between visual features and language context.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published