AlignVLM

This repository contains the implementation of AlignVLM paper, which proposes a novel method for vision language alignment

Disclaimer

This is not the official implementation of AlignVLM paper. There is a slight difference in the method suggested in AlignVLM paper and this implementation. The difference is as follows: \

Align module from the paper (Section 3.1) can be written as follows, where F is vision embeddings, W1 and W2 are weights linear layer from ALIGN. E_Text is LLM's text embeddings.

(1) P_vocab = softmax(LayerNorm(W2 * LayerNorm(W1 * F)))
(2) F_prime_align = P_vocab (transpose) * E_text
(3) H_input = concat(F_prime_align, E_text(x))

This implementation differs in step (3). Instead of concatenating the F_prime_align and E_text, this implementation uses an interleaved sequence packing of image-text tokens as suggested by SmolVLM.

Introduction

This repo provides:

Implementation of ALIGN module from AlignVLM paper
Example of how ALIGN module can be integrated with a VLM by using smolVLM from HF
Easy to use Training and Evaluation scripts
Utilities to download and process the datasets

After training over just 300k image-caption pairs, below are the results over COCO Captions dataset. Considering that SMOLVLM was trained on roughly 3 million images and 30 million QA pairs (see The Cauldron) and I have trained AlignVLM module only for 300k images, we can get an indication that:

If ALIGN module is used connector in a VLM and trained on large scale data, it has potential to be a viable connector.
If a VLM is trained with ALIGN module as connector from scratch then it might surpass the performance of VLMs with other connector modules

Here is a visualization of the results:

Training on image captioning datasets:

Prerequisites

Follow the requirements.txt to set up dependencies.

Dataset Prep

Utilities from dataset_utils can help you in downloading and preparing the datasets. Following datasets are used in this training\

Dataset	Number sampled used	Link
Pixmo Cap	100k	https://huggingface.co/datasets/allenai/pixmo-cap
Localised Narratives	200k	https://huggingface.co/datasets/HuggingFaceM4/the_cauldron/viewer/localized_narratives?views%5B%5D=localized_narratives

Training Process

Prepare your dataset in the required format and place it in the specified path in the config.
Run the training script:

python train.py --config configs/train_config.yaml

Evaluations on COCO Caption:

You can use eval.py and configs/eval_config.yaml to evaluate the models. It is possible to carry out zero-shot\ or few-shot evaluation.

Checkpoints and Datasets

I plan to make checkpoint and dataset used available via Huggingface.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
align_vlm		align_vlm
configs		configs
core		core
dataset_utils		dataset_utils
metrics		metrics
resources		resources
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AlignVLM

Disclaimer

Introduction

Training on image captioning datasets:

Prerequisites

Dataset Prep

Training Process

Evaluations on COCO Caption:

Checkpoints and Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

mvish7/AlignVLM

Folders and files

Latest commit

History

Repository files navigation

AlignVLM

Disclaimer

Introduction

Training on image captioning datasets:

Prerequisites

Dataset Prep

Training Process

Evaluations on COCO Caption:

Checkpoints and Datasets

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages