Skip to content

Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport' (ICASSP2025)

License

Notifications You must be signed in to change notification settings

Hyeongkeun/LAVCap

Repository files navigation

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

Kyeongha Rho1*·   Hyeongkeun Lee1*·   Valentio Iverson2·   Joon Son Chung1    
1 KAIST   2University of Waterloo  
*Equal Contribution 

This repository contains the official implementation of LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport, presented at ICASSP 2025. Our method introduces a novel approach to audio-visual captioning, achieving state-of-the-art performance on AudioCaps.

This implementation is specifically designed to run on NVIDIA GPUs and leverages GPU acceleration for efficient training and inference.

Abstract

LAVCap

Fig. 1. (a) Overview of the proposed LAVCap Framework. (b) Detail of the Optimal Transport Fusion module.

Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transportbased alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing.

Prepare Environment & Dataset

Create conda environment

conda env create -f lavcap.yaml
conda activate lavcap

Please download and organize the AudioCaps dataset in the required structure within the dataset folder. The preprocessed AudioCaps dataset is available here: Train set & Test set. The visual features for each video, to be placed in the visual_feature folder, can be extracted using the ViT-L/14 pretrained weights and can be downloaded here. In addition, download the pretrained models used in this project, including CED as the audio encoder (to be placed in the pretrained_weights/ced folder) and LLaMA-2 as the foundational backbone (to be placed in the pretrained_weights/Video-LLaMA-2-7B-Finetuned folder). The CED model is available here, and LLaMA-2 model can be downloaded from this link.

LAVCap/
├── dataset/
|   ├── audiocaps/
|       ├── train/
│           ├── frames/
│           └── waveforms/
|       ├── test/
│           ├── frames/
│           └── waveforms/
|   ├── visual_feature/
|       ├── train/
|       ├── test/
|   ├── train.json
|   ├── test.json
|   ├── test_coco.json
│   └── torch2iid.json
│
├── pretrained_weights/
│   ├── ced/
│   └── Video-LLaMA-2-7B-Finetuned/
│
...

Training

To train the LAVCap, run the command below:

python train.py --cfg-path configs/config_best.yaml

Citation

If you find this repo helpful, please consider citing:

@article{rho2025lavcap,
  title={LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport},
  author={Rho, Kyeongha and Lee, Hyeongkeun and Iverson, Valentio and Chung, Joon Son},
  journal={arXiv preprint arXiv:2501.09291},
  year={2025}
}

Acknowledgement

This repo is built upon the framework of SALMONN, LOAE and AVCap.

About

Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport' (ICASSP2025)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages