LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

Kyeongha Rho^1* · Hyeongkeun Lee^1* · Valentio Iverson² · Joon Son Chung¹
¹ KAIST ²University of Waterloo
^*Equal Contribution

[arXiv]

This repository contains the official implementation of LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport, presented at ICASSP 2025. Our method introduces a novel approach to audio-visual captioning, achieving state-of-the-art performance on AudioCaps.

This implementation is specifically designed to run on NVIDIA GPUs and leverages GPU acceleration for efficient training and inference.

Abstract

Fig. 1. (a) Overview of the proposed LAVCap Framework. (b) Detail of the Optimal Transport Fusion module.

Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transportbased alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing.

Prepare Environment & Dataset

Create conda environment

conda env create -f lavcap.yaml
conda activate lavcap

Please download and organize the AudioCaps dataset in the required structure within the dataset folder. The preprocessed AudioCaps dataset is available here: Train set & Test set. The visual features for each video, to be placed in the visual_feature folder, can be extracted using the ViT-L/14 pretrained weights and can be downloaded here. In addition, download the pretrained models used in this project, including CED as the audio encoder (to be placed in the pretrained_weights/ced folder) and LLaMA-2 as the foundational backbone (to be placed in the pretrained_weights/Video-LLaMA-2-7B-Finetuned folder). The CED model is available here, and LLaMA-2 model can be downloaded from this link.

LAVCap/
├── dataset/
|   ├── audiocaps/
|       ├── train/
│           ├── frames/
│           └── waveforms/
|       ├── test/
│           ├── frames/
│           └── waveforms/
|   ├── visual_feature/
|       ├── train/
|       ├── test/
|   ├── train.json
|   ├── test.json
|   ├── test_coco.json
│   └── torch2iid.json
│
├── pretrained_weights/
│   ├── ced/
│   └── Video-LLaMA-2-7B-Finetuned/
│
...

Training

To train the LAVCap, run the command below:

python train.py --cfg-path configs/config_best.yaml

Citation

If you find this repo helpful, please consider citing:

@article{rho2025lavcap,
  title={LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport},
  author={Rho, Kyeongha and Lee, Hyeongkeun and Iverson, Valentio and Chung, Joon Son},
  journal={arXiv preprint arXiv:2501.09291},
  year={2025}
}

Acknowledgement

This repo is built upon the framework of SALMONN, LOAE and AVCap.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
dataset		dataset
fig		fig
loss		loss
models		models
prompts		prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
dist_utils.py		dist_utils.py
lavcap.yaml		lavcap.yaml
logger.py		logger.py
optims.py		optims.py
runner.py		runner.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

[arXiv]

Abstract

Prepare Environment & Dataset

Training

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

Hyeongkeun/LAVCap

Folders and files

Latest commit

History

Repository files navigation

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

[arXiv]

Abstract

Prepare Environment & Dataset

Training

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages