1 KAIST 2University of Waterloo
*Equal Contribution
[arXiv]
This repository contains the official implementation of LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport, presented at ICASSP 2025. Our method introduces a novel approach to audio-visual captioning, achieving state-of-the-art performance on AudioCaps.
This implementation is specifically designed to run on NVIDIA GPUs and leverages GPU acceleration for efficient training and inference.

Fig. 1. (a) Overview of the proposed LAVCap Framework. (b) Detail of the Optimal Transport Fusion module.
Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transportbased alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing.
Create conda environment
conda env create -f lavcap.yaml
conda activate lavcap
Please download and organize the AudioCaps dataset in the required structure within the dataset
folder. The preprocessed AudioCaps dataset is available here: Train set & Test set. The visual features for each video, to be placed in the visual_feature folder, can be extracted using the ViT-L/14 pretrained weights and can be downloaded here. In addition, download the pretrained models used in this project, including CED as the audio encoder (to be placed in the pretrained_weights/ced
folder) and LLaMA-2 as the foundational backbone (to be placed in the pretrained_weights/Video-LLaMA-2-7B-Finetuned
folder). The CED model is available here, and LLaMA-2 model can be downloaded from this link.
LAVCap/
├── dataset/
| ├── audiocaps/
| ├── train/
│ ├── frames/
│ └── waveforms/
| ├── test/
│ ├── frames/
│ └── waveforms/
| ├── visual_feature/
| ├── train/
| ├── test/
| ├── train.json
| ├── test.json
| ├── test_coco.json
│ └── torch2iid.json
│
├── pretrained_weights/
│ ├── ced/
│ └── Video-LLaMA-2-7B-Finetuned/
│
...
To train the LAVCap, run the command below:
python train.py --cfg-path configs/config_best.yaml
If you find this repo helpful, please consider citing:
@article{rho2025lavcap,
title={LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport},
author={Rho, Kyeongha and Lee, Hyeongkeun and Iverson, Valentio and Chung, Joon Son},
journal={arXiv preprint arXiv:2501.09291},
year={2025}
}
This repo is built upon the framework of SALMONN, LOAE and AVCap.