GitHub - ttgeng233/LongVALE: LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng

If our project helps you, please give us a star ⭐ and cite our paper!

[🌐 Project Page] [📖 Paper] [🤗 LongVALE Dataset (Hugging face)] [📊 LongVALE Dataset (Baidu drive)]

News

27/02/2025, 🔥LongVALE has been accepted to CVPR 2025.

TODO

Release the annotation files of LongVALE.
Release the extracted features (video, audio, speech) of LongVALE.
Release the LongVALE-LLM model with training and evauluation code.
Release inference demo on your own videos.
Release pipeline code for automatic generation of high-quality omni-modality fine-grained annotations for multi-modal long videos.

👀 Overview

We propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and crossmodal correlation-aware event captioning.
We present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omnimodal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos.
We build LongVALE-LLM to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time.

Requirements

We recommend setting up a conda environment for the project:

conda create --name=longvale python=3.10
conda activate longvale
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Dataset

Annotation files of training and evaluation sets

Split	Download	# Videos	# Omni-modal Events	Video Duration
Training set	🤗 link	7,240	91,863	473.8 hrs
Evaluation set	🤗 link	1,171	13,867	75.6 hrs

[Note] The json files include the information of video id (YouTube id), video duration, timestamps and detailed captions of each omni-modal events. You can download the raw videos on YouTube using the provided video ids.

LongVALE-based dialogue data for LongVALE-LLM training

Tuning Stage	Download	# Videos	# QA Dialogues	Data Source
Omni boundary perception	🤗 longvale-sft-bp-7k	7,240	7,240	LongVALE
	🤗 longvale-sft-bp-154k	~141K	~154K	LongVALE + VTimeLLM_stage2
Omni instruction tuning	🤗 longvale-sft-it-25k	7,240	~25.4K	LongVALE
	🤗 longvale-sft-it-61k	-	~61.4K	LongVALE + VTimeLLM_stage3

Extracted features of LongVALE

Modality	Encoder	Download checkpoint	Download features
Visual frames	CLIP	ViT-L/14	Training
			Evaluation
Audio	BEATs	BEATs_iter3_plus_AS20K	Training
			Evaluation
Speech	Whisper	whisper-large-v2	Training
			Evaluation

[Note] You can also extract features by youself by using the provided scripts at ./preprocess. The raw videos can be downloaded from this link (Baidu drive, pwd: i6s7). Since the copyright remains with the original video owners, please download videos under the CC BY-NC-SA 4.0 license.

Evaluation

For evaluation instruction, please refer to eval.md

Training

If you want to train the model by youself, please refer to train.md for training instructions.

Acknowledgement

We are grateful for the following awesome projects: VTimeLLM

Citation

If you find our project are useful for your research, please consider citing:

@inproceedings{geng2025longvale,
  title={Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos},
  author={Geng, Tiantian and Zhang, Jinrui and Wang, Qingni and Wang, Teng and Duan, Jinming and Zheng, Feng},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={18959--18969},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
longvalellm		longvalellm
preprocess		preprocess
scripts		scripts
LICENSE		LICENSE
README.md		README.md
fig1.jpg		fig1.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng

If our project helps you, please give us a star ⭐ and cite our paper!

News

👀 Overview

Requirements

Dataset

Annotation files of training and evaluation sets

LongVALE-based dialogue data for LongVALE-LLM training

Extracted features of LongVALE

Evaluation

Training

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ttgeng233/LongVALE

Folders and files

Latest commit

History

Repository files navigation

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng

If our project helps you, please give us a star ⭐ and cite our paper!

News

👀 Overview

Requirements

Dataset

Annotation files of training and evaluation sets

LongVALE-based dialogue data for LongVALE-LLM training

Extracted features of LongVALE

Evaluation

Training

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages