LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
If our project helps you, please give us a star ⭐ and cite our paper!
[🌐 Project Page] [📖 Paper] [🤗 LongVALE Dataset (Hugging face)] [📊 LongVALE Dataset (Baidu drive)]
- 27/02/2025, 🔥LongVALE has been accepted to CVPR 2025.
TODO
- Release the annotation files of LongVALE.
- Release the extracted features (video, audio, speech) of LongVALE.
- Release the LongVALE-LLM model with training and evauluation code.
- Release inference demo on your own videos.
- Release pipeline code for automatic generation of high-quality omni-modality fine-grained annotations for multi-modal long videos.
- We propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and crossmodal correlation-aware event captioning.
- We present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omnimodal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos.
- We build LongVALE-LLM to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time.
We recommend setting up a conda environment for the project:
conda create --name=longvale python=3.10
conda activate longvale
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Split | Download | # Videos | # Omni-modal Events | Video Duration |
---|---|---|---|---|
Training set | 🤗 link | 7,240 | 91,863 | 473.8 hrs |
Evaluation set | 🤗 link | 1,171 | 13,867 | 75.6 hrs |
[Note] The json files include the information of video id (YouTube id), video duration, timestamps and detailed captions of each omni-modal events. You can download the raw videos on YouTube using the provided video ids.
Tuning Stage | Download | # Videos | # QA Dialogues | Data Source |
---|---|---|---|---|
Omni boundary perception | 🤗 longvale-sft-bp-7k | 7,240 | 7,240 | LongVALE |
🤗 longvale-sft-bp-154k | ~141K | ~154K | LongVALE + VTimeLLM_stage2 | |
Omni instruction tuning | 🤗 longvale-sft-it-25k | 7,240 | ~25.4K | LongVALE |
🤗 longvale-sft-it-61k | - | ~61.4K | LongVALE + VTimeLLM_stage3 |
Modality | Encoder | Download checkpoint | Download features |
---|---|---|---|
Visual frames | CLIP | ViT-L/14 | Training |
Evaluation | |||
Audio | BEATs | BEATs_iter3_plus_AS20K | Training |
Evaluation | |||
Speech | Whisper | whisper-large-v2 | Training |
Evaluation |
[Note] You can also extract features by youself by using the provided scripts at ./preprocess
. The raw videos can be downloaded from this link (Baidu drive, pwd: i6s7). Since the copyright remains with the original video owners, please download videos under the CC BY-NC-SA 4.0 license.
For evaluation instruction, please refer to eval.md
If you want to train the model by youself, please refer to train.md for training instructions.
We are grateful for the following awesome projects: VTimeLLM
If you find our project are useful for your research, please consider citing:
@inproceedings{geng2025longvale,
title={Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos},
author={Geng, Tiantian and Zhang, Jinrui and Wang, Qingni and Wang, Teng and Duan, Jinming and Zheng, Feng},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={18959--18969},
year={2025}
}