Skip to content

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))

License

Notifications You must be signed in to change notification settings

ttgeng233/LongVALE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng

If our project helps you, please give us a star ⭐ and cite our paper!

[🌐 Project Page] [📖 Paper] [🤗 LongVALE Dataset (Hugging face)] [📊 LongVALE Dataset (Baidu drive)]

News

  • 27/02/2025, 🔥LongVALE has been accepted to CVPR 2025.

TODO

  • Release the annotation files of LongVALE.
  • Release the extracted features (video, audio, speech) of LongVALE.
  • Release the LongVALE-LLM model with training and evauluation code.
  • Release inference demo on your own videos.
  • Release pipeline code for automatic generation of high-quality omni-modality fine-grained annotations for multi-modal long videos.

👀 Overview

  • We propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and crossmodal correlation-aware event captioning.
  • We present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omnimodal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos.
  • We build LongVALE-LLM to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time.

Requirements

We recommend setting up a conda environment for the project:

conda create --name=longvale python=3.10
conda activate longvale
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Dataset

Annotation files of training and evaluation sets

Split Download # Videos # Omni-modal Events Video Duration
Training set 🤗 link 7,240 91,863 473.8 hrs
Evaluation set 🤗 link 1,171 13,867 75.6 hrs

[Note] The json files include the information of video id (YouTube id), video duration, timestamps and detailed captions of each omni-modal events. You can download the raw videos on YouTube using the provided video ids.

LongVALE-based dialogue data for LongVALE-LLM training

Tuning Stage Download # Videos # QA Dialogues Data Source
Omni boundary perception 🤗 longvale-sft-bp-7k 7,240 7,240 LongVALE
🤗 longvale-sft-bp-154k ~141K ~154K LongVALE + VTimeLLM_stage2
Omni instruction tuning 🤗 longvale-sft-it-25k 7,240 ~25.4K LongVALE
🤗 longvale-sft-it-61k - ~61.4K LongVALE + VTimeLLM_stage3

Extracted features of LongVALE

Modality Encoder Download checkpoint Download features
Visual frames CLIP ViT-L/14 Training
Evaluation
Audio BEATs BEATs_iter3_plus_AS20K Training
Evaluation
Speech Whisper whisper-large-v2 Training
Evaluation

[Note] You can also extract features by youself by using the provided scripts at ./preprocess. The raw videos can be downloaded from this link (Baidu drive, pwd: i6s7). Since the copyright remains with the original video owners, please download videos under the CC BY-NC-SA 4.0 license.

Evaluation

For evaluation instruction, please refer to eval.md

Training

If you want to train the model by youself, please refer to train.md for training instructions.

Acknowledgement

We are grateful for the following awesome projects: VTimeLLM

Citation

If you find our project are useful for your research, please consider citing:

@inproceedings{geng2025longvale,
  title={Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos},
  author={Geng, Tiantian and Zhang, Jinrui and Wang, Qingni and Wang, Teng and Duan, Jinming and Zheng, Feng},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={18959--18969},
  year={2025}
}

About

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published