🚀 Introducing OVBench OVBench is a benchmark tailored for real-time video understanding:
- Memory, Perception, and Prediction of Temporal Contexts: Questions are framed to reference the present state of entities, requiring models to memorize/perceive/predict past/present/future temporal contexts over time.
- Dynamic Spatio-temporal Interaction: The benchmark demands precise real-time interactions with video content, where actions, objects, and events must be understood in the context of their spatial and temporal relationships.
- Contextual Awareness at Specific Moments: Real-time questions are contextual, changing based on the specific timestamp they are asked, requiring a deep understanding of how temporal context evolves.
🏗️ Pyramid Memory Bank
To tackle the challenges of infinite video streams, we propose a multi-layered Pyramid Memory Bank that balances spatial and temporal information:
- Spatial Anchors: The lower layers retain high-resolution features to preserve fine-grained spatial cues, capturing keyframes as "spatial anchors" with a lower sampling rate.
- Progressive Abstraction: As the layers progress, spatial resolution decays while the temporal sampling rate grows proportionally, forming an abstract representation of fine-grained long-short-term patterns.
- Dynamic Eviction: A dynamic eviction mechanism detects temporal redundancy via similarity, combined with pooling for spatial compression, improving storage efficiency.
🎯 Offline-to-Online Learning Paradigm
A novel training strategy designed for online video streams:
- Interleaved Dialogue Tuning: Combines offline video data with online instruction tuning in a dialogue format.
- Progressive Learning: Bridges offline and online video understanding, enhancing real-time adaptability.
- Model checkpoint Upload
- A more interactive demo
See our leaderboard here
Evaluation of Existing Models on OVBench Using lmms_eval.
-
Environment Setup: Ensure that all dependencies required by lmms_eval are properly installed.
-
Please perform a global search for the field
/path_to_your
in the lmms-eval-ovbench directory and replace it with the corresponding file path on your local system.
- Execute the script
lmms-eval-ovbench/scripts/eval_models/eval_internvl2-8B.sh
to initiate the benchmark evaluation.
-
Given that the video data used in this benchmark consists of both image sequences and video clips, it is necessary to utilize the
lmms-eval-ovbench/llava/video_utils.py
to read video data correctly. -
You may refer to the implementation of the
load_video
function inlmms-eval-ovbench/lmms_eval/models/internvl2.py
as a guideline. Integrate this function into your custom model as needed to enable compatibility with the lmms_eval evaluation framework.
Email xinhaoli00@outlook.com with your result.json or open an issue in this repo.
To launch the demo, use the following script:
pingpong.mp4
bash gradio_demo.sh
To install the necessary dependencies, use the following commands:
conda create -n your_env python=3.9
pip install -r requirements.txt
The anno_data file provides the paths for different types of datasets:
"coin_sl_train": {
"annotation": "Path to the annotations json file.",
"data_root": "your data path",
},
...
We support the data reading formats LLaVA
and VideoChat2-IT
for specific data JSON formats.
🔄 Online SFT Data Download
For the construction format of online data, please refer to VideoChatOnline-IT
Benchmark | Result |
---|---|
OVBench | 54.9 |
VideoMME | Short: 65.8 Medium: 50.2 Long: 47.1 Avg: 54.4 |
MVBench | 65.2 |
EgoSchema | 54.7 |
MLVU | 60.8 |
LongVideoBench | 54.1 |
To run the training, execute the following bash commands for different stages:
#Offline SFT:
bash shell/online_4b/videochat_online_4b_stage1_ft.sh
#Online & Offline Joint SFT:
bash shell/online_4b/videochat_online_4b_stage2_ft.sh
📊 Evaluation on OVBench
#Sliding Window Setting:
bash shell/eval/online_bench_sliding_window.sh
#Streaming Setting:
bash shell/eval/online_bench_stream.sh