🔥 A Plug-and-Play method (TIM) to empower absolute time understanding for MRoPE of Qwen2.5-VL🔥
🔥 A state-of-the-art video sampling (TASS) for video mllm🔥
This repository is the official implementation of the paper "DATE: Dynamic Absolute Time Enhancement for Long Video Understanding". We propose a plug-and-play method, TIM, to empower real absolute time understanding for MRoPE of Qwen2.5-VL. Additionally, we introduce TASS (Time-Aware Sampling Strategy), a state-of-the-art video sampling method that enhances video understanding by selecting frames based on their relevance to the input query.
- [2025.09.12] 🔥🔥🔥 Paper is available on Arxiv!
- Replace the original Qwen2.5-VL's codes with two lines:
transformers.models.qwen2_5_vl.processing_qwen2_5_vl.Qwen2_5_VLProcessor.__call__ = date_processing_qwen2_5_vl__call__
transformers.models.qwen2_5_vl.modeling_qwen2_5_vl.Qwen2_5_VLForConditionalGeneration.get_rope_index = date_get_rope_index
- Get
timestamps
when loading video, and pass them to the Qwen2.5-VL's processor:
inputs = processor(text=[text_prompt], videos=[video], padding=True, return_tensors="pt", timestamps=timestamps)
- Generate a better caption based on input query (optional):
from utils.query import chatcompletions
caption = chatcompletions(question = question)
- Get scores for each frame:
from utils.clip_sim import process_video_question
scores, timestamps, ids, embedding = process_video_question(video_path, caption, fps=2)
- Use TASS to sample frames:
from utils.sampling import tass_sampling
final_timestamps, final_indices = tass_sampling(timestamps, scores, topk=None, max_frames=256)
- Install the environment according to the official documentation of Qwen2.5-VL.
- Modify the input settings in
demo.py
lines 200-202.