This is the official repository for the Q-ViD paper available at: https://arxiv.org/pdf/2402.10698.pdf
Q-ViD performs video QA by relying on an 𝐨𝐩𝐞𝐧 instruction-aware vision-language model to generate video frame descriptions that are more relevant to the task at hand, it uses question-dependent captioning instructions as input to InstructBLIP to generate useful frame captions. Subsequently, we use a QA instruction that combines the captions, question, options and a task description as input to a LLM-based reasoning module to perform video QA. When compared with prior works based on more complex architectures or 𝐜𝐥𝐨𝐬𝐞𝐝 GPT models, Q-ViD achieves either the second-best or the best overall average accuracy across all evaluated benchmarks.
Q-ViD relies on InstructBLIP, its original trimmed checkpoints that can be used by the Lavis library can be downloaded with the following links:
Hugging Face Model | Checkpoint |
---|---|
InstructBLIP-flan-t5-xl | Download |
InstructBLIP-flan-t5-xxl | Download |
- (Optional) Creating conda environment
conda create -n Q-ViD python=3.8
conda activate Q-ViD
- Build from source
cd Q-ViD
pip install -e .
We test Q-ViD in five video QA benchmarks. They can be downloaded in the following links:
After downloading, the original files (json,jsonl,csv) from each dataset they have to be preprocessed to a unique json format using a preprocessing_script. Afterwards, update the paths for the video folder and annotation files in the corresponding config files from each dataset.
To run Q-ViD use the following code, depending on each dataset modify the script as it's shown below. Here we define the corresponding prompts, number of frames, and batch size, please note that for the latter we change the value for each dataset according to our computational limitations.
result_dir="YOUR_PATH"
exp_name='nextqa_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/nextqa_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D,E) from the options." \
datasets.nextqa.vis_processor.eval.n_frms=64 \
run.batch_size_eval=2 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'
result_dir="YOUR_PATH"
exp_name='star_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/star_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D) from the options." \
datasets.star.vis_processor.eval.n_frms=64 \
run.batch_size_eval=2 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'
result_dir="YOUR_PATH"
exp_name='how2qa_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/how2qa_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D) from the options." \
datasets.how2qa.vis_processor.eval.n_frms=64 \
run.batch_size_eval=2 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'
result_dir="YOUR_PATH"
exp_name='tvqa_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/tvqa_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D,E) from the options." \
datasets.tvqa.vis_processor.eval.n_frms=64 \
run.batch_size_eval=1 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'
result_dir="YOUR_PATH"
exp_name='intentqa_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/intentqa_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D,E) from the options." \
datasets.intentqa.vis_processor.eval.n_frms=64 \
run.batch_size_eval=2 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'
if it was useful for your work please cite our paper:
@misc{romero2024questioninstructed,
title={Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering},
author={David Romero and Thamar Solorio},
year={2024},
eprint={2402.10698},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
We thank the developers of LAVIS, BLIP-2, SeViLa for their public code release.