Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

This is the official repository for the Q-ViD paper available at: https://arxiv.org/pdf/2402.10698.pdf

Model Description

Q-ViD performs video QA by relying on an 𝐨𝐩𝐞𝐧 instruction-aware vision-language model to generate video frame descriptions that are more relevant to the task at hand, it uses question-dependent captioning instructions as input to InstructBLIP to generate useful frame captions. Subsequently, we use a QA instruction that combines the captions, question, options and a task description as input to a LLM-based reasoning module to perform video QA. When compared with prior works based on more complex architectures or 𝐜𝐥𝐨𝐬𝐞𝐝 GPT models, Q-ViD achieves either the second-best or the best overall average accuracy across all evaluated benchmarks.

Q-ViD relies on InstructBLIP, its original trimmed checkpoints that can be used by the Lavis library can be downloaded with the following links:

Hugging Face Model	Checkpoint
InstructBLIP-flan-t5-xl	Download
InstructBLIP-flan-t5-xxl	Download

Installation

(Optional) Creating conda environment

conda create -n Q-ViD python=3.8
conda activate Q-ViD

Build from source

cd Q-ViD
pip install -e .

Datasets

We test Q-ViD in five video QA benchmarks. They can be downloaded in the following links:

After downloading, the original files (json,jsonl,csv) from each dataset they have to be preprocessed to a unique json format using a preprocessing_script. Afterwards, update the paths for the video folder and annotation files in the corresponding config files from each dataset.

Usage

To run Q-ViD use the following code, depending on each dataset modify the script as it's shown below. Here we define the corresponding prompts, number of frames, and batch size, please note that for the latter we change the value for each dataset according to our computational limitations.

NExT-QA

result_dir="YOUR_PATH"

exp_name='nextqa_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/nextqa_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D,E) from the options." \
datasets.nextqa.vis_processor.eval.n_frms=64 \
run.batch_size_eval=2 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'

STAR

result_dir="YOUR_PATH"

exp_name='star_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/star_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D) from the options." \
datasets.star.vis_processor.eval.n_frms=64 \
run.batch_size_eval=2 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'

How2QA

result_dir="YOUR_PATH"

exp_name='how2qa_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/how2qa_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D) from the options." \
datasets.how2qa.vis_processor.eval.n_frms=64 \
run.batch_size_eval=2 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'

TVQA

result_dir="YOUR_PATH"

exp_name='tvqa_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/tvqa_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D,E) from the options." \
datasets.tvqa.vis_processor.eval.n_frms=64 \
run.batch_size_eval=1 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'

IntentQA

result_dir="YOUR_PATH"

exp_name='intentqa_infer'
ckpt='Q-ViD/qvid_checkpoints/instruct_blip_flanxxl_trimmed.pth'
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 evaluate.py \
--cfg-path Q-ViD/lavis/projects/qvid/intentqa_eval.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.model_type=flant5xxl \
model.cap_prompt="Provide a detailed description of the image related to the" \
model.qa_prompt="Considering the information presented in the captions, select the correct answer in one letter (A,B,C,D,E) from the options." \
datasets.intentqa.vis_processor.eval.n_frms=64 \
run.batch_size_eval=2 \
model.task='qvh_freeze_loc_freeze_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'

Citation

if it was useful for your work please cite our paper:

@misc{romero2024questioninstructed,
      title={Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering}, 
      author={David Romero and Thamar Solorio},
      year={2024},
      eprint={2402.10698},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgments

We thank the developers of LAVIS, BLIP-2, SeViLa for their public code release.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
app		app
data_preprocessing		data_preprocessing
docs		docs
lavis		lavis
qvid_checkpoints		qvid_checkpoints
run_scripts/inference		run_scripts/inference
salesforce_lavis.egg-info		salesforce_lavis.egg-info
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

Model Description

Installation

Datasets

Usage

NExT-QA

STAR

How2QA

TVQA

IntentQA

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

Daromog/Q-ViD

Folders and files

Latest commit

History

Repository files navigation

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

Model Description

Installation

Datasets

Usage

NExT-QA

STAR

How2QA

TVQA

IntentQA

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages