ICCV 2025

How Can Objects Help Video-Language Understanding?

Zitian Tang, Shijie Wang, Junho Cho, Jaewook Yoo, Chen Sun

This repository is the official implementation for the paper How Can Objects Help Video-Language Understanding?. We propose ObjectMLLM, a Multimodal Large Language Model with enhanced spatial understanding capability in videos. ObjectMLLM incorporates visual embedding, video frame caption, and object bounding boxes as inputs. We explore two approaches to adapt bounding boxes to MLLMs - textual representation and embedding projector.

⚙️ Install

Create an Environment

git clone git@github.com:brown-palm/ObjectMLLM.git
cd ObjectMLLM
conda create -n objectmllm python=3.10
conda activate objectmllm

# You may want to change the PyTorch CUDA version in requirements.txt
pip install -r requirements.txt

Install Flash Attention

This is only used for ObjectMLLM with VideoLLaMA2 backbone.

To install flash-attn, run

pip install flash-attn==2.5.9 --no-build-isolation

If you find it very slow, an alternative way is to install pre-built flash-attn wheels. We recommend you to install a wheel version compatible with your machine from flash-attention-prebuild-wheels. For example,

pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.8/flash_attn-2.5.9+cu118torch2.4-cp310-cp310-linux_x86_64.whl

🦙 ObjectMLLM with LLaMA3 Backbone

Data Preparation

You can use our preprocessed datasets. For convenience, you can directly run our download script.

cd llama3
bash scripts/download/download_dataset.sh

This will download the datasets via gdown and unzip them to llama3/data.

If you want to run our experiments about embedding projector, you will need the raw files of our detected & tracked object bounding boxes (CLEVRER-MC, Perception Test, STAR, NExT-QA & IntentQA). For convenience, you can use our download script.

cd llama3
bash scripts/download/download_bbox.sh

This will download the bounding boxes and unzip them to llama3/data/{dataset_name}/bbox.

Base Model Preparation

Please download the LLaMA3-8B base model from huggingface. Especially, we will need the original/ checkpoint in our implementation. You can store the base model anywhere on your device and set its path as model_path in our training and evaluation scripts.

Training

To train ObjectMLLM with textual bounding box representation, we provide training scripts in llama3/scripts/train_textual. You can run them inside the llama3 folder. For example, to train a model with all three modalities as input on Perception Test, run

cd llama3
bash scripts/train_textual/ptest_vis_cap_box.sh

To train ObjectMLLM with bounding box projector, the scripts are in llama3/scripts/train_projector. For example,

cd llama3
bash scripts/train_projector/ptest_box_projector.sh

In each script, --max_seq_len specifies the maximal context window length that is required for the model. Training tasks with context window lengths smaller than 1500 are executable on GPUs with 40G memory (e.g, A6000). For tasks with context length larger than 1500, GPUs with 80G memory (e.g, A100 and H100) are required. According to the number of your available GPUs, you may need to adjust --nproc_per_node and --accum_iter accordingly.

Released Checkpoints

We release our LLaMA3 checkpoints trained on each dataset. You can use our script to download them.

cd llama3
bash scripts/download/download_checkpoint.sh

This will download and unzip the checkpoints to llama3/checkpoint_released.

Evaluation

Our evaluation scripts are in llama3/scripts/eval. They are executable on a single 40G GPU (e.g. A6000). You may enlarge --nproc_per_node to use multiple GPUs to speed up.

To evaluate our released checkpoints, please set CHECKPOINT_DIR=checkpoint_released. To evaluate checkpoints reproduced by our training scripts, set CHECKPOINT_DIR=checkpoint.

🤖 ObjectMLLM with VideoLLaMA2 Backbone

Data Preparation

You can use our preprocessed datasets. For convenience, you can directly run our download script.

cd videollama2
bash scripts/download/download_dataset.sh

This will download the datasets via gdown and unzip them to videollama2/data.

To develop or evaluate ObjectMLLM with VideoLLaMA2 backbone, you will also need the raw videos from the CLEVRER, Perception Test, STAR, and NExT-QA datasets. Please download them as needed. Note that IntentQA shares the same video source as NExT-QA.

Training

To train ObjectMLLM with VideoLLaMA2 backbone, we provide training scripts in videollama2/scripts/train. You can run them inside the videollama2 folder. For example, to train a model on Perception Test, run

cd videollama2
bash scripts/train/ptest_vis_box.sh

In each script, you need to set YOUR_VIDEO_DIR to the path of your downloaded dataset videos. All the experiments are runnable on GPUs with 40G memory. You may adjust ARG_NPROC_PER_NODE according to your number of GPUs.

Released Checkpoints

We release our VideoLLaMA2 checkpoints trained on CLEVRER-MC, Perception Test, STAR, NExT-QA, and IntentQA. You can use our script to download them.

cd videollama2
bash scripts/download/download_checkpoint.sh

This will download and unzip the checkpoints to videollama2/checkpoint_released.

Evaluation

Our evaluation scripts are in videollama2/scripts/eval. They are executable on a single 24G GPU (e.g., GeForce 3090).

To evaluate our released checkpoints, please set CHECKPOINT_DIR=checkpoint_released. To evaluate checkpoints reproduced by our training scripts, set CHECKPOINT_DIR=checkpoint.

👍 Acknowledgements

This repo is built upon Vamos and VideoLLaMA2.

📑 Citation

@misc{tang2025objectmllm,
    title={How Can Objects Help Video-Language Understanding?}, 
    author={Zitian Tang and Shijie Wang and Junho Cho and Jaewook Yoo and Chen Sun},
    year={2025},
    eprint={2504.07454},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

🔒 LICENSE

ObjectMLLM is released under MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assests		assests
llama3		llama3
videollama2		videollama2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ICCV 2025

How Can Objects Help Video-Language Understanding?

Zitian Tang, Shijie Wang, Junho Cho, Jaewook Yoo, Chen Sun

⚙️ Install

Create an Environment

Install Flash Attention

🦙 ObjectMLLM with LLaMA3 Backbone

Data Preparation

Base Model Preparation

Training

Released Checkpoints

Evaluation

🤖 ObjectMLLM with VideoLLaMA2 Backbone

Data Preparation

Training

Released Checkpoints

Evaluation

👍 Acknowledgements

📑 Citation

🔒 LICENSE

About

Uh oh!

Releases

Packages

Languages

License

brown-palm/ObjectMLLM

Folders and files

Latest commit

History

Repository files navigation

ICCV 2025 How Can Objects Help Video-Language Understanding?

Zitian Tang, Shijie Wang, Junho Cho, Jaewook Yoo, Chen Sun

⚙️ Install

Create an Environment

Install Flash Attention

🦙 ObjectMLLM with LLaMA3 Backbone

Data Preparation

Base Model Preparation

Training

Released Checkpoints

Evaluation

🤖 ObjectMLLM with VideoLLaMA2 Backbone

Data Preparation

Training

Released Checkpoints

Evaluation

👍 Acknowledgements

📑 Citation

🔒 LICENSE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

ICCV 2025

How Can Objects Help Video-Language Understanding?

Packages