One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory, ICCV 2025
Official PyTorch code for TrajViT, an efficient video tokenization paradigm and a video transformer encoder. TrajViT tokenizes video with panoptic sub-object trajectories, significantly surpassing traditional way of space-time patch tokenization by a large margin in video understanding tasks while using 10x less tokens.
- support joint training on image data
- support attentive probing evaluations on action classification and localization tasks
- release model checkpoints
- instruction on using panda-70m training set
The required packages contain two parts: one for generating trajectories from video, one for training video vit encoder. Below are detailed steps to setup two parts of packages jointly in a conda environment.
# create and active conda environment
conda create --name trajvit python=3.10
conda activate trajvit
# go to traj_gen folder to install trajectory generation packages
cd traj_gen/
pip install -e ".[dev]"
# go back to repository folder to install the remaining packages
cd ..
pip install -r requirements.txt
# download sam2-small checkpoint
wget -P checkpoints/ https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_small.pt
In your .bashrc file, set the environment variables:
export SL_EXP_DIR="<path-to-trajvit-repo>/results"
export SL_DATA_DIR="<path-to-trajvit-repo>/data"
These variables are accessed by the yaml files in the configs/ directory and the shell scripts in entry/run.sh.
[Optional] Our codebase support using wandb to monitor training. If you want to use wandb, you will need to set up it following this very short instruction, and also set wandb.enable
in the configs to be True
.
We wrote a demo code in [demo.py] that demonstrates how our model can be used to inference a video. Simply run
python demo.py --video_path example/example.mp4
to get the result. If you want to visualize the generate trajectories, you can additionally pass in argument:
python demo.py --video_path example/example.mp4 --visualize_tracks
It is recommended to save data annotation files under ${SL_DATA_DIR}
. For example, the config file configs/pretrain.yaml assume ${SL_DATA_DIR}/metadata
is the directory containing all data annotation files. Below, We use MSRVTT dataset as demonstration.
Depending on the dataset you want to train/evaluate, you will download videos from different sources. For MSRVTT, you can download their videos by
wget -P data/videodata/ https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip
unzip data/videodata/MSRVTT.zip -d data/videodata/
The annotation file is in json
format, which can be loaded as a list of dictionaries. Each dictionary is {'image': path_to_image, 'caption': image_caption}
for image-text dataset, and is {'video': path_to_video, 'caption': video_caption}
for video-text dataset. We already provided train&eval annotation files for MSRVTT dataset in [data/metadata/] folder.
In configs/pretrain.yaml, add name and paths to your annotation file under the available_corpus
entry. For example my_new_dataset: [path_to_json, path_to_video_directory, video]
.
it is recommended to pre-generate trajectories for all data and save them to disk, so you won't need to generate trajectories again for every model inference. The trajectory generation script takes data annotation json file as input. For example, to generate trajectories for MSRVTT training split, simply need to run:
cd traj_gen
python training/generate_traj.py --json_path ../data/metadata/msrvtt_train.json --video_dir ../data/videodata/MSRVTT/videos/all/ --use_key_frame
The eval split can be generated similarly. We also support splitting dataset into several splits, to run them separately and speed up trajectory generation. The command is:
python training/generate_traj.py --split 1 --total_split 10 --json_path ../data/metadata/msrvtt_train.json --video_dir ../data/videodata/MSRVTT/videos/all/ --use_key_frame
Launch pre-training with the following command. This assumes running on 8 gpus.
cd entry
bash run.sh --task pretrain --train_corpus msrvtt_train --exp_name debug_exp --ngpus 8 --nnode 1 --model trajvit --log_wandb
The full command-line arguments you can specify can be seen in entry/run.sh. --model
can be either trajvit
or the baseline vit3d
.
For zero-shot retrieval evaluation, run
cd entry
bash run.sh --task zero_shot_eval --test_corpus msrvtt_test --model trajvit --ckpt <path-to-model-checkpoint> --log_wandb
We will update the repo soon with evaluation scripts on more tasks.
This code used resources from singularity, transformers, ALBEF, ClipBERT, frozen. The code is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.
If you found our paper or code useful, please cite:
@article{zheng2025one,
title={One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory},
author={Zheng, Chenhao and Zhang, Jieyu and Salehi, Mohammadreza and Gao, Ziqi and Iyengar, Vishnu and Kobori, Norimasa and Kong, Quan and Krishna, Ranjay},
journal={arXiv preprint arXiv:2505.23617},
year={2025}
}