‡ Corresponding author.
The Chinese University of Hong Kong
We focus on the fine-grained spatiotemporl grounding on egocentric videos.
Specifically, we conduct a quantitative and systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts.
To address the absense of related datasets, we develop an automatic dat annotation pipeline and propose EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos, and EgoMask-Train, a large-scale training dataset to facilitate model development.
Experiments demonstrate that existing state-of-the-art models struggle with egocentric videos but significantly improve when fine-tuned on our training data, while retaining performance on exocentric benchmarks.
Comparison of video grounding tasks. We propose the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos
- [2025-08-04] We release the evaluation code and fine-tuning code.
- [2025-08-01] We release the paper and data.
We have provided the annotations in Hugging Face.
See dataset/README.md for details.
The statistics of EgoMask-Train and the comparison with existing egocentric training datasets.
The statistics of EgoMask and comparison with existing exocentric benchmarks.
Note:
- Total Duration (%): the percent of the total appearance of the objects.
- Mask Area (%): the average area of the annotated mask over the frame size, which can reveal the object size.
- # Traj: the number of object’s continuous trajectories throughout the video, where the trajectory is defined as one consecutive appearance in the video.
- Avg. Traj. Length (%): the average of each trajectory duration over the whole video.
- Disappear. Ratio(%): the mean of disappearance duration over trajectory duration. These two values (Avg. Traj. Length and Disappear. Ratio) can reveal the sparsity of the continuous trajectory.
- Adj. Mask IoU(%): shows the positional shifts over the adjacent frames by calculating the IoU value of the masks of the target object.
- Expr.: refers to the expression.
Prepare benchmark data
- Download this Repo.
git clone https://github.com/LaVi-Lab/EgoMask.git
- Download dataset annotations and make sure the benchmark path is dataset/egomask.
cd EgoMask
hf download XuuuXYZ/EgoMask --repo-type dataset --local-dir dataset
- Follow dataset/README.md to access Ego4D dataset and process benchmark images.
cd dataset
bash preprocess/process_refego.sh
bash preprocess/process_egotracks_for_benchmark.sh
Please follow this repo to set up the environment & download models.
Then add the checkpoint paths and config paths in scripts/eval_groundedsam2.sh
and run the following scripts.
# bash scripts/eval_groundedsam2.sh [DATASET_TYPE]
bash scripts/eval_groundedsam2.sh long
bash scripts/eval_groundedsam2.sh mid
bash scripts/eval_groundedsam2.sh short
Please follow this repo to set up the environment & download models. Then run the following scripts.
# bash scripts/eval_videolisa.sh [DATASET_TYPE] [MODEL_PATH]
bash scripts/eval_videolisa.sh long ZechenBai/VideoLISA-3.8B
bash scripts/eval_videolisa.sh mid ZechenBai/VideoLISA-3.8B
bash scripts/eval_videolisa.sh short ZechenBai/VideoLISA-3.8B
Please follow this repo to set up the environment & download models. Then run the following scripts.
# bash scripts/eval_sa2va.sh [DATASET_TYPE] [MODEL_PATH]
bash scripts/eval_sa2va.sh long ByteDance/Sa2VA-4B
bash scripts/eval_sa2va.sh mid ByteDance/Sa2VA-4B
bash scripts/eval_sa2va.sh short ByteDance/Sa2VA-4B
We use VideoLISA-3.8B as our base model.
-
Follow VideoLISA to set up environment and prepare data. The data structure should follows,
├─ videolisa_data/ ├─ LISA_data/ │ ├─ ade20k/ │ ├─ coco/ │ ├─ cocostuff/ │ ├─ llava_dataset/ │ ├─ mapillary/ │ ├─ reason_seg/ │ ├─ refer_seg/ │ └─ vlpart/ ├─ mevis/ ├─ ReasonVOS/ ├─ ref-davis/ ├─ egomask-train/ ├─ ref-youtube-vos/ ├─ ReVOS/ └─ YTVOS/
-
Use utils/ and train_joint.py from
src/training/videolisa
to replace the original directory and file. -
Copy srcipts/FT_videolisa.sh to the downloaded code repo and then run this script.
We use Sa2VA-4B as our base model.
-
Follow Sa2VA to set up environment, download models, and prepare data. The data structure should follows,
├─ sa2va_data/ └─ video_datas/ ├─ rvos/ ├─ mevis/ ├─ revos/ └─ egomask-train/
-
Copy files in datasets, models and configs from
src/training/sa2va/projects/llava_sam2
to the corresponding original directories. -
Change model paths or config paths in
projects/llava_sam2/sa2va_4b_FT_with_egomask-train.py
if necessary. -
Run the following scripts.
# Training Script
bash tools/dist.sh train projects/llava_sam2/configs/sa2va_4b_FT_with_egomask-train.py 8
# Convert trained model to huggingface format
python projects/llava_sam2/hf/convert_to_hf.py projects/llava_sam2/configs/sa2va_4b_FT_with_egomask-train.py --pth-model [PATH_TO_PTH_MODEL] --save-path [PATH_TO_SAVE_FOLDER]
We would like to thank the following works for their contributions to the opensourced codebase and community!
- EgoTracks, RefEgo: the datasets we use.
- Grounded-SAM2, Sa2VA, VideoLISA: We refer to these works for the data processing and evaluation setup.
If you find our EgoMask useful for your research, please consider giving this repository a star and citing our paper as follows:
@article{liang2025finegrained,
title={Fine-grained Spatiotemporal Grounding on Egocentric Videos},
author={Shuo Liang and Yiwu Zhong and Zi-Yuan Hu and Yeyao Tao and Liwei Wang},
journal={arxiv preprint arXiv:2508.00518},
year={2025},
}