Authors: Erhang Zhang#, Junyi Ma#, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang*
EgoLoc is a vision-language model (VLM)-based framework that localizes hand-object contact and separation timestamps in egocentric videos in a zero-shot manner. Our approach extends the traditional scope of temporal action localization (TAL) to a finer level, which we define as temporal interaction localization (TIL).
π Read our paper β accepted at IROS 2025.
We greatly appreciate Yuchen Xie for helping organize our repository and developing VDA-based version.
We provide two demo videos from the EgoPAT3D-DT dataset for quick experimentation.
conda create -n egoloc python=3.10 -y && conda activate egoloc && \
git clone https://github.com/IRMVLab/EgoLoc.git && cd EgoLoc && \
pip install -r requirements.txt
Grounded-SAM Dependency Installation (Mandatory)
git clone --recursive https://github.com/IDEA-Research/Grounded-Segment-Anything.git
If you plan to use CUDA (recommended for speed) outside Docker, set:
export AM_I_DOCKER=False
export BUILD_WITH_CUDA=True # ensures CUDA kernels are compiled
# 4-A Segment Anything (SAM)
python -m pip install -e Grounded-Segment-Anything/segment_anything
# 4-B Grounding DINO
pip install --no-build-isolation -e Grounded-Segment-Anything/GroundingDINO
# Diffusers (for prompt-based image generation; optional but handy)
pip install --upgrade 'diffusers[torch]'
git submodule update --init --recursive
cd Grounded-Segment-Anything/grounded-sam-osx
bash install.sh # compiles custom ops
cd ../.. # return to project root
git clone https://github.com/xinyu1205/recognize-anything.git
pip install -r recognize-anything/requirements.txt
pip install -e recognize-anything/
pip install opencv-python pycocotools matplotlib onnxruntime onnx ipykernel
These are needed for COCO-format mask export, ONNX export, and Jupyter notebooks.
cd Grounded-Segment-Anything
# Grounding DINO (Swin-T, object-grounded captions)
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
# Segment Anything (ViT-H)
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
Step 8 β Download BERT backbone (for text embeddings) <-- Download Within the Grounded Segment Anything Repo
git clone https://huggingface.co/google-bert/bert-base-uncased
Quick installation (extra steps for the 3D demo)
# ---- inside the EgoLoc root --------------------------------------------------
# 1) external repos
git clone https://github.com/geopavlakos/hamer.git
git clone https://github.com/DepthAnything/Video-Depth-Anything.git
# 2) python packages
# Install HaMeR dependencies (Note: MANO model is NOT required for this installation)
# Install VDA dependencies
pip install opencv-python matplotlib scipy tqdm
# 3) Get Video-Depth-Anything checkpoint
- If you encounter a module error regarding segment_anything, please add a
__init__.py
file inside the directory of ./Grounded-Segment-Anything/segment_anything with the following:
from .segment_anything import SamPredictor, sam_model_registry
If you encounter a bug, please do not hesitate to make a PR.
We provide both 2D and 3D demos for you to test out.
We provide several example videos to demonstrate how our 2D version of EgoLoc performs in a closed-loop setup. To run the demo:
python egoloc_2D_demo.py \
--video_path ./video1.mp4 \
--output_dir output \
--config Grounded-Segment-Anything/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
--grounded_checkpoint Grounded-Segment-Anything/groundingdino_swint_ogc.pth \
--sam_checkpoint Grounded-Segment-Anything/sam_vit_h_4b8939.pth \
--bert_base_uncased_path Grounded-Segment-Anything/bert-base-uncased/ \
--text_prompt hand \
--box_threshold 0.3 \
--text_threshold 0.25 \
--device cuda \
--credentials auth.env \
--action "Grasping the object" \
--grid_size 3 \
--max_feedbacks 1
The temporal interaction localization results will be saved in the output
directory.
Video | Contact Frame | Separation Frame |
---|---|---|
video1 | ![]() |
![]() |
video2 | ![]() |
![]() |
Note: Due to inherent randomness in VLM-based reasoning, EgoLoc may produce slightly different results on different runs.
We also provide our newest 3D version of EgoLoc, which uses 3D hand velocities for adaptive sampling. VDA is used here to synthesize pseudo depth observations, eliminating the reliance on RGB-D cameras for more flexible applications. To run the demo:
python egoloc_3D_demo.py \
--video_path video3.mp4 \
--output_dir output \
--device cuda \
--credentials auth.env \
--encoder vits \
--grid_size 3
The temporal interaction localization results will be saved in the output
directory.
Video | Pseudo Depth | Contact Frame | Separation Frame |
---|---|---|---|
video3 | ![]() |
![]() |
![]() |
Note: Due to inherent randomness in VLM-based reasoning, EgoLoc may produce slightly different results on different runs.
Here are some key arguments you can adjust when running EgoLoc. For file paths related to GroundedSAM, please refer to its original repository.
- video_path: Path to the input egocentric video
- output_dir: Directory to save the output frames and results
- text_prompt: Prompt used for hand grounding (e.g.,
"hand"
) - box_threshold: Threshold for hand box grounding confidence
- grid_size: Grid size for image tiling used in VLM prompts
- max_feedbacks: Number of feedback iterations
- credentials: File containing your OpenAI API key
We plan to release a full version of EgoLoc and additional benchmarks soon. In the future, we will also show:
- How to integrate EgoLoc with state-of-the-art hand motion forecasting frameworks like MMTwin
- How to deploy EgoLoc in robotic manipulation tasks
But for now, feel free to explore the demos β and try it out on your own videos!
π If you find EgoLoc useful in your research, please consider citing:
@article{zhang2025zero,
title={Zero-Shot Temporal Interaction Localization for Egocentric Videos},
author={Zhang, Erhang and Ma, Junyi and Zheng, Yin-Dong and Zhou, Yixuan and Wang, Hesheng},
journal={arXiv preprint arXiv:2506.03662},
year={2025}
}
- Add support for 3D hand motion analysis (within 2 weeks)
- Extend to long untrimmed videos (before IROS 2025)
- Improve efficiency of the feedback loop mechanism (before IROS 2025)
We appreciate your interest and patience!
Copyright 2025, IRMV Lab, SJTU.
This project is free software made available under the MIT License. For more details see the LICENSE file.