EgoLoc: Zero-Shot Temporal Interaction Localization for Egocentric Videos

Authors: Erhang Zhang#, Junyi Ma#, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang*

EgoLoc is a vision-language model (VLM)-based framework that localizes hand-object contact and separation timestamps in egocentric videos in a zero-shot manner. Our approach extends the traditional scope of temporal action localization (TAL) to a finer level, which we define as temporal interaction localization (TIL).

📄 Read our paper – accepted at IROS 2025.

from [🧭TAL] to [🎯TIL]

We greatly appreciate Yuchen Xie for helping organize our repository and developing VDA-based version.

1. Getting Started

We provide two demo videos from the EgoPAT3D-DT dataset for quick experimentation.

1.1 EgoLoc & Grounded-SAM Environment setup & dependency installation 🚀

EgoLoc One-Liner Installation - For Conda

conda create -n egoloc python=3.10 -y && conda activate egoloc && \
git clone https://github.com/IRMVLab/EgoLoc.git && cd EgoLoc && \
pip install -r requirements.txt

Grounded-SAM Dependency Installation (Mandatory)

Step 1 – Clone Grounded-SAM (with submodules)

git clone --recursive https://github.com/IDEA-Research/Grounded-Segment-Anything.git

If you plan to use CUDA (recommended for speed) outside Docker, set:

export AM_I_DOCKER=False
export BUILD_WITH_CUDA=True          # ensures CUDA kernels are compiled

Step 2 – Build Grounded-SAM components

# 4-A  Segment Anything (SAM)
python -m pip install -e Grounded-Segment-Anything/segment_anything

# 4-B  Grounding DINO
pip install --no-build-isolation -e Grounded-Segment-Anything/GroundingDINO

Step 3 – Vision-language extras

# Diffusers (for prompt-based image generation; optional but handy)
pip install --upgrade 'diffusers[torch]'

Step 4 – OSX module (object-centric cross-attention)

git submodule update --init --recursive
cd Grounded-Segment-Anything/grounded-sam-osx
bash install.sh          # compiles custom ops
cd ../..                 # return to project root

Step 5 – RAM & Tag2Text (open-vocabulary tagger)

git clone https://github.com/xinyu1205/recognize-anything.git
pip install -r recognize-anything/requirements.txt
pip install -e recognize-anything/

Step 6 – Optional utilities

pip install opencv-python pycocotools matplotlib onnxruntime onnx ipykernel

These are needed for COCO-format mask export, ONNX export, and Jupyter notebooks.

Step 7 – Download pretrained weights (place inside `Grounded-Segment-Anything`)

cd Grounded-Segment-Anything

# Grounding DINO (Swin-T, object-grounded captions)
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

# Segment Anything (ViT-H)
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Step 8 – Download BERT backbone (for text embeddings) <-- Download Within the Grounded Segment Anything Repo

git clone https://huggingface.co/google-bert/bert-base-uncased

1.2 EgoLoc 3D-Specific Required Dependencies

Quick installation (extra steps for the 3D demo)

# ---- inside the EgoLoc root --------------------------------------------------
# 1) external repos
git clone https://github.com/geopavlakos/hamer.git
git clone https://github.com/DepthAnything/Video-Depth-Anything.git

# 2) python packages
# Install HaMeR dependencies (Note: MANO model is NOT required for this installation)
# Install VDA dependencies
pip install opencv-python matplotlib scipy tqdm

# 3) Get Video-Depth-Anything checkpoint

> Note: the VDA-based 3D demo is still under developmental phase, if any bugs are spotted please do not hesitate to make a PR.

1.3 Known Bugs (With the assumption of successfull installation of all dependecies)

If you encounter a module error regarding segment_anything, please add a __init__.py file inside the directory of ./Grounded-Segment-Anything/segment_anything with the following:

from .segment_anything import SamPredictor, sam_model_registry

If you encounter a bug, please do not hesitate to make a PR.

2. Running EgoLoc

We provide both 2D and 3D demos for you to test out.

2.1 Running EgoLoc-2D (RGB Video Only)

We provide several example videos to demonstrate how our 2D version of EgoLoc performs in a closed-loop setup. To run the demo:

python egoloc_2D_demo.py \
  --video_path ./video1.mp4 \
  --output_dir output \
  --config Grounded-Segment-Anything/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --grounded_checkpoint Grounded-Segment-Anything/groundingdino_swint_ogc.pth \
  --sam_checkpoint Grounded-Segment-Anything/sam_vit_h_4b8939.pth \
  --bert_base_uncased_path Grounded-Segment-Anything/bert-base-uncased/ \
  --text_prompt hand \
  --box_threshold 0.3 \
  --text_threshold 0.25 \
  --device cuda \
  --credentials auth.env \
  --action "Grasping the object" \
  --grid_size 3 \
  --max_feedbacks 1

The temporal interaction localization results will be saved in the output directory.

Video	Contact Frame	Separation Frame
video1
video2

Note: Due to inherent randomness in VLM-based reasoning, EgoLoc may produce slightly different results on different runs.

2.2 Running EgoLoc-3D (RGB Video + Auto Synthetically Generated Depths)

We also provide our newest 3D version of EgoLoc, which uses 3D hand velocities for adaptive sampling. VDA is used here to synthesize pseudo depth observations, eliminating the reliance on RGB-D cameras for more flexible applications. To run the demo:

python egoloc_3D_demo.py \
          --video_path video3.mp4 \
          --output_dir output \
          --device cuda \
          --credentials auth.env \
          --encoder vits \
          --grid_size 3

The temporal interaction localization results will be saved in the output directory.

Video	Pseudo Depth	Contact Frame	Separation Frame
video3

Note: Due to inherent randomness in VLM-based reasoning, EgoLoc may produce slightly different results on different runs.

3. Configuration Parameters

Here are some key arguments you can adjust when running EgoLoc. For file paths related to GroundedSAM, please refer to its original repository.

video_path: Path to the input egocentric video
output_dir: Directory to save the output frames and results
text_prompt: Prompt used for hand grounding (e.g., "hand")
box_threshold: Threshold for hand box grounding confidence
grid_size: Grid size for image tiling used in VLM prompts
max_feedbacks: Number of feedback iterations
credentials: File containing your OpenAI API key

We plan to release a full version of EgoLoc and additional benchmarks soon. In the future, we will also show:

How to integrate EgoLoc with state-of-the-art hand motion forecasting frameworks like MMTwin
How to deploy EgoLoc in robotic manipulation tasks

But for now, feel free to explore the demos — and try it out on your own videos!

4. Citation

🙏 If you find EgoLoc useful in your research, please consider citing:

@article{zhang2025zero,
  title={Zero-Shot Temporal Interaction Localization for Egocentric Videos},
  author={Zhang, Erhang and Ma, Junyi and Zheng, Yin-Dong and Zhou, Yixuan and Wang, Hesheng},
  journal={arXiv preprint arXiv:2506.03662},
  year={2025}
}

5. Our Future Roadmap

Add support for 3D hand motion analysis (within 2 weeks)
Extend to long untrimmed videos (before IROS 2025)
Improve efficiency of the feedback loop mechanism (before IROS 2025)

We appreciate your interest and patience!

6. License

This project is free software made available under the MIT License. For more details see the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EgoLoc: Zero-Shot Temporal Interaction Localization for Egocentric Videos

from [🧭TAL] to [🎯TIL]

1. Getting Started

1.1 EgoLoc & Grounded-SAM Environment setup & dependency installation 🚀

EgoLoc One-Liner Installation - For Conda

Step 1 – Clone Grounded-SAM (with submodules)

Step 2 – Build Grounded-SAM components

Step 3 – Vision-language extras

Step 4 – OSX module (object-centric cross-attention)

Step 5 – RAM & Tag2Text (open-vocabulary tagger)

Step 6 – Optional utilities

Step 7 – Download pretrained weights (place inside `Grounded-Segment-Anything`)

Step 8 – Download BERT backbone (for text embeddings) <-- Download Within the Grounded Segment Anything Repo

1.2 EgoLoc 3D-Specific Required Dependencies

1.3 Known Bugs (With the assumption of successfull installation of all dependecies)

2. Running EgoLoc

2.1 Running EgoLoc-2D (RGB Video Only)

2.2 Running EgoLoc-3D (RGB Video + Auto Synthetically Generated Depths)

3. Configuration Parameters

4. Citation

5. Our Future Roadmap

6. License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
output		output
LICENSE		LICENSE
README.md		README.md
TAL_TIL.png		TAL_TIL.png
auth.env		auth.env
demo.gif		demo.gif
egoloc_2D_demo.py		egoloc_2D_demo.py
egoloc_3D_demo.py		egoloc_3D_demo.py
requirements.txt		requirements.txt
video1.mp4		video1.mp4
video2.mp4		video2.mp4

License

IRMVLab/EgoLoc

Folders and files

Latest commit

History

Repository files navigation

EgoLoc: Zero-Shot Temporal Interaction Localization for Egocentric Videos

from [🧭TAL] to [🎯TIL]

1. Getting Started

1.1 EgoLoc & Grounded-SAM Environment setup & dependency installation 🚀

EgoLoc One-Liner Installation - For Conda

Step 1 – Clone Grounded-SAM (with submodules)

Step 2 – Build Grounded-SAM components

Step 3 – Vision-language extras

Step 4 – OSX module (object-centric cross-attention)

Step 5 – RAM & Tag2Text (open-vocabulary tagger)

Step 6 – Optional utilities

Step 7 – Download pretrained weights (place inside Grounded-Segment-Anything)

Step 8 – Download BERT backbone (for text embeddings) <-- Download Within the Grounded Segment Anything Repo

1.2 EgoLoc 3D-Specific Required Dependencies

1.3 Known Bugs (With the assumption of successfull installation of all dependecies)

2. Running EgoLoc

2.1 Running EgoLoc-2D (RGB Video Only)

2.2 Running EgoLoc-3D (RGB Video + Auto Synthetically Generated Depths)

3. Configuration Parameters

4. Citation

5. Our Future Roadmap

6. License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Step 7 – Download pretrained weights (place inside `Grounded-Segment-Anything`)

Packages