This repository contains a PyTorch implementation of our ICML-23 paper:
SpotEM: Efficient Video Search for Episodic Memory
Santhosh Kumar Ramakrishnan1 Ziad Al-Halah2 Kristen Grauman1,3
1The University of Texas at Austin 2University of Utah 3FAIR, Meta AI
Project website: https://vision.cs.utexas.edu/projects/spotem/
The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., ``where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: a novel clip selector that learns to identify promising video regions to search conditioned on the language query; a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and distillation losses that address optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10%-25% of the clip features, we preserve 84%-95%+ of the original EM model's accuracy.
- Install miniforge following instructions here.
- Create new mamba / conda environment and activate it.
mamba create -n spotem python=3.10 mamba activate spotem
- Clone repository and define
SPOTEM_ROOT
variable.git clone ... <download_path> export SPOTEM_ROOT=<download_path>
- Install requirements.
pip install -r requirements.txt
- Download NLTK resources
python -c "import nltk; nltk.download('punkt_tab')"
- Download Ego4D dataset from here and place it in
data/
. - Download video features. Valid feature types:
clip
,imagenet
,room
,interaction
,object
,egovlp
,egovlp+imagenet
,egovlp+rio
,internvideo
,internvideo+imagenet
andinternvideo+rio
.python utils/download_features.py --feature_types <feature_type_1> <feature_type_2> ...
- Run preprocessing script using:
python VSLNet/utils/prepare_ego4d_dataset.py \ --input_train_split data/nlq_train.json \ --input_val_split data/nlq_val.json \ --input_test_split data/nlq_test_unannotated.json \ --output_save_path data/dataset/nlq_official_v1
- Download pretrained models for EgoVLP, InternVideo, ReLER models. SpotEM's RIO feature encoders are also available: room, interaction and object encoders.
The training scripts for each model are provided in scripts/training/<em_method_type>/
, where <em_method_type>
can be internvideo
, reler
or egovlp
. All training scripts expect SPOTEM_ROOT
to be set as mentioned in the installation instructions. All scripts also require the GPU ids to be specified as the first argument. Some scripts optionally require the feature sampler efficiency to be specified as the second argument. Here is an example of training a random sampling baseline with the internvideo
method, 4 GPUs and 50% sampling efficiency.
# Copy script to the required experiment path
cp $SPOTEM_ROOT/scripts/training/internvideo/random/train.sh <EXPERIMENT_DIR>
cd <EXPERIMENT_DIR>
bash train.sh "0,1,2,3" 0.5
Note that the models are evaluated on the validation split during the training process. For evaluating existing checkpoints, please refer to the VSLNet and RELER repositories for instructions.
Please cite our work if you find this repository useful.
@inproceedings{ramakrishnan2023spotem,
title={Spotem: Efficient video search for episodic memory},
author={Ramakrishnan, Santhosh Kumar and Al-Halah, Ziad and Grauman, Kristen},
booktitle={International Conference on Machine Learning},
pages={28618--28636},
year={2023},
organization={PMLR}
}
Our code is based on the following implementations of VSLNet and RELER.