State-of-the-art (Papers with Code)
🚨 Update (22nd July 2025): Instructions for custom datasets have been added!
🔔 Update (16th July 2025): Code has been updated with instructions!
- 🎯 Highlights
- 📜 Abstract
- 🧠 Architecture
- 🛠️ Installation instructions
- 📊 Inference code: Reproduce 30-shot SOTA results in Few-shot COCO
- 🔍 Custom dataset
- 📚 Citation
- 💡 Training-Free: No fine-tuning, no prompt engineering—just a reference image.
- 🖼️ Reference-Based: Segment new objects using just a few examples.
- 🔥 SOTA Performance: Outperforms previous training-free approaches on COCO, PASCAL VOC, and Cross-Domain FSOD.
Links:
The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).
git clone https://github.com/miquel-espinosa/no-time-to-train.git
cd no-time-to-train
We will create a conda environment with the required packages.
conda env create -f environment.yml
conda activate no-time-to-train
We will install SAM2 and DinoV2 from source.
pip install -e .
cd dinov2
pip install -e .
cd ..
Please download COCO dataset and place it in data/coco
We will download the exact SAM2 checkpoints used in the paper. (Note, however, that SAM2.1 checkpoints are already available and might perform better.)
mkdir -p checkpoints/dinov2
cd checkpoints
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt
cd dinov2
wget https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth
cd ../..
Define useful variables and create a folder for results:
CONFIG=./no_time_to_train/new_exps/coco_fewshot_10shot_Sam2L.yaml
CLASS_SPLIT="few_shot_classes"
RESULTS_DIR=work_dirs/few_shot_results
SHOTS=30
SEED=33
GPUS=4
mkdir -p $RESULTS_DIR
FILENAME=few_shot_${SHOTS}shot_seed${SEED}.pkl
python no_time_to_train/dataset/few_shot_sampling.py \
--n-shot $SHOTS \
--out-path ${RESULTS_DIR}/${FILENAME} \
--seed $SEED \
--dataset $CLASS_SPLIT
python run_lightening.py test --config $CONFIG \
--model.test_mode fill_memory \
--out_path ${RESULTS_DIR}/memory.ckpt \
--model.init_args.model_cfg.memory_bank_cfg.length $SHOTS \
--model.init_args.dataset_cfgs.fill_memory.memory_pkl ${RESULTS_DIR}/${FILENAME} \
--model.init_args.dataset_cfgs.fill_memory.memory_length $SHOTS \
--model.init_args.dataset_cfgs.fill_memory.class_split $CLASS_SPLIT \
--trainer.logger.save_dir ${RESULTS_DIR}/ \
--trainer.devices $GPUS
python run_lightening.py test --config $CONFIG \
--model.test_mode postprocess_memory \
--model.init_args.model_cfg.memory_bank_cfg.length $SHOTS \
--ckpt_path ${RESULTS_DIR}/memory.ckpt \
--out_path ${RESULTS_DIR}/memory_postprocessed.ckpt \
--trainer.devices 1
python run_lightening.py test --config $CONFIG \
--ckpt_path ${RESULTS_DIR}/memory_postprocessed.ckpt \
--model.init_args.test_mode test \
--model.init_args.model_cfg.memory_bank_cfg.length $SHOTS \
--model.init_args.model_cfg.dataset_name $CLASS_SPLIT \
--model.init_args.dataset_cfgs.test.class_split $CLASS_SPLIT \
--trainer.logger.save_dir ${RESULTS_DIR}/ \
--trainer.devices $GPUS
If you'd like to see inference results online (as they are computed), uncomment lines 1746-1749 in no_time_to_train/models/Sam2MatchingBaseline_noAMG.py
here.
Adjust the score threshold score_thr
parameter as needed to see more or less segmented instances.
Images will now be saved in results_analysis/few_shot_classes/
. The image on the left shows the ground truth, the image on the right shows the segmented instances found by our training-free method.
Note that in this example we are using the few_shot_classes
split, thus, we should only expect to see segmented instances of the classes in this split (not all classes in COCO).
After running all images in the validation set, you should obtain:
BBOX RESULTS:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.368
SEGM RESULTS:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.342
We provide the instructions for running our pipeline on a custom dataset. Annotation format are always in COCO format.
TLDR; To directly see how to run full pipeline on custom datasets, find
scripts/matching_cdfsod_pipeline.sh
together with example scripts of CD-FSOD datasets (e.g.scripts/dior_fish.sh
)
Let's imagine we want to detect boats⛵ and birds🐦 in a custom dataset. To use our method we will need:
- At least 1 annotated reference image for each class (i.e. 1 reference image for boat and 1 reference image for bird)
- Multiple target images to find instances of our desired classes.
We have prepared a toy script to create a custom dataset with coco images, for a 1-shot setting.
python scripts/make_custom_dataset.py
This will create a custom dataset with the following folder structure:
data/my_custom_dataset/
├── annotations/
│ ├── custom_references.json
│ ├── custom_targets.json
│ └── references_visualisations/
│ ├── bird_1.jpg
│ └── boat_1.jpg
└── images/
├── 429819.jpg
├── 101435.jpg
└── (all target and reference images)
Reference images visualisation (1-shot):
1-shot Reference Image for BIRD 🐦 | 1-shot Reference Image for BOAT ⛵ |
---|---|
![]() |
![]() |
We also provide a script to generate instance-level segmentation masks by using SAM2. This is useful if you only have bounding box annotations available for the reference images.
# Download sam_h checkpoint. Feel free to use more recent checkpoints (note: code might need to be adapted)
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -O checkpoints/sam_vit_h_4b8939.pth
# Run automatic instance segmentation from ground truth bounding boxes.
python no_time_to_train/dataset/sam_bbox_to_segm_batch.py \
--input_json data/my_custom_dataset/annotations/custom_references.json \
--image_dir data/my_custom_dataset/images \
--sam_checkpoint checkpoints/sam_vit_h_4b8939.pth \
--model_type vit_h \
--device cuda \
--batch_size 8 \
--visualize
Reference images with instance-level segmentation masks (generated by SAM2 from gt bounding boxes, 1-shot):
Visualisation of the generated segmentation masks are saved in data/my_custom_dataset/annotations/custom_references_with_SAM_segm/references_visualisations/
.
1-shot Reference Image for BIRD 🐦 (automatically segmented with SAM) | 1-shot Reference Image for BOAT ⛵ (automatically segmented with SAM) |
---|---|
![]() |
![]() |
python no_time_to_train/dataset/coco_to_pkl.py \
data/my_custom_dataset/annotations/custom_references_with_segm.json \
data/my_custom_dataset/annotations/custom_references_with_segm.pkl \
1
First, define useful variables and create a folder for results. For correct visualisation of labels, class names should be ordered by category id as appears in the json file. E.g. bird
has category id 16
, boat
has category id 9
. Thus, CAT_NAMES=boat,bird
.
DATASET_NAME=my_custom_dataset
DATASET_PATH=data/my_custom_dataset
CAT_NAMES=boat,bird
CATEGORY_NUM=2
SHOT=1
YAML_PATH=no_time_to_train/pl_configs/matching_cdfsod_template.yaml
PATH_TO_SAVE_CKPTS=./tmp_ckpts/my_custom_dataset
mkdir -p $PATH_TO_SAVE_CKPTS
Run step 1:
python run_lightening.py test --config $YAML_PATH \
--model.test_mode fill_memory \
--out_path $PATH_TO_SAVE_CKPTS/$DATASET_NAME\_$SHOT\_refs_memory.pth \
--model.init_args.dataset_cfgs.fill_memory.root $DATASET_PATH/images \
--model.init_args.dataset_cfgs.fill_memory.json_file $DATASET_PATH/annotations/custom_references_with_segm.json \
--model.init_args.dataset_cfgs.fill_memory.memory_pkl $DATASET_PATH/annotations/custom_references_with_segm.pkl \
--model.init_args.dataset_cfgs.fill_memory.memory_length $SHOT \
--model.init_args.dataset_cfgs.fill_memory.cat_names $CAT_NAMES \
--model.init_args.model_cfg.dataset_name $DATASET_NAME \
--model.init_args.model_cfg.memory_bank_cfg.length $SHOT \
--model.init_args.model_cfg.memory_bank_cfg.category_num $CATEGORY_NUM \
--trainer.devices 1
python run_lightening.py test --config $YAML_PATH \
--model.test_mode postprocess_memory \
--ckpt_path $PATH_TO_SAVE_CKPTS/$DATASET_NAME\_$SHOT\_refs_memory.pth \
--out_path $PATH_TO_SAVE_CKPTS/$DATASET_NAME\_$SHOT\_refs_memory_postprocessed.pth \
--model.init_args.model_cfg.dataset_name $DATASET_NAME \
--model.init_args.model_cfg.memory_bank_cfg.length $SHOT \
--model.init_args.model_cfg.memory_bank_cfg.category_num $CATEGORY_NUM \
--trainer.devices 1
If ONLINE_VIS
is set to True, prediction results will be saved in results_analysis/my_custom_dataset/
and displayed as they are computed. NOTE that running with online visualisation is much slower.
Feel free to change the score threshold VIS_THR
to see more or less segmented instances.
ONLINE_VIS=True
VIS_THR=0.4
python run_lightening.py test --config $YAML_PATH \
--model.test_mode test \
--ckpt_path $PATH_TO_SAVE_CKPTS/$DATASET_NAME\_$SHOT\_refs_memory_postprocessed.pth \
--model.init_args.model_cfg.dataset_name $DATASET_NAME \
--model.init_args.model_cfg.memory_bank_cfg.length $SHOT \
--model.init_args.model_cfg.memory_bank_cfg.category_num $CATEGORY_NUM \
--model.init_args.model_cfg.test.imgs_path $DATASET_PATH/images \
--model.init_args.model_cfg.test.online_vis $ONLINE_VIS \
--model.init_args.model_cfg.test.vis_thr $VIS_THR \
--model.init_args.dataset_cfgs.test.root $DATASET_PATH/images \
--model.init_args.dataset_cfgs.test.json_file $DATASET_PATH/annotations/custom_targets.json \
--model.init_args.dataset_cfgs.test.cat_names $CAT_NAMES \
--trainer.devices 1
Performance metrics (with the exact same parameters as commands above) should be:
BBOX RESULTS:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.478
SEGM RESULTS:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.458
Visual results are saved in results_analysis/my_custom_dataset/
. Note that our method works for false negatives, that is, images that do not contain any instances of the desired classes.
Click images to enlarge ⬇️
Target image with boats ⛵ (left GT, right predictions) | Target image with birds 🐦 (left GT, right predictions) |
---|---|
![]() |
![]() |
Target image with boats and birds ⛵🐦 (left GT, right predictions) | Target image without boats or birds 🚫 (left GT, right predictions) |
---|---|
![]() |
![]() |
If you use this work, please cite us:
@article{espinosa2025notimetotrain,
title={No time to train! Training-Free Reference-Based Instance Segmentation},
author={Miguel Espinosa and Chenhongyi Yang and Linus Ericsson and Steven McDonagh and Elliot J. Crowley},
journal={arXiv preprint arXiv:2507.02798},
year={2025},
primaryclass={cs.CV}
}