Official implementation of:
📄 Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection
👨💻 Zekun Zhang*, Vu Quang Truong*, Minh Hoai (*equal contribution)
🎯 Accepted at ICCV 2025 MMFM Workshop
Method overview.
We propose a low-rank prompt enhancer module to adapt open-vocabulary object detectors (OVDs) like GroundingDINO without changing their backbone or head. This enhancer is:
- Lightweight and parameter-efficient
- Learns to improve prompts using few labeled images
- Integrates easily into Grounded SAM 2 for unseen object instance segmentation (UOIS)
- ✅ Improves GroundingDINO across multiple OVD datasets
- ✅ Outperforms LoRA, LoSA, BitFit, Prompt Tuning, Res-Tuning and full fine-tuning
- ✅ Enables Grounded SAM 2 to achieve SOTA on UOIS with only 50 box-labeled images
- GroundingDINO: https://github.com/IDEA-Research/GroundingDINO
- SAM2: https://github.com/facebookresearch/sam2
You will need to manually download all datasets, extract, and place them at the same directory level as this repository. The expected structure looks like this:
root_dir/
├── PromptAdaptOVD/
├── EgoPER/
├── MSCOCO2017/
├── RarePlanes/
├── PTG/ # EgoPER
├── OIH_VIS/ # HOIST
├── odinw_13/
├── OCID/
├── HouseCat6D/
└── ...
This repository only uses the annotated subset of Scenes100. You must ensure that the folder:
PromptAdaptOVD/images/annotated/
contains all the annotated images and their metadata. If this folder is missing, Scenes100 experiments will not run correctly.
You can download our pretrained enhancer weights here:
➡️ Download Model Weights (Hugging Face)
Place all of the content in the extracted weights
folder in the folder PromptAdaptOVD/scripts/groundingdino_baseline
.
cd scripts/groundingdino_baseline
bash train_enhancer.sh rank type
cd scripts/groundingdino_baseline
bash eval_enhancer.sh rank type
With rank
is the rank of the enhancer and type
is the feature attention method, which can be both
, image
or text
.
Please check the scripts/groundingdino_baseline
folder for the scripts of other methods (e.g., LoRA, LoSA, Res-Tuning).
Method | Params % | Scenes100 | EgoPER | HOIST | OV-COCO | RarePlanes | Avg. |
---|---|---|---|---|---|---|---|
Base Model | 0% | 30.84 | 24.83 | 17.47 | 19.97 | 41.54 | 26.04 |
Res-Tuning | 0.06% | 48.59 | 68.05 | 39.61 | 38.04 | 57.36 | 50.33 |
BitFit | 0.06% | 55.55 | 67.00 | 37.37 | 45.00 | 49.09 | 50.80 |
LoRA | 0.68% | 55.74 | 67.36 | 37.66 | 44.76 | 52.02 | 51.51 |
Ours (r=16) | 0.04% | 56.16 | 68.05 | 38.69 | 42.61 | 52.92 | 51.68 |
👉 Our enhancer outperforms all parameter-efficient baselines in the average
Method | Training Images | Overlap F | Boundary F | % ≥ 75 |
---|---|---|---|---|
UCN | 280,000 | 59.4 | 36.5 | 48.0 |
UOAIS-Net | 53,450 | 67.9 | 62.3 | 73.1 |
MSMFormer | 53,450 | 70.5 | 64.9 | 75.3 |
MSMFormer + Refinement | 53,450 | 66.3 | 54.8 | 52.8 |
UOIS-SAM | 5,345 | 79.9 | 72.5 | 78.3 |
Ours (r=16) | 50 | 77.2 | 73.7 | 74.0 |
Method | Input | Training Images | Overlap F | Boundary F | % ≥ 75 |
---|---|---|---|---|---|
UCN | RGB | 280,000 | 45.0 | 22.5 | 48.4 |
UOAIS-Net | RGB | 53,450 | 60.3 | 52.8 | 81.2 |
MSMFormer | RGB | 53,450 | 67.3 | 57.6 | 80.4 |
MSMFormer + Refinement | RGB | 53,450 | 66.7 | 54.9 | 71.3 |
UOIS-SAM | RGB | 5,345 | 70.0 | 66.2 | 84.8 |
Ours (r=16) | RGB | 50 | 82.7 | 78.9 | 89.7 |
📌 All methods above use RGB-only input. Our approach uses only 50 images with box annotations yet achieves competitive performance, including those trained on thousands of images with mask annotations.
If you find our work useful, please cite:
@inproceedings{zhang2025lowrank,
title = {Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection},
author = {Zekun Zhang and Vu Quang Truong and Minh Hoai},
booktitle = {ICCV Workshop on Multi-modal Foundation Models (MMFM)},
year = {2025}
}
- 📧 Vu Quang Truong: vuquang27102001@gmail.com
- 📧 Zekun Zhang: zekzhang@cs.stonybrook.edu