Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection

Official implementation of:

📄 Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection

👨‍💻 Zekun Zhang*, Vu Quang Truong*, Minh Hoai (*equal contribution)

🎯 Accepted at ICCV 2025 MMFM Workshop

🔍 Overview

Method overview.

We propose a low-rank prompt enhancer module to adapt open-vocabulary object detectors (OVDs) like GroundingDINO without changing their backbone or head. This enhancer is:

Lightweight and parameter-efficient
Learns to improve prompts using few labeled images
Integrates easily into Grounded SAM 2 for unseen object instance segmentation (UOIS)

🌟 Highlights

✅ Improves GroundingDINO across multiple OVD datasets
✅ Outperforms LoRA, LoSA, BitFit, Prompt Tuning, Res-Tuning and full fine-tuning
✅ Enables Grounded SAM 2 to achieve SOTA on UOIS with only 50 box-labeled images

📦 Installation

GroundingDINO: https://github.com/IDEA-Research/GroundingDINO
SAM2: https://github.com/facebookresearch/sam2

📁 Data Preparation

You will need to manually download all datasets, extract, and place them at the same directory level as this repository. The expected structure looks like this:

root_dir/
├── PromptAdaptOVD/
├── EgoPER/
├── MSCOCO2017/
├── RarePlanes/
├── PTG/            # EgoPER
├── OIH_VIS/        # HOIST
├── odinw_13/
├── OCID/
├── HouseCat6D/
└── ...

🖼 Scenes100 Exception

This repository only uses the annotated subset of Scenes100. You must ensure that the folder:

PromptAdaptOVD/images/annotated/

contains all the annotated images and their metadata. If this folder is missing, Scenes100 experiments will not run correctly.

📥 Model Weights

You can download our pretrained enhancer weights here:

➡️ Download Model Weights (Hugging Face)

Place all of the content in the extracted weights folder in the folder PromptAdaptOVD/scripts/groundingdino_baseline.

🚀 Getting Started

🔧 Training

cd scripts/groundingdino_baseline
bash train_enhancer.sh rank type

🧪 Evaluation

cd scripts/groundingdino_baseline
bash eval_enhancer.sh rank type

With rank is the rank of the enhancer and type is the feature attention method, which can be both, image or text. Please check the scripts/groundingdino_baseline folder for the scripts of other methods (e.g., LoRA, LoSA, Res-Tuning).

🧠 Results

📦 Open-Vocabulary Detection (AP^m)

Method	Params %	Scenes100	EgoPER	HOIST	OV-COCO	RarePlanes	Avg.
Base Model	0%	30.84	24.83	17.47	19.97	41.54	26.04
Res-Tuning	0.06%	48.59	68.05	39.61	38.04	57.36	50.33
BitFit	0.06%	55.55	67.00	37.37	45.00	49.09	50.80
LoRA	0.68%	55.74	67.36	37.66	44.76	52.02	51.51
Ours (r=16)	0.04%	56.16	68.05	38.69	42.61	52.92	51.68

👉 Our enhancer outperforms all parameter-efficient baselines in the average $AP^m$ with fewer parameters.

🔐 UOIS Segmentation Results

🥣 OCID Dataset (RGB-only Input)

Method	Training Images	Overlap F	Boundary F	% ≥ 75
UCN	280,000	59.4	36.5	48.0
UOAIS-Net	53,450	67.9	62.3	73.1
MSMFormer	53,450	70.5	64.9	75.3
MSMFormer + Refinement	53,450	66.3	54.8	52.8
UOIS-SAM	5,345	79.9	72.5	78.3
Ours (r=16)	50	77.2	73.7	74.0

🏠 HouseCat6D Dataset

Method	Input	Training Images	Overlap F	Boundary F	% ≥ 75
UCN	RGB	280,000	45.0	22.5	48.4
UOAIS-Net	RGB	53,450	60.3	52.8	81.2
MSMFormer	RGB	53,450	67.3	57.6	80.4
MSMFormer + Refinement	RGB	53,450	66.7	54.9	71.3
UOIS-SAM	RGB	5,345	70.0	66.2	84.8
Ours (r=16)	RGB	50	82.7	78.9	89.7

📌 All methods above use RGB-only input. Our approach uses only 50 images with box annotations yet achieves competitive performance, including those trained on thousands of images with mask annotations.

📘 Citation

If you find our work useful, please cite:

@inproceedings{zhang2025lowrank,
  title     = {Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection},
  author    = {Zekun Zhang and Vu Quang Truong and Minh Hoai},
  booktitle = {ICCV Workshop on Multi-modal Foundation Models (MMFM)},
  year      = {2025}
}

🤝 Contact

📧 Vu Quang Truong: vuquang27102001@gmail.com
📧 Zekun Zhang: zekzhang@cs.stonybrook.edu

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
media		media
scripts		scripts
.gitignore		.gitignore
README.md		README.md
masks.json		masks.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection

🔍 Overview

🌟 Highlights

📦 Installation

📁 Data Preparation

🖼 Scenes100 Exception

📥 Model Weights

🚀 Getting Started

🔧 Training

🧪 Evaluation

🧠 Results

📦 Open-Vocabulary Detection (AP^m)

🔐 UOIS Segmentation Results

🥣 OCID Dataset (RGB-only Input)

🏠 HouseCat6D Dataset

📘 Citation

🤝 Contact

About

Uh oh!

Releases

Packages

Languages

cvlab-stonybrook/PromptAdaptOVD

Folders and files

Latest commit

History

Repository files navigation

Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection

🔍 Overview

🌟 Highlights

📦 Installation

📁 Data Preparation

🖼 Scenes100 Exception

📥 Model Weights

🚀 Getting Started

🔧 Training

🧪 Evaluation

🧠 Results

📦 Open-Vocabulary Detection (APm)

🔐 UOIS Segmentation Results

🥣 OCID Dataset (RGB-only Input)

🏠 HouseCat6D Dataset

📘 Citation

🤝 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

📦 Open-Vocabulary Detection (AP^m)

Packages