This is the repository of Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Evaluation Radar Results:
Last update on 2025/04/09
First Type
- [CVPR 2022] Grounded Language-Image Pre-training [Paper][Code]
- [CVPR 2022] RegionCLIP: Region-based Language-Image Pretraining [Paper][Code]
- [ECCV 2022] Open Vocabulary Object Detection with Pseudo Bounding-Box Labels [Paper][Code]
- [NeulIPS 2022] DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection [Paper]
- [ECCV 2022] Simple Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]
- [NeurIPS 2023] Scaling Open-Vocabulary Object Detection [Paper][Code]
- [CVPR 2023] DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment [Paper]
- [CVPR 2024] DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [Paper]
- [ECCV 2024] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [Paper][Code]
- [CVPR 2024] YOLO-World: Real-Time Open-Vocabulary Object Detection [Paper][Code]
- [arxiv] OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [Paper][Code]
Second Type
-
[ECCV 2022] Detecting Twenty-thousand Classes using Image-level Supervision [Paper][Code]
-
[ICLR 2023] Learning Object-Language Alignments for Open-Vocabulary Object Detection [Paper][Code]
-
[CVPR 2022] Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model [Paper][Code]
-
[ECCV 2022] Open-Vocabulary DETR with Conditional Matching [Paper][Code]
-
[ICLR 2022] Open-vocabulary Object Detection via Vision and Language Knowledge Distillation [Paper][Code]
-
[CVPR 2022] Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation [Paper][Code]
-
[ECCV 2022] PromptDet: Towards Open-vocabulary Detection using Uncurated Images [Paper][Code]
-
[CVPR 2023] Aligning Bag of Regions for Open-Vocabulary Object Detection [Paper][Code]
-
[NeurIPS 2023] CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection [Paper][Code]
-
[CVPR 2023] CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching [Paper][Code]
-
[ICCV 2023] Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection [Paper][Code]
-
[arxiv] DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection[Paper][Code]
-
[ICCV 2023] EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment [Paper]
-
[ICLR 2023] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models [Paper][Code]
-
[ICML 2023] Multi-Modal Classifiers for Open-Vocabulary Object Detection [Paper][Code]
-
[CVPR 2023] Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection [Paper][Code]
-
[arxiv] Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection [Paper]
-
[CVPR 2023] Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]
-
[CVPR 2024] Taming Self-Training for Open-Vocabulary Object Detection [Paper][Code]
-
[ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [Paper][Code]
-
[WACV 2024] LP-OVOD: Open-Vocabulary Object Detection by Linear Probing [Paper][Code]
- [ICLR 2022] Language-driven Semantic Segmentation [Paper][code]
- [CVPR 2024] CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [Paper][code]
- [CVPR 2023] Side Adapter Network for Open-Vocabulary Semantic Segmentation [Paper]code
- [ECCV 2022] A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model [Paper] [code]
- [ICML 2023] Open-Vocabulary Universal Image Segmentation with MaskCLIP[Paper][code]
- [ICCV 2023] Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network[Paper][code]
- [NeurIPS 2023] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP[Paper] [code]
- [NeurIPS 2023] Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [Paper] [code]
- [CVPR 2024] SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation[Paper][code]
- [CVPR 2024] Open-Vocabulary Segmentation with Semantic-Assisted Calibration[Paper][code]
- [CVPR 2024] Transferable and Principled Efficiency for Open-Vocabulary Segmentation[Paper] [code]
- [CVPR 2024] Open-Vocabulary Semantic Segmentation with Image Embedding Balancing[Paper][code]
- [CVPR 2022] Decoupling Zero-Shot Semantic Segmentation[Paper][code]
- [CVPR 2023] ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation[Paper] [code]
- [CVPR 2023] Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation[Paper] [code]
- [ICML 2024] Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation[Paper] [code]
- [ICML2024] SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation[Paper] [code]
- [CVPR 2023] Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs[Paper] [code]
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
COCO 2014 Detection | 2014 | 80 | 83,000 | 41,000 | Box mAP | Project |
COCO 2017 Detection | 2017 | 80 | 118,287 | 5,000 | Box mAP | Project |
PASCAL VOC | 2012 | 20 | 5,717 | 5,823 | Box mAP | Project |
LVIS | 2019 | 1203 | 100,170 | 19,809 | Box mAP | Project |
ODinW | 2022 | 314 | 132,413 | 20,070 | Box mAP | Project |
Objects365 | 2019 | 365 | 600,000 | 38,000 | Box mAP | Project |
Stanford Dogs | 2011 | 120 | 12,000 | 8,580 | Box mAP | Project |
CUB-200-2011 | 2011 | 200 | 5,994 | 5,794 | Box mAP | Project |
Cityscapes | 2016 | 8 | 2,975 | 500 | Box mAP | Project |
Foggy Cityscapes | 2018 | 8 | 2,975 | 500 | Box mAP | Project |
WaterColor | 2018 | 6 | 1,000 | - | Box mAP | Project |
Comic | 2018 | 6 | 1,000 | - | Box mAP | Project |
KITTI | 2012 | 1 | 7,481 | - | Box mAP | Project |
Sim10K | 2016 | 1 | 10,000 | - | Box mAP | Project |
VOC-C | 2019 | 20 | 543,115 | 553,185 | Box mAP | Project |
COCO-C | 2019 | 80 | 11,237,265 | 475,000 | Box mAP | Project |
Cityscapes-C | 2019 | 8 | 282,625 | 47,500 | Box mAP | Project |
CrowdHuman | 2018 | 1 | 15,000 | 4,370 | Box mAP | Project |
OCHuman | 2019 | 1 | - | 2,500 | Box mAP | Project |
WiderPerson | 2019 | 1 | 7,891 | 1,000 | Box mAP | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
COCO-Stuff | 2018 | 172 | 118k | 20k | mIoU | Project |
PASCAL VOC 2012 | 2012 | 20 | 1,464 | 1,449 | mIoU | Project |
PASCAL Content | 2014 | 459 | 4,998 | 5,105 | mIoU | Project |
Cityscapes | 2016 | 19 | 2,975 | 500 | mIoU | Project |
ADE20k | 2017 | 150 | 25,574 | 2,000 | mIoU | Project |
MESS* | 2023 | - | - | - | mIoU | Project |
PASCAL-Part | 2023 | 116 | 8432 | 851 | mIoU | Project |
ADE20k-Part-234 | 2023 | 234 | 7348 | 1017 | mIoU | Project |
PASCAL-5i** | 2015 | 20 | - | - | mIoU, FB-IoU | Project |
COCO-20i** | 2014 | 80 | - | - | mIoU, FB-IoU | Project |
FSS-1000 | 2020 | 1000 | 5200 | 2400 | mIoU, FB-IoU | Project |
OCHuman | 2019 | 1 | - | 2231 | AP, AP50, AP75 | Project |
CIS | 2023 | 1 | - | 459 | AP, AP50, AP75 | Project |
COCO-OCC | 2021 | 80 | - | 1005 | AP, AP50, AP75 | Project |
CamVid | 2008 | 11 | 467 | 233 | mIoU | Project |
UAVid | 2018 | 9 | 200 | 100 | mIoU | Projects |
UDD6 | 2018 | 12 | 205 | 45 | mIoU | Project |
*The benchmark includes a wide range of domain-specific datasets.
**The benchmark has different training and testing sets under various settings.
If our work is helpful for your research, please consider citing the following BibTeX entry.
@article{feng2025vision,
title={Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation},
author={Feng, Yongchao and Liu, Yajie and Yang, Shuai and Cai, Wenrui and Zhang, Jinqing and Zhan, Qiqi and Huang, Ziyue and Yan, Hongxi and Wan, Qiao and Liu, Chenguang and others},
journal={arXiv preprint arXiv:2504.09480},
year={2025}
}