Skip to content

better-chao/perceptual_abilities_evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

This is the repository of Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Evaluation Radar Results:

Models

Last update on 2025/04/09

VLM Detection Models

First Type

  • [CVPR 2022] Grounded Language-Image Pre-training [Paper][Code]
  • [CVPR 2022] RegionCLIP: Region-based Language-Image Pretraining [Paper][Code]
  • [ECCV 2022] Open Vocabulary Object Detection with Pseudo Bounding-Box Labels [Paper][Code]
  • [NeulIPS 2022] DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection [Paper]
  • [ECCV 2022] Simple Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]
  • [NeurIPS 2023] Scaling Open-Vocabulary Object Detection [Paper][Code]
  • [CVPR 2023] DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment [Paper]
  • [CVPR 2024] DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [Paper]
  • [ECCV 2024] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [Paper][Code]
  • [CVPR 2024] YOLO-World: Real-Time Open-Vocabulary Object Detection [Paper][Code]
  • [arxiv] OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [Paper][Code]

Second Type

  • [ECCV 2022] Detecting Twenty-thousand Classes using Image-level Supervision [Paper][Code]

  • [ICLR 2023] Learning Object-Language Alignments for Open-Vocabulary Object Detection [Paper][Code]

  • [CVPR 2022] Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model [Paper][Code]

  • [ECCV 2022] Open-Vocabulary DETR with Conditional Matching [Paper][Code]

  • [ICLR 2022] Open-vocabulary Object Detection via Vision and Language Knowledge Distillation [Paper][Code]

  • [CVPR 2022] Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation [Paper][Code]

  • [ECCV 2022] PromptDet: Towards Open-vocabulary Detection using Uncurated Images [Paper][Code]

  • [CVPR 2023] Aligning Bag of Regions for Open-Vocabulary Object Detection [Paper][Code]

  • [NeurIPS 2023] CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection [Paper][Code]

  • [CVPR 2023] CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching [Paper][Code]

  • [ICCV 2023] Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection [Paper][Code]

  • [arxiv] DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection[Paper][Code]

  • [ICCV 2023] EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment [Paper]

  • [ICLR 2023] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models [Paper][Code]

  • [ICML 2023] Multi-Modal Classifiers for Open-Vocabulary Object Detection [Paper][Code]

  • [CVPR 2023] Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection [Paper][Code]

  • [arxiv] Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection [Paper]

  • [CVPR 2023] Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers [Paper][Code]

  • [CVPR 2024] Taming Self-Training for Open-Vocabulary Object Detection [Paper][Code]

  • [ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [Paper][Code]

  • [WACV 2024] LP-OVOD: Open-Vocabulary Object Detection by Linear Probing [Paper][Code]

VLM Segmentation Models

  • [ICLR 2022] Language-driven Semantic Segmentation [Paper][code]
  • [CVPR 2024] CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [Paper][code]
  • [CVPR 2023] Side Adapter Network for Open-Vocabulary Semantic Segmentation [Paper]code
  • [ECCV 2022] A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model [Paper] [code]
  • [ICML 2023] Open-Vocabulary Universal Image Segmentation with MaskCLIP[Paper][code]
  • [ICCV 2023] Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network[Paper][code]
  • [NeurIPS 2023] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP[Paper] [code]
  • [NeurIPS 2023] Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [Paper] [code]
  • [CVPR 2024] SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation[Paper][code]
  • [CVPR 2024] Open-Vocabulary Segmentation with Semantic-Assisted Calibration[Paper][code]
  • [CVPR 2024] Transferable and Principled Efficiency for Open-Vocabulary Segmentation[Paper] [code]
  • [CVPR 2024] Open-Vocabulary Semantic Segmentation with Image Embedding Balancing[Paper][code]
  • [CVPR 2022] Decoupling Zero-Shot Semantic Segmentation[Paper][code]
  • [CVPR 2023] ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation[Paper] [code]
  • [CVPR 2023] Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation[Paper] [code]
  • [ICML 2024] Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation[Paper] [code]
  • [ICML2024] SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation[Paper] [code]
  • [CVPR 2023] Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs[Paper] [code]

Datasets

Datasets for Detection

Dataset Year Classes Training Testing Evaluation Metric Project
COCO 2014 Detection 2014 80 83,000 41,000 Box mAP Project
COCO 2017 Detection 2017 80 118,287 5,000 Box mAP Project
PASCAL VOC 2012 20 5,717 5,823 Box mAP Project
LVIS 2019 1203 100,170 19,809 Box mAP Project
ODinW 2022 314 132,413 20,070 Box mAP Project
Objects365 2019 365 600,000 38,000 Box mAP Project
Stanford Dogs 2011 120 12,000 8,580 Box mAP Project
CUB-200-2011 2011 200 5,994 5,794 Box mAP Project
Cityscapes 2016 8 2,975 500 Box mAP Project
Foggy Cityscapes 2018 8 2,975 500 Box mAP Project
WaterColor 2018 6 1,000 - Box mAP Project
Comic 2018 6 1,000 - Box mAP Project
KITTI 2012 1 7,481 - Box mAP Project
Sim10K 2016 1 10,000 - Box mAP Project
VOC-C 2019 20 543,115 553,185 Box mAP Project
COCO-C 2019 80 11,237,265 475,000 Box mAP Project
Cityscapes-C 2019 8 282,625 47,500 Box mAP Project
CrowdHuman 2018 1 15,000 4,370 Box mAP Project
OCHuman 2019 1 - 2,500 Box mAP Project
WiderPerson 2019 1 7,891 1,000 Box mAP Project

Datasets for Segmentation

Dataset Year Classes Training Testing Evaluation Metric Project
COCO-Stuff 2018 172 118k 20k mIoU Project
PASCAL VOC 2012 2012 20 1,464 1,449 mIoU Project
PASCAL Content 2014 459 4,998 5,105 mIoU Project
Cityscapes 2016 19 2,975 500 mIoU Project
ADE20k 2017 150 25,574 2,000 mIoU Project
MESS* 2023 - - - mIoU Project
PASCAL-Part 2023 116 8432 851 mIoU Project
ADE20k-Part-234 2023 234 7348 1017 mIoU Project
PASCAL-5i** 2015 20 - - mIoU, FB-IoU Project
COCO-20i** 2014 80 - - mIoU, FB-IoU Project
FSS-1000 2020 1000 5200 2400 mIoU, FB-IoU Project
OCHuman 2019 1 - 2231 AP, AP50, AP75 Project
CIS 2023 1 - 459 AP, AP50, AP75 Project
COCO-OCC 2021 80 - 1005 AP, AP50, AP75 Project
CamVid 2008 11 467 233 mIoU Project
UAVid 2018 9 200 100 mIoU Projects
UDD6 2018 12 205 45 mIoU Project

*The benchmark includes a wide range of domain-specific datasets.

**The benchmark has different training and testing sets under various settings.

Bibtex

If our work is helpful for your research, please consider citing the following BibTeX entry.

@article{feng2025vision,
  title={Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation},
  author={Feng, Yongchao and Liu, Yajie and Yang, Shuai and Cai, Wenrui and Zhang, Jinqing and Zhan, Qiqi and Huang, Ziyue and Yan, Hongxi and Wan, Qiao and Liu, Chenguang and others},
  journal={arXiv preprint arXiv:2504.09480},
  year={2025}
}

About

Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 12