Skip to content

wanghao9610/X-SAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

X-SAM

From Segment Anything to Any Segmentation

Hao Wang1,2,Limeng Qiao3,Zequn Jie3, Zhijian Huang1, Chengjian Feng3,

Qingfang Zheng1, Lin Ma3, Xiangyuan Lan2📧, Xiaodan Liang1,2📧

1 Sun Yat-sen University, 2 Pengcheng Lab, 3 Meituan Inc

📧 Corresponding author.

🔥 Updates

  • 2025-07-24: We release the Demo of X-SAM.

🚀 Introduction

This project provides the official PyTorch implementation of X-SAM.

  • X-SAM is novel unified segmentation MLLMs, which offers superior performance on all image segmentation benchmarks.

  • X-SAM integrates the SAM into MLLMs via a unified formulation adapted to all image segmentation, extending the SAM's capability from segment anything to any segmentation.

  • X-SAM co-trains on multi data sources via a effective multi-stage training strategy, achieving the robust performance across all tasks.

This project provides awesome code for segmentation MLLMs:

  • Training code for segmentation MLLMs.
  • Evaluation code for all image segmentation benchmarks.
  • Visualization code for segmentation MLLMs.
  • Training code for LLaVA-based MLLMs (based on XTuner).
  • Evaluation code for all VLM benchmarks (based on VLMEvalKit).

If you have any questions, please feel free to open an issue.

📄 Abstract

The Segment Anything Model (SAM) has emerged as a pivotal advancement in computer vision, particularly within the context of visual-prompt-driven segmentation. However, SAM is constrained by intrinsic limitations in multi-mask prediction and category-specific image segmentation tasks. Concurrently, Large Language Models (LLMs) have exhibited remarkable proficiency in comprehensive knowledge representation across a wide range of domains, yet they inherently lack the capacity for pixel-level perceptual understanding. To bridge these complementary gaps, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that seamlessly integrates SAM with LLMs, thereby extending SAM's capabilities from segment anything to any segmentation. Specifically, we introduce a novel approach for integrating SAM with MLLMs, which facilitates more advanced dense, pixel-level perceptual comprehension within MLLMs. Furthermore, we propose a new segmentation paradigm, termed Visual GrounDed (VGD) segmentation, which empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training of MLLMs on diverse data sources, we devise a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal pixel-level visual understanding.

🔍 Overview

📊 Benchmark Results

Please refer to the benchmark results for more details.

✅ TODO

  • Release the Demo.
  • Release the weight.
  • Release the code and instructions for demo.
  • Release the code for evaluation on all segmentation benchmarks.
  • Release the code for evaluation on all VLM Benchmarks.
  • Release the code for training LLaVA-based MLLMs.
  • Release the code for training X-SAM (More than 500 🌟).

😊 Acknowledge

This project has referenced some excellent open-sourced repositories: XTuner, VLMEvalKit, Sa2VA. Thanks for their wonderful works and contributions to the community.

📌 Citation

If you find X-SAM helpful for your research or applications, please consider giving us a star 🌟 and citing it using the following BibTeX entry.

About

X-SAM: From Segment Anything to Any Segmentation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published