Hao Wang1,2,Limeng Qiao3,Zequn Jie3, Zhijian Huang1, Chengjian Feng3,
Qingfang Zheng1, Lin Ma3, Xiangyuan Lan2📧, Xiaodan Liang1,2📧
1 Sun Yat-sen University, 2 Pengcheng Lab, 3 Meituan Inc
📧 Corresponding author.
2025-07-24
: We release the Demo of X-SAM.
This project provides the official PyTorch implementation of X-SAM.
-
X-SAM is novel unified segmentation MLLMs, which offers superior performance on all image segmentation benchmarks.
-
X-SAM integrates the SAM into MLLMs via a unified formulation adapted to all image segmentation, extending the SAM's capability from segment anything to any segmentation.
-
X-SAM co-trains on multi data sources via a effective multi-stage training strategy, achieving the robust performance across all tasks.
This project provides awesome code for segmentation MLLMs:
- Training code for segmentation MLLMs.
- Evaluation code for all image segmentation benchmarks.
- Visualization code for segmentation MLLMs.
- Training code for LLaVA-based MLLMs (based on XTuner).
- Evaluation code for all VLM benchmarks (based on VLMEvalKit).
If you have any questions, please feel free to open an issue.
The Segment Anything Model (SAM) has emerged as a pivotal advancement in computer vision, particularly within the context of visual-prompt-driven segmentation. However, SAM is constrained by intrinsic limitations in multi-mask prediction and category-specific image segmentation tasks. Concurrently, Large Language Models (LLMs) have exhibited remarkable proficiency in comprehensive knowledge representation across a wide range of domains, yet they inherently lack the capacity for pixel-level perceptual understanding. To bridge these complementary gaps, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that seamlessly integrates SAM with LLMs, thereby extending SAM's capabilities from segment anything to any segmentation. Specifically, we introduce a novel approach for integrating SAM with MLLMs, which facilitates more advanced dense, pixel-level perceptual comprehension within MLLMs. Furthermore, we propose a new segmentation paradigm, termed Visual GrounDed (VGD) segmentation, which empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training of MLLMs on diverse data sources, we devise a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal pixel-level visual understanding.
Please refer to the benchmark results for more details.
- Release the Demo.
- Release the weight.
- Release the code and instructions for demo.
- Release the code for evaluation on all segmentation benchmarks.
- Release the code for evaluation on all VLM Benchmarks.
- Release the code for training LLaVA-based MLLMs.
- Release the code for training X-SAM (More than 500 🌟).
This project has referenced some excellent open-sourced repositories: XTuner, VLMEvalKit, Sa2VA. Thanks for their wonderful works and contributions to the community.
If you find X-SAM helpful for your research or applications, please consider giving us a star 🌟 and citing it using the following BibTeX entry.