ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

The repo is the official implement of "ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models".

We also recommande you to "Seg-Zero" and "VisionReasoner:".

Paper: ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
HuggingFace Daily: TO BE UPDATED
Data: 🤗 ViSurf_multi_non_object_7300_size840
Model: 🤗 Visurf-7B-Best-on-gRefCOCO 🤗 Visurf-7B-NoThink-Best-on-gRefCOCO
Relative Link: VisionReasoner [code] SegZero [code]

Overview of ViSurf:

ViSurf (Visual Supervised- and-Reinforcement Fine-Tuning) is a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage.

News

[Oct. 12th, 2025] 🔥 ViSurf is coming! We have released the code and training data.

Installation

git clone https://github.com/dvlab-research/ViSurf.git
cd ViSurf
conda create -n visionreasoner python=3.12
conda activate visionreasoner
pip install -e .

Inference

Download pretrained models using the following scripts:

mkdir pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/Ricky06662/Visurf-7B-Best-on-gRefCOCO

Tip

If you encounter issues with connecting to Hugging Face, consider using export HF_ENDPOINT=https://hf-mirror.com.

Then run inference using:

python inference_scripts/inference_visurf.py

The default question is

"I want to rest, where should I sit?"

You will get the thinking process in command line, like:

"The question seems to be asking where to sit, but the image only shows a kitchen counter with food and flowers."

And the mask will be presented in inference_scripts folder. In this case, there is no related object.

You can also try find objects in the image by:

python inference_scripts/inference_visurf.py --text "I want to cook food, what can I use?"

You will get the thinking process in command line, like:

"The question asks what kitchen tools or ingredients are visible that could be used for cooking."

The mask will be presented in inference_scripts folder.

You can also provide your own image_path and text by:

python inference_scripts/inference_visurf.py --image_path "your_image_path" --text "your question text"

Evaluation

Evaluation Data: 🤗 gRefCOCO val

We recommand you to VisionReasoner for evaluation on ViSurf.

Note

In ViSurf, the best results on different benchmark are evaluated using different checkpoint. We only release best ckpt on gRefCOCO. For someone who may care about the performance, we suggest you can evaluate and compare the value in your environment.

Training

1. ViSurf Training

Training Data: 🤗 ViSurf 7300
Download dataset using this script:

python training_scripts/download_dataset.py

Tip

Try resize the image and re-calculate the corresponding bbox/point coordinates if you have lower GPU memory. Remeber changing the corresponding resize_size in evaluation and inference.

Download pretrained models using the following scripts:

mkdir pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

(Optional) Start Ray in advance by:

ray start --head  # or ray start --head -- port xxxx

Start training using this script:

bash training_scripts/qwen2_5vl_visurf_nonobj_7300.sh

You can try change the following hyper-parameters if you have a large GPU memory.

worker.actor.micro_batch_size_per_device_for_update=1 or 2 or 4 or 8 or 16 \
worker.actor.micro_batch_size_per_device_for_experience=1 or2 or 4 or 8 or 16 \

If your GPU has less memory, you can change the following config. The number is depend on your GPU memory.

worker.rollout.tensor_parallel_size=[your number between 1-4]
worker.rollout.gpu_memory_utilization=[your number between 0-1]
worker.rollout.n=[your number between 2-32]

2. Merge Checkpoint in Hugging Face Format

python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]

Build Your Own Training Data (Optional)

Please refer to SegZero if you want to build your own dataset.

Citation

@article{liu2025visurf,
  title        = {ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models},
  author       = {Liu, Yuqi and Chen, Liangyu and Liu, Jiazhen and Zhu, Mingkang and Zhong, Zhisheng and Yu, Bei and Jia, Jiaya},
  journal      = {arXiv preprint arXiv:2503.06520},
  year         = {2025}
}


@article{liu2025segzero,
  title        = {Seg-Zero: Reasoning-Chain Guided  Segmentation via Cognitive Reinforcement},
  author       = {Liu, Yuqi and Peng, Bohao and Zhong, Zhisheng and Yue, Zihao and Lu, Fanbin and Yu, Bei and Jia, Jiaya},
  journal      = {arXiv preprint arXiv:2503.06520},
  year         = {2025}
}

@article{liu2025visionreasoner,
  title        = {VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning},
  author       = {Liu, Yuqi and Qu, Tianyuan and Zhong, Zhisheng and Peng, Bohao and Liu, Shu and Yu, Bei and Jia, Jiaya},
  journal = {arXiv preprint arXiv:2505.12081},
  year         = {2025}
}

Acknowledgement

We would like to thank the following repos for their great work:

This work is built upon the EasyR1 and veRL.
This work utilizes models from Qwen2-VL, Qwen2.5-VL and SAM2.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
examples		examples
inference_scripts		inference_scripts
scripts		scripts
sft_codes		sft_codes
training_scripts		training_scripts
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile.legacy		Dockerfile.legacy
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

News

Contents

Installation

Inference

Evaluation

Training

1. ViSurf Training

2. Merge Checkpoint in Hugging Face Format

Build Your Own Training Data (Optional)

Citation

Acknowledgement

Star History

About

Uh oh!

Releases

Packages

Languages

License

dvlab-research/ViSurf

Folders and files

Latest commit

History

Repository files navigation

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

News

Contents

Installation

Inference

Evaluation

Training

1. ViSurf Training

2. Merge Checkpoint in Hugging Face Format

Build Your Own Training Data (Optional)

Citation

Acknowledgement

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages