Skip to content

dvlab-research/ViSurf

Repository files navigation

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

The repo is the official implement of "ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models".

We also recommande you to "Seg-Zero" and "VisionReasoner:".

Paper: ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
HuggingFace Daily: TO BE UPDATED
Data: 🤗 ViSurf_multi_non_object_7300_size840
Model: 🤗 Visurf-7B-Best-on-gRefCOCO 🤗 Visurf-7B-NoThink-Best-on-gRefCOCO
Relative Link: VisionReasoner[code] SegZero[code]

Overview of ViSurf:

ViSurf (Visual Supervised- and-Reinforcement Fine-Tuning) is a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage.

News

[Oct. 12th, 2025] 🔥 ViSurf is coming! We have released the code and training data.

Contents

Installation

git clone https://github.com/dvlab-research/ViSurf.git
cd ViSurf
conda create -n visionreasoner python=3.12
conda activate visionreasoner
pip install -e .

Inference

Download pretrained models using the following scripts:

mkdir pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/Ricky06662/Visurf-7B-Best-on-gRefCOCO

Tip

If you encounter issues with connecting to Hugging Face, consider using export HF_ENDPOINT=https://hf-mirror.com.

Then run inference using:

python inference_scripts/inference_visurf.py

The default question is

"I want to rest, where should I sit?"

You will get the thinking process in command line, like:

"The question seems to be asking where to sit, but the image only shows a kitchen counter with food and flowers."

And the mask will be presented in inference_scripts folder. In this case, there is no related object.

You can also try find objects in the image by:

python inference_scripts/inference_visurf.py --text "I want to cook food, what can I use?"

You will get the thinking process in command line, like:

"The question asks what kitchen tools or ingredients are visible that could be used for cooking."

The mask will be presented in inference_scripts folder.

You can also provide your own image_path and text by:

python inference_scripts/inference_visurf.py --image_path "your_image_path" --text "your question text"

Evaluation

Evaluation Data: 🤗 gRefCOCO val

We recommand you to VisionReasoner for evaluation on ViSurf.

Note

In ViSurf, the best results on different benchmark are evaluated using different checkpoint. We only release best ckpt on gRefCOCO. For someone who may care about the performance, we suggest you can evaluate and compare the value in your environment.

Training

1. ViSurf Training

Training Data: 🤗 ViSurf 7300
Download dataset using this script:

python training_scripts/download_dataset.py

Tip

Try resize the image and re-calculate the corresponding bbox/point coordinates if you have lower GPU memory. Remeber changing the corresponding resize_size in evaluation and inference.

Download pretrained models using the following scripts:

mkdir pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

(Optional) Start Ray in advance by:

ray start --head  # or ray start --head -- port xxxx

Start training using this script:

bash training_scripts/qwen2_5vl_visurf_nonobj_7300.sh

You can try change the following hyper-parameters if you have a large GPU memory.

worker.actor.micro_batch_size_per_device_for_update=1 or 2 or 4 or 8 or 16 \
worker.actor.micro_batch_size_per_device_for_experience=1 or2 or 4 or 8 or 16 \

If your GPU has less memory, you can change the following config. The number is depend on your GPU memory.

worker.rollout.tensor_parallel_size=[your number between 1-4]
worker.rollout.gpu_memory_utilization=[your number between 0-1]
worker.rollout.n=[your number between 2-32]

2. Merge Checkpoint in Hugging Face Format

python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]

Build Your Own Training Data (Optional)

Please refer to SegZero if you want to build your own dataset.

Citation

@article{liu2025visurf,
  title        = {ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models},
  author       = {Liu, Yuqi and Chen, Liangyu and Liu, Jiazhen and Zhu, Mingkang and Zhong, Zhisheng and Yu, Bei and Jia, Jiaya},
  journal      = {arXiv preprint arXiv:2503.06520},
  year         = {2025}
}


@article{liu2025segzero,
  title        = {Seg-Zero: Reasoning-Chain Guided  Segmentation via Cognitive Reinforcement},
  author       = {Liu, Yuqi and Peng, Bohao and Zhong, Zhisheng and Yue, Zihao and Lu, Fanbin and Yu, Bei and Jia, Jiaya},
  journal      = {arXiv preprint arXiv:2503.06520},
  year         = {2025}
}

@article{liu2025visionreasoner,
  title        = {VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning},
  author       = {Liu, Yuqi and Qu, Tianyuan and Zhong, Zhisheng and Peng, Bohao and Liu, Shu and Yu, Bei and Jia, Jiaya},
  journal = {arXiv preprint arXiv:2505.12081},
  year         = {2025}
}

Acknowledgement

We would like to thank the following repos for their great work:

Star History

Star History Chart

About

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages