VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Authors: Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen

🔥News

[2025/4/22] We release the dataset 🤗 ViRL39K. It covers comprehensive collection of 39K queries including eight categories, and provides fine-grained model-capability annotations for data selection.

Overview

Abstract

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, \model, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve significantly to achieve 80.3\%, 61.8\% and 43.9\% respectively. \model also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1. Our empirical results show the effectiveness of our approaches.

Release Progress

models.
data.
inference and evaluation code.
training code.

Dataset

ViRL39K lays the foundation for our RL training. It has the following merits:

high-quality and verifiable: the QAs undergo rigorous filtering and quality control, removing problematic queries or ones that cannot be verified by rules.
covering comprehensive topics and categories: from grade school problems to broader STEM and Social topics; reasoning with charts, diagrams, tables, documents, spatial relationships, etc.
with fine-grained model-capability annotations: it tells you what queries to use when training models at different scales.

RL-ed Models

VL-Rethinker-7B: undergoes the proposed SSR and Forced Rethinking training from Qwen2.5-VL-7B-Instruct.
VL-Rethinker-72B: undergoes the proposed SSR and Forced Rethinking training from Qwen2.5-VL-72B-Instruct.

We are training 32B and further enhancing these models. Stay Tuned!

Performance

See our website or paper for detailed performance report.

Selective Sample Replay (SSR)

Training 72B models on publicly collected queries reveals "vanishing advantages," a phenomenon where rapid saturation in large models drastically reduces effective training samples. The concurrent work DAPO on LLMs, made a similar observation.

DAPO combats this by filtering ineffective queries for gradient stability.Different from this gradient perspective, our method, Selective Sample Replay (SSR), takes an active learning perspective. Drawing a similar merit from Prioritized Experience Replay, SSR re-arranges training samples based on their informativeness -- examples with high advantages, which lie near the model's capability limits (i.e., correct responses to queries the model likely fails), are particularly informative. This active selection focuses training on samples most likely to contribute to model improvement, thereby pushing training efficiency.

The implementation for SSR is also simple. In addition to code in active_sampling() @openrlhf/trainer/ppo_utils/replay_buffer.py. Here is a pseudocode for the key idea of SSR.

effective_qas = rule_out_zero(candidates)
p = normalize_adv(effective_qas, alpha=1)
selection = np.random.choice(np.arange(len(effective_qas)), size=size, p=p))

Note: For different scenarios, e.g., on-policy or off-policy, the choice of candidates, size can be different.

Inference

Our models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to the standard inference procedure of Qwen2.5-VL.

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "TIGER-Lab/VL-Rethinker-7B", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processor
# processor = AutoProcessor.from_pretrained("TIGER-Lab/VL-Rethinker-7B")


min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("TIGER-Lab/VL-Rethinker-7B", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Important Notes:

Based on the training configurations of the VL-Rethinker family, it's recommended to:

Prompt:

append \n\nPlease reason step by step, and put your final answer within \\boxed{} after the use queries.

Resolutions:

min_pixels = 256*28*28
max_pixels = 1280*28*28

🚀Quick Start

The proposed algorithm is implemented with the OpenRLHF framework.

Installations

Please see the installation instructions.

Evaluation

Our models can be evaluated like Qwen2.5-VL using lmms_eval.

Here we provide an alternative evaluation approach. It offers the following benefits:

Fast: Batch inference using vLLM for 1K queries on 8 A800 within 30 mins.
Convenient: Evaluation without time-consuming API calls. Judgement made by our rule-based functions align with LLM Judges.
Train-Test Aligned: the evaluation re-uses the correctness judgement of training to minimize the gap between training and test-time evaluation.

The evaluation is integrated with the OpenRLHF framework.

bash ./scripts/eval_7b.sh [benchmark] [modelname] [modelpath]

Note: for MMMU-Val we cannot reproduce Qwen2.5-VL with neither lmms_eval, vlmevalkit or our native evaluation. We greatly appreciate it if you could provide any insights into the correct means of reproducing it.

Training

Run the following.

bash ./scripts/train_vlm_multi.sh

Acknowledgement

This project adapts from OpenRLHF and LMM-R1, released under the Apache License 2.0. Thanks for their open-source contributions!

Citation

If you find this work useful, please give us a free cite:

@article{vl-rethinker,
      title={VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning},
      author = {Wang, Haozhe and Qu, Chao and Huang, Zuming and Chu, Wei and Lin, Fangzhen and Chen, Wenhu},
      journal={arXiv preprint arXiv:2504.08837},
      year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
openrlhf		openrlhf
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
installation.md		installation.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

🔥News

Overview

Release Progress

Dataset

RL-ed Models

Performance

Selective Sample Replay (SSR)

Inference

🚀Quick Start

Installations

Evaluation

Training

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

TIGER-AI-Lab/VL-Rethinker

Folders and files

Latest commit

History

Repository files navigation

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

🔥News

Overview

Release Progress

Dataset

RL-ed Models

Performance

Selective Sample Replay (SSR)

Inference

🚀Quick Start

Installations

Evaluation

Training

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages