OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

We maintain our initially released model here: Legacy model: OpenVLThinker-v1.0, with our initial exploratory blog.

Authors: Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang

Our study investigates whether R1-like reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.

As an early result, we present OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision.

Training

OpenVLThinker is iteratively trained in two main stages: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). The instructions for replicating the training process are located in their respective subdirectories.

1. Supervised Fine-Tuning (SFT)

This process is managed using the LLaMA-Factory framework. For complete setup and training instructions, please refer to the SFT README: ➡️ SFT Training Instructions

2. Reinforcement Learning (RL)

This process is based on the EasyR1 framework. For detailed steps on running the two-stage RL training, please see the RL README: ➡️ RL Training Instructions

Evaluation

Our model has been evaluated on several challenging benchmarks:

Math reasoning: MathVista, MathVerse, MathVision
General reasoning: MMMU-Pro, EMMA
Perception: HallusionBench

Necessary packages

pip install qwen_vl_utils
pip install mathruler

Run Evaluation

We provide two evaluation scripts to handle different answer formats:

For OpenVLThinker evaluation:

python evaluation/eval_openvlthinker.py --dataset mathvista

For Qwen2.5-VL evaluation:

python evaluation/eval_qwen.py --dataset mathvista

An optional --cuda argument can be used to specify the GPU device (e.g., --cuda 0). The evaluation results, including a detailed JSON report, will be saved in the ./evaluation/outputs directory.

Datasets

Evaluation supports

mathvista,
mathverse,
mathvision
EMMA (emma-math,emma-chem, emma-code, emma-physics)
MMMU (mmmu-pro-vision, mmmu-pro-4, mmmu-pro-10)
hallusionbench

Special Case: MathVerse Evaluation

Due to the free-form nature of the MathVerse benchmark, we use GPT-4V to verify the model's responses. After generating the output file with the command above, run the verification script:

python evaluation/verify_mathverse_gpt4.py \
    --responses_file ./evaluation/outputs/mathverse_OpenVLThinker-v1.2.json

Note: This requires an OPENAI_API_KEY to be set in your environment variables.

Citation

@misc{deng2025openvlthinker,
      title={OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement}, 
      author={Yihe Deng and Hritik Bansal and Fan Yin and Nanyun Peng and Wei Wang and Kai-Wei Chang},
      year={2025},
      eprint={2503.17352},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.17352}, 
}

Acknowledgments

We thank LLaMA-Factory and EasyR1 for open-sourcing the model training frameworks that we used in this work.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
evaluation		evaluation
generation		generation
paper		paper
train		train
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Training

1. Supervised Fine-Tuning (SFT)

2. Reinforcement Learning (RL)

Evaluation

Run Evaluation

Datasets

Special Case: MathVerse Evaluation

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

yihedeng9/OpenVLThinker

Folders and files

Latest commit

History

Repository files navigation

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Training

1. Supervised Fine-Tuning (SFT)

2. Reinforcement Learning (RL)

Evaluation

Run Evaluation

Datasets

Special Case: MathVerse Evaluation

Citation

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages