We maintain our initially released model here: Legacy model: OpenVLThinker-v1.0, with our initial exploratory blog.
Authors: Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang
Our study investigates whether R1-like reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.
As an early result, we present OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision.
OpenVLThinker is iteratively trained in two main stages: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). The instructions for replicating the training process are located in their respective subdirectories.
This process is managed using the LLaMA-Factory framework. For complete setup and training instructions, please refer to the SFT README: ➡️ SFT Training Instructions
This process is based on the EasyR1 framework. For detailed steps on running the two-stage RL training, please see the RL README: ➡️ RL Training Instructions
Our model has been evaluated on several challenging benchmarks:
- Math reasoning: MathVista, MathVerse, MathVision
- General reasoning: MMMU-Pro, EMMA
- Perception: HallusionBench
Necessary packages
pip install qwen_vl_utils
pip install mathruler
We provide two evaluation scripts to handle different answer formats:
- For OpenVLThinker evaluation:
python evaluation/eval_openvlthinker.py --dataset mathvista
- For Qwen2.5-VL evaluation:
python evaluation/eval_qwen.py --dataset mathvista
An optional --cuda
argument can be used to specify the GPU device (e.g., --cuda 0
). The evaluation results, including a detailed JSON report, will be saved in the ./evaluation/outputs
directory.
Evaluation supports
mathvista
,mathverse
,mathvision
- EMMA (
emma-math
,emma-chem
,emma-code
,emma-physics
) - MMMU (
mmmu-pro-vision
,mmmu-pro-4
,mmmu-pro-10
) hallusionbench
Due to the free-form nature of the MathVerse benchmark, we use GPT-4V to verify the model's responses. After generating the output file with the command above, run the verification script:
python evaluation/verify_mathverse_gpt4.py \
--responses_file ./evaluation/outputs/mathverse_OpenVLThinker-v1.2.json
Note: This requires an OPENAI_API_KEY to be set in your environment variables.
@misc{deng2025openvlthinker,
title={OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement},
author={Yihe Deng and Hritik Bansal and Fan Yin and Nanyun Peng and Wei Wang and Kai-Wei Chang},
year={2025},
eprint={2503.17352},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.17352},
}
We thank LLaMA-Factory and EasyR1 for open-sourcing the model training frameworks that we used in this work.