Skip to content

yihedeng9/OpenVLThinker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

🤗Models🤗Data📄Paper

We maintain our initially released model here: Legacy model: OpenVLThinker-v1.0, with our initial exploratory blog.

Authors: Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang

Our study investigates whether R1-like reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.

As an early result, we present OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision.

Training

OpenVLThinker is iteratively trained in two main stages: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). The instructions for replicating the training process are located in their respective subdirectories.

1. Supervised Fine-Tuning (SFT)

This process is managed using the LLaMA-Factory framework. For complete setup and training instructions, please refer to the SFT README: ➡️ SFT Training Instructions

2. Reinforcement Learning (RL)

This process is based on the EasyR1 framework. For detailed steps on running the two-stage RL training, please see the RL README: ➡️ RL Training Instructions

Evaluation

Our model has been evaluated on several challenging benchmarks:

  • Math reasoning: MathVista, MathVerse, MathVision
  • General reasoning: MMMU-Pro, EMMA
  • Perception: HallusionBench

Necessary packages

pip install qwen_vl_utils
pip install mathruler

Run Evaluation

We provide two evaluation scripts to handle different answer formats:

  1. For OpenVLThinker evaluation:
python evaluation/eval_openvlthinker.py --dataset mathvista
  1. For Qwen2.5-VL evaluation:
python evaluation/eval_qwen.py --dataset mathvista

An optional --cuda argument can be used to specify the GPU device (e.g., --cuda 0). The evaluation results, including a detailed JSON report, will be saved in the ./evaluation/outputs directory.

Datasets

Evaluation supports

  • mathvista,
  • mathverse,
  • mathvision
  • EMMA (emma-math,emma-chem, emma-code, emma-physics)
  • MMMU (mmmu-pro-vision, mmmu-pro-4, mmmu-pro-10)
  • hallusionbench

Special Case: MathVerse Evaluation

Due to the free-form nature of the MathVerse benchmark, we use GPT-4V to verify the model's responses. After generating the output file with the command above, run the verification script:

python evaluation/verify_mathverse_gpt4.py \
    --responses_file ./evaluation/outputs/mathverse_OpenVLThinker-v1.2.json 

Note: This requires an OPENAI_API_KEY to be set in your environment variables.

Citation

@misc{deng2025openvlthinker,
      title={OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement}, 
      author={Yihe Deng and Hritik Bansal and Fan Yin and Nanyun Peng and Wei Wang and Kai-Wei Chang},
      year={2025},
      eprint={2503.17352},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.17352}, 
}

Acknowledgments

We thank LLaMA-Factory and EasyR1 for open-sourcing the model training frameworks that we used in this work.

About

OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages