DeepSeek-R1 demonstrates outstanding reasoning abilities across various problems except visual-related problems. Although there are some efforts to implement multimodal R1, they all focus on enhancing the reasoning ability of VL models.
We build this project to explore the way that Transfers Reasoning Ability from R1 (LLMs) to Visual R1 (MLLMs).
At this stage, we focus on enhancing textual R1 for visual perception by bridging a visual encoder and an adapter. Now, we obtained a preliminary version of Visual R1.
At this stage, we focus on enhancing Visual R1's multimodal ability while maintaining its ability to think slowly about text through various multimodal data, general text data, and textual reasoning data.
At this stage, we utilize multimodal reasoning data to enable Visual R1 to quickly acquire multimodal reasoning ability by transferring textual reasoning ability.
At this stage, we further enhance Visual R1's multimodal reasoning ability.
Thus, we have transferred reasoning ability from R1 to Visual R1 that can reason with both text and images.
TODO: Report the performance of our Visual R1.
We leverage DeepSeek-R1-Distill-Qwen-1.5B \ DeepSeek-R1-Distill-Qwen-7B \ DeepSeek-R1-Distill-Qwen-32B as LLM backbones.