ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

📄 Paper | 🌐 Project Page | 💾 Model Weights

🚀 Overview

📖 Description

we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks—such as small-object and dense object grounding—and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.

🚩 Plan

🛠️ Getting Started

📐 Set up Environment

# build environment
conda create -n activeo3 python=3.10
conda activate activeo3

# install packages
pip install torch==2.5.1 torchvision==0.20.1
pip install flash-attn --no-build-isolation
pip install transformers==4.51.3
pip install qwen-omni-utils[decord]

🔍 demo

# run demo
python demo/activeo3_demo_vstar.py

🎫 License

For academic usage, this project is licensed under the 2-clause BSD License. For commercial inquiries, please contact Chunhua Shen.

🖊️ Citation

If you find this work helpful for your research, please cite:

@article{zhu2025active,
  title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO},
  author={Zhu, Muzhi and Zhong, Hao and Zhao, Canyu and Du, Zongze and Huang, Zheng and Liu, Mingyu and Chen, Hao and Zou, Cheng and Chen, Jingdong and Yang, Ming and others},
  journal={arXiv preprint arXiv:2505.21457},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
demo		demo
example_data		example_data
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

🚀 Overview

📖 Description

🚩 Plan

🛠️ Getting Started

📐 Set up Environment

🔍 demo

🎫 License

🖊️ Citation

About

Uh oh!

Releases

Packages

aim-uofa/Active-o3

Folders and files

Latest commit

History

Repository files navigation

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

🚀 Overview

📖 Description

🚩 Plan

🛠️ Getting Started

📐 Set up Environment

🔍 demo

🎫 License

🖊️ Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages