Skip to content

aim-uofa/Active-o3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

1Zhejiang University, ย  2Ant Group

๐Ÿ“„ Paperย  | ย ๐ŸŒ Project Pageย  | ย ๐Ÿ’พ Model Weights

๐Ÿš€ Overview

SegAgent Framework

๐Ÿ“– Description

we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasksโ€”such as small-object and dense object groundingโ€”and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.

๐Ÿšฉ Plan

  • Release the weights.
  • Release the inference demo.
  • Release the dataset.
  • Release the training scripts.
  • Release the evaluation scripts.

๐Ÿ› ๏ธ Getting Started

๐Ÿ“ Set up Environment

# build environment
conda create -n activeo3 python=3.10
conda activate activeo3

# install packages
pip install torch==2.5.1 torchvision==0.20.1
pip install flash-attn --no-build-isolation
pip install transformers==4.51.3
pip install qwen-omni-utils[decord]

๐Ÿ” demo

# run demo
python demo/activeo3_demo_vstar.py

๐ŸŽซ License

For academic usage, this project is licensed under the 2-clause BSD License. For commercial inquiries, please contact Chunhua Shen.

๐Ÿ–Š๏ธ Citation

If you find this work helpful for your research, please cite:

@article{zhu2025active,
  title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO},
  author={Zhu, Muzhi and Zhong, Hao and Zhao, Canyu and Du, Zongze and Huang, Zheng and Liu, Mingyu and Chen, Hao and Zou, Cheng and Chen, Jingdong and Yang, Ming and others},
  journal={arXiv preprint arXiv:2505.21457},
  year={2025}
}

About

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published