Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

We introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination.

📄 Paper
🌐 Project Page
🧠 Model: M1-32B
📂 Dataset: M500

Overview

Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks that single-agent systems often struggle to manage. While recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks, how to effectively scale collaboration and reasoning in MAS remains an open question. In this work, we introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination. We construct M500, a high-quality dataset containing 500 multi-agent collaborative reasoning traces, and fine-tune Qwen2.5-32B-Instruct on this dataset to produce M1-32B, a model optimized for multi-agent collaboration. To further enable adaptive reasoning, we propose a novel CEO agent that dynamically manages the discussion process, guiding agent collaboration and adjusting reasoning depth for more effective problem-solving. Evaluated in an open-source MAS across a range of tasks—including general understanding, mathematical reasoning, and coding—our system significantly outperforms strong baselines. For instance, M1-32B achieves 12% improvement on GPQA-Diamond, 41% on AIME2024, and 10% on MBPP-Sanitized, matching the performance of state-of-the-art models like DeepSeek-R1 on some tasks. These results highlight the importance of both learned collaboration and adaptive coordination in scaling multi-agent reasoning.

Authors: Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Dimitris N. Metaxas, Tong Che

Note that the results of s1.1-32B are obtained without using budget forcing.

Configure Environment Variables and API Keys

Adding environment variables to ~/.bashrc:

export MY_HOME=
export MY_PROJECT=MAS-TTS
export MY_OUTPUT=
export OPENAI_API_KEY=
export TOGETHER_API_KEY=
export TOGETHER_BASE_URL=
export DEEPSEEK_BASE_URL=
export DEEPSEEK_API_KEY=
export WANDB_USER_NAME=
export WANDB_API_KEY=
export HUGGING_FACE_HUB_TOKEN=
export HUGGING_FACE_USER_NAME=

source ~/.bashrc

Install Requirements:

Installing requirements for the MAS (Agentverse):

# Create and activate conda environment
conda create -n agent python=3.11
conda activate agent
cd $MY_PROJECT
pip install -e .
python -m spacy download en_core_web_sm

Installing requirements for Model Training (LLaMA-Factory):

conda create -n llamafactory python=3.11
conda activate llamafactory
cd $MY_PROJECT/LLaMA-Factory
pip install -e ".[torch,metrics]"
pip install vllm deepspeed flash-attn wandb

Configure Model and Dataset

Download the M1-32B model and M500 dataset: M1-32B is at https://huggingface.co/Can111/m1-32b M500 is at https://huggingface.co/datasets/Can111/m500

If you want to configure new models and datasets, please refer to the following instructions:

Config new models: gpt-4o-mini, o3-mini, deepseek-chat, deepseek-reasoner, Qwen, etc.

# local models
add config in `agentverse/llms/__init__.py`


# remote models
add config in `agentverse/llms/openai.py`

Config new datasets: Add new datasets or tasks: AIME 2024, MATH-500, GPQA-Diamond, etc.

# data
add data in the `data` folder

# tasks
register tasks in `dataloader/__init__.py`

# configs
add configs in `agentverse/tasks/tasksolving`

For more detailed instructions, please refer to the Agentverse repository.

Train and Inference

Train the model on M500 dataset:

bash run/train.sh

Inference the model using MAS on a task:

bash run/inference.sh

Contributors

The MAS framework is built on the Agentverse repository.

Citation

Consider citing our paper if you find our work useful.

@article{jin2025two,
  title={Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning},
  author={Jin, Can and Peng, Hongwu and Zhang, Qixin and Tang, Yujin and Metaxas, Dimitris N and Che, Tong},
  journal={arXiv preprint arXiv:2504.09772},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LLaMA-Factory		LLaMA-Factory
agentverse		agentverse
agentverse_command		agentverse_command
data		data
dataloader		dataloader
documentation		documentation
run		run
tokenizer		tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate_dataset.py		evaluate_dataset.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

Table of Contents

Overview

Configure Environment Variables and API Keys

Install Requirements:

Configure Model and Dataset

Train and Inference

Contributors

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jincan333/MAS-TTS

Folders and files

Latest commit

History

Repository files navigation

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

Table of Contents

Overview

Configure Environment Variables and API Keys

Install Requirements:

Configure Model and Dataset

Train and Inference

Contributors

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages