MASArena 🏟️

Layered Architecture • Stack • Swap • Built for Scale

🌟 Core Features

🧱 Modular Design: Swap agents, tools, datasets, prompts, and evaluators with ease.
📦 Built-in Benchmarks: Single/multi-agent datasets for direct comparison.
📊 Visual Debugging: Inspect interactions, accuracy, and tool use.
🤖 Automated Workflow Optimization: Automatically optimize agent workflows using LLM-driven evolutionary algorithms.
🔧 Tool Support: Manage tool selection via pluggable wrappers.
🧩 Easy Extensions: Add agents via subclassing—no core changes.
📂 Paired Datasets & Evaluators: Add new benchmarks with minimal effort.
🔍 Failure Attribution: Identify failure causes and responsible agents.

🎬 Demo

See MASArena in action! This demo showcases the framework's visualization capabilities:

visualization.mp4

🚀 Quick Start

1. Setup

We recommend using uv for dependency and virtual environment management.

# Install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate

2. Configure Environment Variables

Create a .env file in the project root and set the following:

OPENAI_API_KEY=your_openai_api_key
MODEL_NAME=gpt-4o-mini
OPENAI_API_BASE=https://api.openai.com/v1

3. Running Benchmarks

# Run a standard benchmark (e.g., math with supervisor_mas agent)
./run_benchmark.sh math supervisor_mas 10

# Run the AFlow optimizer on the humaneval benchmark
./run_benchmark.sh humaneval single_agent 10 "" "" aflow

Supported benchmarks:
- Math: math, aime
- Code: humaneval, mbpp
- Reasoning: drop, bbh, mmlu_pro, ifeval
Supported agent systems:
- Single Agent: single_agent
- Multi-Agent: supervisor_mas, swarm, agentverse, chateval, evoagent, jarvis, metagpt

📚 Documentation

For comprehensive guides, tutorials, and API references, visit our complete documentation.

✅ TODOs

Add asynchronous support for model calls
Implement failure detection in MAS workflows
Add more benchmarks emphasizing tool usage
Improve configuration for MAS and tool integration
Integrate multiple tools(e.g., Browser, Video, Audio, Docker) into the current evaluation framework
Optimize the framework's tool management architecture to decouple MCP tool invocation from local tool invocation
Implement more benchmark evaluations(e.g., webArena, SweBench) that requires tool usage
Reimplementation of the Dynamic Architecture Paper Based on the Benchmark Framework

🙌 Contributing

We warmly welcome contributions from the community!

📋 For detailed contribution guidelines, testing procedures, and development setup, please see CONTRIBUTING.md.

You can contribute in many ways:

🧠 New Agent Systems (MAS): Add novel single- or multi-agent systems to expand the diversity of strategies and coordination models.
📊 New Benchmark Datasets: Bring in domain-specific or task-specific datasets (e.g., reasoning, planning, tool-use, collaboration) to broaden the scope of evaluation.
🛠 New Tools & Toolkits: Extend the framework's tool ecosystem by integrating domain tools (e.g., search, calculators, code editors) and improving tool selection strategies.
⚙️ Improvements & Utilities: Help with performance optimization, failure handling, asynchronous processing, or new visualizations.

Quick Start for Contributors

Fork and Clone: Fork the repository and clone it locally
Setup Environment: Install dependencies with pip install -r requirements.txt
Run Tests: Execute pytest tests/ to ensure everything works
Make Changes: Implement your feature with corresponding tests
Submit PR: Create a pull request with a clear description

Our automated CI/CD pipeline will run tests on every pull request to ensure code quality and reliability.

Name		Name	Last commit message	Last commit date
Latest commit History 367 Commits
.github/workflows		.github/workflows
data		data
docs		docs
example/aflow		example/aflow
mas_arena		mas_arena
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_benchmark.sh		run_benchmark.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MASArena 🏟️

🌟 Core Features

🎬 Demo

🚀 Quick Start

1. Setup

2. Configure Environment Variables

3. Running Benchmarks

📚 Documentation

✅ TODOs

🙌 Contributing

Quick Start for Contributors

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

LINs-lab/MASArena

Folders and files

Latest commit

History

Repository files navigation

MASArena 🏟️

🌟 Core Features

🎬 Demo

🚀 Quick Start

1. Setup

2. Configure Environment Variables

3. Running Benchmarks

📚 Documentation

✅ TODOs

🙌 Contributing

Quick Start for Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages