Agent Performance Benchmarking Suite

A comprehensive benchmarking suite for evaluating and comparing different agent frameworks using the any-agent library.

📋 Overview

This project provides tools and methodologies to evaluate the performance, cost, and accuracy of various agent frameworks (OpenAI, LangChain, LlamaIndex, etc.) across standardized tasks. The suite generates quantitative results and visualizations to help users make informed decisions about which framework best suits their needs.

✨ Features

Standardized Tasks: Evaluate agents on a variety of tasks from simple Q&A to complex reasoning
Comprehensive Metrics: Measure performance across dimensions including:
- Accuracy and correctness
- Token usage and cost
- Execution time
- Tool usage patterns
Multiple Frameworks: Test across all major agent frameworks (OpenAI, LangChain, LlamaIndex, AutoGen, etc.)
Visualization Tools: Generate charts and reports comparing framework performance
Fair Comparison: Use the same underlying models across different frameworks

🚀 Installation

# Clone the repository
git clone https://github.com/Asfandyar1213/agent-benchmark-suite.git
cd agent-benchmark-suite

# For Linux/Mac
./install.sh

# For Windows
.\install.bat

Environment Setup

Before running benchmarks, set up your API keys:

# For OpenAI-based benchmarks
export OPENAI_API_KEY=your_api_key_here

# For other providers as needed
export ANTHROPIC_API_KEY=your_api_key_here

📊 Usage

Command Line Interface

# Run a benchmark across all frameworks
python -m benchmark_suite run --task all --framework all

# Run a specific task on specific frameworks
python -m benchmark_suite run --task qa_capitals --framework openai,langchain

# Generate visualizations from results
python -m benchmark_suite visualize --results-file data/results/benchmark_results_20250525_123456.csv

# List available tasks
python -m benchmark_suite list-available-tasks

# List available frameworks
python -m benchmark_suite list-available-frameworks

Example Script

You can also run benchmarks using the provided example script:

# Run the example benchmark comparing OpenAI and LangChain
python examples/run_benchmark.py

Programmatic Usage

from benchmark_suite.config import BenchmarkConfig, FrameworkConfig
from benchmark_suite.runner import BenchmarkRunner
from any_agent import AgentFramework

# Configure the benchmark
config = BenchmarkConfig(
    name="My Benchmark",
    description="Comparing frameworks on QA tasks",
    tasks=["qa_capitals"],
    frameworks=[
        FrameworkConfig(framework=AgentFramework.OPENAI, model_id="gpt-4-turbo"),
        FrameworkConfig(framework=AgentFramework.LANGCHAIN, model_id="gpt-4-turbo"),
    ],
    runs_per_task=3,  # Run each task 3 times for statistical significance
)

# Run the benchmark
runner = BenchmarkRunner(config)
results = runner.run()

# Generate visualizations
from benchmark_suite.visualizers import generate_report
generate_report(results, output_dir="my_benchmark_results/report")

📝 Task Categories

The benchmark suite includes tasks across five key categories:

Question Answering: Simple factual questions (e.g., "What is the capital of France?")
Tool Usage: Tasks requiring effective use of tools (e.g., searching for weather information)
Multi-step Reasoning: Complex problems requiring multiple reasoning steps (e.g., multi-step math problems)
Instruction Following: Evaluating adherence to specific instructions (e.g., formatting requirements)
Multi-turn Dialogues: Conversations requiring context maintenance (e.g., customer service scenarios)

📁 Project Structure

agent-benchmark-suite/
├── data/
│   ├── results/     # Benchmark results
│   └── tasks/       # Task definitions (JSON files)
├── examples/        # Example scripts
├── src/
│   ├── any_agent/   # Unified agent framework interface
│   │   ├── evaluation/    # Evaluation utilities
│   │   ├── implementations/  # Framework-specific implementations
│   │   ├── tools/        # Tool implementations
│   │   └── tracing/      # Execution tracing
│   └── benchmark_suite/
│       ├── evaluators/    # Evaluation modules
│       ├── tasks/         # Task loading and management
│       └── visualizers/   # Visualization tools
└── tests/           # Test files (to be implemented)

⚙️ Requirements

Python 3.11+
API keys for various LLM providers (OpenAI, Anthropic, etc.)
Dependencies:
- pandas
- matplotlib
- seaborn
- pydantic
- typer
- rich

🔄 Supported Frameworks

The benchmark suite currently supports the following agent frameworks:

OpenAI Assistants API
LangChain
LlamaIndex
AutoGen
Semantic Kernel
Haystack
Anthropic Claude

📈 Example Report

The benchmarking suite generates comprehensive HTML reports with visualizations comparing frameworks across different metrics:

Accuracy Comparison: Bar charts showing performance by task type
Cost Comparison: Framework costs for the same tasks
Execution Time: Performance benchmarks across frameworks
Tool Usage Patterns: Analysis of how different frameworks utilize tools

🔍 Extending the Suite

Adding New Tasks

Create a new JSON file in the data/tasks directory:

{
  "id": "my_custom_task",
  "name": "My Custom Task",
  "description": "Description of the task",
  "type": "QUESTION_ANSWERING",
  "prompt": "Your task prompt here",
  "expected_output": "Expected answer (optional)",
  "tools": ["search_web"],
  "evaluation_criteria": [
    {"criteria": "Evaluation criterion 1", "points": 1},
    {"criteria": "Evaluation criterion 2", "points": 1}
  ]
}

Adding New Frameworks

Implement a new framework adapter in the src/any_agent/implementations directory.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
data/tasks		data/tasks
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SUMMARY.md		SUMMARY.md
demo.py		demo.py
docker-compose.yml		docker-compose.yml
install.bat		install.bat
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent Performance Benchmarking Suite

📋 Overview

✨ Features

🚀 Installation

Environment Setup

📊 Usage

Command Line Interface

Example Script

Programmatic Usage

📝 Task Categories

📁 Project Structure

⚙️ Requirements

🔄 Supported Frameworks

📈 Example Report

🔍 Extending the Suite

Adding New Tasks

Adding New Frameworks

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 10

Uh oh!

Languages

License

Asfandyar1213/agent-benchmark-suite

Folders and files

Latest commit

History

Repository files navigation

Agent Performance Benchmarking Suite

📋 Overview

✨ Features

🚀 Installation

Environment Setup

📊 Usage

Command Line Interface

Example Script

Programmatic Usage

📝 Task Categories

📁 Project Structure

⚙️ Requirements

🔄 Supported Frameworks

📈 Example Report

🔍 Extending the Suite

Adding New Tasks

Adding New Frameworks

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 10

Uh oh!

Languages

Packages