Skip to content

Asfandyar1213/agent-benchmark-suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Agent Performance Benchmarking Suite

Version Python License

A comprehensive benchmarking suite for evaluating and comparing different agent frameworks using the any-agent library.

πŸ“‹ Overview

This project provides tools and methodologies to evaluate the performance, cost, and accuracy of various agent frameworks (OpenAI, LangChain, LlamaIndex, etc.) across standardized tasks. The suite generates quantitative results and visualizations to help users make informed decisions about which framework best suits their needs.

✨ Features

  • Standardized Tasks: Evaluate agents on a variety of tasks from simple Q&A to complex reasoning
  • Comprehensive Metrics: Measure performance across dimensions including:
    • Accuracy and correctness
    • Token usage and cost
    • Execution time
    • Tool usage patterns
  • Multiple Frameworks: Test across all major agent frameworks (OpenAI, LangChain, LlamaIndex, AutoGen, etc.)
  • Visualization Tools: Generate charts and reports comparing framework performance
  • Fair Comparison: Use the same underlying models across different frameworks

πŸš€ Installation

# Clone the repository
git clone https://github.com/Asfandyar1213/agent-benchmark-suite.git
cd agent-benchmark-suite

# For Linux/Mac
./install.sh

# For Windows
.\install.bat

Environment Setup

Before running benchmarks, set up your API keys:

# For OpenAI-based benchmarks
export OPENAI_API_KEY=your_api_key_here

# For other providers as needed
export ANTHROPIC_API_KEY=your_api_key_here

πŸ“Š Usage

Command Line Interface

# Run a benchmark across all frameworks
python -m benchmark_suite run --task all --framework all

# Run a specific task on specific frameworks
python -m benchmark_suite run --task qa_capitals --framework openai,langchain

# Generate visualizations from results
python -m benchmark_suite visualize --results-file data/results/benchmark_results_20250525_123456.csv

# List available tasks
python -m benchmark_suite list-available-tasks

# List available frameworks
python -m benchmark_suite list-available-frameworks

Example Script

You can also run benchmarks using the provided example script:

# Run the example benchmark comparing OpenAI and LangChain
python examples/run_benchmark.py

Programmatic Usage

from benchmark_suite.config import BenchmarkConfig, FrameworkConfig
from benchmark_suite.runner import BenchmarkRunner
from any_agent import AgentFramework

# Configure the benchmark
config = BenchmarkConfig(
    name="My Benchmark",
    description="Comparing frameworks on QA tasks",
    tasks=["qa_capitals"],
    frameworks=[
        FrameworkConfig(framework=AgentFramework.OPENAI, model_id="gpt-4-turbo"),
        FrameworkConfig(framework=AgentFramework.LANGCHAIN, model_id="gpt-4-turbo"),
    ],
    runs_per_task=3,  # Run each task 3 times for statistical significance
)

# Run the benchmark
runner = BenchmarkRunner(config)
results = runner.run()

# Generate visualizations
from benchmark_suite.visualizers import generate_report
generate_report(results, output_dir="my_benchmark_results/report")

πŸ“ Task Categories

The benchmark suite includes tasks across five key categories:

  1. Question Answering: Simple factual questions (e.g., "What is the capital of France?")
  2. Tool Usage: Tasks requiring effective use of tools (e.g., searching for weather information)
  3. Multi-step Reasoning: Complex problems requiring multiple reasoning steps (e.g., multi-step math problems)
  4. Instruction Following: Evaluating adherence to specific instructions (e.g., formatting requirements)
  5. Multi-turn Dialogues: Conversations requiring context maintenance (e.g., customer service scenarios)

πŸ“ Project Structure

agent-benchmark-suite/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ results/     # Benchmark results
β”‚   └── tasks/       # Task definitions (JSON files)
β”œβ”€β”€ examples/        # Example scripts
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ any_agent/   # Unified agent framework interface
β”‚   β”‚   β”œβ”€β”€ evaluation/    # Evaluation utilities
β”‚   β”‚   β”œβ”€β”€ implementations/  # Framework-specific implementations
β”‚   β”‚   β”œβ”€β”€ tools/        # Tool implementations
β”‚   β”‚   └── tracing/      # Execution tracing
β”‚   └── benchmark_suite/
β”‚       β”œβ”€β”€ evaluators/    # Evaluation modules
β”‚       β”œβ”€β”€ tasks/         # Task loading and management
β”‚       └── visualizers/   # Visualization tools
└── tests/           # Test files (to be implemented)

βš™οΈ Requirements

  • Python 3.11+
  • API keys for various LLM providers (OpenAI, Anthropic, etc.)
  • Dependencies:
    • pandas
    • matplotlib
    • seaborn
    • pydantic
    • typer
    • rich

πŸ”„ Supported Frameworks

The benchmark suite currently supports the following agent frameworks:

  • OpenAI Assistants API
  • LangChain
  • LlamaIndex
  • AutoGen
  • Semantic Kernel
  • Haystack
  • Anthropic Claude

πŸ“ˆ Example Report

The benchmarking suite generates comprehensive HTML reports with visualizations comparing frameworks across different metrics:

  • Accuracy Comparison: Bar charts showing performance by task type
  • Cost Comparison: Framework costs for the same tasks
  • Execution Time: Performance benchmarks across frameworks
  • Tool Usage Patterns: Analysis of how different frameworks utilize tools

πŸ” Extending the Suite

Adding New Tasks

Create a new JSON file in the data/tasks directory:

{
  "id": "my_custom_task",
  "name": "My Custom Task",
  "description": "Description of the task",
  "type": "QUESTION_ANSWERING",
  "prompt": "Your task prompt here",
  "expected_output": "Expected answer (optional)",
  "tools": ["search_web"],
  "evaluation_criteria": [
    {"criteria": "Evaluation criterion 1", "points": 1},
    {"criteria": "Evaluation criterion 2", "points": 1}
  ]
}

Adding New Frameworks

Implement a new framework adapter in the src/any_agent/implementations directory.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A comprehensive benchmarking suite for evaluating agent frameworks using any-agent

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 10

Languages