A comprehensive benchmarking suite for evaluating and comparing different agent frameworks using the any-agent library.
This project provides tools and methodologies to evaluate the performance, cost, and accuracy of various agent frameworks (OpenAI, LangChain, LlamaIndex, etc.) across standardized tasks. The suite generates quantitative results and visualizations to help users make informed decisions about which framework best suits their needs.
- Standardized Tasks: Evaluate agents on a variety of tasks from simple Q&A to complex reasoning
- Comprehensive Metrics: Measure performance across dimensions including:
- Accuracy and correctness
- Token usage and cost
- Execution time
- Tool usage patterns
- Multiple Frameworks: Test across all major agent frameworks (OpenAI, LangChain, LlamaIndex, AutoGen, etc.)
- Visualization Tools: Generate charts and reports comparing framework performance
- Fair Comparison: Use the same underlying models across different frameworks
# Clone the repository
git clone https://github.com/Asfandyar1213/agent-benchmark-suite.git
cd agent-benchmark-suite
# For Linux/Mac
./install.sh
# For Windows
.\install.bat
Before running benchmarks, set up your API keys:
# For OpenAI-based benchmarks
export OPENAI_API_KEY=your_api_key_here
# For other providers as needed
export ANTHROPIC_API_KEY=your_api_key_here
# Run a benchmark across all frameworks
python -m benchmark_suite run --task all --framework all
# Run a specific task on specific frameworks
python -m benchmark_suite run --task qa_capitals --framework openai,langchain
# Generate visualizations from results
python -m benchmark_suite visualize --results-file data/results/benchmark_results_20250525_123456.csv
# List available tasks
python -m benchmark_suite list-available-tasks
# List available frameworks
python -m benchmark_suite list-available-frameworks
You can also run benchmarks using the provided example script:
# Run the example benchmark comparing OpenAI and LangChain
python examples/run_benchmark.py
from benchmark_suite.config import BenchmarkConfig, FrameworkConfig
from benchmark_suite.runner import BenchmarkRunner
from any_agent import AgentFramework
# Configure the benchmark
config = BenchmarkConfig(
name="My Benchmark",
description="Comparing frameworks on QA tasks",
tasks=["qa_capitals"],
frameworks=[
FrameworkConfig(framework=AgentFramework.OPENAI, model_id="gpt-4-turbo"),
FrameworkConfig(framework=AgentFramework.LANGCHAIN, model_id="gpt-4-turbo"),
],
runs_per_task=3, # Run each task 3 times for statistical significance
)
# Run the benchmark
runner = BenchmarkRunner(config)
results = runner.run()
# Generate visualizations
from benchmark_suite.visualizers import generate_report
generate_report(results, output_dir="my_benchmark_results/report")
The benchmark suite includes tasks across five key categories:
- Question Answering: Simple factual questions (e.g., "What is the capital of France?")
- Tool Usage: Tasks requiring effective use of tools (e.g., searching for weather information)
- Multi-step Reasoning: Complex problems requiring multiple reasoning steps (e.g., multi-step math problems)
- Instruction Following: Evaluating adherence to specific instructions (e.g., formatting requirements)
- Multi-turn Dialogues: Conversations requiring context maintenance (e.g., customer service scenarios)
agent-benchmark-suite/
βββ data/
β βββ results/ # Benchmark results
β βββ tasks/ # Task definitions (JSON files)
βββ examples/ # Example scripts
βββ src/
β βββ any_agent/ # Unified agent framework interface
β β βββ evaluation/ # Evaluation utilities
β β βββ implementations/ # Framework-specific implementations
β β βββ tools/ # Tool implementations
β β βββ tracing/ # Execution tracing
β βββ benchmark_suite/
β βββ evaluators/ # Evaluation modules
β βββ tasks/ # Task loading and management
β βββ visualizers/ # Visualization tools
βββ tests/ # Test files (to be implemented)
- Python 3.11+
- API keys for various LLM providers (OpenAI, Anthropic, etc.)
- Dependencies:
- pandas
- matplotlib
- seaborn
- pydantic
- typer
- rich
The benchmark suite currently supports the following agent frameworks:
- OpenAI Assistants API
- LangChain
- LlamaIndex
- AutoGen
- Semantic Kernel
- Haystack
- Anthropic Claude
The benchmarking suite generates comprehensive HTML reports with visualizations comparing frameworks across different metrics:
- Accuracy Comparison: Bar charts showing performance by task type
- Cost Comparison: Framework costs for the same tasks
- Execution Time: Performance benchmarks across frameworks
- Tool Usage Patterns: Analysis of how different frameworks utilize tools
Create a new JSON file in the data/tasks
directory:
{
"id": "my_custom_task",
"name": "My Custom Task",
"description": "Description of the task",
"type": "QUESTION_ANSWERING",
"prompt": "Your task prompt here",
"expected_output": "Expected answer (optional)",
"tools": ["search_web"],
"evaluation_criteria": [
{"criteria": "Evaluation criterion 1", "points": 1},
{"criteria": "Evaluation criterion 2", "points": 1}
]
}
Implement a new framework adapter in the src/any_agent/implementations
directory.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.