π Paper β’ π Blog β’ π’ AGI Inc β’ π Leaderboard
Build, evaluate, and level up your AI agents β for the real web.
AGI SDK is a toolkit for building and evaluating AI browser agents in real-world environments.
It powers REAL Bench β the first high-fidelity benchmark for AI agents navigating modern websites like Amazon, DoorDash, Airbnb, and more.
πΉ Train agents to browse and interact with real apps
πΉ Benchmark agents with robust, standardized tasks
πΉ Submit to the leaderboard and see how your agents stack up!
TL;DR: Go from βideaβ to βbenchmarked agentβ in <60 seconds
# Install the SDK
pip install agisdk
# Install Playwright browser dependencies
playwright install --force
# Set your LLM API key (for evaluation)
export OPENAI_API_KEY="your-api-key" # any supported provider key works
β
Supports OpenAI, Anthropic, OpenRouter, and custom models!
On Apple Silicon run brew install --cask playwright
first.
Here's a minimal example to get you started for benchmarking an AI agent on the REAL Bench environment:
from agisdk import REAL
harness = REAL.harness(
model="gpt-4o", # any LLM tag
task_type="omnizon", # Amazon-like store
headless=False # watch it click in real-time!
)
print(harness.run()) # π
Need more control? See full examples βΊ
- Full-stack web replicas of top real-world apps (Amazon, Uber, Gmail, Airbnb, etc.)
- Robust agent API: Observations, Actions, Memory, Errors
- Leaderboard integration (REAL Bench)
- Customizable harness: plug your own agents
- Multi-model support: OpenAI, Anthropic, OpenRouter, or your own model
- Parallel evaluation for faster experiments
Checkout the README.md in the example
folder. There are three examples of custom agents in the example
directory:
example/starter.py
: A simple example to get you startedexample/custom.py
: A more complex example with a custom agentexample/nova.py
: For running custom agents which already have browsers running (in this case, Amazon NovaAct)
Additionally, there is a hackable example in example/hackable.py
which is a can be configured for better performance and starting of.
Only if you want to develop locally, you can install from source:
# Clone the repository
git clone https://github.com/agi-inc/agisdk.git
cd agisdk
# Install in development mode
pip install -e .
The AGI SDK includes high-fidelity, fully-deterministic websites for agents to explore. These are modern web stack sites (React + Next.js) with rich functionality for core user flows, realistic mock data, and consistent behavior for testing and evaluation.
The benchmark includes these environments:
App Clone | Task Prefix | Example Use Case |
---|---|---|
π Amazon β Omnizon | webclones.omnizon-* |
Buy a laptop, find a gift |
π DoorDash β DashDish | webclones.dashdish-* |
Order dinner |
webclones.fly-unified-* |
Book a flight | |
π‘ Airbnb β Staynb | webclones.staynb-* |
Reserve accommodation |
π Google Calendar β GoCalendar | webclones.gocalendar-* |
Schedule a meeting |
π¬ Gmail β GoMail | webclones.gomail-* |
Compose an email |
π½οΈ OpenTable β OpenDining | webclones.opendining-* |
Book a restaurant |
π LinkedIn β NetworkIn | webclones.networkin-* |
Accept a connection |
π Uber β Udriver | webclones.udriver-* |
Book a ride |
πΌ UpWork β TopWork | webclones.topwork-* |
Find a freelance gig |
π Zillow β Zilloft | webclones.zilloft-* |
Browse houses |
Each task comes with human-written goals designed to stress-test agent capabilities.
To use models from other providers, set their respective API keys:
# For Anthropic models (like sonnet-3.7)
export ANTHROPIC_API_KEY="your-anthropic-api-key"
Your agent gets access to the following observation structure:
{
'chat_messages': [...], # History of chat messages
'goal': "...", # Text description of the goal
'goal_object': [...], # Structured goal object with text and images
'open_pages_urls': [...], # List of open page URLs
'active_page_index': 0, # Index of the active page
'url': "...", # Current URL
'screenshot': np.array(...), # Screenshot as numpy array
'dom_object': {...}, # DOM structure
'axtree_object': {...}, # Accessibility tree
'extra_element_properties': {...}, # Additional element properties
'focused_element_bid': "...", # ID of the focused element
'last_action': "...", # Last action performed
'last_action_error': "...", # Error from last action (if any)
'elapsed_time': 0.0, # Time elapsed in the episode
'browser': {...} # Playwright browser object (for direct control)
}
Actions are specified as strings in the format of function calls. Here are some commonly used actions:
# Navigation
"goto('https://www.google.com')"
"go_back()"
"go_forward()"
# Interaction
"click('element_id')"
"fill('input_id', 'text to enter')"
"press('Enter')"
# Communication
"send_msg_to_user('I found the answer: $42.99')"
# Reporting infeasible tasks
"report_infeasible('The requested item is out of stock')"
The harness function accepts the following parameters:
REAL.harness(
# Agent configuration (provide one of these)
model="gpt-4o", # OpenAI models
model="sonnet-3.7", # Anthropic models
model="openrouter/deepseek/deepseek-chat-v3-0324", # OpenRouter models (with openrouter/ prefix)
agentargs=MyAgentArgs(), # Or provide your own agent arguments
# Task selection (provide one of these or don't provide any to run all tasks)
task_name="webclones.omnizon-1", # Specific task to run
task_type="omnizon", # Run all tasks of this type
task_id=1, # Run specific task ID within a type
# Browser configuration
headless=False, # Whether to show the browser
max_steps=25, # Maximum number of steps
browser_dimensions=(1280, 720), # Browser window dimensions
# Observation options
use_html=False, # Include HTML in observations
use_axtree=True, # Include accessibility tree
use_screenshot=True, # Include screenshots
# Leaderboard submission
leaderboard=False, # Whether to submit to leaderboard
run_id="my_unique_id", # Unique ID for the submission
# Execution options
num_workers=4, # Number of parallel workers
use_cache=True, # Use cached results when available
cache_only=False, # Only use cached results
force_refresh=False, # Force re-running tasks
# Output options
results_dir="./results" # Where to store results
)
We welcome contributions of all kinds:
- π’ Feature requests? Open an Issue
- π Bug reports? Create a ticket
- π Improve REAL tasks? Join our Project Board
- π οΈ Submit code? Fork + PR β we love clean commits!
Let's build the future of agents together. π₯
- Join our Discord (coming soon!)
- Follow AGI Inc. on LinkedIn
Because your agents deserve better than toy environments.
Because the real web is messy β and that's where the magic happens.
Because the future is agentic β and it starts here.