A professional Web Agent evaluation tool built with Python and Playwright, providing a complete solution for evaluating web automation agents. Supports multiple agent types, batch testing, and advanced features.
This framework is a modular Web Agent evaluation system with the following core components:
- ๐ฎ Controller Module - Unified evaluation workflow orchestration and session management
- ๐ Web Environment - Playwright-based browser interaction management
- ๐ค Agent Interface - Support for multiple agent types (Human, Terminal, UITARS)
- ๐ Batch Evaluation - Parallel execution of multiple tasks with various export formats
- โ Task Validation - Intelligent task completion validation system
- CDP Screenshot Technology - Uses Chrome DevTools Protocol to avoid DOM focus loss
- Non-intrusive Screenshots - Screenshot process doesn't affect page state or user interactions
- Smart Coordinate Mapping - Automatic handling of device scale factor coordinate transformations
- Precise Scroll Control - Support for coordinate-based precise scrolling operations
- HumanAgent - Human interaction agent, opens browser and waits for manual operations
- TerminalAgent - Terminal interaction agent, controls operations via command line
- UITARSAgent - Intelligent AI agent with multi-turn conversation and history management
- Parallel Execution - Support for multi-task parallel processing
- Multiple Export Formats - JSON, CSV, HTML, Excel, etc.
- Failure Retry - Automatic retry of failed tasks
- Detailed Reports - Generate comprehensive evaluation reports and statistics
# Clone the project
git clone <repository-url>
cd Agent_Eval
# Install Python dependencies
pip install -r requirements.txt
# Install Playwright browsers
playwright install
# Human agent (default) - opens browser for manual interaction
python main.py single "Click the search button"
# Terminal agent - command line interaction control
python main.py single "Fill out the form" --agent terminal
# With specific URL
python main.py single "Navigate to products page" --url https://example.com
# Headless mode
python main.py single "Click the menu button" --headless
# UITARS AI agent
python main.py single "Complete the login process" --agent uitars
# Run batch evaluation
python main.py batch eval_data/human_evaluation_config.json
python main.py batch eval_data/terminal_evaluation_config.json
python main.py batch eval_data/uitars_eval.json
# Create configuration template
python main.py create-config my_batch_config.json
# Start task annotation tool
python start_annotation.py
# Use configuration consolidator
python config_consolidator.py
{
"batch_name": "web_ui_evaluation",
"description": "Web UI component interaction testing",
"html_files_directory": "eval_data",
"output_directory": "eval_results",
"html_files": [
{
"file_id": "login_page",
"file_path": "login.html",
"tasks": [
{
"task_id": "login_test",
"description": "Login to the system using username 'admin' and password 'password'",
"success_criteria": [
"Successfully navigate to dashboard page",
"Display welcome message"
],
"timeout": 300,
"max_steps": 5
}
]
}
],
"batch_settings": {
"parallel_execution": true,
"max_parallel_workers": 3,
"continue_on_failure": true,
"export_formats": ["json", "html", "excel"]
}
}
Agent_Eval/
โโโ agent_eval/ # Core framework package
โ โโโ controller/ # Controller module
โ โ โโโ evaluation_controller.py
โ โโโ environment/ # Web environment module
โ โ โโโ web_environment.py
โ โโโ agent/ # Agent module
โ โ โโโ base_agent.py # Base agent interface
โ โ โโโ human_agent.py # Human interaction agent
โ โ โโโ terminal_agent.py # Terminal interaction agent
โ โ โโโ uitars_agent.py # UITARS AI agent
โ โโโ batch/ # Batch evaluation module
โ โ โโโ batch_controller.py # Batch controller
โ โ โโโ batch_config.py # Configuration management
โ โ โโโ batch_aggregator.py # Result aggregation
โ โโโ validation/ # Validation module
โ โโโ task_completion_validator.py
โโโ config/ # Configuration files
โ โโโ default_config.py
โโโ eval_data/ # Test data and configurations
โ โโโ *.html # Test pages
โ โโโ *_config.json # Evaluation configurations
โโโ Eval_dataset/ # Large dataset
โโโ logs/ # Logs and results
โโโ *_eval_results/ # Various evaluation results
โโโ annotation_workflow.py # Annotation workflow
โโโ task_annotation_tool.py # Task annotation tool
โโโ config_consolidator.py # Configuration consolidator
โโโ main.py # ๐ฏ Unified entry point
โโโ start_annotation.py # Annotation tool launcher
โโโ requirements.txt # Dependencies list
# Page navigation
await env.launch_webpage(url)
# Screenshots and state
screenshot = await env.get_screenshot()
page_info = await env.get_page_info()
# Interaction operations
await env.click(x, y)
await env.input_text(text)
await env.scroll(x, y, direction, amount)
await env.drag(start_x, start_y, end_x, end_y)
# Basic prediction interface
action = await agent.predict(screenshot, task_description)
# History management (UITARS Agent)
agent.add_to_history(step_info)
conversation = agent.get_conversation_history()
# Load and run
config = load_batch_config("config.json")
results = await batch_controller.run_batch_evaluation(config)
# Export results
await batch_controller.export_results(["json", "excel", "html"])
- Purpose: Manual testing and benchmark establishment
- Features: Opens browser, waits for manual task completion
- Use Cases: Complex task validation, user experience testing
- Purpose: Programmatic control and debugging
- Features: Controls browser operations via terminal input
- Use Cases: Automation script development, precise operation control
- Purpose: Intelligent automated testing
- Features: Visual understanding-based intelligent operations
- Use Cases: Large-scale automated testing, intelligent regression testing
- JSON: Detailed structured data
- CSV: Tabular data for easy analysis
- HTML: Visual reports with screenshots
- Excel: Multi-sheet detailed reports
- Success Rate: Task completion percentage
- Execution Time: Average and total execution time
- Step Statistics: Operation step analysis
- Error Analysis: Failure reason classification
- Automatic task completion status validation
- Support for multiple validation criteria
- Intelligent result judgment
- Visual task configuration
- Browser-integrated annotation
- Automatic configuration generation
-
Playwright Installation Issues
playwright install --force
-
Permission Issues
# Windows Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser # Linux/Mac chmod +x main.py
-
Browser Launch Failures
- Check system dependencies
- Try headless mode
- Check detailed logs
# Check log files
tail -f logs/evaluation.log
- Fork the project
- Create a feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Playwright - Powerful browser automation tool
- Pydantic - Data validation and settings management
- Loguru - Elegant logging
๐ง Contact: For questions or suggestions, please submit an Issue or Pull Request.