Skip to content

overmindy/WebWorld-EvaluateWebAgent

Repository files navigation

Web Agent Evaluation Framework

A professional Web Agent evaluation tool built with Python and Playwright, providing a complete solution for evaluating web automation agents. Supports multiple agent types, batch testing, and advanced features.

๐ŸŽฏ Project Overview

This framework is a modular Web Agent evaluation system with the following core components:

  • ๐ŸŽฎ Controller Module - Unified evaluation workflow orchestration and session management
  • ๐ŸŒ Web Environment - Playwright-based browser interaction management
  • ๐Ÿค– Agent Interface - Support for multiple agent types (Human, Terminal, UITARS)
  • ๐Ÿ“Š Batch Evaluation - Parallel execution of multiple tasks with various export formats
  • โœ… Task Validation - Intelligent task completion validation system

โœจ Key Features

๐Ÿš€ Performance Optimizations

  • CDP Screenshot Technology - Uses Chrome DevTools Protocol to avoid DOM focus loss
  • Non-intrusive Screenshots - Screenshot process doesn't affect page state or user interactions
  • Smart Coordinate Mapping - Automatic handling of device scale factor coordinate transformations
  • Precise Scroll Control - Support for coordinate-based precise scrolling operations

๐ŸŽฏ Diverse Agent Support

  • HumanAgent - Human interaction agent, opens browser and waits for manual operations
  • TerminalAgent - Terminal interaction agent, controls operations via command line
  • UITARSAgent - Intelligent AI agent with multi-turn conversation and history management

๐Ÿ“Š Powerful Batch Evaluation

  • Parallel Execution - Support for multi-task parallel processing
  • Multiple Export Formats - JSON, CSV, HTML, Excel, etc.
  • Failure Retry - Automatic retry of failed tasks
  • Detailed Reports - Generate comprehensive evaluation reports and statistics

๐Ÿš€ Quick Start

1. Environment Setup

# Clone the project
git clone <repository-url>
cd Agent_Eval

# Install Python dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install

2. Basic Usage

Single Task Evaluation

# Human agent (default) - opens browser for manual interaction
python main.py single "Click the search button"

# Terminal agent - command line interaction control
python main.py single "Fill out the form" --agent terminal

# With specific URL
python main.py single "Navigate to products page" --url https://example.com

# Headless mode
python main.py single "Click the menu button" --headless

# UITARS AI agent
python main.py single "Complete the login process" --agent uitars

Batch Evaluation

# Run batch evaluation
python main.py batch eval_data/human_evaluation_config.json
python main.py batch eval_data/terminal_evaluation_config.json
python main.py batch eval_data/uitars_eval.json

# Create configuration template
python main.py create-config my_batch_config.json

Task Configuration Annotation Tools

# Start task annotation tool
python start_annotation.py

# Use configuration consolidator
python config_consolidator.py

3. Configuration Examples

Basic Task Configuration

{
  "batch_name": "web_ui_evaluation",
  "description": "Web UI component interaction testing",
  "html_files_directory": "eval_data",
  "output_directory": "eval_results",
  "html_files": [
    {
      "file_id": "login_page",
      "file_path": "login.html",
      "tasks": [
        {
          "task_id": "login_test",
          "description": "Login to the system using username 'admin' and password 'password'",
          "success_criteria": [
            "Successfully navigate to dashboard page",
            "Display welcome message"
          ],
          "timeout": 300,
          "max_steps": 5
        }
      ]
    }
  ],
  "batch_settings": {
    "parallel_execution": true,
    "max_parallel_workers": 3,
    "continue_on_failure": true,
    "export_formats": ["json", "html", "excel"]
  }
}

๐Ÿ“ Project Structure

Agent_Eval/
โ”œโ”€โ”€ agent_eval/                    # Core framework package
โ”‚   โ”œโ”€โ”€ controller/               # Controller module
โ”‚   โ”‚   โ””โ”€โ”€ evaluation_controller.py
โ”‚   โ”œโ”€โ”€ environment/              # Web environment module
โ”‚   โ”‚   โ””โ”€โ”€ web_environment.py
โ”‚   โ”œโ”€โ”€ agent/                    # Agent module
โ”‚   โ”‚   โ”œโ”€โ”€ base_agent.py        # Base agent interface
โ”‚   โ”‚   โ”œโ”€โ”€ human_agent.py       # Human interaction agent
โ”‚   โ”‚   โ”œโ”€โ”€ terminal_agent.py    # Terminal interaction agent
โ”‚   โ”‚   โ””โ”€โ”€ uitars_agent.py      # UITARS AI agent
โ”‚   โ”œโ”€โ”€ batch/                    # Batch evaluation module
โ”‚   โ”‚   โ”œโ”€โ”€ batch_controller.py  # Batch controller
โ”‚   โ”‚   โ”œโ”€โ”€ batch_config.py      # Configuration management
โ”‚   โ”‚   โ””โ”€โ”€ batch_aggregator.py  # Result aggregation
โ”‚   โ””โ”€โ”€ validation/               # Validation module
โ”‚       โ””โ”€โ”€ task_completion_validator.py
โ”œโ”€โ”€ config/                       # Configuration files
โ”‚   โ””โ”€โ”€ default_config.py
โ”œโ”€โ”€ eval_data/                    # Test data and configurations
โ”‚   โ”œโ”€โ”€ *.html                   # Test pages
โ”‚   โ””โ”€โ”€ *_config.json           # Evaluation configurations
โ”œโ”€โ”€ Eval_dataset/                 # Large dataset
โ”œโ”€โ”€ logs/                         # Logs and results
โ”œโ”€โ”€ *_eval_results/              # Various evaluation results
โ”œโ”€โ”€ annotation_workflow.py       # Annotation workflow
โ”œโ”€โ”€ task_annotation_tool.py      # Task annotation tool
โ”œโ”€โ”€ config_consolidator.py       # Configuration consolidator
โ”œโ”€โ”€ main.py                      # ๐ŸŽฏ Unified entry point
โ”œโ”€โ”€ start_annotation.py          # Annotation tool launcher
โ””โ”€โ”€ requirements.txt             # Dependencies list

๐Ÿ”ง Core API Interfaces

Web Environment Interface

# Page navigation
await env.launch_webpage(url)

# Screenshots and state
screenshot = await env.get_screenshot()
page_info = await env.get_page_info()

# Interaction operations
await env.click(x, y)
await env.input_text(text)
await env.scroll(x, y, direction, amount)
await env.drag(start_x, start_y, end_x, end_y)

Agent Interface

# Basic prediction interface
action = await agent.predict(screenshot, task_description)

# History management (UITARS Agent)
agent.add_to_history(step_info)
conversation = agent.get_conversation_history()

Batch Evaluation Interface

# Load and run
config = load_batch_config("config.json")
results = await batch_controller.run_batch_evaluation(config)

# Export results
await batch_controller.export_results(["json", "excel", "html"])

๐ŸŽฎ Agent Types Explained

1. HumanAgent

  • Purpose: Manual testing and benchmark establishment
  • Features: Opens browser, waits for manual task completion
  • Use Cases: Complex task validation, user experience testing

2. TerminalAgent

  • Purpose: Programmatic control and debugging
  • Features: Controls browser operations via terminal input
  • Use Cases: Automation script development, precise operation control

3. UITARSAgent

  • Purpose: Intelligent automated testing
  • Features: Visual understanding-based intelligent operations
  • Use Cases: Large-scale automated testing, intelligent regression testing

๐Ÿ“Š Evaluation Results and Reports

Result Formats

  • JSON: Detailed structured data
  • CSV: Tabular data for easy analysis
  • HTML: Visual reports with screenshots
  • Excel: Multi-sheet detailed reports

Key Metrics

  • Success Rate: Task completion percentage
  • Execution Time: Average and total execution time
  • Step Statistics: Operation step analysis
  • Error Analysis: Failure reason classification

๐Ÿ› ๏ธ Advanced Features

Task Validation System

  • Automatic task completion status validation
  • Support for multiple validation criteria
  • Intelligent result judgment

Configuration Annotation Tools

  • Visual task configuration
  • Browser-integrated annotation
  • Automatic configuration generation

๐Ÿ” Troubleshooting

Common Issues

  1. Playwright Installation Issues

    playwright install --force
  2. Permission Issues

    # Windows
    Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
    
    # Linux/Mac
    chmod +x main.py
  3. Browser Launch Failures

    • Check system dependencies
    • Try headless mode
    • Check detailed logs

Logging and Debugging

# Check log files
tail -f logs/evaluation.log

๐Ÿค Contributing

  1. Fork the project
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Playwright - Powerful browser automation tool
  • Pydantic - Data validation and settings management
  • Loguru - Elegant logging

๐Ÿ“ง Contact: For questions or suggestions, please submit an Issue or Pull Request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published