Skip to content

VeriGUI-Team/VeriGUI

Repository files navigation

VeriGUI banner

Verifiable Long-Chain Multi-Domain GUI Dataset

Note

This work is still in progress and additional data will be included in a future version.

🧭 Contents

🌟 Updates

  • [Jul 21, 2025] πŸ”₯ We have released the first batch of 130 Web task trajectories!

πŸ“– Overview

Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution.

In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions:

  • (1) πŸ”— Long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask serve as a valid starting point;
  • (2) βœ… subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remain verifiable and consistent.

The dataset consists of GUI task trajectories spanning both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.

VeriGUI Dataset Overview

The VeriGUI dataset consists of various GUI tasks spanning both desktop and web.

✨ Key Features

πŸ”— Long-Chain Complexity

  • Tasks require 2-15 interdependent subtasks with hundreds of GUI actions
  • Complex workflows spanning multiple applications and web pages
  • Realistic task dependencies that require adaptive reasoning and planning
  • Tasks mirror real-world computer usage patterns

βœ… Subtask-Level Verifiability

  • Fine-grained evaluation at each intermediate subtask, not just final outcomes
  • Verifiable goals for each subtask while supporting diverse exploration strategies
  • Open-ended interaction within subtasks - agents can choose different paths to achieve the same goal
  • Detailed supervision signals for better error diagnosis and agent improvement

🌐 Multi-Environment Coverage

  • Web environments: Various websites, online services, and web applications
  • Desktop environments: Office software, operating systems, and professional tools (TODO)
  • Cross-platform task transitions and interactions

πŸ§‘β€πŸŽ¨ Human-Expert Annotation

  • All trajectories carefully created and annotated by human experts
  • High-quality task instructions and subtask-level annotations
  • Verified task feasibility and realistic workflow patterns
VeriGUI Dataset Overview

An overview of the VeriGUI dataset.

πŸš€ Installation

# Only for evaluating
pip install openai tqdm

# Run agents
pip install openai tqdm camel-ai[all] browser-use

πŸ€– Running Agents

We provide some examples of agents under the agents directory. You can run these agents by executing the following command:

python agents/some_agent.py

πŸ“Š Evaluation

The dataset of VeriGUI is located at veriGUI.json. The format of the dataset is described in detail in the following sections.

[
  {
    "id": "1",              // index id
    "name": "V1_3",         // name of the task
    "type": "global",       // type of the task, global or causal
    "instruction": "xxxxx", // instruction for the task
    "answer": "xxxxx",      // expected answer for the task, in JSON format
  },
  ......
]

The evaluation script evaluate.py can be used to evaluate the performance of agents using LLM-as-a-judge. The evaluation script expects a JSON format file with the following format:

[
  {
    "id": "1",              // index id
    "name": "V1_3",         // name of the task
    "type": "global",       // type of the task, global or causal
    "instruction": "xxxxx", // instruction for the task
    "answer": "xxxxx",      // expected answer for the task, in JSON format
    "prediction": "xxxxx",  // agent's predicted result
    "nsteps": 10,           // number of steps taken by the agent
  },
  ......
]

With this file, you can run the evaluation script to get the performance of the agent:

python evaluate.py --input_file veriGUI_prediction.json --output_file output.json

Then, you can use calc_avg.py to calculate the average score of the evaluation results:

python calc_avg.py --input_file output.json

πŸ—‚οΈ Project Structure

The directory structure of the project is defined as follows:

agent-workflow-devkit/
β”œβ”€β”€ agents/                 # Agent implementations
β”‚   └── browseruse.py       # Browser-use agent example
β”œβ”€β”€ data/                   # Dataset files
β”‚   └── veriGUI.json        # Main dataset
β”œβ”€β”€ evaluated/              # Evaluation results
β”œβ”€β”€ predictions/            # Model predictions
β”œβ”€β”€ evaluate.py             # Evaluation script
β”œβ”€β”€ batch_evaluate.py       # Batch evaluation
β”œβ”€β”€ calc_avg.py             # Calculate averages
└── utils.py                # Utility functions

πŸ’» Visualize Tool

Usage

  • Open VeriGUI.2077ai.org
  • Select the corresponding task data folder
  • View the visualization results

Features

  • Interactive event timeline visualization
  • Support for various event types (MOUSE_DRAG, MOUSE_UP, TAB_CHANGE, etc.)
  • Video playback synchronization
  • Jump to specific actions functionality

πŸŽ“ Citation

If you find VeriGUI useful in your research, please cite our paper:

@article{verigui2025,
  title={VeriGUI: Verifiable Long-Chain GUI Dataset},
  author={Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao},
  journal={arXiv preprint arXiv:2508.04026},
  year={2025}
}

πŸ“ž Contact

For questions, suggestions, or collaborations, please feel free to:

πŸ‘₯ Contributors

We thank all contributors who have helped make VeriGUI possible. Special thanks to the research team and community members who provided valuable feedback and improvements.

πŸ“„ License

This project is licensed under the Apache 2.0 License.


🌟 Star us on GitHub if you find VeriGUI helpful! 🌟

About

VeriGUI: Verifiable Long-Chain GUI Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages