Note
This work is still in progress and additional data will be included in a future version.
- π Updates
- π Overview
- β¨ Key Features
- π Installation
- π€ Running Agents
- π Evaluation
- ποΈ Project Structure
- π» Visualize Tool
- π Citation
- π Contact
- π₯ Contributors
- π License
[Jul 21, 2025]
π₯ We have released the first batch of 130 Web task trajectories!
Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution.
In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions:
- (1) π Long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask serve as a valid starting point;
- (2) β subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remain verifiable and consistent.
The dataset consists of GUI task trajectories spanning both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.
- Tasks require 2-15 interdependent subtasks with hundreds of GUI actions
- Complex workflows spanning multiple applications and web pages
- Realistic task dependencies that require adaptive reasoning and planning
- Tasks mirror real-world computer usage patterns
- Fine-grained evaluation at each intermediate subtask, not just final outcomes
- Verifiable goals for each subtask while supporting diverse exploration strategies
- Open-ended interaction within subtasks - agents can choose different paths to achieve the same goal
- Detailed supervision signals for better error diagnosis and agent improvement
- Web environments: Various websites, online services, and web applications
- Desktop environments: Office software, operating systems, and professional tools (TODO)
- Cross-platform task transitions and interactions
- All trajectories carefully created and annotated by human experts
- High-quality task instructions and subtask-level annotations
- Verified task feasibility and realistic workflow patterns
# Only for evaluating
pip install openai tqdm
# Run agents
pip install openai tqdm camel-ai[all] browser-use
We provide some examples of agents under the agents
directory. You can run these agents by executing the following command:
python agents/some_agent.py
The dataset of VeriGUI is located at veriGUI.json. The format of the dataset is described in detail in the following sections.
[
{
"id": "1", // index id
"name": "V1_3", // name of the task
"type": "global", // type of the task, global or causal
"instruction": "xxxxx", // instruction for the task
"answer": "xxxxx", // expected answer for the task, in JSON format
},
......
]
The evaluation script evaluate.py
can be used to evaluate the performance of agents using LLM-as-a-judge. The evaluation script expects a JSON format file with the following format:
[
{
"id": "1", // index id
"name": "V1_3", // name of the task
"type": "global", // type of the task, global or causal
"instruction": "xxxxx", // instruction for the task
"answer": "xxxxx", // expected answer for the task, in JSON format
"prediction": "xxxxx", // agent's predicted result
"nsteps": 10, // number of steps taken by the agent
},
......
]
With this file, you can run the evaluation script to get the performance of the agent:
python evaluate.py --input_file veriGUI_prediction.json --output_file output.json
Then, you can use calc_avg.py
to calculate the average score of the evaluation results:
python calc_avg.py --input_file output.json
The directory structure of the project is defined as follows:
agent-workflow-devkit/
βββ agents/ # Agent implementations
β βββ browseruse.py # Browser-use agent example
βββ data/ # Dataset files
β βββ veriGUI.json # Main dataset
βββ evaluated/ # Evaluation results
βββ predictions/ # Model predictions
βββ evaluate.py # Evaluation script
βββ batch_evaluate.py # Batch evaluation
βββ calc_avg.py # Calculate averages
βββ utils.py # Utility functions
- Open VeriGUI.2077ai.org
- Select the corresponding task data folder
- View the visualization results
- Interactive event timeline visualization
- Support for various event types (MOUSE_DRAG, MOUSE_UP, TAB_CHANGE, etc.)
- Video playback synchronization
- Jump to specific actions functionality
If you find VeriGUI useful in your research, please cite our paper:
@article{verigui2025,
title={VeriGUI: Verifiable Long-Chain GUI Dataset},
author={Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao},
journal={arXiv preprint arXiv:2508.04026},
year={2025}
}
For questions, suggestions, or collaborations, please feel free to:
- π Issues: GitHub Issues
We thank all contributors who have helped make VeriGUI possible. Special thanks to the research team and community members who provided valuable feedback and improvements.
This project is licensed under the Apache 2.0 License.
π Star us on GitHub if you find VeriGUI helpful! π