OpenCUA: Open Foundations for Computer-Use Agents

📢 Updates

2025-08-13: We released our paper and project page. Check it out!

Introduction

OpenCUA is a comprehensive open-source framework for scaling CUA data and foundation models, consisting of:

AgentNet: the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites;
AgentNetTool: an annotation infrastructure that seamlessly captures human computer-use demonstrations;
AgentNetBench: an offline evaluator that benchmarks model-predicted low-level actions against ground-truth trajectories.
OpenCUA Models: end-to-end computer-use foundation models than can produce executable actions in the computer environments with great planning and grounding capabilities.

With the help of OpenCUA framework, our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models.

🚀 Quick Start of OpenCUA Models

⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B):

To align with our training infrastructure, we have modified the model in two places:

1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE.
2. Using the same Tokenizer and ChatTemplate as Kimi-VL.
Do not use the default transformers and vllm classes to load the model. Tokenizer and Chat Template should be aligned if training the models.

Installation & Download

First, install the required transformers dependencies:

conda create -n opencua python=3.10
conda activate opencua
pip install -r requirement.txt

Download the model weight from huggingface:

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="xlangai/OpenCUA-7B",
    local_dir="OpenCUA-7B",                
    local_dir_use_symlinks=False  
)

🎯 GUI Grounding

You can run the five grounding examples in OpenCUA/model/inference/huggingface_inference.py:

cd ./model/inference/
python huggingface_inference.py

🖥️ Computer Use Agent

OpenCUAAgent is developed in the OSWorld environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.

Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:

    python run_multienv_opencua.py \
        --headless \
        --observation_type screenshot \
        --model OpenCUA-32B \
        --result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
        --max_steps 100 \
        --num_envs 30  \
        --coordinate_type qwen25

Currently we only supports huggingface inference. We are implementing the vLLM supports of OpenCUA models. Please stay tuned.

Performance

Online Agent Evaluation

OpenCUA models achieves strong performance on OSWorld-Verified. OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins. It also closes the gap to proprietary Claude models.

Model	15 Steps	50 Steps	100 Steps
Proprietary
OpenAI CUA	26.0	31.3	31.4
Seed 1.5-VL	27.9	—	34.1
Claude 3.7 Sonnet	27.1	35.8	35.9
Claude 4 Sonnet	31.2	43.9	41.5
Open-Source
Qwen 2.5-VL-32B-Instruct	3.0	—	3.9
Qwen 2.5-VL-72B-Instruct	4.4	—	5.0
Kimi-VL-A3B	9.7	—	10.3
UI-TARS-72B-DPO	24.0	25.8	27.1
UI-TARS-1.5-7B	24.5	27.3	27.4
OpenCUA-7B (Ours)	24.3	27.9	26.6
OpenCUA-32B (Ours)	29.7	34.1	34.8

OpenCUA scores are the mean of 3 independent runs.

GUI Grounding Performance

Model	OSWorld-G	ScreenSpot-V2	ScreenSpot-Pro
Qwen2.5-VL-7B	31.4	88.8	27.6
Qwen2.5-VL-32B	46.5	87.0	39.4
UI-TARS-72B	57.1	90.3	38.1
OpenCUA-A3B	48.6	91.4	28.5
OpenCUA-7B	45.7	88.5	23.7
OpenCUA-2.5-7B	55.3	92.3	50.0
OpenCUA-2.5-32B	59.6	93.4	55.3

AgentNetBench (Offline Evaluation)

Model	Coordinate Actions	Content Actions	Function Actions	Average
Qwen2.5-VL-7B	50.7	40.8	3.1	48.0
Qwen2.5-VL-32B	66.6	47.2	41.5	64.8
Qwen2.5-VL-72B	67.2	52.6	50.5	67.0
OpenAI CUA	71.7	57.3	80.0	73.1
OpenCUA-2.5-7B	79.0	62.0	44.3	75.2
OpenCUA-2.5-32B	81.9	66.1	55.7	79.1

AgentNet Dataset - Large-Scale Computer-Use Dataset

AgentNet is the first large-scale desktop computer-use agent trajectory dataset, containing 22.6K human-annotated computer-use tasks across Windows, macOS, and Ubuntu systems.

👉 AgentNet Huggingface Dataset

Download the dataset here：

pip install -U huggingface_hub
huggingface-cli download xlangai/AgentNet --repo-type dataset --local-dir ./AgentNet

Use the following command to unzip the file (For exmaple, Ubuntu data):

cd path_to_your_zip_files

# Merge all the zips
zip -s 0 images.zip --out images-full.zip

# Unzip
unzip images-full.zip -d path_to_your_target_dir

Collecting computer-use agent training data requires 3 steps:

Demonstrate human computer-use task via AgentNetTool;
Preprocess the demonstration using Action Reduction & State-Action Matching;
For each step, synthesize reflective long CoT

1 AgentNetTool – Annotation & Verification Tool

Our AgentNetTool is a cross-platform GUI recorder that runs unobtrusively on annotators’ machines. It captures synchronized screen video, mouse/keyboard events, and accessibility trees, then provides an in-browser UI for reviewing, trimming, and submitting demonstrations. AgentNet Tool is available on Windows, macOS and Ubuntu.

👉 AgentNetTool Document

2 DataProcessor – Action Reduction & State–Action Matching

Raw demonstrations can contain thousands of low-level events that are too dense for model training.
The DataProcessor module (./data/data-process/) performs two key steps:

Action Reduction — merges granular signals into concise, semantically meaningful PyAutoGUI actions (e.g., collapsing mouse moves → click, coalescing scrolls, grouping key-press sequences into text or hotkeys).
State–Action Matching — aligns every reduced action with the last visually distinct frame before the action begins, avoiding future-information leakage and yielding compact state–action pairs.

These processed trajectories underlie all downstream training and evaluation.

3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue

To boost robustness and interpretability, we augment each trajectory with reflective long Chain-of-Thought (CoT) reasoning.
The CoTGenerator pipeline (./data/cot-generator/) synthesizes step-level reflections that:

reflect on the previous action,
explain why an action is chosen given the current observation and history,
note potential alternative actions, and
forecast the expected next state.

Empirically, models trained with these rich CoTs scale better with data and generalize across unseen applications.

AgentNetBench

AgentNetBench (./AgentNetBench/) provides a realistic offline evaluator for OS agent trajectories. It compares model-predicted low-level actions (click, moveTo, write, press, scroll, terminate, etc.) against ground-truth human actions and reports detailed metrics.

👉 See AgentNetBench/README.md for usage instructions.

TODO

vLLM Support
- Actively working with the vLLM team to add support for OpenCUA models.
- Workaround: For now, use the standard transformers library as shown in the examples above.
- Will update this section once vLLM support becomes available.
Training Code
- OpenCUA models are developed based on the training infrastructure of Kimi Team.
- Currently developing the training pipeline based on open-source infrastructure.

Acknowledge

We thank Yu Su, Caiming Xiong, and the anonymous reviewers for their insightful discussions and valuable feedback. We are grateful to Moonshot AI for providing training infrastructure and annotated data. We also sincerely appreciate Hao Yang, Zhengtao Wang, and Yanxu Chen from the Kimi Team for their strong infrastructure support and helpful guidance. The development of our tool is based on the open-source projects-DuckTrack and OpenAdapt. We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.

Research Use and Disclaimer

OpenCUA is intended for research and educational purposes only.

Prohibited Uses

The model, dataset, tool, and code may not be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction
Use for illegal, unethical, or harmful activities is strictly prohibited

Disclaimer

The authors, contributors, and copyright holders are not responsible for any illegal, unethical, or harmful use of the Software, nor for any direct or indirect damages resulting from such use
Use of the "OpenCUA" name, logo, or trademarks does not imply any endorsement or affiliation unless separate written permission is obtained
Users are solely responsible for ensuring their use complies with applicable laws and regulations

Citation

If you use OpenCUA in your research, please cite our work:

@misc{wang2025opencuaopenfoundationscomputeruse,
      title={OpenCUA: Open Foundations for Computer-Use Agents}, 
      author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
      year={2025},
      eprint={2508.09123},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.09123}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
assets/images		assets/images
data		data
evaluation/agentnetbench		evaluation/agentnetbench
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenCUA: Open Foundations for Computer-Use Agents

📢 Updates

Introduction

🚀 Quick Start of OpenCUA Models

Installation & Download

🎯 GUI Grounding

🖥️ Computer Use Agent

Performance

Online Agent Evaluation

GUI Grounding Performance

AgentNetBench (Offline Evaluation)

AgentNet Dataset - Large-Scale Computer-Use Dataset

1 AgentNetTool – Annotation & Verification Tool

2 DataProcessor – Action Reduction & State–Action Matching

3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue

AgentNetBench

TODO

Acknowledge

Research Use and Disclaimer

Prohibited Uses

Disclaimer

Citation

About

Uh oh!

Releases 1

Packages

Contributors 2

Languages

License

xlang-ai/OpenCUA

Folders and files

Latest commit

History

Repository files navigation

OpenCUA: Open Foundations for Computer-Use Agents

📢 Updates

Introduction

🚀 Quick Start of OpenCUA Models

Installation & Download

🎯 GUI Grounding

🖥️ Computer Use Agent

Performance

Online Agent Evaluation

GUI Grounding Performance

AgentNetBench (Offline Evaluation)

AgentNet Dataset - Large-Scale Computer-Use Dataset

1 AgentNetTool – Annotation & Verification Tool

2 DataProcessor – Action Reduction & State–Action Matching

3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue

AgentNetBench

TODO

Acknowledge

Research Use and Disclaimer

Prohibited Uses

Disclaimer

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages