Skip to content

AI controls your OS. OS AI Computer Use, OS and API agnostic. For now on Anthropic (Claude) API. Desktop app ready.

License

777genius/os-ai-computer-use

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OS AI Computer Use

CI visitor badge

image
2025-10-19.22.26.02.mov

For End Users

Want to use OS AI without coding? Download the latest release for your platform:

Download Latest Release

Available for:

  • macOS (Intel + Apple Silicon)
  • Windows (x64)
  • Linux (x64)
  • Web

New to OS AI? Read the User Guide for installation and setup instructions.

Key Features:

  • 🤖 Interact with Claude AI for computer automation
  • 🔒 Secure API key storage in system keychain
  • 💬 Chat-based interface with visual feedback
  • 📊 Real-time cost tracking
  • 🎨 Cross-platform Flutter UI

For Developers

Table of Contents

Local agent for desktop automation. It currently integrates Anthropic Computer Use (Claude) but is architected to be provider‑agnostic: the LLM layer is abstracted behind LLMClient, so OpenAI Computer Use (and others) can be added with minimal changes.

What this project is:

  • A provider‑agnostic Computer Use agent with a stable tool interface
  • An OS‑agnostic execution layer using ports/drivers (macOS and Windows today)
  • A CLI you can bundle into a single executable for local use

What it is not (yet):

  • A remote SaaS; this is a local agent
  • A finished set of drivers for every OS/desktop (Linux Wayland has limits for synthetic input)

Highlights:

  • Smooth mouse movement, clicks, drag‑and‑drop with easing and timing controls
  • Reliable keyboard input (robust Enter on macOS), hotkeys and hold sequences
  • Screenshots (Quartz on macOS or PyAutoGUI fallback), on‑disk saving and base64 tool_result
  • Detailed logs and running cost estimation per iteration and total
  • Multiple chats
  • Images upload
  • Voice input
  • AI API Agnostic

See provider architecture in docs/architecture-universal-llm.md, OS ports/drivers in docs/os-architecture.md, and packaging notes in docs/ci-packaging.md.

Installation & Setup

Requirements:

  • macOS 13+ or Windows 10/11
  • Python 3.12+
  • Anthropic API key: ANTHROPIC_API_KEY (for now; OpenAI planned)

Install:

# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate

# install dependencies
make install

# (optional) install local packages in editable mode (mono-repo dev)
make dev-install

macOS permissions (for GUI automation):

make macos-perms  # opens System Settings → Privacy & Security panels

Grant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.


Quick start

Requirements:

  • macOS 13+ or Windows 10/11 (unit tests on any OS; GUI tests macOS/self‑hosted Windows)
  • Python 3.12+
  • Anthropic API key (ANTHROPIC_API_KEY)

Install:

# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate

# install top-level dependencies
make install

macOS permissions (required for GUI automation):

# open System Settings → Privacy & Security panels
make macos-perms

Grant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.

Run the agent (CLI):

export ANTHROPIC_API_KEY=sk-ant-...
python main.py --provider anthropic --debug --task "Open Safari, search for 'macOS automation', scroll, make a screenshot"

CLI Examples

# 1) Open Chrome, search in Google, take a screenshot
python main.py --provider anthropic --task "Open Chrome, focus the address bar, type google.com, search for 'computer use AI', open first result, scroll down and take a screenshot"

# 2) Copy/paste workflow in a text editor
python main.py --provider anthropic --task "Open TextEdit, create a new document, type 'Hello world!', select all and copy, create another document and paste"

# 3) Window management + hotkeys
python main.py --provider anthropic --task "Open System Settings, search for 'Privacy', navigate to Privacy & Security, disable GEO"

# 4) Precise drag operations
python main.py --provider anthropic --task "In Finder, open Downloads, switch to icon view, drag the first file to Desktop"

Useful make targets:

make install                     # install top-level dependencies
make test                        # unit tests
RUN_CURSOR_TESTS=1 make itest    # GUI integration tests (macOS; requires permissions)
make itest-local-keyboard        # run keyboard harness
make itest-local-click           # run click/drag harness

Development Mode

For development with backend + frontend (Flutter UI):

1. Install dependencies

# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate

# install Python dependencies
make install

# install local packages in editable mode for mono-repo dev
make dev-install

2. Start the backend

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# (optional) enable debug mode
export OS_AI_BACKEND_DEBUG=1

# Start backend on 127.0.0.1:8765
os-ai-backend

# Or run directly via Python module
# python -m os_ai_backend.app

Backend environment variables (optional):

  • OS_AI_BACKEND_HOST - host address (default: 127.0.0.1)
  • OS_AI_BACKEND_PORT - port number (default: 8765)
  • OS_AI_BACKEND_DEBUG - enable debug logging (default: 0)
  • OS_AI_BACKEND_TOKEN - authentication token (optional)
  • OS_AI_BACKEND_CORS_ORIGINS - allowed CORS origins (default: http://localhost,http://127.0.0.1)

Backend endpoints:

  • GET /healthz - health check
  • WS /ws - WebSocket for JSON-RPC commands
  • POST /v1/files - file upload
  • GET /v1/files/{file_id} - file download
  • GET /metrics - metrics snapshot

3. Start the frontend (in a new terminal)

cd frontend_flutter

# Install Flutter dependencies
flutter pub get

# Run on macOS
flutter run -d macos

# Or run on other platforms
# flutter run -d chrome   # web
# flutter run -d windows  # Windows

Frontend config (in code):

  • Default backend WebSocket: ws://127.0.0.1:8765/ws
  • Default REST base: http://127.0.0.1:8765

See frontend_flutter/README.md for more details on the Flutter app architecture and features.


Features

  • Smooth mouse motion: easing, distance‑based durations
  • Clicks with modifiers: modifiers: "cmd+shift" for click/down/up
  • Drag control: hold_before_ms, hold_after_ms, steps, step_delay
  • Keyboard input: key, hold_key; robust Enter on macOS via Quartz
  • Screenshots: Quartz (macOS) or PyAutoGUI fallback; optional downscale for model display
  • Logging and cost: per‑iteration and total usage/cost with 429 retry logic

Supported Platforms

  • OS‑agnostic execution: core depends only on OS ports; drivers are loaded per OS (see docs/os-architecture.md).
  • macOS (supported):
    • Full driver set with overlay (AppKit), robust Enter (Quartz), screenshots (Quartz/PyAutoGUI), sounds (NSSound).
    • Integration tests available; requires Accessibility, Input Monitoring, Screen Recording.
    • Single‑file CLI bundle via make build-macos-bundle.
  • Windows (implemented, not yet integration‑tested):
    • Drivers for mouse/keyboard/screen via PyAutoGUI; overlay/sound are no‑ops baseline.
    • Unit contract tests exist; for GUI tests use a self‑hosted Windows runner (see docs/windows-integration-testing.md).
    • Single‑file CLI bundle via make build-windows-bundle (build on Windows).
  • Linux: not provided out‑of‑the‑box. X11 can support synthetic input (XTest), while Wayland often restricts it. Contributions welcome.

Configuration (config/settings.py)

Key options (partial list):

  • Coordinates/calibration
    • COORD_X_SCALE, COORD_Y_SCALE, COORD_X_OFFSET, COORD_Y_OFFSET
    • Post‑move correction: POST_MOVE_VERIFY, POST_MOVE_TOLERANCE_PX, POST_MOVE_CORRECTION_DURATION
  • Screenshots
    • SCREENSHOT_MODE (native|downscale)
    • VIRTUAL_DISPLAY_ENABLED, VIRTUAL_DISPLAY_WIDTH_PX, VIRTUAL_DISPLAY_HEIGHT_PX
    • SCREENSHOT_FORMAT (PNG|JPEG), SCREENSHOT_JPEG_QUALITY
  • Overlay
    • PREMOVE_HIGHLIGHT_ENABLED, PREMOVE_HIGHLIGHT_DEFAULT_DURATION, PREMOVE_HIGHLIGHT_RADIUS, colors
  • Model/tool
    • MODEL_NAME, COMPUTER_TOOL_TYPE, COMPUTER_BETA_FLAG, MAX_TOKENS
    • ALLOW_PARALLEL_TOOL_USE

See file for full list and comments.


Tool input (API)

The agent expects blocks with action and parameters:

  • Mouse movement
{"action":"mouse_move","coordinate":[x,y],"coordinate_space":"auto|screen|model","duration":0.35,"tween":"linear"}
  • Clicks
{"action":"left_click","coordinate":[x,y],"modifiers":"cmd+shift"}
  • Key press / hold
{"action":"key","key":"cmd+l"}
{"action":"hold_key","key":"ctrl+shift+t"}
  • Drag‑and‑drop
{
  "action":"left_click_drag",
  "start":[x1,y1],
  "end":[x2,y2],
  "modifiers":"shift",
  "hold_before_ms":80,
  "hold_after_ms":80,
  "steps":4,
  "step_delay":0.02
}
  • Scroll
{"action":"scroll","coordinate":[x,y],"scroll_direction":"down|up|left|right","scroll_amount":3}
  • Typing
{"action":"type","text":"Hello, world!"}
  • Screenshot
{"action":"screenshot"}

Responses are returned as a list of tool_result content blocks (text/image). Screenshots are base64‑encoded.


Tests

Unit tests (no real GUI):

make test

Integration (real OS tests, macOS; Windows via self‑hosted runner):

export RUN_CURSOR_TESTS=1
make itest

If macOS blocks automation, tests are skipped. Grant permissions with make macos-perms and retry.

Windows integration testing options are described in docs/windows-integration-testing.md.


Flutter integration

Recommended setup: Flutter as pure UI, local Python service:

  • Transport: WebSocket + JSON‑RPC for chat/commands, REST for files
  • Streams: screenshots (JPEG/PNG), logs, events
  • Example notes: docs/flutter.md

To run backend + frontend in development mode, see the Development Mode section above.

Note: project code and docs use English.


Contributing

  • Fork → feature branch → PR
  • Code style: readable, explicit names, avoid deep nesting
  • Tests: add unit tests and integration tests when applicable
  • Before PR:
make test
RUN_CURSOR_TESTS=1 make itest   # optional if GUI interactions changed
  • Commit messages: clear and atomic

Architecture, packaging and testing docs:

  • OS Ports & Drivers: docs/os-architecture.md
  • Packaging & CI: docs/ci-packaging.md
  • Windows integration testing: docs/windows-integration-testing.md
  • Code style: CODE_STYLE.md
  • Contributing: CONTRIBUTING.md

Packaging (single executable bundles):

  • macOS: make build-macos-bundledist/agent_core/agent_core
  • Windows: make build-windows-bundledist/agent_core/agent_core.exe

License

Apache License 2.0. Preserve NOTICE when distributing.

  • See LICENSE and NOTICE at repository root.

Troubleshooting

  • Cursor/keyboard don’t work (macOS): grant permissions in System Settings → Privacy & Security (Accessibility, Input Monitoring, Screen Recording) for Terminal and current Python.
  • Integration tests skipped: restart terminal, ensure same interpreter (which python, python -c 'import sys; print(sys.executable)').
  • Screenshots empty/missing overlay: enable Screen Recording; check screenshot mode settings.

Contact

Issues/PR in this repository. Attribution is listed in NOTICE.