Minimal AI agent that controls Android devices using OpenAI’s computer-use-preview model.
cua-droid-duolingo-demo.mp4
- Connects to a running Android emulator.
- Captures full-screen device screenshots.
- Scales down the screenshots for OpenAI model compatibility.
- Sends screenshots and user instructions to OpenAI’s computer-use-preview model.
- Receives structured actions (click, scroll, type, keypress, wait, drag).
- Rescales model outputs back to real device coordinates.
- Executes the actions on the device.
- Repeats until you type
exit
.
-
Install dependencies:
npm install
-
Create a
.env
file with your OpenAI API key:echo "OPENAI_API_KEY=your-api-key" > .env
-
Make sure Android Debug Bridge (ADB) is available in your system PATH:
adb version
-
Start your Android emulator manually (optional):
emulator -avd Your_AVD_Name
-
Run the agent:
node index.js --avd=Your_AVD_Name
If no
--avd
is provided, the agent will try to connect to the first running device.
- Captures screenshots directly from the device (
adb exec-out screencap -p
). - Dynamically scales screenshots for OpenAI compatibility.
- Maps model-generated actions (click, scroll, drag, type, keypress, wait) back to real device coordinates.
- Connects automatically to a running emulator or launches it if needed.
- Pretends the device screen is embedded inside a browser page for environment compatibility.
Flag | Description |
---|---|
--avd=AVD_NAME |
Select the emulator device by AVD name. |
--instructions=FILENAME |
Load user instructions from a text file. |
--record |
Save every screenshot into a folder for later review or video creation. |
Start your emulator:
emulator -avd Pixel_5_API_34
Run the agent:
node index.js --avd=Pixel_5_API_34
Run with an instructions file:
node index.js --avd=Pixel_5_API_34 --instructions=example.txt
Example example.txt
:
Open Chrome
Search for "Loadmill"
Scroll down
Go back to the home screen
exit
- Node.js 18 or higher
- A running Android emulator (AVD)
- Android Debug Bridge (ADB) installed and available in system PATH
- OpenAI Tier 3 access for the computer-use-preview model
Note
Your OpenAI account must be Tier 3 to access the computer-use-preview model.
Learn more: OpenAI Computer Use Preview
File | Responsibility |
---|---|
index.js |
Manages user input, OpenAI conversation, and main loop. |
device.js |
ADB device connection, screenshot capture, screen size management. |
actions.js |
Executes model actions on the device (tap, swipe, drag, type, keypress). |
openai.js |
Sends requests to OpenAI and manages API responses. |
If you run the agent with the --record
flag, it saves all screenshots to a folder like:
droid-cua-recording-1715098765432/
You can convert the frames into a video using ffmpeg
:
ffmpeg -framerate 1 -pattern_type glob -i 'droid-cua-recording-*/frame_*.png' \
-vf "pad=ceil(iw/2)*2:ceil(ih/2)*2" \
-c:v libx264 -pix_fmt yuv420p session.mp4