A proof of concept for an AI based screen reader and browser assistant.
This project turns natural‐language, voice commands into fully‐automated browser actions using the open-source browser-use library.
Read more in this blog post.
Hold CTRL, speak an instruction, release the key and the agent will:
- Transcribe your speech with OpenAI Whisper
- Launch a Chromium browser (Playwright) locally
- Let the LLM (GPT-4.1 by default, or OpenAI's computer-use model) reason about the task
- Click, type and scroll until it fulfils the goal
- Speak back the result ☺︎
- Repeat the process until the user says "exit"
$ python main.py --voice --start-url "https://google.com"
- 🔊 Push-to-talk – hold CTRL to record, release to send
- 🖱️ Autonomous web control powered by browser-use or OpenAI's computer-use and Playwright
- 🦜 OpenAI GPT-4.1 or OpenAI's computer-use model by default (configurable)
- 💬 Speaks every step and the final answer (text fallback when
--voice
off) - 🔄 Conversation history – the agent remembers previous steps and uses them to reason about the current task
- Playback Caching – the agent will cache the playback of the same message to avoid repeated API calls
- Skip playback – press ESC to skip playback
- 🔌 Pluggable architecture – swap agent and STT/TTS providers via environment variables
This project is built around modular providers that you can mix-and-match at run-time.
Provider | Description |
---|---|
browser-use (default) |
Launches a local Playwright-controlled Chromium using the open-source browser-use library. |
computer-use |
Also uses a local Playwright-controlled Chromium, but with OpenAI's computer-use model (currently in preview). |
Select the implementation with the AGENT_PROVIDER
environment variable.
Speech-to-Text (STT) and Text-to-Speech (TTS) engines are configured independently, so you can combine them freely:
Provider | STT engine | TTS engine |
---|---|---|
openai |
Whisper | tts-1 (multiple voices) |
system |
OS default dictation | OS default synthesis (e.g. say on macOS, espeak on Linux) |
Configure these via VOICE_STT_PROVIDER
and VOICE_TTS_PROVIDER
.
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# one-time browser download (~120 MB)
playwright install
Save your keys in a .env
file or export them in the shell:
OPENAI_API_KEY=sk-… # required when using OpenAI LLM or audio endpoints
# --- Voice provider configuration ---
# Choose which engines to use for speech-to-text (STT) and text-to-speech (TTS).
# Supported providers: `openai`, `system`
VOICE_STT_PROVIDER=openai
VOICE_TTS_PROVIDER=openai
# Optional – only read when the provider is *openai*
VOICE_OPENAI_TRANSCRIPTION_MODEL=whisper-1
VOICE_OPENAI_TTS_MODEL=tts-1
VOICE_OPENAI_VOICE=alloy
# --- Agent provider configuration ---
# Choose which agent to use for executing browser actions. Currently supported:
# - `browser-use` (default) – uses the browser-use + Playwright implementation.
AGENT_PROVIDER=browser-use # or computer-use
The
main.py
helper reads these variables early during start-up (viapython-dotenv
) and instantiates the correct providers. Neither the providers themselves norvoice_io.py
depend on environment variables – it's all wired up in one place for clarity.
python main.py [--voice] [--start-url URL] [--debug]
--voice
Enable microphone input + audio output (requires speakers)--start-url
Load a page before the first step (default: Bing)--debug
Raise exceptions instead of friendly messages
Example (text-only):
$ python main.py
Type your instructions (or 'exit' to quit):
› Open CNN.com and give me a brief summary of the latest news
a11y-agent/
├── main.py # CLI & control loop
├── voice_io.py # Speech-to-text + text-to-speech (push-to-talk added)
├── speech_providers/ # Pluggable STT/TTS engines (OpenAI, System, …)
├── agent_providers/ # Pluggable agent implementations (browser-use, …)
├── requirements.txt # Python dependencies
└── README.md # This file
- First run is slow ➜ Playwright downloads the browser; subsequent runs are fast.
- "Playwright not installed" ➜ run
pip install playwright && playwright install
. - Voice I/O fails with API-key error ➜ make sure the
OPENAI_API_KEY
is exported in the shell or present in a.env
file in the project root. The key is loaded early viapython-dotenv
. - System TTS fails on MacOS ➜ Make sure that
Spoken Content
is enabled inSystem Settings > Accessibility
(Test in Terminal:say "Hello, world!"
). - macOS "process is not trusted" warning ➜ grant Accessibility permission:
- Keep the script running so macOS shows the prompt or open System Settings ▸ Privacy & Security ▸ Accessibility.
- Click "+" and add the Terminal/iTerm/VScode app you use to run the program. Ensure the toggle is enabled.
- Re-launch the terminal and run the script again.
- Microphone not detected ➜ make sure
sounddevice
&pynput
have the necessary OS permissions (see the same Privacy panel above). - GPT-4o too expensive? Replace the model via
browser_use.llm.ChatOpenAI(model="gpt-3.5-turbo")
insidemain.py
.
Ideas, improvements, PRs — all welcome. If you want to help make this project better, faster, or more flexible, open an issue or submit a pull request.
MIT — use freely, modify openly, and share widely. See the LICENSE file for details.
© 2025 — Jan Mittelman