Vision-Enabled Web Browsing Agent

Vision-Enabled Web Browsing Agent is a Python project that implements a web automation agent driven by annotated browser screenshots and language model predictions. The agent leverages state-of-the-art browser automation and large language models (LLMs) to simulate human-like browsing behavior by controlling the mouse and keyboard.

Overview

This project integrates several powerful libraries and tools:

Playwright: For browser automation and interaction.
LangChain and LangGraph: To integrate LLMs (via OpenAI Chat models) and manage the conversation flow using a state graph.
PIL (Pillow): For processing and displaying screenshots.
Custom JavaScript (mark_page.js): To annotate pages with bounding boxes that highlight actionable UI elements.

The agent works by taking a screenshot of the browser, identifying interactive elements, and then deciding which action to take next (e.g., clicking, typing, scrolling) based on LLM predictions. This simulation of human behavior is designed to help avoid detection (e.g., CAPTCHAs) during web automation.

Features

Vision-Enabled Navigation:
The agent captures screenshots of the browser, overlays bounding boxes on UI elements, and uses these annotations to inform action decisions.
LLM-Powered Decision Making:
Integrates ChatGPT (via OpenAI’s API) to analyze the current page state and decide on the next best action.
Tool Execution:
Supports a variety of actions:
- Click: Simulate mouse clicks on identified UI elements.
- Type Text: Enter text with realistic, human-like typing delays.
- Scroll: Scroll the window or specific elements.
- Wait: Pause execution to mimic human reading time.
- Go Back: Navigate back in browser history.
- Navigate to Google: Quickly jump to google.com.
State Graph Management:
Uses LangGraph to manage and transition between different agent states and actions.
Human-Like Typing Simulation:
The type_text function simulates realistic typing by introducing delays between keystrokes, reducing the risk of automated detection.

Requirements

Python 3.8 or higher
Playwright
LangChain and LangGraph
Pillow
Other dependencies as listed in requirements.txt

You will also need an OpenAI API key, which should be available as the environment variable OPENAI_API_KEY.

Installation

Clone the repository:

git clone https://github.com/DonGuillotine/langraph_browser_agent.git
cd langraph_browser_agent

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the Python dependencies:
```
pip install -r requirements.txt
```
Install the Playwright browsers:
```
playwright install
```
Set the OpenAI API Key:

You can either set the environment variable manually:
```
export OPENAI_API_KEY="your-api-key"  # On Windows use set instead of export
```
Or simply run the script; it will prompt you for the API key if it isn’t set.

Usage

The main entry point is the main() function in your project. When executed, the agent will launch a Chromium browser window, navigate to a starting page (e.g., google.com), and execute actions based on the LLM’s predictions.

To run the agent:

python main.py

While running, the agent will display:

Annotated browser screenshots (using PIL)
A console log of the actions taken

Example prompt for the agent might be:
"Find me the latest news on CNN"

Project Structure

main.py
The main script that sets up the agent, defines the state graph, and orchestrates the LLM-driven actions.
mark_page.js
A JavaScript file used by Playwright to annotate the current page with bounding boxes for interactive elements.
requirements.txt
Lists all the required Python packages for the project.
README.md
This file.

Customization

Typing Simulation:
The type_text function simulates human-like typing. You can adjust the delay parameters (or use randomized delays) to further mimic natural typing speeds.
LLM Prompt and Output Parsing:
The agent uses a prompt (pulled from wfh/web-voyager) and a custom parser to interpret LLM responses. Feel free to modify these to better suit your needs.
Tool Configuration:
You can add, remove, or modify tools in the tools dictionary in main.py to change the agent’s behavior.

Contributing

Contributions are welcome! If you have suggestions for improvements, bug fixes, or new features, please open an issue or submit a pull request.

When contributing, please ensure:

Your code follows the existing style.
Any new features are accompanied by relevant tests or documentation updates.

License

This project is licensed under the MIT License.

Acknowledgments

Contact

For questions, suggestions, or contributions, please open an issue in this repository or contact infect3dlab@gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision-Enabled Web Browsing Agent

Table of Contents

Overview

Features

Requirements

Installation

Usage

Project Structure

Customization

Contributing

License

Acknowledgments

Contact

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
mark_page.js		mark_page.js
requirements.txt		requirements.txt

DonGuillotine/langraph_browser_agent

Folders and files

Latest commit

History

Repository files navigation

Vision-Enabled Web Browsing Agent

Table of Contents

Overview

Features

Requirements

Installation

Usage

Project Structure

Customization

Contributing

License

Acknowledgments

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages