You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ever wanted to automate real browser actions just by describing what you want? Meet talk2browser, a LangGraph-powered agent that turns prompts into real-time web actions and reusable test scripts.
Hi everyone! 👋 I'm excited to share talk2browser, which leverages LangGraph's agent orchestration capabilities to create a self-improving browser automation system. Inspired by the Browser-Use open source project, it takes natural language tasks and executes real browser actions while generating reusable test scripts.
Agent State Management — Complex browser workflows with conditional transitions using AgentState TypedDict
Dynamic Tool Registration — 25+ browser automation tools automatically registered as LangGraph tools via decorators
Multi-Step Orchestration — Planning → Execution → Script Generation phases with state persistence
Self-Improving Workflows — Action recording and replay capabilities for iterative improvement
Vision Integration — YOLOv11-based UI element detection with LLM context injection
Sensitive Data Handling — Secure credential management with environment variable injection
✨ Key Features
Feature
Description
🗣️ Natural Language Control
Plain English commands for web app testing and automation
📝 Multi-Framework Scripts
Auto-generates Playwright, Cypress, and Selenium code from recorded actions
👁️ Vision Integration
YOLOv11-based UI element detection with bounding box coordinates
🔐 Secure Data Handling
Environment-based credential management with SecretStr support
📊 PDF Report Generation
Comprehensive documentation output with screenshots and structured data
♻️ Repeatable Execution
JSON action recording for consistent replay across unlimited runs
🎯 Element Detection
Smart CSS/XPath selector resolution with hash-based element mapping
🔧 Quality Assurance
Full mypy, flake8, black compliance with automated CI/CD pipeline
🧠 Agent Architecture
The LangGraph agent uses a two-node graph with conditional routing:
classAgentState(TypedDict):
messages: Annotated[List[BaseMessage], add_messages]
next: str# For LangGraph routingelement_map: Dict[str, str] # Element hash to xpath mappingvision: dict# Optional vision metadata for LLM context# Agent workflow: chatbot -> tools -> chatbot (or END)graph=StateGraph(AgentState)
graph.add_node("agent", self._chatbot)
graph.add_node("tools", ToolNode(TOOLS))
graph.add_conditional_edges("agent", self._route_tools)
The agent maintains context across browser sessions and learns from previous automation patterns through the ActionService which records all tool calls with execution time, arguments, results, and errors.
Note: The system includes 25+ registered tools including navigation, clicking, form filling, screenshot capture, PDF generation, and script creation capabilities.
🚀 Quick Example
Here's how to automate GitHub trending analysis:
importasynciofromtalk2browser.agent.agentimportBrowserAgentasyncdefmain():
# Prepare a test scenariotask="""Go to https://github.com/trending. Extract information about the top 10 trending repositories including: - Repository name, owner, description, language, stars, forks, URL Create a comprehensive PDF report and generate a Playwright script."""asyncwithBrowserAgent(headless=False, info_mode=True) asagent:
response=awaitagent.run(task)
print("Agent response:", response)
asyncio.run(main())
@tool@resolve_hash_argsasyncdefclick(selector: str, *, timeout: int=5000) ->str:
"""Click on an element matching the CSS selector."""# Automatic tool registration with LangGraph# Hash-based element resolution# Error handling and logging
State Management
# Agent maintains persistent state across tool callsstate= {
"messages": [HumanMessage, AIMessage, ToolMessage],
"next": "tools", # or "agent" or END"element_map": {"#abc123": "xpath=//button[@id='submit']"},
"vision": {"detections": [...], "image_path": "..."}
}
🤝 Community Questions
I'd love to hear from the LangChain community:
What real-world automation workflows could benefit from natural language control? (e.g., E2E testing, data extraction, monitoring)
How do you currently approach multi-step browser automation with state persistence across actions?
What LangGraph patterns have you found most effective for conditional routing and error recovery in agent workflows?
How do you handle dynamic web content and element detection in your automation projects?
What's your experience with integrating computer vision (YOLO, OCR) into LangChain/LangGraph workflows?
How do you manage sensitive data and credentials in production automation systems?
What testing frameworks would you most want to see supported for script generation?
⚠️ What to Watch Out For
Vision/YOLOv11 Integration: Optional feature. Requires a YOLOv11 model file and additional setup. Not required for core browser automation.
Script Summarization: (Planned) Feature for AI-powered summaries of generated automation scripts is on the roadmap but not yet implemented.
PDF Generation: Fully supported. Generates comprehensive PDF reports with execution details and screenshots.
Manual Action Override: Partially implemented. Human-in-the-loop/manual override is available for some actions and is being actively enhanced for broader coverage.
🔮 Future Roadmap
PDF Script Documentation — Generate comprehensive PDF reports for generated test scripts with execution details and screenshots
Script Summarization — AI-powered summaries of generated automation scripts with key actions and validation points
Enhanced Manual Action Override — Improved human-in-the-loop capabilities for manual intervention during automation
Performance Optimization — Faster element detection and action execution
Error Handling — Better recovery from browser automation failures
Test Coverage — Expanded unit and integration test suite
🛠️ Technical Stack
LangGraph: Agent orchestration and state management
Playwright: Browser automation engine with 25+ registered tools
Claude 3 Opus/Haiku: Natural language reasoning and planning
YOLOv11: Computer vision for UI element detection
Python 3.10+: Core implementation with full type safety
Pydantic: Data validation and settings management
Looking for feedback, use cases, and contributions! What browser automation challenges could this help solve for your projects? 🤔
Feel free to star ⭐ the repo if you find this interesting!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Ever wanted to automate real browser actions just by describing what you want? Meet talk2browser, a LangGraph-powered agent that turns prompts into real-time web actions and reusable test scripts.
Hi everyone! 👋 I'm excited to share talk2browser, which leverages LangGraph's agent orchestration capabilities to create a self-improving browser automation system. Inspired by the Browser-Use open source project, it takes natural language tasks and executes real browser actions while generating reusable test scripts.
🔗 LangGraph Implementation
talk2browser showcases advanced LangGraph patterns:
AgentState
TypedDict✨ Key Features
🧠 Agent Architecture
The LangGraph agent uses a two-node graph with conditional routing:
The agent maintains context across browser sessions and learns from previous automation patterns through the
ActionService
which records all tool calls with execution time, arguments, results, and errors.🚀 Quick Example
Here's how to automate GitHub trending analysis:
CLI Usage
Or use the CLI with predefined tasks:
🎮 Getting Started
Prerequisites
Installation
Quick Test
🔍 Code Quality & Development
This project maintains high code quality through automated checks:
Local Development
📚 Resources
🛠️ Technical Architecture
Core Components
Tool Registration System
State Management
🤝 Community Questions
I'd love to hear from the LangChain community:
🔮 Future Roadmap
🛠️ Technical Stack
Looking for feedback, use cases, and contributions! What browser automation challenges could this help solve for your projects? 🤔
Feel free to star ⭐ the repo if you find this interesting!
🏷️ Tags
#langgraph
#browser-automation
#playwright
#ai-agents
#test-automation
#natural-language
#python
#claude
#computer-vision
#pdf-generation
Beta Was this translation helpful? Give feedback.
All reactions