You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 3, 2025. It is now read-only.
After reading through the source code, I understand that the current workflow is as follows:
The system captures a screenshot of the page and retrieves element information.
The screenshot and element information are sent to the LLM.
The LLM analyzes the screenshot and decides which element to interact with (returns the element index).
The controller retrieves the coordinates of the element from the known list of elements based on the element index.
The system uses these coordinates to perform the action.
If the browser cannot return element information, even if the LLM can identify the element to interact with from the screenshot, the system will not be able to translate the LLM's decision (element index) into actual coordinates because:
state.interactive_elements will be empty.
When the LLM returns an element index, the click_element method will not be able to find the corresponding element and its coordinates.
Is there any consideration for using a pure visual mode to drive the workflow in the future?
In my previous experiments, I found that the browser's capability to provide element indices is insufficient. It is easy to lose key elements. Therefore, I believe that using a pure visual mode, or making it the primary mode, is a good trend.
Additionally, I would like to know if there is any intention to develop this feature.