Regarding the Use of a Pure Visual Mode to Drive Workflows

After reading through the source code, I understand that the current workflow is as follows:

- The system captures a screenshot of the page and retrieves element information.
- The screenshot and element information are sent to the LLM.
- The LLM analyzes the screenshot and decides which element to interact with (returns the element index).
- The controller retrieves the coordinates of the element from the known list of elements based on the element index.
- The system uses these coordinates to perform the action.

If the browser cannot return element information, even if the LLM can identify the element to interact with from the screenshot, the system will not be able to translate the LLM's decision (element index) into actual coordinates because:

- state.interactive_elements will be empty.
- When the LLM returns an element index, the click_element method will not be able to find the corresponding element and its coordinates.

Is there any consideration for using a pure visual mode to drive the workflow in the future?

In my previous experiments, I found that the browser's capability to provide element indices is insufficient. It is easy to lose key elements. Therefore, I believe that using a pure visual mode, or making it the primary mode, is a good trend.

Additionally, I would like to know if there is any intention to develop this feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regarding the Use of a Pure Visual Mode to Drive Workflows #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regarding the Use of a Pure Visual Mode to Drive Workflows #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions