Skip to content
This repository was archived by the owner on Jul 3, 2025. It is now read-only.
This repository was archived by the owner on Jul 3, 2025. It is now read-only.

Regarding the Use of a Pure Visual Mode to Drive Workflows #29

@Explorer1092

Description

@Explorer1092

After reading through the source code, I understand that the current workflow is as follows:

  • The system captures a screenshot of the page and retrieves element information.
  • The screenshot and element information are sent to the LLM.
  • The LLM analyzes the screenshot and decides which element to interact with (returns the element index).
  • The controller retrieves the coordinates of the element from the known list of elements based on the element index.
  • The system uses these coordinates to perform the action.

If the browser cannot return element information, even if the LLM can identify the element to interact with from the screenshot, the system will not be able to translate the LLM's decision (element index) into actual coordinates because:

  • state.interactive_elements will be empty.
  • When the LLM returns an element index, the click_element method will not be able to find the corresponding element and its coordinates.

Is there any consideration for using a pure visual mode to drive the workflow in the future?

In my previous experiments, I found that the browser's capability to provide element indices is insufficient. It is easy to lose key elements. Therefore, I believe that using a pure visual mode, or making it the primary mode, is a good trend.

Additionally, I would like to know if there is any intention to develop this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions