Skip to content

[Feature Request] Add icon descriptions in visual prompt of interactive elements detection #24

Open
@dandansamax

Description

@dandansamax

Required prerequisites

  • I have searched the Issue Tracker that this hasn't already been reported. (+1 or comment there if it has.)

Motivation

The current object detection visual prompt (GroundingDino) only finds the icon box. We want to get semantic descriptions for each icon to help agent understand UI.

Solution

The first step can be using VLLM to generate the description after passing through the object detection.

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions