Skip to content

Commit 1e65d64

Browse files
authored
Update readme with vision information (#135)
* Update readme with vision information * fix notes * reduce links for API reference
1 parent 3aa80a6 commit 1e65d64

File tree

1 file changed

+8
-8
lines changed

1 file changed

+8
-8
lines changed

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,20 +22,18 @@
2222
- [Intro](#intro)
2323
- [Getting Started](#getting-started)
2424
- [API Reference](#api-reference)
25-
- [Stagehand()](#stagehand)
2625
- [act()](#act)
2726
- [extract()](#extract)
2827
- [observe()](#observe)
29-
- [page and context](#page-and-context)
30-
- [log()](#log)
3128
- [Model Support](#model-support)
3229
- [How It Works](#how-it-works)
3330
- [Roadmap](#roadmap)
3431
- [Contributing](#contributing)
3532
- [Acknowledgements](#acknowledgements)
3633
- [License](#license)
3734

38-
> [!NOTE] > `Stagehand` is currently available as an early release, and we're actively seeking feedback from the community. Please join our [Slack community](https://join.slack.com/t/stagehand-dev/shared_invite/zt-2tdncfgkk-fF8y5U0uJzR2y2_M9c9OJA) to stay updated on the latest developments and provide feedback.
35+
> [!NOTE]
36+
> `Stagehand` is currently available as an early release, and we're actively seeking feedback from the community. Please join our [Slack community](https://join.slack.com/t/stagehand-dev/shared_invite/zt-2tdncfgkk-fF8y5U0uJzR2y2_M9c9OJA) to stay updated on the latest developments and provide feedback.
3937
4038
## Intro
4139

@@ -173,7 +171,7 @@ This constructor is used to create an instance of Stagehand.
173171

174172
- `action`: a `string` describing the action to perform, e.g., `"search for 'x'"`.
175173
- `modelName`: (optional) an `AvailableModel` string to specify the model to use.
176-
- `useVision`: (optional) a `boolean` or `"fallback"` to determine if vision-based processing should be used.
174+
- `useVision`: (optional) a `boolean` or `"fallback"` to determine if vision-based processing should be used. Defaults to `"fallback"`.
177175

178176
- **Returns:**
179177

@@ -222,6 +220,7 @@ If you are looking for a specific element, you can also pass in an instruction t
222220
- **Arguments:**
223221

224222
- `instruction`: a `string` providing instructions for the observation.
223+
- `useVision`: (optional) a `boolean` or `"fallback"` to determine if vision-based processing should be used. Defaults to `"fallback"`.
225224

226225
- **Returns:**
227226

@@ -295,9 +294,6 @@ The SDK has two major phases:
295294
### DOM processing
296295

297296
Stagehand uses a combination of techniques to prepare the DOM.
298-
Stagehand only uses text input as of this version, but the release of `gpt-4o` incorporating vision is attractive.
299-
300-
\*_[update before release_]\*
301297

302298
The DOM Processing steps look as follows:
303299

@@ -316,6 +312,10 @@ While LLMs will continue to get bigger context windows and improve latency, givi
316312

317313
![](./docs/media/chunks.png)
318314

315+
### Vision
316+
317+
The `act()` and `observe()` methods can take a `useVision` flag. If this is set to `true`, the LLM will be provided with a annotated screenshot of the current page to identify which elements to act on. This is useful for complex DOMs that the LLM has a hard time reasoning about, even after processing and chunking. By default, this flag is set to `"fallback"`, which means that if the LLM fails to successfully identify a single element, Stagehand will retry the attempt using vision.
318+
319319
### LLM analysis
320320

321321
Now we have a list of candidate elements and a way to select them. We can present those elements with additional context to the LLM for extraction or action. While untested at on a large scale, presenting a "numbered list of elements" guides the model to not treat the context as a full DOM, but as a list of related but independent elements to operate on.

0 commit comments

Comments
 (0)