Update readme with vision information (#135)

pkiv · web-flow · commit 1e65d6490f5d · 2024-10-28T04:44:32.000-07:00
* Update readme with vision information

* fix notes

* reduce links for API reference
diff --git a/README.md b/README.md
@@ -22,20 +22,18 @@
 - [Intro](#intro)
 - [Getting Started](#getting-started)
 - [API Reference](#api-reference)
-  - [Stagehand()](#stagehand)
   - [act()](#act)
   - [extract()](#extract)
   - [observe()](#observe)
-  - [page and context](#page-and-context)
-  - [log()](#log)
 - [Model Support](#model-support)
 - [How It Works](#how-it-works)
 - [Roadmap](#roadmap)
 - [Contributing](#contributing)
 - [Acknowledgements](#acknowledgements)
 - [License](#license)
 
-> [!NOTE] > `Stagehand` is currently available as an early release, and we're actively seeking feedback from the community. Please join our [Slack community](https://join.slack.com/t/stagehand-dev/shared_invite/zt-2tdncfgkk-fF8y5U0uJzR2y2_M9c9OJA) to stay updated on the latest developments and provide feedback.
+> [!NOTE]
+> `Stagehand` is currently available as an early release, and we're actively seeking feedback from the community. Please join our [Slack community](https://join.slack.com/t/stagehand-dev/shared_invite/zt-2tdncfgkk-fF8y5U0uJzR2y2_M9c9OJA) to stay updated on the latest developments and provide feedback.
 
 ## Intro
 
@@ -173,7 +171,7 @@ This constructor is used to create an instance of Stagehand.
 
   - `action`: a `string` describing the action to perform, e.g., `"search for 'x'"`.
   - `modelName`: (optional) an `AvailableModel` string to specify the model to use.
-  - `useVision`: (optional) a `boolean` or `"fallback"` to determine if vision-based processing should be used.
+  - `useVision`: (optional) a `boolean` or `"fallback"` to determine if vision-based processing should be used. Defaults to `"fallback"`.
 
 - **Returns:**
 
@@ -222,6 +220,7 @@ If you are looking for a specific element, you can also pass in an instruction t
 - **Arguments:**
 
   - `instruction`: a `string` providing instructions for the observation.
+  - `useVision`: (optional) a `boolean` or `"fallback"` to determine if vision-based processing should be used. Defaults to `"fallback"`.
 
 - **Returns:**
 
@@ -295,9 +294,6 @@ The SDK has two major phases:
 ### DOM processing
 
 Stagehand uses a combination of techniques to prepare the DOM.
-Stagehand only uses text input as of this version, but the release of `gpt-4o` incorporating vision is attractive.
-
-\*_[update before release_]\*
 
 The DOM Processing steps look as follows:
 
@@ -316,6 +312,10 @@ While LLMs will continue to get bigger context windows and improve latency, givi
 
 ![](./docs/media/chunks.png)
 
+### Vision
+
+The `act()` and `observe()` methods can take a `useVision` flag. If this is set to `true`, the LLM will be provided with a annotated screenshot of the current page to identify which elements to act on. This is useful for complex DOMs that the LLM has a hard time reasoning about, even after processing and chunking. By default, this flag is set to `"fallback"`, which means that if the LLM fails to successfully identify a single element, Stagehand will retry the attempt using vision.
+
 ### LLM analysis
 
 Now we have a list of candidate elements and a way to select them. We can present those elements with additional context to the LLM for extraction or action. While untested at on a large scale, presenting a "numbered list of elements" guides the model to not treat the context as a full DOM, but as a list of related but independent elements to operate on.