Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
76cc39b
allow free modelclientoptions
sameelarif Aug 29, 2025
28f4b6e
Merge branch 'main' into sameel/stg-692-azurebedrock-api-integration-…
sameelarif Sep 3, 2025
1a415bb
Merge branch 'main' into sameel/stg-692-azurebedrock-api-integration-…
sameelarif Sep 9, 2025
04fb315
change zod ver to working build
sameelarif Sep 10, 2025
9daa584
add playwright arguments to agent (#1066)
tkattkat Sep 10, 2025
f6f05b0
[docs] add info on not needing project id in browserbase session para…
chrisreadsf Sep 11, 2025
c886544
Export aisdk (#1058)
chrisreadsf Sep 15, 2025
87505a3
docs: update fingerprint settings to reflect the new session create c…
Kylejeong2 Sep 15, 2025
8eccd56
send client options on every request
sameelarif Sep 15, 2025
3c39a05
[docs] export aisdk (#1074)
chrisreadsf Sep 16, 2025
bf2d0e7
Fix zod peer dependency support (#1032)
miguelg719 Sep 16, 2025
7f38b3a
add stagehand agent to api (#1077)
tkattkat Sep 16, 2025
3a0dc58
add playwright screenshot option for browserbase env (#1070)
derekmeegan Sep 17, 2025
b7be89e
add webbench, chrome-based OS world, and ground truth to web voyager …
filip-michalsky Sep 18, 2025
df76f7a
Fix python installation instructions (#1087)
rsbryan Sep 19, 2025
b9c8102
update xpath in `observe_vantechjournal` (#1088)
seanmcguire12 Sep 20, 2025
536f366
Fix session create logs on api (#1089)
miguelg719 Sep 21, 2025
8ff5c5a
Improve failed act logs (#1090)
miguelg719 Sep 21, 2025
569e444
[docs] add aisdk workaround before npm release + add versions to work…
chrisreadsf Sep 22, 2025
8c0fd01
pass stagehand, instead of stagehandPage to agent (#1082)
tkattkat Sep 22, 2025
dc2d420
img diff algo for screenshots (#1072)
filip-michalsky Sep 23, 2025
c6a752d
test bedrock file
filip-michalsky Sep 23, 2025
2931804
add azure test file
filip-michalsky Sep 23, 2025
f89b13e
Eval metadata (#1092)
miguelg719 Sep 23, 2025
2f3b8b9
fix bedrock test
sameelarif Sep 24, 2025
be8b7a4
Merge branch 'main' into sameel/stg-692-azurebedrock-api-integration-…
sameelarif Sep 25, 2025
27c722c
Update pnpm-lock.yaml
sameelarif Sep 25, 2025
108de3c
update evals cli docs (#1096)
miguelg719 Sep 26, 2025
69c3d93
better modelclientoption api handling
sameelarif Sep 26, 2025
467dade
dont override region
sameelarif Sep 26, 2025
0735ca3
fix bedrock example
sameelarif Sep 26, 2025
e0e6b30
adding support for new claude 4.5 sonnet agent model (#1099)
Kylejeong2 Sep 29, 2025
76b44ae
lint
sameelarif Sep 30, 2025
18937ee
read aws creds from client options obj
sameelarif Sep 30, 2025
889cb6c
properly convert custom / mcp tools to anthropic cua format (#1103)
tkattkat Oct 1, 2025
a99aa48
Add current date and page url to agent context (#1102)
miguelg719 Oct 1, 2025
a1ad06c
Additional agent logging (#1104)
miguelg719 Oct 1, 2025
0af4acf
update evals cli docs (#1096)
miguelg719 Sep 26, 2025
c762944
adding support for new claude 4.5 sonnet agent model (#1099)
Kylejeong2 Sep 29, 2025
4bd7412
properly convert custom / mcp tools to anthropic cua format (#1103)
tkattkat Oct 1, 2025
ce07cfa
Add current date and page url to agent context (#1102)
miguelg719 Oct 1, 2025
06ae0e6
Additional agent logging (#1104)
miguelg719 Oct 1, 2025
9fe40fd
fix system prompt
miguelg719 Oct 2, 2025
938b51c
remove dup log
miguelg719 Oct 2, 2025
607b4c3
pass modelClientOptions for stagehand agent
miguelg719 Oct 2, 2025
adec13c
Merge branch 'main' into sameel/stg-692-azurebedrock-api-integration-…
sameelarif Oct 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/curly-boats-push.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand-evals": patch
---

improve evals screenshot service - add img hashing diff to add screenshots and change to screenshot intercepts from the agent
5 changes: 5 additions & 0 deletions .changeset/dark-crabs-repair.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand-evals": minor
---

added web voyager ground truth (optional), added web bench, and subset of OSWorld evals which run on a browser
5 changes: 5 additions & 0 deletions .changeset/few-frogs-smoke.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
---

Pass stagehand object to agent instead of stagehand page
5 changes: 5 additions & 0 deletions .changeset/fifty-windows-throw.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
---

Fix logging for stagehand agent
5 changes: 5 additions & 0 deletions .changeset/icy-toes-obey.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
---

Add playwright arguments to agent execute response
5 changes: 5 additions & 0 deletions .changeset/loud-waves-think.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
---

adds support for stagehand agent in the api
5 changes: 5 additions & 0 deletions .changeset/many-rats-punch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
---

Fix for zod peer dependency support
5 changes: 5 additions & 0 deletions .changeset/purple-squids-know.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
---

Fixed info logs on api session create
5 changes: 5 additions & 0 deletions .changeset/short-mirrors-switch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
---

patch custom tool support in anthropic cua client
5 changes: 5 additions & 0 deletions .changeset/tasty-candles-retire.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
---

Improve failed act error logs
5 changes: 5 additions & 0 deletions .changeset/upset-ghosts-shout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
---

Add current page and date context to agent
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
CLAUDE.md
node_modules/
/test-results/
/playwright-report/
Expand Down
3 changes: 0 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,15 +233,13 @@
We're thrilled to announce the release of Stagehand 2.0, bringing significant improvements to make browser automation more powerful, faster, and easier to use than ever before.

### 🚀 New Features

- **Introducing `stagehand.agent`**: A powerful new way to integrate SOTA Computer use models or Browserbase's [Open Operator](https://operator.browserbase.com) into Stagehand with one line of code! Perfect for multi-step workflows and complex interactions. [Learn more](https://docs.stagehand.dev/concepts/agent)
- **Lightning-fast `act` and `extract`**: Major performance improvements to make your automations run significantly faster.
- **Enhanced Logging**: Better visibility into what's happening during automation with improved logging and debugging capabilities.
- **Comprehensive Documentation**: A completely revamped documentation site with better examples, guides, and best practices.
- **Improved Error Handling**: More descriptive errors and better error recovery to help you debug issues faster.

### 🛠️ Developer Experience

- **Better TypeScript Support**: Enhanced type definitions and better IDE integration
- **Better Error Messages**: Clearer, more actionable error messages to help you debug faster
- **Improved Caching**: More reliable action caching for better performance
Expand Down Expand Up @@ -502,7 +500,6 @@
- [#316](https://github.com/browserbase/stagehand/pull/316) [`902e633`](https://github.com/browserbase/stagehand/commit/902e633e126a58b80b757ea0ecada01a7675a473) Thanks [@kamath](https://github.com/kamath)! - rename browserbaseResumeSessionID -> browserbaseSessionID

- [#296](https://github.com/browserbase/stagehand/pull/296) [`f11da27`](https://github.com/browserbase/stagehand/commit/f11da27a20409c240ceeea2003d520f676def61a) Thanks [@kamath](https://github.com/kamath)! - - Deprecate fields in `init` in favor of constructor options

- Deprecate `initFromPage` in favor of `browserbaseResumeSessionID` in constructor
- Rename `browserBaseSessionCreateParams` -> `browserbaseSessionCreateParams`

Expand Down
20 changes: 4 additions & 16 deletions docs/configuration/browser.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ stagehand = Stagehand(
apiKey: process.env.BROWSERBASE_API_KEY,
projectId: process.env.BROWSERBASE_PROJECT_ID,
browserbaseSessionCreateParams: {
projectId: process.env.BROWSERBASE_PROJECT_ID!,
projectId: process.env.BROWSERBASE_PROJECT_ID!, // Optional: automatically set if given in environment variable or by Stagehand parameter
proxies: true,
region: "us-west-2",
timeout: 3600, // 1 hour session timeout
Expand All @@ -124,17 +124,11 @@ stagehand = Stagehand(
blockAds: true,
solveCaptchas: true,
recordSession: false,
os: "windows", // Valid: "windows" | "mac" | "linux" | "mobile" | "tablet"
viewport: {
width: 1920,
height: 1080,
},
fingerprint: {
browsers: ["chrome", "edge"],
devices: ["desktop"],
operatingSystems: ["windows", "macos"],
locales: ["en-US", "en-GB"],
httpVersion: 2,
},
},
userMetadata: {
userId: "automation-user-123",
Expand All @@ -149,7 +143,7 @@ stagehand = Stagehand(
api_key=os.getenv("BROWSERBASE_API_KEY"),
project_id=os.getenv("BROWSERBASE_PROJECT_ID"),
browserbase_session_create_params={
"project_id": os.getenv("BROWSERBASE_PROJECT_ID"),
"project_id": os.getenv("BROWSERBASE_PROJECT_ID"), # Optional: automatically set if given in environment or by Stagehand parameter
"proxies": True,
"region": "us-west-2",
"timeout": 3600, # 1 hour session timeout
Expand All @@ -159,17 +153,11 @@ stagehand = Stagehand(
"block_ads": True,
"solve_captchas": True,
"record_session": False,
"os": "windows", # "windows" | "mac" | "linux" | "mobile" | "tablet"
"viewport": {
"width": 1920,
"height": 1080,
},
"fingerprint": {
"browsers": ["chrome", "edge"],
"devices": ["desktop"],
"operating_systems": ["windows", "macos"],
"locales": ["en-US", "en-GB"],
"http_version": 2,
},
},
"user_metadata": {
"user_id": "automation-user-123",
Expand Down
116 changes: 100 additions & 16 deletions docs/configuration/evals.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,33 +25,114 @@ Evaluations help you understand how well your automation performs, which models

Evaluations help you systematically test and improve your automation workflows. Stagehand provides both built-in evaluations and tools to create your own.

<Tip>
To run evals, you'll need to clone the [Stagehand repo](https://github.com/browserbase/stagehand) and run `npm install` to install the dependencies.
</Tip>

We have three types of evals:
1. **Deterministic Evals** - These are evals that are deterministic and can be run without any LLM inference.
We have 2 types of evals:
1. **Deterministic Evals** - These include unit tests, integration tests, and E2E tests that can be run without any LLM inference.
2. **LLM-based Evals** - These are evals that test the underlying functionality of Stagehand's AI primitives.


### LLM-based Evals
### Evals CLI
![Evals CLI](/media/evals-cli.png)

<Tip>
To run LLM evals, you'll need a [Braintrust account](https://www.braintrust.dev/docs/).
To run evals, you'll need to clone the [Stagehand repo](https://github.com/browserbase/stagehand) and set up the CLI.

We recommend using [Braintrust](https://www.braintrust.dev/docs/) to help visualize evals results and metrics.
</Tip>

To run LLM-based evals, you can run `npm run evals` from within the Stagehand repo. This will test the functionality of the LLM primitives within Stagehand to make sure they're working as expected.
The Stagehand CLI provides a powerful interface for running evaluations. You can run specific evals, categories, or external benchmarks with customizable settings.

Evals are grouped into three categories:
Evals are grouped into:
1. **Act Evals** - These are evals that test the functionality of the `act` method.
2. **Extract Evals** - These are evals that test the functionality of the `extract` method.
3. **Observe Evals** - These are evals that test the functionality of the `observe` method.
4. **Combination Evals** - These are evals that test the functionality of the `act`, `extract`, and `observe` methods together.
5. **Experimental Evals** - These are experimental custom evals that test the functionality of the stagehand primitives.
6. **Agent Evals** - These are evals that test the functionality of `agent`.
7. **(NEW) External Benchmarks** - Run external benchmarks like WebBench, GAIA, WebVoyager, OnlineMind2Web, and OSWorld.

#### Installation

<Steps>
<Step title="Install Dependencies">
```bash
# From the stagehand root directory
pnpm install
```
</Step>

#### Configuring and Running Evals
You can view the specific evals in [`evals/tasks`](https://github.com/browserbase/stagehand/tree/main/evals/tasks). Each eval is grouped into eval categories based on [`evals/evals.config.json`](https://github.com/browserbase/stagehand/blob/main/evals/evals.config.json). You can specify models to run and other general task config in [`evals/taskConfig.ts`](https://github.com/browserbase/stagehand/blob/main/evals/taskConfig.ts).
<Step title="Build the CLI">
```bash
pnpm run build:cli
```
</Step>

To run a specific eval, you can run `npm run evals <eval>`, or run all evals in a category with `npm run evals category <category>`.
<Step title="Verify Installation">
```bash
evals help
```
</Step>
</Steps>

#### CLI Commands and Options

##### Basic Commands

```bash
# Run all evals
evals run all

# Run specific category
evals run act
evals run extract
evals run observe
evals run agent

# Run specific eval
evals run extract/extract_text

# List available evals
evals list
evals list --detailed

# Configure defaults
evals config
evals config set env browserbase
evals config set trials 5
```

##### Command Options

- **`-e, --env`**: Environment (`local` or `browserbase`)
- **`-t, --trials`**: Number of trials per eval (default: 3)
- **`-c, --concurrency`**: Max parallel sessions (default: 10)
- **`-m, --model`**: Model override
- **`-p, --provider`**: Provider override
- **`--api`**: Use Stagehand API instead of SDK

##### Running External Benchmarks

The CLI supports several industry-standard benchmarks:

```bash
# WebBench with filters
evals run benchmark:webbench -l 10 -f difficulty=easy -f category=READ

# GAIA benchmark
evals run b:gaia -s 100 -l 25 -f level=1

# WebVoyager
evals run b:webvoyager -l 50

# OnlineMind2Web
evals run b:onlineMind2Web

# OSWorld
evals run b:osworld -f source=Mind2Web
```

#### Configuration Files

You can view the specific evals in [`evals/tasks`](https://github.com/browserbase/stagehand/tree/main/evals/tasks). Each eval is grouped into eval categories based on [`evals/evals.config.json`](https://github.com/browserbase/stagehand/blob/main/evals/evals.config.json).


#### Viewing eval results
Expand All @@ -65,7 +146,7 @@ You can use the Braintrust UI to filter by model/eval and aggregate results acro

### Deterministic Evals

To run deterministic evals, you can just run `npm run e2e` from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it's working as expected.
To run deterministic evals, you can run `npm run e2e` from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it's working as expected.

These tests are in [`evals/deterministic`](https://github.com/browserbase/stagehand/tree/main/evals/deterministic) and test on both Browserbase browsers and local headless Chromium browsers.

Expand Down Expand Up @@ -139,10 +220,13 @@ Update `evals/evals.config.json`:
<Step title="Run Your Evaluation">
```bash
# Test your custom evaluation
npm run evals custom_task_name
evals run custom_task_name

# Run the entire custom category
npm run evals category custom
evals run custom

# Run with specific settings
evals run custom_task_name -e browserbase -t 5 -m gpt-4o
```
</Step>
</Steps>
Expand Down
43 changes: 33 additions & 10 deletions docs/configuration/models.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -156,47 +156,70 @@ stagehand = Stagehand(
## Custom LLM Integration

<Note>
Custom LLMs are currently only supported in TypeScript.
Only [LiteLLM compatible providers](https://docs.litellm.ai/docs/providers) are available in Python. Some may require extra setup.
</Note>

Integrate any LLM with Stagehand using custom clients. The only requirement is **structured output support** for consistent automation behavior.

### Vercel AI SDK
The [Vercel AI SDK](https://sdk.vercel.ai/providers/ai-sdk-providers) is a popular library for interacting with LLMs. You can use any of the providers supported by the Vercel AI SDK to create a client for your model, **as long as they support structured outputs**.

Vercel AI SDK supports providers for OpenAI, Anthropic, and Google, along with support for **Amazon Bedrock** and **Azure OpenAI**.
Vercel AI SDK supports providers for OpenAI, Anthropic, and Google, along with support for **Amazon Bedrock** and **Azure OpenAI**. For a full list, see the [Vercel AI SDK providers page](https://sdk.vercel.ai/providers/ai-sdk-providers).

To get started, you'll need to install the `ai` package and the provider you want to use. For example, to use Amazon Bedrock, you'll need to install the `@ai-sdk/amazon-bedrock` package.
To get started, you'll need to install the `ai` package (version 4) and the provider you want to use (version 1 - both need to be compatible with LanguageModelV1). For example, to use Amazon Bedrock, you'll need to install the `@ai-sdk/amazon-bedrock` package.

You'll also need to use the [Vercel AI SDK external client](https://github.com/browserbase/stagehand/blob/main/examples/external_clients/aisdk.ts) as a template to create a client for your model.
You'll also need to import the [Vercel AI SDK external client](https://github.com/browserbase/stagehand/blob/main/lib/llm/aisdk.ts) through Stagehand to create a client for your model.

<Tabs>
<Tab title="npm">
```bash
npm install ai @ai-sdk/amazon-bedrock
npm install ai@4 @ai-sdk/amazon-bedrock@1
```
</Tab>

<Tab title="pnpm">
```bash
pnpm install ai @ai-sdk/amazon-bedrock
pnpm install ai@4 @ai-sdk/amazon-bedrock@1
```
</Tab>

<Tab title="yarn">
```bash
yarn add ai @ai-sdk/amazon-bedrock
yarn add ai@4 @ai-sdk/amazon-bedrock@1
```
</Tab>
</Tabs>

To get started, you can use the [Vercel AI SDK external client](https://github.com/browserbase/stagehand/blob/84f810b4631291307a32a47addad7e26e9c1deb3/examples/external_clients/aisdk.ts) as a template to create a client for your model.
<Note>
The `AISdkClient` is not yet available via the Stagehand npm package. For now, install Stagehand as a git repository to access the `AISdkClient` (this will be included in the npm package in an upcoming release).
</Note>

<Tabs>
<Tab title="npm">
```bash
npm install @browserbasehq/stagehand@git+https://github.com/browserbase/stagehand.git
```
</Tab>

<Tab title="pnpm">
```bash
pnpm install @browserbasehq/stagehand@git+https://github.com/browserbase/stagehand.git
```
</Tab>

<Tab title="yarn">
```bash
yarn add @browserbasehq/stagehand@git+https://github.com/browserbase/stagehand.git
```
</Tab>
</Tabs>

```ts
// Install/import the provider you want to use.
// For example, to use OpenAI, import `openai` from @ai-sdk/openai
// For example, to use Azure OpenAI, import { createAzure } from '@ai-sdk/azure';
import { bedrock } from "@ai-sdk/amazon-bedrock";
import { AISdkClient } from "./external_clients/aisdk";
// @ts-ignore
import { AISdkClient } from "@browserbasehq/stagehand";

const stagehand = new Stagehand({
llmClient: new AISdkClient({
Expand Down
Loading