Benchforce

Benchforce is a flexible, extensible framework for evaluating text- and voice-based agents in real time or multi-leg scenarios. It supports:

Multiple interaction modes (text, real-time speech, multi-turn dialogues)
Third-party and custom models (OpenAI, Google Gemini, xAI Grok, Anthropic Claude, Meta Llama, etc.)
Streaming audio/text with fine-grained function-call instrumentation
Comprehensive packet logging (JSONL transcripts, full .wav recordings)
Built-in metrics (accuracy, latency, custom, etc) and easy metric-plugging

Quick Start

1. Clone & configure

git clone git@github.com:SalesforceAIResearch/benchforce.git
cd benchforce

Edit config.yaml to choose:

environment (e.g. "appointments_management")
entries ([-1] = all tasks, or list of indices)
metrics (e.g. ["accuracy"])
agent models and voice settings
threading, max turns, noise, etc.

Copy .env.example to .env and fill in relevant API keys, such as ELEVENLABS_API_KEY or OPENAI_API_KEY.

Create a Python environment:

uv venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt

2. Run Benchforce

python run.py --agent openai

4. Results

High-level results are displayed in a text box:

┌──────────────────────────────────────────────┐
│ Evaluation results:                          │
│           metric        parameter      value │
│ accuracy         Total Test Cases          3 │
│ accuracy                 Accuracy 100% (3/3) │
└──────────────────────────────────────────────┘

4. More Options

Re-run a previous run to fill in any gaps:

# Re-run the given runner_id with the current setup in config.yaml
# Note that the given runner_id must still have files on your computer from a previous run.
python run.py --runner_id SOME_EXISTING_RUNNER_ID

Specify entries to run (overrides config.yaml):

python run.py --entry_nums 10 11 12

Verbose logging:

LOG_LEVEL=DEBUG python run.py

Specific verbose logging (capitalized logger names):

LOG_LEVEL_DEEPGRAMSTT=DEBUG \
LOG_LEVEL_ELEVENLABSTTS=DEBUG \
    python run.py --skip_upload

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.env.example		.env.example
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
NOTICE		NOTICE
README.MD		README.MD
SECURITY.md		SECURITY.md
config_example.yaml		config_example.yaml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Benchforce

Quick Start

1. Clone & configure

2. Run Benchforce

4. Results

4. More Options

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

SalesforceAIResearch/benchforce

Folders and files

Latest commit

History

Repository files navigation

Benchforce

Quick Start

1. Clone & configure

2. Run Benchforce

4. Results

4. More Options

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages