UserBench: An Interactive Gym Environment for User-Centric Agents

This is the official repository for paper "UserBench: An Interactive Gym Environment for User-Centric Agents".

UserBench is an evaluation environment for testing language models on multi-turn travel planning tasks. This open-source implementation provides a robust framework for evaluating how well language models can understand user preferences, make appropriate function calls, and provide personalized recommendations in travel planning scenarios.

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/SalesforceAIResearch/UserBench.git
cd UserBench

# Install dependencies
pip install -r requirements.txt

# Set up your API keys (choose one or more)
export OPENAI_API_KEY="your-openai-key-here"

Basic Evaluation with eval.py

The eval.py script is the main evaluation tool for TravelGym. It supports various model types:

1. Evaluating OpenAI Models

python eval.py \
    --model_name gpt-4o \
    --port 8000 \
    --max_turns 20 \
    --pass_k 1 \
    --temperature 0.0 \
    --envs travel22 travel33 travel44 \
    --save_name travel_gpt4o_eval

2. Evaluating Local Models with vLLM

First, start your vLLM server:

# Start vLLM server (example)
# Example:
vllm serve Qwen/Qwen3-8B \
   --max-model-len 32768 \
   --gpu-memory-utilization 0.9 \
   --tensor-parallel-size 4 \
   --enable-auto-tool-choice \
   --tool-call-parser hermes \
   --chat-template tool_template/hermes.jinja \
   --port 8000

Please make sure that tool call is enabled when serving the model.

Then evaluate:

python eval.py \
    --model_name Qwen/Qwen3-8B \
    --port 8000 \
    --max_turns 20 \
    --pass_k 1 \
    --temperature 0.0 \
    --envs travel22 travel33 travel44 \
    --save_name travel_llama2_eval

3. Evaluating Other API Models

# Together AI API
python eval.py \
    --model_name deepseek-chat \
    --port 8000 \
    --max_turns 20 \
    --temperature 0.0 \
    --envs travel22 travel33 travel44 \
    --save_name travel_deepseek_eval

# Google Gemini
python eval.py \
    --model_name gemini-2.5-pro \
    --port 8000 \
    --max_turns 20 \
    --temperature 0.0 \
    --envs travel22 travel33 travel44 \
    --save_name travel_gemini_eval

Please note that the envs being evaluated could also be customized. Please refer to travelgym/data directory for more details.

Configuration Options

Parameter	Description	Default
`--model_name`	Model name or path	Required
`--port`	Port for vLLM server (local models)	8000
`--max_turns`	Maximum conversation turns	20
`--pass_k`	Number of attempts per task	1
`--temperature`	Model temperature	0.0
`--envs`	Evaluation environments	travel22 travel33 travel44
`--save_name`	Result file prefix	model_name

🏗️ Architecture

TravelGym is built on the Gymnasium framework with the following key components:

Core Environment (`travelgym/env/`)

travel_env.py: Main environment with conversation and evaluation logic
prompts.py / prompt_async.py: LLM prompts and response generation
task_data.py: Dataset loading and management
utils.py: Utility functions for evaluation

Configuration (`travelgym/config.py`)

Centralized configuration management
Model-specific settings
Environment parameters

Evaluation Data (`travelgym/data/`)

Structured scenarios with user preferences
Ground truth annotations
Multiple difficulty levels (travel22, travel33, travel44)

🔧 Advanced Usage

Custom Evaluation Scripts

You can customize evaluation by modifying eval.py or creating your own scripts:

import travelgym

# Create custom configuration
config = travelgym.TravelGymConfig(
    max_steps=15,
    temperature=0.2,
    data_mode="single",
    verbose=True
)

# Initialize environment
env = travelgym.TravelEnv(config)

# Run evaluation loop
observation, info = env.reset()
for step in range(config.max_steps):
    # Your model logic here
    action = your_model.generate(observation['feedback'])
    observation, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        break

🖊️ Citation

@article{qian2025userbench,
  title={UserBench: An Interactive Gym Environment for User-Centric Agents},
  author={Qian, Cheng and Liu, Zuxin and Prabhakar, Akshara and Liu, Zhiwei and Zhang, Jianguo and Chen, Haolin and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
  journal={arXiv preprint arXiv:2507.22034},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
data		data
schema		schema
tool_template		tool_template
travelgym		travelgym
.gitignore		.gitignore
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
eval.py		eval.py
eval.sh		eval.sh
host.sh		host.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UserBench: An Interactive Gym Environment for User-Centric Agents

🚀 Quick Start

Installation

Basic Evaluation with eval.py

1. Evaluating OpenAI Models

2. Evaluating Local Models with vLLM

3. Evaluating Other API Models

Configuration Options

🏗️ Architecture

Core Environment (`travelgym/env/`)

Configuration (`travelgym/config.py`)

Evaluation Data (`travelgym/data/`)

🔧 Advanced Usage

Custom Evaluation Scripts

🖊️ Citation

About

Uh oh!

Releases

Packages

Languages

License

SalesforceAIResearch/UserBench

Folders and files

Latest commit

History

Repository files navigation

UserBench: An Interactive Gym Environment for User-Centric Agents

🚀 Quick Start

Installation

Basic Evaluation with eval.py

1. Evaluating OpenAI Models

2. Evaluating Local Models with vLLM

3. Evaluating Other API Models

Configuration Options

🏗️ Architecture

Core Environment (travelgym/env/)

Configuration (travelgym/config.py)

Evaluation Data (travelgym/data/)

🔧 Advanced Usage

Custom Evaluation Scripts

🖊️ Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Core Environment (`travelgym/env/`)

Configuration (`travelgym/config.py`)

Evaluation Data (`travelgym/data/`)

Packages