| π Paper | π Dataset |
This is the official repository for paper "UserBench: An Interactive Gym Environment for User-Centric Agents".
UserBench is an evaluation environment for testing language models on multi-turn travel planning tasks. This open-source implementation provides a robust framework for evaluating how well language models can understand user preferences, make appropriate function calls, and provide personalized recommendations in travel planning scenarios.
# Clone the repository
git clone https://github.com/SalesforceAIResearch/UserBench.git
cd UserBench
# Install dependencies
pip install -r requirements.txt
# Set up your API keys (choose one or more)
export OPENAI_API_KEY="your-openai-key-here"
The eval.py
script is the main evaluation tool for TravelGym. It supports various model types:
python eval.py \
--model_name gpt-4o \
--port 8000 \
--max_turns 20 \
--pass_k 1 \
--temperature 0.0 \
--envs travel22 travel33 travel44 \
--save_name travel_gpt4o_eval
First, start your vLLM server:
# Start vLLM server (example)
# Example:
vllm serve Qwen/Qwen3-8B \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 4 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--chat-template tool_template/hermes.jinja \
--port 8000
Please make sure that tool call is enabled when serving the model.
Then evaluate:
python eval.py \
--model_name Qwen/Qwen3-8B \
--port 8000 \
--max_turns 20 \
--pass_k 1 \
--temperature 0.0 \
--envs travel22 travel33 travel44 \
--save_name travel_llama2_eval
# Together AI API
python eval.py \
--model_name deepseek-chat \
--port 8000 \
--max_turns 20 \
--temperature 0.0 \
--envs travel22 travel33 travel44 \
--save_name travel_deepseek_eval
# Google Gemini
python eval.py \
--model_name gemini-2.5-pro \
--port 8000 \
--max_turns 20 \
--temperature 0.0 \
--envs travel22 travel33 travel44 \
--save_name travel_gemini_eval
Please note that the envs being evaluated could also be customized. Please refer to travelgym/data
directory for more details.
Parameter | Description | Default |
---|---|---|
--model_name |
Model name or path | Required |
--port |
Port for vLLM server (local models) | 8000 |
--max_turns |
Maximum conversation turns | 20 |
--pass_k |
Number of attempts per task | 1 |
--temperature |
Model temperature | 0.0 |
--envs |
Evaluation environments | travel22 travel33 travel44 |
--save_name |
Result file prefix | model_name |
TravelGym is built on the Gymnasium framework with the following key components:
travel_env.py
: Main environment with conversation and evaluation logicprompts.py
/prompt_async.py
: LLM prompts and response generationtask_data.py
: Dataset loading and managementutils.py
: Utility functions for evaluation
- Centralized configuration management
- Model-specific settings
- Environment parameters
- Structured scenarios with user preferences
- Ground truth annotations
- Multiple difficulty levels (travel22, travel33, travel44)
You can customize evaluation by modifying eval.py
or creating your own scripts:
import travelgym
# Create custom configuration
config = travelgym.TravelGymConfig(
max_steps=15,
temperature=0.2,
data_mode="single",
verbose=True
)
# Initialize environment
env = travelgym.TravelEnv(config)
# Run evaluation loop
observation, info = env.reset()
for step in range(config.max_steps):
# Your model logic here
action = your_model.generate(observation['feedback'])
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
break
@article{qian2025userbench,
title={UserBench: An Interactive Gym Environment for User-Centric Agents},
author={Qian, Cheng and Liu, Zuxin and Prabhakar, Akshara and Liu, Zhiwei and Zhang, Jianguo and Chen, Haolin and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
journal={arXiv preprint arXiv:2507.22034},
year={2025}
}