A benchmark framework for evaluating Large Language Models in multi-agent social deduction games
Strategy Bench is designed to test the exact capabilities personal AI will need when representing us in messy, multi-party settings and making decisions based on incomplete information. The benchmark uses strategic games where agents take actions, communicate with each other, and build world models: focusing on scenarios that combine asymmetric information, multi-party interactions, and deceptive behavior.
This provides a crisp, high-signal way to evaluate how well AI agents can handle the complex social and strategic reasoning required when representing human interests in real-world scenarios.
- Secret Hitler - Political strategy with hidden roles, policies, and presidential powers
- Among Us - Spatial deduction with tasks, emergency meetings, and impostor kills
- Avalon - Quest-based team building with hidden roles and assassination
- Spyfall - Question-and-answer based location deduction
- Werewolf - Classic social deduction with night/day phases and special roles
- Sheriff of Nottingham - Bluffing and negotiation with goods inspection
# Clone repository
git clone https://github.com/yourusername/strategy-bench.git
cd social-deduction-bench
# Install package
pip install -e .# Option 1: Environment variable
export OPENROUTER_API_KEY="your_key_here"
# Option 2: Create .env file
echo "OPENROUTER_API_KEY=your_key_here" > .env# List all available games
python scripts/sdb_cli.py list
# Play any game with default settings
python scripts/sdb_cli.py play secret_hitler
python scripts/sdb_cli.py play among_us
python scripts/sdb_cli.py play avalon
# Customize game settings
python scripts/sdb_cli.py play secret_hitler --num-players 7 --model anthropic/claude-4.5-sonnet
python scripts/sdb_cli.py play werewolf --num-players 8 --temperature 0.9
# Use random agents for testing
python scripts/sdb_cli.py play spyfall --agent-type random --num-players 6# Run Secret Hitler with 7 players
python examples/run_secret_hitler.py
# Run Among Us with 6 players
python examples/run_among_us.py
# Run Avalon with 5 players
python examples/run_avalon.py
# Run Spyfall
python examples/run_spyfall.py
# Run Werewolf
python examples/run_werewolf.py
# Run Sheriff of Nottingham
python examples/run_sheriff.pyfrom sdb.environments.secret_hitler import SecretHitlerEnv, SecretHitlerConfig
from sdb.agents.llm.openrouter_agent import OpenRouterAgent
from sdb.logging.game_logger import GameLogger
# Create agents
agents = [
OpenRouterAgent(
player_id=i,
model="openai/gpt-5",
temperature=0.7,
memory_capacity=50
)
for i in range(5)
]
# Configure game
config = SecretHitlerConfig(n_players=5, log_private_info=True)
logger = GameLogger(output_dir="experiments/my_games", log_private=True)
# Run game
env = SecretHitlerEnv(agents=agents, config=config, logger=logger)
result = env.run()
print(f"Winner: {result.winner}")
print(f"Reason: {result.win_reason}")
print(f"Rounds: {result.num_rounds}")All games can be configured via YAML files in the configs/ directory:
configs/secret_hitler.yaml- Secret Hitler settingsconfigs/among_us.yaml- Among Us settingsconfigs/avalon.yaml- Avalon settingsconfigs/spyfall.yaml- Spyfall settingsconfigs/werewolf.yaml- Werewolf settingsconfigs/sheriff.yaml- Sheriff of Nottingham settings
Example configuration:
# Game configuration
n_players: 7
n_impostors: 2
max_task_rounds: 50
discussion_rounds: 2
# LLM Agent Settings
agent:
model: "anthropic/claude-4.5-sonnet"
temperature: 0.8
memory_capacity: 35
# Logging
logging:
enabled: true
output_dir: "experiments/my_game"
log_private: true # Save private info (roles, observations) - recommended for analysisPolitical strategy game where Liberals try to enact 5 Liberal policies or assassinate Hitler, while Fascists try to enact 6 Fascist policies or elect Hitler as Chancellor.
Features:
- Full game mechanics (nomination, voting, legislative session, presidential powers)
- Veto power (unlocks after 5 Fascist policies)
- Discussion phases (public deliberation before votes)
- Memory integration (agents remember all discussions and events)
- Belief tracking (agents build models of other players)
Spatial deduction game with impostors and crewmates on a spaceship.
Features:
- Two-phase round resolution (deterministic, order-independent)
- Spatial map system (14 rooms with corridor connections)
- Movement and vent systems
- Task completion and progress tracking
- Emergency meetings and body reporting
- Discussion and voting mechanics
- Kill cooldowns and proper ejection handling
Quest-based team building with hidden roles.
Features:
- Team proposal and voting system
- Quest success/fail mechanics
- Assassination phase for Evil team
- Special roles (Merlin, Assassin)
- Pre-proposal discussion
- Proper round and proposal tracking
Question-and-answer based location deduction.
Features:
- Turn-based Q&A system
- Location-based questioning
- Spy final guess mechanic
- Accusation and voting system
- Time limits and turn tracking
Classic social deduction with night/day phases.
Features:
- Night phase (Werewolf kills, Doctor saves, Seer investigates)
- Day phase (Discussion and voting)
- Majority voting system
- Special role powers
- Proper phase transitions
Bluffing and negotiation game with goods inspection.
Features:
- Market phase (card drawing)
- Loading phase (bag preparation)
- Declaration phase
- Negotiation phase (multi-round)
- Inspection phase with penalties
- Royal goods and contraband
Strategy Bench uses OpenRouter to access multiple LLM providers.
# Use cheaper models for bulk experiments
agent = OpenRouterAgent(
player_id=0,
model="openai/gpt-5",
temperature=0.7,
max_tokens=1024 # Limit response length
)social-deduction-bench/
├── sdb/ # Main package
│ ├── core/ # Base classes & interfaces
│ │ ├── base_env.py # BaseEnvironment class
│ │ └── types.py # Core types (Action, Observation, etc.)
│ ├── agents/ # Agent implementations
│ │ └── llm/ # LLM agents (OpenRouter)
│ │ └── openrouter_agent.py
│ ├── environments/ # Game implementations
│ │ ├── secret_hitler/ # Secret Hitler game
│ │ ├── among_us/ # Among Us game
│ │ ├── avalon/ # Avalon game
│ │ ├── spyfall/ # Spyfall game
│ │ ├── werewolf/ # Werewolf game
│ │ └── sheriff/ # Sheriff of Nottingham
│ ├── memory/ # Memory & belief tracking
│ ├── logging/ # Logging system
│ └── llm_interface/ # LLM API clients
├── examples/ # Example scripts
│ ├── run_secret_hitler.py
│ ├── run_among_us.py
│ ├── run_avalon.py
│ ├── run_spyfall.py
│ ├── run_werewolf.py
│ └── run_sheriff.py
├── configs/ # YAML configuration files
├── experiments/ # Experiment outputs
└── tests/ # Unit tests
All games generate JSONL logs with detailed event information:
import json
# Load game log
with open("experiments/my_game/game_xyz.jsonl") as f:
events = [json.loads(line) for line in f]
# Filter specific events
discussions = [e for e in events if e["event_type"] == "DISCUSSION"]
votes = [e for e in events if e["event_type"] == "VOTE_CAST"]
actions = [e for e in events if e["event_type"] == "PLAYER_ACTION"]
# Analyze game flow
for event in events:
print(f"{event['timestamp']}: {event['event_type']}")Common event types across all games:
GAME_START- Game initializationPHASE_CHANGE- Phase transitionsPLAYER_ACTION- Player actionsDISCUSSION- Discussion statementsVOTE_CAST- Vote eventsELECTION_RESULT- Voting outcomesPLAYER_ELIMINATED- Player eliminationsGAME_END- Game conclusionERROR- Error events with error codes
Agents maintain:
- Short-term memory: Recent events and observations
- Belief tracking: Probabilistic models of other players
- Discussion memory: All public statements from all players
Every game event is logged:
- Player actions and reasoning
- Public discussions
- Private information (role assignments, investigations)
- Game state transitions
- Error events with detailed error codes
- LLM agents are completely game-agnostic
- All game-specific prompts and actions are in environment folders
- Observations include
instructionfield with formatted context - Hybrid memory: agents maintain short-term memory, games provide full history
- Structured error codes (e.g.,
INVALID_TARGET_ID,KILL_ON_COOLDOWN) - Last error tracking for agent self-correction
- JSON parse retries with increased token limits
- Fallback to safe actions on failures
- Phase-based action gating
- Explicit action choices provided to agents
- Player directory with ID-to-name mapping
- Prevention of duplicate votes and invalid actions
Among Us implements deterministic, order-independent action resolution:
- Snapshot all positions at round start
- Resolve kills based on pre-move positions
- Apply movements for survivors
- Process body reports and meetings
This prevents order-dependent kill failures and ensures fair gameplay.
Games track discussion rounds properly:
- Each player speaks once per round
- Rounds advance when all alive players have spoken
- Duplicate detection prevents repetitive statements
- Phase automatically advances after configured rounds
Different voting mechanics per game:
- Secret Hitler: Simple majority for governments
- Werewolf: Majority required for elimination
- Avalon: Team approval requires majority, quest voting is anonymous
- Among Us: Plurality voting for ejections
- Spyfall: Majority required for accusations
Each game includes an example script in the examples/ directory. To test:
# Test Secret Hitler
python examples/run_secret_hitler.py
# Test Among Us
python examples/run_among_us.py
# Test Avalon
python examples/run_avalon.py
# Test Spyfall
python examples/run_spyfall.py
# Test Werewolf
python examples/run_werewolf.py
# Test Sheriff of Nottingham
python examples/run_sheriff.pyLogs will be saved to experiments/<game_name>/ by default.
- Architecture: See
ARCHITECTURE.mdfor technical details - API Reference: See inline docstrings in source code
- Game Rules: Each game environment has its own
rules.pyfile - Fix Logs: See
AMONG_US_FIXES.mdandSHERIFF_ALL_FIXES.mdfor detailed implementation notes
Contributions are welcome! To add a new game:
- Create game directory in
sdb/environments/your_game/ - Implement required files:
env.py- Main environment (inherit fromBaseEnvironment)state.py- Game state managementconfig.py- Configuration dataclasstypes.py- Game-specific typesrules.py- Rule validation functions
- Add configuration YAML in
configs/your_game.yaml - Create example script in
examples/run_your_game.py - Write tests
- Submit PR
See ARCHITECTURE.md for detailed implementation guide.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This framework builds upon:
- Secret Hitler - Original game by Mike Boxleiter, Tommy Maranges, Max Temkin
- AmongAgents - Among Us implementation
- AvalonBench & Strategist (ICLR 2025) - Avalon with search agents
- Spyfall - Question-based deduction mechanics
- Werewolf Arena (Google) - Werewolf implementation
- Sheriff of Nottingham - Bluffing and negotiation mechanics
For technical architecture details, see ARCHITECTURE.md
