Skip to content

EternisAI/strategy-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Strategy Bench

CC BY-NC 4.0

A benchmark framework for evaluating Large Language Models in multi-agent social deduction games

What is Strategy Bench?

Strategy Bench is designed to test the exact capabilities personal AI will need when representing us in messy, multi-party settings and making decisions based on incomplete information. The benchmark uses strategic games where agents take actions, communicate with each other, and build world models: focusing on scenarios that combine asymmetric information, multi-party interactions, and deceptive behavior.

This provides a crisp, high-signal way to evaluate how well AI agents can handle the complex social and strategic reasoning required when representing human interests in real-world scenarios.

Supported Games

  • Secret Hitler - Political strategy with hidden roles, policies, and presidential powers
  • Among Us - Spatial deduction with tasks, emergency meetings, and impostor kills
  • Avalon - Quest-based team building with hidden roles and assassination
  • Spyfall - Question-and-answer based location deduction
  • Werewolf - Classic social deduction with night/day phases and special roles
  • Sheriff of Nottingham - Bluffing and negotiation with goods inspection

Quick Start

1. Installation

# Clone repository
git clone https://github.com/yourusername/strategy-bench.git
cd social-deduction-bench

# Install package
pip install -e .

2. Set up API Keys

# Option 1: Environment variable
export OPENROUTER_API_KEY="your_key_here"

# Option 2: Create .env file
echo "OPENROUTER_API_KEY=your_key_here" > .env

3. Run Your First Game

Using the CLI (Recommended)

# List all available games
python scripts/sdb_cli.py list

# Play any game with default settings
python scripts/sdb_cli.py play secret_hitler
python scripts/sdb_cli.py play among_us
python scripts/sdb_cli.py play avalon

# Customize game settings
python scripts/sdb_cli.py play secret_hitler --num-players 7 --model anthropic/claude-4.5-sonnet
python scripts/sdb_cli.py play werewolf --num-players 8 --temperature 0.9

# Use random agents for testing
python scripts/sdb_cli.py play spyfall --agent-type random --num-players 6

Using Example Scripts

# Run Secret Hitler with 7 players
python examples/run_secret_hitler.py

# Run Among Us with 6 players
python examples/run_among_us.py

# Run Avalon with 5 players
python examples/run_avalon.py

# Run Spyfall
python examples/run_spyfall.py

# Run Werewolf
python examples/run_werewolf.py

# Run Sheriff of Nottingham
python examples/run_sheriff.py

Using Python API

from sdb.environments.secret_hitler import SecretHitlerEnv, SecretHitlerConfig
from sdb.agents.llm.openrouter_agent import OpenRouterAgent
from sdb.logging.game_logger import GameLogger

# Create agents
agents = [
    OpenRouterAgent(
        player_id=i,
        model="openai/gpt-5",
        temperature=0.7,
        memory_capacity=50
    )
    for i in range(5)
]

# Configure game
config = SecretHitlerConfig(n_players=5, log_private_info=True)
logger = GameLogger(output_dir="experiments/my_games", log_private=True)

# Run game
env = SecretHitlerEnv(agents=agents, config=config, logger=logger)
result = env.run()

print(f"Winner: {result.winner}")
print(f"Reason: {result.win_reason}")
print(f"Rounds: {result.num_rounds}")

Configuration

All games can be configured via YAML files in the configs/ directory:

  • configs/secret_hitler.yaml - Secret Hitler settings
  • configs/among_us.yaml - Among Us settings
  • configs/avalon.yaml - Avalon settings
  • configs/spyfall.yaml - Spyfall settings
  • configs/werewolf.yaml - Werewolf settings
  • configs/sheriff.yaml - Sheriff of Nottingham settings

Example configuration:

# Game configuration
n_players: 7
n_impostors: 2
max_task_rounds: 50
discussion_rounds: 2

# LLM Agent Settings
agent:
  model: "anthropic/claude-4.5-sonnet"
  temperature: 0.8
  memory_capacity: 35

# Logging
logging:
  enabled: true
  output_dir: "experiments/my_game"
  log_private: true  # Save private info (roles, observations) - recommended for analysis

Game-Specific Features

Secret Hitler

Political strategy game where Liberals try to enact 5 Liberal policies or assassinate Hitler, while Fascists try to enact 6 Fascist policies or elect Hitler as Chancellor.

Features:

  • Full game mechanics (nomination, voting, legislative session, presidential powers)
  • Veto power (unlocks after 5 Fascist policies)
  • Discussion phases (public deliberation before votes)
  • Memory integration (agents remember all discussions and events)
  • Belief tracking (agents build models of other players)

Among Us

Spatial deduction game with impostors and crewmates on a spaceship.

Features:

  • Two-phase round resolution (deterministic, order-independent)
  • Spatial map system (14 rooms with corridor connections)
  • Movement and vent systems
  • Task completion and progress tracking
  • Emergency meetings and body reporting
  • Discussion and voting mechanics
  • Kill cooldowns and proper ejection handling

Avalon

Quest-based team building with hidden roles.

Features:

  • Team proposal and voting system
  • Quest success/fail mechanics
  • Assassination phase for Evil team
  • Special roles (Merlin, Assassin)
  • Pre-proposal discussion
  • Proper round and proposal tracking

Spyfall

Question-and-answer based location deduction.

Features:

  • Turn-based Q&A system
  • Location-based questioning
  • Spy final guess mechanic
  • Accusation and voting system
  • Time limits and turn tracking

Werewolf

Classic social deduction with night/day phases.

Features:

  • Night phase (Werewolf kills, Doctor saves, Seer investigates)
  • Day phase (Discussion and voting)
  • Majority voting system
  • Special role powers
  • Proper phase transitions

Sheriff of Nottingham

Bluffing and negotiation game with goods inspection.

Features:

  • Market phase (card drawing)
  • Loading phase (bag preparation)
  • Declaration phase
  • Negotiation phase (multi-round)
  • Inspection phase with penalties
  • Royal goods and contraband

Supported Models

Strategy Bench uses OpenRouter to access multiple LLM providers.

Cost Optimization

# Use cheaper models for bulk experiments
agent = OpenRouterAgent(
    player_id=0,
    model="openai/gpt-5",  
    temperature=0.7,
    max_tokens=1024  # Limit response length
)

Project Structure

social-deduction-bench/
├── sdb/                        # Main package
│   ├── core/                   # Base classes & interfaces
│   │   ├── base_env.py        # BaseEnvironment class
│   │   └── types.py           # Core types (Action, Observation, etc.)
│   ├── agents/                 # Agent implementations
│   │   └── llm/               # LLM agents (OpenRouter)
│   │       └── openrouter_agent.py
│   ├── environments/           # Game implementations
│   │   ├── secret_hitler/     # Secret Hitler game
│   │   ├── among_us/          # Among Us game
│   │   ├── avalon/            # Avalon game
│   │   ├── spyfall/           # Spyfall game
│   │   ├── werewolf/          # Werewolf game
│   │   └── sheriff/           # Sheriff of Nottingham
│   ├── memory/                # Memory & belief tracking
│   ├── logging/               # Logging system
│   └── llm_interface/         # LLM API clients
├── examples/                   # Example scripts
│   ├── run_secret_hitler.py
│   ├── run_among_us.py
│   ├── run_avalon.py
│   ├── run_spyfall.py
│   ├── run_werewolf.py
│   └── run_sheriff.py
├── configs/                    # YAML configuration files
├── experiments/                # Experiment outputs
└── tests/                      # Unit tests

Analyzing Results

View Game Logs

All games generate JSONL logs with detailed event information:

import json

# Load game log
with open("experiments/my_game/game_xyz.jsonl") as f:
    events = [json.loads(line) for line in f]

# Filter specific events
discussions = [e for e in events if e["event_type"] == "DISCUSSION"]
votes = [e for e in events if e["event_type"] == "VOTE_CAST"]
actions = [e for e in events if e["event_type"] == "PLAYER_ACTION"]

# Analyze game flow
for event in events:
    print(f"{event['timestamp']}: {event['event_type']}")

Log Event Types

Common event types across all games:

  • GAME_START - Game initialization
  • PHASE_CHANGE - Phase transitions
  • PLAYER_ACTION - Player actions
  • DISCUSSION - Discussion statements
  • VOTE_CAST - Vote events
  • ELECTION_RESULT - Voting outcomes
  • PLAYER_ELIMINATED - Player eliminations
  • GAME_END - Game conclusion
  • ERROR - Error events with error codes

Key Features

1. Memory-Aware Agents

Agents maintain:

  • Short-term memory: Recent events and observations
  • Belief tracking: Probabilistic models of other players
  • Discussion memory: All public statements from all players

2. Comprehensive Logging

Every game event is logged:

  • Player actions and reasoning
  • Public discussions
  • Private information (role assignments, investigations)
  • Game state transitions
  • Error events with detailed error codes

3. Generic Agent Design

  • LLM agents are completely game-agnostic
  • All game-specific prompts and actions are in environment folders
  • Observations include instruction field with formatted context
  • Hybrid memory: agents maintain short-term memory, games provide full history

4. Robust Error Handling

  • Structured error codes (e.g., INVALID_TARGET_ID, KILL_ON_COOLDOWN)
  • Last error tracking for agent self-correction
  • JSON parse retries with increased token limits
  • Fallback to safe actions on failures

5. Action Validation

  • Phase-based action gating
  • Explicit action choices provided to agents
  • Player directory with ID-to-name mapping
  • Prevention of duplicate votes and invalid actions

Advanced Features

Two-Phase Resolution (Among Us)

Among Us implements deterministic, order-independent action resolution:

  1. Snapshot all positions at round start
  2. Resolve kills based on pre-move positions
  3. Apply movements for survivors
  4. Process body reports and meetings

This prevents order-dependent kill failures and ensures fair gameplay.

Discussion Rounds (Multiple Games)

Games track discussion rounds properly:

  • Each player speaks once per round
  • Rounds advance when all alive players have spoken
  • Duplicate detection prevents repetitive statements
  • Phase automatically advances after configured rounds

Voting Systems

Different voting mechanics per game:

  • Secret Hitler: Simple majority for governments
  • Werewolf: Majority required for elimination
  • Avalon: Team approval requires majority, quest voting is anonymous
  • Among Us: Plurality voting for ejections
  • Spyfall: Majority required for accusations

Testing Games

Each game includes an example script in the examples/ directory. To test:

# Test Secret Hitler
python examples/run_secret_hitler.py

# Test Among Us
python examples/run_among_us.py

# Test Avalon
python examples/run_avalon.py

# Test Spyfall
python examples/run_spyfall.py

# Test Werewolf
python examples/run_werewolf.py

# Test Sheriff of Nottingham
python examples/run_sheriff.py

Logs will be saved to experiments/<game_name>/ by default.

Documentation

  • Architecture: See ARCHITECTURE.md for technical details
  • API Reference: See inline docstrings in source code
  • Game Rules: Each game environment has its own rules.py file
  • Fix Logs: See AMONG_US_FIXES.md and SHERIFF_ALL_FIXES.md for detailed implementation notes

Contributing

Contributions are welcome! To add a new game:

  1. Create game directory in sdb/environments/your_game/
  2. Implement required files:
    • env.py - Main environment (inherit from BaseEnvironment)
    • state.py - Game state management
    • config.py - Configuration dataclass
    • types.py - Game-specific types
    • rules.py - Rule validation functions
  3. Add configuration YAML in configs/your_game.yaml
  4. Create example script in examples/run_your_game.py
  5. Write tests
  6. Submit PR

See ARCHITECTURE.md for detailed implementation guide.

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

Acknowledgments

This framework builds upon:

  • Secret Hitler - Original game by Mike Boxleiter, Tommy Maranges, Max Temkin
  • AmongAgents - Among Us implementation
  • AvalonBench & Strategist (ICLR 2025) - Avalon with search agents
  • Spyfall - Question-based deduction mechanics
  • Werewolf Arena (Google) - Werewolf implementation
  • Sheriff of Nottingham - Bluffing and negotiation mechanics

For technical architecture details, see ARCHITECTURE.md

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages