Skip to content

krafton-ai/Orak

Repository files navigation

Orak 🎮

arXiv HuggingFace Leaderboard X

Orak (오락) is a foundational benchmark for evaluating Large Language Model (LLM) agents in diverse popular video games. Please check out our paper and the leaderboard for more details!

*The name Orak comes from 오락 (orak), the native Korean word meaning "game".

by Dongmin Park1*, Minkyu Kim1*, Beongjun Choi1*, Junhyuck Kim1, Keon Lee1, Jonghyun Lee1, Inkyu Park1, Byeong-Uk Lee1, Jaeyoung Hwang1, Jaewoo Ahn1,2, Ameya S. Mahabaleshwarkar3, Bilal Kartal3, Pritam Biswas3, Yoshi Suhara3, Kangwook Lee1,4, Jaewoong Cho1.

1 KRAFTON AI, 2 Seoul National University, 3 NVIDIA, 4 University of Wisconsin-Madison


Table of Contents

  1. Features
  2. Project Structure
  3. Installation
  4. Evaluation
  5. Agent Module Study
  6. Bonus: Freeform Gameplay with Claude
  7. Submission Guideline

Features

  • Cover most game genres with 12 popular titles — see full game list
  • Enable plug-and-play studies of agentic modules via the Model Context Protocol (MCP) interface
  • Support analysis of both LLMs and VLMs on textual and visual game states
  • Easily integrate new environments, models, and custom agents with a config-driven setup — see script customization

Project Structure

Game List

Action Adventure RPG Simulation Strategy Puzzle
Street Fighter III Ace Attorney Pokémon Red Minecraft StarCraft II Baba Is You
Street Fighter Ace Attorney Pokemon Minecraft StarCraft Space Invaders
Super Mario Her Story Darkest Dungeon Stardew Valley Slay the Spire 2048
Super Mario Her Story Darkest Dungeon Stardew Valley Slay the Spire 2048

Core Modules

MCP structure description
  • mcp_agent_client/: Manages interaction between agent modules at mcp_agent_servers and the game environment.
    • Defines a MCP client responsible for managing connections between agent servers and game servers.
    • Implements API functions for multiple LLM instances (e.g., OpenAI's GPT-4o, Meta's Llama-3.2-1B-Instruct).
    • Provides game-independent play logic with a main configuration for managing hyperparameters related to game execution.
  • mcp_agent_servers/: Implementations of LLM/SLM-based gaming agents.
    • Implements servers (MCP tools) that communicate with the mcp_agent_client to support agentic modulation.
    • Defines prompts for each game and agent.
    • Compatible with platform-provided client LLMs (e.g., Claude Desktop) without relying on the APIs defined in mcp_agent_client.
  • mcp_game_servers/: Collection of supported game environments.
    • Implements servers (MCP tools) that communicate with the mcp_agent_client to deliver and update game states.
    • Defines environment implementations for each supported game.
    • Compatible with platform-provided client LLMs (e.g., Claude Desktop) without relying on the APIs defined in mcp_agent_client.

Installation

1. Game Setup

Each game must be set up individually following the instructions in docs/setup_{game}.md. Note that, 6 games, Ace Attorney, Her Story, Darkest Dungeon, Stardew Valley, Slay the Spire, and Baba Is You, require a one-time purchase, typically priced between $9.99 and $24.99. The other six games are free-to-play.

2. Python Environment

We support both the MCP script (based on uv environment) and Python scripts (based on conda environments). Both scripts invoke the same game environment and produce identical gameplay results.

MCP version

uv installation

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" # windows

virtual environment creation

uv venv --python 3.10
.venv/Script/activate # windows
uv pip install -e .

default for most games

uv pip install -r requirements/base.txt

for some games (supermario, etc)

uv pip install -r requirements/base.txt
uv pip install -r requirements/{game}.txt
Python script version

default for most games

conda create -n orak python=3.10
conda activate orak
pip install -r requirements/base.txt

for some games (supermario, etc)

conda create -n orak python=3.10
conda activate orak
pip install -r requirements/base.txt
pip install -r requirements/{game}.txt

3. API Key Setup

To use commercial API-based LLMs (from OpenAI, Anthropic, Google, DeepSeek), create a key file under the src/mcp_agent_servers/keys/ as follows:

API key details
  • OpenAI
    • create src/mcp_agent_servers/keys/openai-key/key.env and add your API key (as a plain text string started with sk-***)
  • Anthropic
    • create src/mcp_agent_servers/keys/anthropic-key/key.env and add your API key (as a plain text string started with sk-***)
  • Google (current version is for Vertex AI)
    • create src/mcp_agent_servers/keys/google-key/gemini_gcp.json and add your GCP JSON service account key.
  • DeepSeek
    • create src/mcp_agent_servers/keys/deepseek-key/key.env and add your API key (as a plain text string started with sk-***)

Evaluation

Leaderboard (Single-player)

  1. Setup the Game Environment
    Follow the Game Environment Setup Guide to configure the required environment.

  2. Launch the Game
    Ensure the game is running and ready for interaction. Note that the game will be automatically launched for certain games, e.g., supermario and 2048. Note: Some games require minor manual setup after launch, so please check the corresponding setup file in docs/setup_{game}.md before running.

  3. Run the Gaming Agent

  • MCP version
    bash scripts/leaderboard/mcp/{game}.sh
  • python script version
    bash scripts/leaderboard/python/{game}.sh

Battle Arena (Two-players)

  • MCP version
    bash scripts/arena/mcp/{game}.sh
  • python script version
    bash scripts/arena/python/{game}.sh

We described the details of evaulation metrics for each game in docs/eval_metrics.md.

Run Your Custom Script

You can easily customize the run script by specifying <Game, LLM, Agent Module, Input Type>. This enables studies on agentic strategies and input state types over all games — see available agent list

  • MCP version

    uv run ./scripts/mcp_play_game.py \
       --config ./src/mcp_agent_client/configs/{game}/config.yaml \
          env.input_modality={input_modality} \
          agent.llm_name={model} \
          agent.agent_type={agent} \
          agent.prompt_path=mcp_agent_servers.{game}.prompts.{input_modality}.{agent}

    Replace {game}, {model}, {agent}, {input_modality} with the name of the those you want to run. You can also customize the configuration by changing <Game, LLM, Agent Module, Input Type> in ./src/mcp_agent_client/configs/{game}/config.yaml — see config details

  • Python script version

    python scripts/play_game.py --config {config_path} \
       env.input_modality={input_modality} \
       agent.llm_name={model} \
       agent.agent_type={agent} \
       agent.prompt_path=mcp_agent_servers.{game}.prompts.{input_modality}.{agent}

    Replace {game}, {model}, {agent}, {input_modality} with the name of the those you want to run. You can also customize the configuration by changing <Game, LLM, Agent Module, Input Type> in ./src/mcp_agent_client/configs/{game}/config.yaml — see config details

Bonus: Freeform Gameplay with Claude via MCP

Our MCP interface supports fully free-form, open-ended gameplay beyond standard evaluation with static agentic strategies. The LLM can decide when and how to use different tools and prompts during gameplay. For example, you can simply prompt Claude with "Play the {game} by yourself. {some instructions to use the mcp tools}", which allows Claude to take full control of gameplay decisions and tool usage. Below are video examples of Claude actively playing Ace Attorney and Baba Is You — see claude gameplay guideline for more details


Claude playing Ace Attorney

Claude playing Baba Is You

Submission Guideline

You can submit your own LLM backbones and agentic strategies to our repo. Please check out the guideline in docs/submission_guidline.md. We are also open to the contribution for adding new games. If you want to do so, please make the PR or reach out to dongmin.park@krafton.com. We will credit your contribution in README.md.

Citation

@article{park2025orak,
  title     = {Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games},
  author    = {Park, Dongmin and Kim, Minkyu and Choi, Beongjun and Kim, Junhyuck and Lee, Keon and Lee, Jonghyun and Park, Inkyu and Lee, Byeong-Uk and Hwang, Jaeyoung and Ahn, Jaewoo and Mahabaleshwarkar, Ameya S. and Kartal, Bilal and Biswas, Pritam and Suhara, Yoshi and Lee, Kangwook and Cho, Jaewoong},
  year      = {2025},
  eprint    = {2506.03610},
  archivePrefix = {arXiv},
  note      = {arXiv:2506.03610}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 8