👾Are Large Vision Language Models Good Game Players?

^🎮The University of Adelaide ^🕹️Zhejiang University

LVLM-Playground is a benchmark to evaluate Large Vision Language Models (LVLMs) on game-playing tasks, assessing their perception, reasoning, and decision-making across six classic games. This repository provides tools to run experiments, analyze performance, and visualize results. For further details, please refer to our paper here.

News

[2025.Mar] LVLM-Playground is released! 🚀
[2025.Feb] LVLM-Playground has been accepted to ICLR 2025! 🎉 Check the paper here.

Installation

Clone the Repository:

git clone https://github.com/xinke-wang/LVLM-Playground.git
cd LVLM-Playground

Setup a Conda Environment

conda create -n playground python=3.11 -y
conda activate playground
pip install -r requirements.txt

Install Stockfish for Chess

To run experiments on Chess, you need to install Stockfish.
- With sudo privileges, install Stockfish via your package manager:
```
sudo apt-get install stockfish
```
- Alternatively, you can download the latest Stocfish binary from stockfishcess.org.
- Extract the binary and place the stockfish executable in your system PATH or the project directory.

Data Preparation

Download Pre-generated Benchmark Data:

To facilitate reproducibility of the experiments in our paper, we provide pre-generated benchmark data. You can download the data by running the following command:

wget https://universityofadelaide.box.com/shared/static/9xx4brpiipqmmyomau2v522frtijx930.zip -O benchmark.zip
unzip benchmark.zip -d .

After unzipping, you should have the following directory structure:

LVLM-Playground
├── benchmark
│   ├── perceive
│   │   ├── chess
│   │   │   ├── 0000000.jpg
│   │   │   ├── 0000001.jpg
│   │   │   ├── ...
│   │   │   └── annotation.json
│   │   ├── gomoku
│   │   ├── minesweeper
│   │   ├── reversi
│   │   ├── sudoku
│   │   └── tictactoe
│   ├── qa
│   └── rule

Generating a Custom Benchmark (Optional):

Alternatively, you can generate a new benchmark dataset by running the following command:

python generate_benchmark.py

You can modify the configs/base.py file to customize the benchmark generation process.

# configs/base.py
benchmark_setting = dict(
   games=['tictactoe', 'gomoku', 'minesweeper', 'reversi', 'sudoku', 'chess'],
   sample_size=2000,
   e2e_round=100,
   offline_task=['perceive', 'qa', 'rule'],
   benchmark_path='benchmark'
)

Adjust games to include or exclude specific games.
Modify sample_size to control the number of samples per game.
Change benchmark_path to specify the output directory.

Running Experiments

Once the data is ready, run experiments using:

python run.py --exp-recipe configs/recipe/base.py --agent-cfg configs/agents/internvl/internvl2-1b.py

--exp-recipe specifies the experiment settings, and --agent-cfg specifies the agent configuration. If you are using the commercial model (e.g., OpenAI, Google, Anthropic) as agent, ensure you have the necessary API keys set as environment variables (e.g., OPENAI+API_KEY, GOOGLE_API_KEY). The framework can automatically resume the experiment from unexpected termination, as long as you set the same experiment name in the experiment recipe config (configs/recipe/base.py).

We provide several pre-defined agent configurations in the configs/agents directory, includes three widely used commercial APIs Gemini, Claude, and ChatGPT, as well open-source models supported by LMDeploy. You can find the pre-set configurations in configs/agents, and modify them to customize the LVLM settings.

You can customize the experiment settings by modifying the configuration file configs/recipe/base.py.

# configs/recipe/base.py
name = 'standard'
save_path = 'experiments'
tasks = ['perceive', 'qa', 'rule', 'e2e']
games = ['tictactoe', 'reversi', 'gomoku', 'minesweeper', 'sudoku', 'chess']

Adjust tasks to include or exclude specific tasks.
Modify games to specify the games to evaluate.
Change save_path to specify the output directory.
Set name to identify the experiment.

Evaluating Results

Once you have run the experiments, the results will be saved in the experiments directory with a name specified in the experiment recipe. You can evaluate the results using:

python evaluate.py --exp-path experiments/standard/gpt4o.json

Evaluation results will be saved by default in the evaluation_results/ directory.

Visualizing Results

To visualize the evaluation results, generate a radar chart comparing LVLMs across tasks:

python plot_radar.py

This will automatically create a radar chart (radar_chart.pdf) in the current directory, illustrating performance differences.

To compare with the results in our paper, you can download the evaluation files from here and place them in the evaluation_results directory. This includes the evaluation results for GPT-4o-240806, Gemini-1.5pro, Claude-3.5-sonnet, Qwen2-vl-7b, DeepSeek-vl-7b, Phi3-vl, LLaVA-1.6-7b, and InternVL2-8b.

Evaluating Customized Models

To evaluate a customized LVLM, follow these steps:

Implement Your Model:

Implement your model by inheriting the BaseAgent class in agents/base_agent.py, and registering it with AGENT_REGISTRY. You may use the following template:

from playground.agents import BaseAgent
from playground.registry import AGENT_REGISTRY

@AGENT_REGISTRY.register('custom_single')
class CustomAgentSingleStep(BaseAgent):

   def __init__(self, agent_cfg):
      super().__init__(agent_cfg)
      # Initialize your model, API, or configuration here
      pass

   def get_decision(self, screenshot_path: str, prompt: str):
      # Implement logic to process the screenshot and prompt, return a decision
      pass

Configure Your Model in configs/agents:

Create or modify a configuration file (e.g., configs/agents/custom_agent.py) to define your model’s settings. Example:
```
lmm_agent = dict(
   agent='custom_single',
   model='your_model_name',
   max_tokens=512,
   image_size=512,
   backend_config=None,
   general_config=None,
   name='custom_agent'
)
```
Ensure the agent field matches the registered name in the AGENT_REGISTRY. After defining and configuring your model, follow the standard steps to run experiments (python run.py), evaluate results (python evaluate.py), and visualize performance (python plot_radar.py).

Acknowledgements

We acknowledge the authors of the following repositories for providing the game UIs and search-based AI implementations:

Contact

If you have any questions or suggestions, please feel free to open an issue or contact us via email Xinyu Wang xinyu.wang02@adelaide.edu.au.

Citation

If you find this repository useful for your research, please consider citing our paper:

@inproceedings{wang2025large,
  title={Are Large Vision Language Models Good Game Players?},
  author={Wang, Xinyu and Zhuang, Bohan and Wu, Qi},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
playground		playground
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
evaluate.py		evaluate.py
generate_benchmark.py		generate_benchmark.py
plot_radar.py		plot_radar.py
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

👾Are Large Vision Language Models Good Game Players?

News

Installation

Data Preparation

Running Experiments

Evaluating Results

Visualizing Results

Evaluating Customized Models

Acknowledgements

Contact

Citation

About

Uh oh!

Releases

Packages

Languages

xinke-wang/LVLM-Playground

Folders and files

Latest commit

History

Repository files navigation

👾Are Large Vision Language Models Good Game Players?

News

Installation

Data Preparation

Running Experiments

Evaluating Results

Visualizing Results

Evaluating Customized Models

Acknowledgements

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages