Skip to content

xinke-wang/LVLM-Playground

Repository files navigation

👾Are Large Vision Language Models Good Game Players?

🎮The University of Adelaide 🕹️Zhejiang University
License: MIT

LVLM-Playground is a benchmark to evaluate Large Vision Language Models (LVLMs) on game-playing tasks, assessing their perception, reasoning, and decision-making across six classic games. This repository provides tools to run experiments, analyze performance, and visualize results. For further details, please refer to our paper here.

News

[2025.Mar] LVLM-Playground is released! 🚀
[2025.Feb] LVLM-Playground has been accepted to ICLR 2025! 🎉 Check the paper here.

Installation

  1. Clone the Repository:

    git clone https://github.com/xinke-wang/LVLM-Playground.git
    cd LVLM-Playground
  2. Setup a Conda Environment

    conda create -n playground python=3.11 -y
    conda activate playground
    pip install -r requirements.txt
  3. Install Stockfish for Chess

    To run experiments on Chess, you need to install Stockfish.

    • With sudo privileges, install Stockfish via your package manager:

      sudo apt-get install stockfish
    • Alternatively, you can download the latest Stocfish binary from stockfishcess.org.

    • Extract the binary and place the stockfish executable in your system PATH or the project directory.

Data Preparation

  1. Download Pre-generated Benchmark Data:

To facilitate reproducibility of the experiments in our paper, we provide pre-generated benchmark data. You can download the data by running the following command:

wget https://universityofadelaide.box.com/shared/static/9xx4brpiipqmmyomau2v522frtijx930.zip -O benchmark.zip
unzip benchmark.zip -d .

After unzipping, you should have the following directory structure:

LVLM-Playground
├── benchmark
│   ├── perceive
│   │   ├── chess
│   │   │   ├── 0000000.jpg
│   │   │   ├── 0000001.jpg
│   │   │   ├── ...
│   │   │   └── annotation.json
│   │   ├── gomoku
│   │   ├── minesweeper
│   │   ├── reversi
│   │   ├── sudoku
│   │   └── tictactoe
│   ├── qa
│   └── rule
  1. Generating a Custom Benchmark (Optional):

Alternatively, you can generate a new benchmark dataset by running the following command:

python generate_benchmark.py

You can modify the configs/base.py file to customize the benchmark generation process.

# configs/base.py
benchmark_setting = dict(
   games=['tictactoe', 'gomoku', 'minesweeper', 'reversi', 'sudoku', 'chess'],
   sample_size=2000,
   e2e_round=100,
   offline_task=['perceive', 'qa', 'rule'],
   benchmark_path='benchmark'
)
  • Adjust games to include or exclude specific games.
  • Modify sample_size to control the number of samples per game.
  • Change benchmark_path to specify the output directory.

Running Experiments

Once the data is ready, run experiments using:

python run.py --exp-recipe configs/recipe/base.py --agent-cfg configs/agents/internvl/internvl2-1b.py

--exp-recipe specifies the experiment settings, and --agent-cfg specifies the agent configuration. If you are using the commercial model (e.g., OpenAI, Google, Anthropic) as agent, ensure you have the necessary API keys set as environment variables (e.g., OPENAI+API_KEY, GOOGLE_API_KEY). The framework can automatically resume the experiment from unexpected termination, as long as you set the same experiment name in the experiment recipe config (configs/recipe/base.py).

We provide several pre-defined agent configurations in the configs/agents directory, includes three widely used commercial APIs Gemini, Claude, and ChatGPT, as well open-source models supported by LMDeploy. You can find the pre-set configurations in configs/agents, and modify them to customize the LVLM settings.

You can customize the experiment settings by modifying the configuration file configs/recipe/base.py.

# configs/recipe/base.py
name = 'standard'
save_path = 'experiments'
tasks = ['perceive', 'qa', 'rule', 'e2e']
games = ['tictactoe', 'reversi', 'gomoku', 'minesweeper', 'sudoku', 'chess']
  • Adjust tasks to include or exclude specific tasks.
  • Modify games to specify the games to evaluate.
  • Change save_path to specify the output directory.
  • Set name to identify the experiment.

Evaluating Results

Once you have run the experiments, the results will be saved in the experiments directory with a name specified in the experiment recipe. You can evaluate the results using:

python evaluate.py --exp-path experiments/standard/gpt4o.json

Evaluation results will be saved by default in the evaluation_results/ directory.

Visualizing Results

To visualize the evaluation results, generate a radar chart comparing LVLMs across tasks:

python plot_radar.py

This will automatically create a radar chart (radar_chart.pdf) in the current directory, illustrating performance differences.

To compare with the results in our paper, you can download the evaluation files from here and place them in the evaluation_results directory. This includes the evaluation results for GPT-4o-240806, Gemini-1.5pro, Claude-3.5-sonnet, Qwen2-vl-7b, DeepSeek-vl-7b, Phi3-vl, LLaVA-1.6-7b, and InternVL2-8b.

Evaluating Customized Models

To evaluate a customized LVLM, follow these steps:

  1. Implement Your Model:

    Implement your model by inheriting the BaseAgent class in agents/base_agent.py, and registering it with AGENT_REGISTRY. You may use the following template:

    from playground.agents import BaseAgent
    from playground.registry import AGENT_REGISTRY
    
    @AGENT_REGISTRY.register('custom_single')
    class CustomAgentSingleStep(BaseAgent):
    
       def __init__(self, agent_cfg):
          super().__init__(agent_cfg)
          # Initialize your model, API, or configuration here
          pass
    
       def get_decision(self, screenshot_path: str, prompt: str):
          # Implement logic to process the screenshot and prompt, return a decision
          pass
  2. Configure Your Model in configs/agents:

    Create or modify a configuration file (e.g., configs/agents/custom_agent.py) to define your model’s settings. Example:

    lmm_agent = dict(
       agent='custom_single',
       model='your_model_name',
       max_tokens=512,
       image_size=512,
       backend_config=None,
       general_config=None,
       name='custom_agent'
    )

    Ensure the agent field matches the registered name in the AGENT_REGISTRY. After defining and configuring your model, follow the standard steps to run experiments (python run.py), evaluate results (python evaluate.py), and visualize performance (python plot_radar.py).

Acknowledgements

We acknowledge the authors of the following repositories for providing the game UIs and search-based AI implementations:

Contact

If you have any questions or suggestions, please feel free to open an issue or contact us via email Xinyu Wang xinyu.wang02@adelaide.edu.au.

Citation

If you find this repository useful for your research, please consider citing our paper:

@inproceedings{wang2025large,
  title={Are Large Vision Language Models Good Game Players?},
  author={Wang, Xinyu and Zhuang, Bohan and Wu, Qi},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

About

[ICLR2025] Are Large Vision Language Models Good Game Players?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages