Skip to content

freddiev4/pokeshadowbench

Repository files navigation

PokeShadowBench

Dataset Summary

PokeShadowBench contains silhouette images from the "Who's That Pokémon?" segments from the Pokémon anime series. Each entry includes:

  • A silhouette image of a Pokémon
  • The Pokédex number
  • The name of the Pokémon

The dataset is focused on the Indigo League, and includes 61 Pokemon.

Benchmark Overview

The "Who's That Pokémon?" benchmark evaluates multimodal language models on their ability to recognize Pokémon from silhouette images, replicating the classic challenge from the anime.

How the Benchmark Works

Task

Given a Pokémon silhouette image, predict the correct Pokémon species.

Evaluation Metrics

  • Accuracy: Percentage of correct predictions (1 attempt)

Model Performance

Overall Results

Without Thinking / Reasoning: accuracy chart

With Thinking / Reasoning: accuracy chart with thinking

Individual Results

Without Thinking / Reasoning (click to expand)
Pokemon o4-Mini GPT-4.1 GPT-4o Claude Opus Claude Sonnet 4 Claude 3.7 Sonnet Claude 3.5 Sonnet Gemini 2.5 Pro google/gemini-2.5-flash-preview-05-20
abra
aerodactyl
alakazam
arbok
arcanine
bellsprout
bulbasaur
butterfree
caterpie
charmander
clefairy
cloyster
cubone
diglett
ditto
eevee
exeggcute
farfetchd
fearow
gastly
gengar
geodude
gloom
growlithe
haunter
hitmonchan
horsea
ivysaur
jigglypuff
jynx
kabutops
kangaskhan
koffing
krabby
magikarp
magmar
magnemite
metapod
moltres
mr
nidoran♂
onix
paras
pidgeotto
pikachu
ponyta
primeape
psyduck
raichu
raticate
sandshrew
scyther
seaking
seel
slowbro
snorlax
squirtle
venonat
vileplume
vulpix
wartortle
With Thinking / Reasoning Results (click to expand)
Pokemon o4-Mini GPT-4.1 GPT-4o Claude Opus Claude Sonnet 4 Claude 3.7 Sonnet Claude 3.5 Sonnet Gemini 2.5 Pro google/gemini-2.5-flash-preview-05-20
abra
aerodactyl
alakazam
arbok
arcanine
bellsprout
bulbasaur
butterfree
caterpie
charmander
clefairy
cloyster
cubone
diglett
ditto
eevee
exeggcute
farfetchd
fearow
gastly
gengar
geodude
gloom
growlithe
haunter
hitmonchan
horsea
ivysaur
jigglypuff
jynx
kabutops
kangaskhan
koffing
krabby
magikarp
magmar
magnemite
metapod
moltres
mr
nidoran♂
onix
paras
pidgeotto
pikachu
ponyta
primeape
psyduck
raichu
raticate
sandshrew
scyther
seaking
seel
slowbro
snorlax
squirtle
venonat
vileplume
vulpix
wartortle

Individual Predictions

See raw results here: https://github.com/freddiev4/pokeshadowbench/tree/main/results

Setup

  1. Install dependencies
pip install -r requirements.txt
  1. Set your API keys in your environment
export OPENAI_API_KEY=<your_openai_api_key>
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
export GEMINI_API_KEY=<your_gemini_api_key>

Usage

Basic Usage

python src/evaluate_llms.py

Test Specific Prompts Only (prompts.yaml)

python src/evaluate_llms.py --prompts default indigo_hint think_and_reflect

With Custom Prompts File

python src/evaluate_llms.py --prompts-file my_prompts.yaml

Enable Thinking Models

python src/evaluate_llms.py --with-thinking

Sequential Processing

If you want to test models one at a time, or need to debug a specific model, you can run the script in sequential mode.

python evaluate_llms.py --sequential

YAML Configuration

Edit the prompts.yaml file with your prompt variations:

prompts:
  default:
    name: "Default Who's That Pokemon"
    prompt: "Let's play a game called \"Who's that Pokemon?\". You will be given a silhouette of a Pokemon. Your job is to guess the Pokemon name. Respond with ONLY the Pokemon name, nothing else."

About

How good are LLM's at "Who's that Pokemon?"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages