PokeShadowBench

Dataset Summary

PokeShadowBench contains silhouette images from the "Who's That Pokémon?" segments from the Pokémon anime series. Each entry includes:

A silhouette image of a Pokémon
The Pokédex number
The name of the Pokémon

The dataset is focused on the Indigo League, and includes 61 Pokemon.

Benchmark Overview

The "Who's That Pokémon?" benchmark evaluates multimodal language models on their ability to recognize Pokémon from silhouette images, replicating the classic challenge from the anime.

How the Benchmark Works

Task

Given a Pokémon silhouette image, predict the correct Pokémon species.

Evaluation Metrics

Accuracy: Percentage of correct predictions (1 attempt)

Model Performance

Overall Results

Without Thinking / Reasoning:

With Thinking / Reasoning:

Individual Results

Without Thinking / Reasoning (click to expand)

Pokemon	o4-Mini	GPT-4.1	GPT-4o	Claude Opus	Claude Sonnet 4	Claude 3.7 Sonnet	Claude 3.5 Sonnet	Gemini 2.5 Pro	google/gemini-2.5-flash-preview-05-20
abra	❌	❌	❌	❌	❌	❌	❌	❌	❌
aerodactyl	❌	❌	❌	❌	❌	✅	✅	❌	✅
alakazam	❌	❌	❌	✅	✅	✅	❌	❌	❌
arbok	❌	❌	❌	❌	❌	❌	❌	❌	❌
arcanine	❌	❌	❌	❌	❌	❌	❌	✅	✅
bellsprout	❌	✅	❌	❌	❌	❌	❌	❌	✅
bulbasaur	✅	✅	✅	❌	❌	❌	✅	✅	❌
butterfree	✅	✅	✅	✅	✅	✅	✅	✅	✅
caterpie	❌	❌	✅	❌	❌	❌	❌	❌	❌
charmander	✅	✅	✅	❌	✅	✅	✅	✅	✅
clefairy	❌	❌	❌	❌	❌	❌	❌	❌	❌
cloyster	❌	❌	❌	✅	❌	❌	❌	❌	❌
cubone	❌	❌	✅	✅	❌	❌	❌	❌	✅
diglett	❌	❌	❌	✅	❌	✅	❌	❌	❌
ditto	✅	✅	✅	✅	✅	✅	✅	✅	✅
eevee	✅	✅	✅	❌	❌	❌	❌	✅	✅
exeggcute	✅	❌	❌	❌	❌	❌	❌	❌	❌
farfetchd	❌	❌	❌	❌	❌	❌	❌	❌	❌
fearow	❌	❌	❌	❌	❌	❌	❌	❌	❌
gastly	❌	❌	❌	❌	❌	❌	❌	❌	❌
gengar	❌	❌	❌	❌	✅	✅	❌	✅	✅
geodude	✅	✅	✅	✅	❌	❌	❌	✅	❌
gloom	✅	❌	❌	❌	❌	❌	❌	❌	❌
growlithe	✅	✅	✅	❌	✅	❌	❌	✅	❌
haunter	✅	❌	❌	✅	❌	❌	❌	✅	❌
hitmonchan	❌	❌	❌	❌	❌	❌	❌	❌	❌
horsea	✅	✅	✅	❌	❌	❌	❌	✅	✅
ivysaur	❌	❌	❌	✅	❌	❌	✅	❌	❌
jigglypuff	✅	✅	✅	✅	✅	✅	✅	❌	✅
jynx	✅	✅	✅	❌	❌	❌	❌	❌	❌
kabutops	✅	❌	❌	❌	❌	✅	✅	❌	❌
kangaskhan	❌	❌	❌	❌	❌	❌	❌	❌	❌
koffing	✅	❌	❌	✅	❌	❌	❌	✅	✅
krabby	❌	❌	✅	✅	❌	✅	✅	✅	✅
magikarp	✅	❌	❌	❌	❌	❌	❌	✅	✅
magmar	❌	❌	❌	❌	❌	❌	❌	❌	❌
magnemite	✅	✅	✅	❌	❌	❌	✅	❌	❌
metapod	❌	❌	❌	❌	❌	❌	❌	❌	❌
moltres	❌	❌	❌	❌	❌	❌	❌	❌	❌
mr	❌	❌	❌	❌	❌	❌	❌	❌	❌
nidoran♂	❌	❌	❌	❌	❌	❌	❌	❌	❌
onix	✅	✅	❌	❌	❌	✅	✅	✅	✅
paras	✅	✅	❌	❌	❌	❌	❌	❌	❌
pidgeotto	❌	✅	❌	❌	❌	❌	❌	❌	✅
pikachu	✅	✅	✅	✅	✅	✅	✅	✅	✅
ponyta	✅	❌	✅	✅	❌	❌	❌	✅	❌
primeape	❌	❌	❌	❌	❌	❌	❌	❌	❌
psyduck	✅	✅	✅	✅	✅	❌	❌	✅	✅
raichu	❌	✅	✅	✅	❌	✅	✅	✅	✅
raticate	❌	✅	✅	✅	❌	❌	❌	✅	❌
sandshrew	❌	✅	❌	❌	❌	❌	❌	❌	❌
scyther	❌	❌	❌	❌	❌	✅	❌	✅	❌
seaking	❌	❌	❌	❌	❌	❌	❌	❌	❌
seel	❌	❌	❌	❌	❌	❌	✅	❌	✅
slowbro	❌	❌	✅	❌	❌	❌	❌	❌	❌
snorlax	✅	✅	✅	✅	✅	❌	❌	✅	✅
squirtle	✅	✅	✅	✅	✅	✅	✅	✅	✅
venonat	✅	✅	✅	✅	✅	✅	✅	❌	❌
vileplume	❌	❌	❌	❌	✅	❌	❌	❌	❌
vulpix	❌	✅	✅	❌	❌	❌	❌	✅	❌
wartortle	❌	❌	❌	❌	❌	❌	❌	❌	❌

With Thinking / Reasoning Results (click to expand)

Pokemon	o4-Mini	GPT-4.1	GPT-4o	Claude Opus	Claude Sonnet 4	Claude 3.7 Sonnet	Claude 3.5 Sonnet	Gemini 2.5 Pro	google/gemini-2.5-flash-preview-05-20
abra	❌	❌	❌	❌	❌	❌	❌	❌	❌
aerodactyl	❌	❌	❌	❌	❌	❌	❌	❌	✅
alakazam	❌	❌	❌	✅	❌	❌	❌	❌	❌
arbok	❌	❌	❌	❌	❌	❌	❌	❌	❌
arcanine	❌	✅	❌	❌	❌	❌	❌	✅	✅
bellsprout	✅	✅	✅	❌	❌	❌	❌	❌	❌
bulbasaur	❌	✅	❌	✅	❌	✅	✅	✅	✅
butterfree	✅	✅	✅	✅	✅	✅	✅	✅	✅
caterpie	❌	❌	✅	❌	❌	❌	❌	❌	❌
charmander	✅	✅	✅	✅	✅	✅	✅	✅	✅
clefairy	❌	❌	❌	❌	❌	❌	❌	❌	❌
cloyster	✅	❌	❌	❌	❌	❌	❌	❌	❌
cubone	❌	❌	❌	✅	❌	❌	❌	❌	✅
diglett	❌	❌	❌	❌	❌	❌	❌	❌	❌
ditto	✅	✅	✅	✅	✅	✅	✅	✅	✅
eevee	✅	✅	✅	❌	✅	❌	❌	✅	✅
exeggcute	✅	❌	❌	❌	❌	❌	❌	❌	❌
farfetchd	❌	❌	❌	❌	❌	❌	❌	❌	❌
fearow	❌	❌	❌	❌	❌	❌	❌	❌	❌
gastly	❌	❌	❌	❌	❌	❌	❌	❌	❌
gengar	✅	❌	❌	❌	✅	✅	✅	✅	❌
geodude	✅	✅	✅	❌	❌	❌	❌	✅	✅
gloom	❌	❌	❌	❌	❌	❌	❌	❌	❌
growlithe	✅	✅	✅	❌	✅	❌	❌	❌	✅
haunter	✅	❌	❌	❌	❌	❌	❌	❌	❌
hitmonchan	❌	❌	❌	❌	❌	❌	❌	❌	❌
horsea	✅	✅	✅	✅	❌	❌	❌	❌	✅
ivysaur	❌	❌	❌	❌	❌	❌	❌	❌	❌
jigglypuff	✅	✅	✅	✅	✅	✅	✅	❌	✅
jynx	❌	✅	✅	❌	❌	❌	❌	❌	✅
kabutops	✅	❌	❌	❌	❌	✅	✅	❌	❌
kangaskhan	❌	❌	❌	❌	❌	❌	❌	❌	❌
koffing	✅	❌	❌	✅	❌	❌	❌	✅	❌
krabby	✅	❌	✅	✅	❌	✅	✅	✅	❌
magikarp	❌	❌	❌	❌	❌	❌	❌	✅	✅
magmar	❌	❌	❌	❌	❌	❌	❌	❌	❌
magnemite	✅	✅	✅	❌	❌	❌	✅	❌	❌
metapod	❌	❌	❌	❌	❌	❌	❌	❌	❌
moltres	✅	❌	❌	❌	❌	❌	❌	❌	❌
mr	❌	❌	❌	❌	❌	❌	❌	❌	❌
nidoran♂	❌	❌	❌	❌	❌	❌	❌	❌	❌
onix	✅	✅	❌	❌	❌	✅	✅	✅	✅
paras	❌	✅	❌	❌	❌	❌	❌	❌	❌
pidgeotto	❌	❌	❌	❌	❌	❌	❌	❌	❌
pikachu	✅	✅	✅	✅	✅	✅	✅	✅	✅
ponyta	❌	❌	✅	❌	❌	✅	❌	✅	❌
primeape	❌	❌	❌	❌	❌	❌	❌	✅	✅
psyduck	✅	✅	✅	✅	✅	❌	❌	✅	✅
raichu	❌	✅	✅	❌	❌	✅	✅	✅	✅
raticate	✅	✅	✅	✅	❌	❌	❌	✅	✅
sandshrew	❌	✅	❌	❌	❌	❌	❌	❌	❌
scyther	❌	❌	❌	❌	❌	✅	❌	✅	❌
seaking	❌	❌	❌	❌	❌	❌	❌	❌	❌
seel	❌	❌	❌	❌	❌	❌	✅	❌	❌
slowbro	✅	❌	✅	❌	❌	❌	❌	❌	❌
snorlax	❌	✅	✅	❌	✅	❌	✅	✅	✅
squirtle	✅	✅	✅	✅	✅	✅	✅	❌	✅
venonat	✅	✅	✅	✅	✅	✅	✅	❌	❌
vileplume	❌	❌	❌	❌	❌	❌	❌	❌	❌
vulpix	❌	✅	✅	❌	❌	❌	❌	✅	❌
wartortle	❌	❌	❌	❌	❌	❌	❌	❌	❌

Individual Predictions

See raw results here: https://github.com/freddiev4/pokeshadowbench/tree/main/results

Setup

Install dependencies

pip install -r requirements.txt

Set your API keys in your environment

export OPENAI_API_KEY=<your_openai_api_key>
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
export GEMINI_API_KEY=<your_gemini_api_key>

Usage

Basic Usage

python src/evaluate_llms.py

Test Specific Prompts Only (`prompts.yaml`)

python src/evaluate_llms.py --prompts default indigo_hint think_and_reflect

With Custom Prompts File

python src/evaluate_llms.py --prompts-file my_prompts.yaml

Enable Thinking Models

python src/evaluate_llms.py --with-thinking

Sequential Processing

If you want to test models one at a time, or need to debug a specific model, you can run the script in sequential mode.

python evaluate_llms.py --sequential

YAML Configuration

Edit the prompts.yaml file with your prompt variations:

prompts:
  default:
    name: "Default Who's That Pokemon"
    prompt: "Let's play a game called \"Who's that Pokemon?\". You will be given a silhouette of a Pokemon. Your job is to guess the Pokemon name. Respond with ONLY the Pokemon name, nothing else."

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
results		results
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_pokemon_accuracy.py		analyze_pokemon_accuracy.py
generate_graph.py		generate_graph.py
generate_table.py		generate_table.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PokeShadowBench

Dataset Summary

Benchmark Overview

How the Benchmark Works

Task

Evaluation Metrics

Model Performance

Overall Results

Individual Results

Individual Predictions

Setup

Usage

Basic Usage

Test Specific Prompts Only (`prompts.yaml`)

With Custom Prompts File

Enable Thinking Models

Sequential Processing

YAML Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

freddiev4/pokeshadowbench

Folders and files

Latest commit

History

Repository files navigation

PokeShadowBench

Dataset Summary

Benchmark Overview

How the Benchmark Works

Task

Evaluation Metrics

Model Performance

Overall Results

Individual Results

Individual Predictions

Setup

Usage

Basic Usage

Test Specific Prompts Only (prompts.yaml)

With Custom Prompts File

Enable Thinking Models

Sequential Processing

YAML Configuration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Test Specific Prompts Only (`prompts.yaml`)

Packages