🧩 Conservation Benchmark for Vision-Language Models

Reasoning about Physical Transformation in Number · Length · Volume · Size

This repository hosts a large-scale benchmark that probes whether current vision-language models (VLMs) understand quantity invariance—the Piagetian concept of conservation—across four physical domains:

Domain	Example transformation	Also referred to as…*
Number	spreading coins	—
Length	sliding / rotating straws	—
Volume	pouring liquid	Liquid
Size	reshaping clay	Solid (legacy name)

*Early drafts used “Solid” for the Size domain; both spellings map to the same column in the raw CSV and are unified here as Size.

The benchmark contains 192 short videos (4 domains × 48 variants) and a full-factorial evaluation design
(3 frame counts × 3 frame-selection methods × 4 prompt styles) → 6 912 multimodal QA items.

We report results for 36 models in total (34 publicly released + 2 proprietary).

Repository Layout

Conservation/
├─ Code/                       # Re-run analyses & figures
│  ├─ boxplot_frames.py
│  ├─ boxplot_prompts.py
│  ├─ export_model_tables.py
│  └─ ...
├─ Data/
│  ├─ Raw_Data/
│  │   └─ Raw Data.csv        # 0/1 accuracy matrix (6 912 × 36)
│  └─ Significance/           # Paired-t statistics
│      ├─ Average_accuracy.txt
│      ├─ Extraction_Method.txt
│      ├─ Frame_Number.txt
│      └─ Prompt.txt
├─ Figures/                    # Camera-ready graphics (PDF)
│  ├─ sec_label_heatmap.pdf
│  ├─ Paired_mapping_*.pdf
│  └─ ...
└─ README.md                   # You are here

Getting the Dataset

All videos + QA JSONs are released on HuggingFace:

https://huggingface.co/datasets/<YOUR_ORG>/Conservation-v1

SHA‑256 checksums are provided in Data/checksums.txt for integrity verification.

Quick Start: Reproducing the Paper Figures

1 Create environment

python -m venv venv
source venv/bin/activate            # Windows: venv\Scripts�ctivate
pip install pandas numpy scipy matplotlib seaborn

2 Generate plots

python Code/boxplot_frames.py       # frame‑count effect
python Code/boxplot_prompts.py      # prompt effect

PDFs appear in Figures/.

3 Export per‑model tables

python Code/export_model_tables.py  # outputs CSVs to Data/Processed/

All scripts accept --csv and --outdir flags; see docstrings for details.

Column Schema of `Raw Data.csv`

Column	Description
`Your_index`	Question slug (`conservation_####`)
`QuestionID`	Internal ID of the QA system
`Sec. Label`	`Number`, `Length`, `Volume`, `Size` (alias “Solid”)
`Frame Number`	`3`, `8`, or `16` elapsed frames
`Reference`	Prompt style (`clean`, `workflow`, `description`, `instruction`)
`Extraction Method`	`Uniform`, `Human`, `SeViLA`
`Parameter-1 … -5`	Controlled visual factors (object count, distance, colour, etc.)
`<Model-Name>`	36 columns — `1` = correct, `0` = incorrect, blank = no answer

Experimental Factors & Metrics

Factor	Levels (values)	Purpose
Prompt	Direct • Sequential • CoT • Continuous	Linguistic scaffolding
Frame Count	3 • 8 • 16	Temporal resolution
Frame Selection	Uniform • Human • SeViLA	Key‑frame picking strategy
Evaluation metric	Accuracy (binary)	Per‑question correctness
Statistics	Mean ± SD, paired t-tests (see `Significance/`)	Significance across 34 public models

Typical Use Cases

Benchmarking new VLMs against 36 baselines on physical‑reasoning tasks.
Prompt / sampling research: how language or frame choice changes performance.
Cognitive comparison with developmental stages of human conservation.
Classroom demos: visualise classic failure modes of state‑of‑the‑art models.

Contributing

Pull requests and issues are welcome—especially for

new model results (results_<model>.csv)
additional visualisations or metrics
bug fixes in the analysis scripts

Last updated: 2025‑05‑16

License

Code — MIT License (see LICENSE-MIT).
Dataset — CC BY‑NC 4.0 (non‑commercial research only).
Commercial use requires explicit written permission from the authors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧩 Conservation Benchmark for Vision-Language Models

Repository Layout

Getting the Dataset

Quick Start: Reproducing the Paper Figures

1 Create environment

2 Generate plots

3 Export per‑model tables

Column Schema of `Raw Data.csv`

Experimental Factors & Metrics

Typical Use Cases

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Code		Code
Data		Data
Figures		Figures
README.md		README.md

grow-ai-like-a-child/Conservation-VLM

Folders and files

Latest commit

History

Repository files navigation

🧩 Conservation Benchmark for Vision-Language Models

Repository Layout

Getting the Dataset

Quick Start: Reproducing the Paper Figures

1 Create environment

2 Generate plots

3 Export per‑model tables

Column Schema of Raw Data.csv

Experimental Factors & Metrics

Typical Use Cases

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Column Schema of `Raw Data.csv`

Packages