Reasoning about Physical Transformation in Number · Length · Volume · Size
This repository hosts a large-scale benchmark that probes whether current vision-language models (VLMs) understand quantity invariance—the Piagetian concept of conservation—across four physical domains:
| Domain | Example transformation | Also referred to as…* |
|---|---|---|
| Number | spreading coins | — |
| Length | sliding / rotating straws | — |
| Volume | pouring liquid | Liquid |
| Size | reshaping clay | Solid (legacy name) |
*Early drafts used “Solid” for the Size domain; both spellings map to the same column in the raw CSV and are unified here as Size.
The benchmark contains 192 short videos (4 domains × 48 variants) and a full-factorial evaluation design
(3 frame counts × 3 frame-selection methods × 4 prompt styles) → 6 912 multimodal QA items.
We report results for 36 models in total (34 publicly released + 2 proprietary).
Conservation/
├─ Code/ # Re-run analyses & figures
│ ├─ boxplot_frames.py
│ ├─ boxplot_prompts.py
│ ├─ export_model_tables.py
│ └─ ...
├─ Data/
│ ├─ Raw_Data/
│ │ └─ Raw Data.csv # 0/1 accuracy matrix (6 912 × 36)
│ └─ Significance/ # Paired-t statistics
│ ├─ Average_accuracy.txt
│ ├─ Extraction_Method.txt
│ ├─ Frame_Number.txt
│ └─ Prompt.txt
├─ Figures/ # Camera-ready graphics (PDF)
│ ├─ sec_label_heatmap.pdf
│ ├─ Paired_mapping_*.pdf
│ └─ ...
└─ README.md # You are here
All videos + QA JSONs are released on HuggingFace:
https://huggingface.co/datasets/<YOUR_ORG>/Conservation-v1
SHA‑256 checksums are provided in
Data/checksums.txtfor integrity verification.
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts�ctivate
pip install pandas numpy scipy matplotlib seabornpython Code/boxplot_frames.py # frame‑count effect
python Code/boxplot_prompts.py # prompt effectPDFs appear in Figures/.
python Code/export_model_tables.py # outputs CSVs to Data/Processed/All scripts accept --csv and --outdir flags; see docstrings for details.
| Column | Description |
|---|---|
Your_index |
Question slug (conservation_####) |
QuestionID |
Internal ID of the QA system |
Sec. Label |
Number, Length, Volume, Size (alias “Solid”) |
Frame Number |
3, 8, or 16 elapsed frames |
Reference |
Prompt style (clean, workflow, description, instruction) |
Extraction Method |
Uniform, Human, SeViLA |
Parameter-1 … -5 |
Controlled visual factors (object count, distance, colour, etc.) |
<Model-Name> |
36 columns — 1 = correct, 0 = incorrect, blank = no answer |
| Factor | Levels (values) | Purpose |
|---|---|---|
| Prompt | Direct • Sequential • CoT • Continuous | Linguistic scaffolding |
| Frame Count | 3 • 8 • 16 | Temporal resolution |
| Frame Selection | Uniform • Human • SeViLA | Key‑frame picking strategy |
| Evaluation metric | Accuracy (binary) | Per‑question correctness |
| Statistics | Mean ± SD, paired t-tests (see Significance/) |
Significance across 34 public models |
- Benchmarking new VLMs against 36 baselines on physical‑reasoning tasks.
- Prompt / sampling research: how language or frame choice changes performance.
- Cognitive comparison with developmental stages of human conservation.
- Classroom demos: visualise classic failure modes of state‑of‑the‑art models.
Pull requests and issues are welcome—especially for
- new model results (
results_<model>.csv) - additional visualisations or metrics
- bug fixes in the analysis scripts
Last updated: 2025‑05‑16
- Code — MIT License (see
LICENSE-MIT). - Dataset — CC BY‑NC 4.0 (non‑commercial research only).
Commercial use requires explicit written permission from the authors.