Skip to content

grow-ai-like-a-child/Conservation-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🧩 Conservation Benchmark for Vision-Language Models

Reasoning about Physical Transformation in Number · Length · Volume · Size

This repository hosts a large-scale benchmark that probes whether current vision-language models (VLMs) understand quantity invariance—the Piagetian concept of conservation—across four physical domains:

Domain Example transformation Also referred to as…*
Number spreading coins
Length sliding / rotating straws
Volume pouring liquid Liquid
Size reshaping clay Solid (legacy name)

*Early drafts used “Solid” for the Size domain; both spellings map to the same column in the raw CSV and are unified here as Size.

The benchmark contains 192 short videos (4 domains × 48 variants) and a full-factorial evaluation design
(3 frame counts × 3 frame-selection methods × 4 prompt styles) → 6 912 multimodal QA items.

We report results for 36 models in total (34 publicly released + 2 proprietary).


Repository Layout

Conservation/
├─ Code/                       # Re-run analyses & figures
│  ├─ boxplot_frames.py
│  ├─ boxplot_prompts.py
│  ├─ export_model_tables.py
│  └─ ...
├─ Data/
│  ├─ Raw_Data/
│  │   └─ Raw Data.csv        # 0/1 accuracy matrix (6 912 × 36)
│  └─ Significance/           # Paired-t statistics
│      ├─ Average_accuracy.txt
│      ├─ Extraction_Method.txt
│      ├─ Frame_Number.txt
│      └─ Prompt.txt
├─ Figures/                    # Camera-ready graphics (PDF)
│  ├─ sec_label_heatmap.pdf
│  ├─ Paired_mapping_*.pdf
│  └─ ...
└─ README.md                   # You are here

Getting the Dataset

All videos + QA JSONs are released on HuggingFace:

https://huggingface.co/datasets/<YOUR_ORG>/Conservation-v1

SHA‑256 checksums are provided in Data/checksums.txt for integrity verification.


Quick Start: Reproducing the Paper Figures

1 Create environment

python -m venv venv
source venv/bin/activate            # Windows: venv\Scripts�ctivate
pip install pandas numpy scipy matplotlib seaborn

2 Generate plots

python Code/boxplot_frames.py       # frame‑count effect
python Code/boxplot_prompts.py      # prompt effect

PDFs appear in Figures/.

3 Export per‑model tables

python Code/export_model_tables.py  # outputs CSVs to Data/Processed/

All scripts accept --csv and --outdir flags; see docstrings for details.


Column Schema of Raw Data.csv

Column Description
Your_index Question slug (conservation_####)
QuestionID Internal ID of the QA system
Sec. Label Number, Length, Volume, Size (alias “Solid”)
Frame Number 3, 8, or 16 elapsed frames
Reference Prompt style (clean, workflow, description, instruction)
Extraction Method Uniform, Human, SeViLA
Parameter-1 … -5 Controlled visual factors (object count, distance, colour, etc.)
<Model-Name> 36 columns — 1 = correct, 0 = incorrect, blank = no answer

Experimental Factors & Metrics

Factor Levels (values) Purpose
Prompt Direct • Sequential • CoT • Continuous Linguistic scaffolding
Frame Count 3 • 8 • 16 Temporal resolution
Frame Selection Uniform • Human • SeViLA Key‑frame picking strategy
Evaluation metric Accuracy (binary) Per‑question correctness
Statistics Mean ± SD, paired t-tests (see Significance/) Significance across 34 public models

Typical Use Cases

  1. Benchmarking new VLMs against 36 baselines on physical‑reasoning tasks.
  2. Prompt / sampling research: how language or frame choice changes performance.
  3. Cognitive comparison with developmental stages of human conservation.
  4. Classroom demos: visualise classic failure modes of state‑of‑the‑art models.

Contributing

Pull requests and issues are welcome—especially for

  • new model results (results_<model>.csv)
  • additional visualisations or metrics
  • bug fixes in the analysis scripts

Last updated: 2025‑05‑16


License

  • Code — MIT License (see LICENSE-MIT).
  • Dataset — CC BY‑NC 4.0 (non‑commercial research only).
    Commercial use requires explicit written permission from the authors.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages