Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

Overview

This repository contains a comprehensive framework for benchmarking the physics reasoning capabilities of Vision-Language Models (VLMs). It provides an end-to-end solution, from programmatic generation of physics scenarios and their ground-truth solutions to a multi-faceted evaluation pipeline that assesses not just accuracy but the quality of a model's reasoning.

🎯 The Goal of This Framework

The primary goal is to provide a robust, reproducible, and interpretable framework to systematically evaluate the physics reasoning abilities of VLMs.

This project is designed to:

Go Beyond Accuracy: Move past simple right/wrong scoring by analyzing the quality and physical correctness of a model's step-by-step reasoning.
Ensure Interpretability: Through detailed failure analysis, identify specific conceptual gaps (e.g., misunderstanding "momentum conservation") in different models.
Standardize Evaluation: Offer a unified benchmark with diverse, programmatically generated scenarios across multiple physics domains, complete with ground-truth solutions and visualizations.
Analyze Model Capabilities: Enable deep dives into model performance, including their adaptability to problem difficulty, the impact of quantization, and the effectiveness of various prompting strategies.

✨ Features

Diverse Physics Environments:
- Projectile Motion: 2D trajectory problems with gravity and air resistance.
- Collision Dynamics: Elastic and inelastic collisions.
- Mechanics: Levers, pulleys, and inclined planes.
- Fluid Dynamics: Scenarios based on the continuity equation and Bernoulli's principle.
Automated Content Generation:
- Scenario Generator: Programmatically creates hundreds of unique problems with varying difficulty levels.
- Ground Truth Engine: Calculates high-precision solutions for every scenario.
- Scene Renderer: Automatically generates visualizations for each problem.
Multi-Faceted Evaluation Metrics:
- Physics Accuracy: Quantitative comparison against ground-truth solutions.
- Reasoning Quality: NLP-based analysis of explanations for logical consistency and correct application of physics principles.
- Adaptability: Performance measurement across different domains and difficulty levels.
Advanced Prompt Engineering:
- Includes strategies like Chain-of-Thought (CoT), Few-Shot, and Socratic prompting to elicit detailed reasoning from models.
Comprehensive Failure Analysis:
- Automatically categorizes errors into Conceptual, Computational, and Perceptual types.
- Identifies specific physics principles that are challenging for each model.
Resource-Aware Design:
- Features a QuantizationPipeline for efficient evaluation of large models (e.g., 4-bit and 8-bit quantization) on consumer-grade or cloud GPUs (e.g., T4).
- Tracks computational metrics like GPU memory and CPU usage.
Publication-Ready Outputs:
- Automatically generates leaderboards, performance tables, and a rich set of visualizations (radar charts, heatmaps, learning curves) to facilitate research and analysis.

📁 Repository Structure

📂 Repository Structure

physic_vlm_benchmark/
│
├── README.md               # Project overview, usage instructions
├── requirements.txt        # Dependencies
├── run.py                  # Main entrypoint to run experiments
│
├── configs/                # Experiment and model configuration files
│   ├── default.yaml
│   ├── models.yaml
│   └── environments.yaml
│
├── environments/           # Physics simulation environments
│   ├── projectile.py
│   ├── collisions.py
│   └── ...                 # (Other physics domains)
│
├── models/                 # Wrappers for different VLMs
│   ├── base_model.py       # Abstract base class
│   └── ...                 # (Wrappers for LLaMA, Qwen, etc.)
│
├── evaluation/             # Evaluation logic, metrics, and pipeline
│   ├── metrics.py
│   └── evaluator.py
│
├── utils/                  # Helper functions (logging, plotting)
│
├── experiments/            # Scripts for specific experiments
│   ├── runner.py           # Orchestrates experiments
│   └── ablation_study.py
│
├── results/                # Stores outputs (JSON, CSV, plots)
│
└── logs/                   # Log files from experiment runs

📊 Datasets, Models, and Evaluation Results

Supported Physics Domains (Environments)

This framework includes a benchmark suite of over 400 procedurally generated scenarios distributed across four core physics domains, each with difficulty levels (easy, medium, hard), ground-truth solutions, and visualizations.

Environment	Description	Key Concepts	Scenarios
Projectile Motion	2D trajectory prediction with gravity and air resistance.	Kinematics, Vector Components, Energy Conservation	100
Collision Dynamics	1D elastic and inelastic collisions between objects.	Momentum & Energy Conservation, Coefficient of Restitution	100
Mechanics	Analysis of levers, pulleys, and inclined planes.	Torque, Mechanical Advantage, Static Equilibrium, Friction	100
Fluid Dynamics	Pipe flow and pressure change problems.	Continuity Equation, Bernoulli's Principle	100

Supported Models

The framework is model-agnostic. The VLMWrapper class provides a unified interface to integrate a broad range of open-source VLMs. Baseline models used include:

Gemma2-27B-Vision
Qwen2.5-VL-7B
LLaMA-3.2-Vision-11B
DeepSeek-VL-1.3B

Evaluation Results

Summary of baseline performances with an overall score combining Physics Accuracy and Reasoning Quality:

Baseline Model Performance Summary

Rank	Model	Size (B)	Overall Score	95% CI	Physics Accuracy	Reasoning Quality	Computational Efficiency	Success Rate	Avg. Inference (s)
1	Qwen-VL-7B	7B	0.815	[0.800, 0.830]	0.85	0.78	0.92	0.89	2.3
2	LLaMA-Vision-11B	11B	0.765	[0.750, 0.780]	0.79	0.83	0.68	0.84	3.7
3	Gemma-27B	27B	0.750	[0.735, 0.765]	0.88	0.85	0.45	0.91	8.2
4	DeepSeek-VL-1.3B	1.3B	0.700	[0.685, 0.715]	0.72	0.69	0.95	0.78	1.1

🏗️ Quickstart

1. Installation

Clone the repository and install dependencies:

git clone https://github.com/prnvpwr2612/Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models.git
cd Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models
pip install -r requirements.txt

2. Running the Full Benchmark

Launch the full benchmark pipeline via the notebook notebook/Code.ipynb by running sequential cells grouped into phases that cover setup, scenario generation, evaluation, ablation, and result visualization.

3. Evaluating a New Model

Implement a custom wrapper subclassing BaseModelWrapper if needed.
Register your new model in the evaluation pipeline.
Run evaluation notebooks or experiment scripts to benchmark.

🛠️ Framework Architecture

/environments: Physics domain logic, scenario generation, ground-truth calculation.
/models: Model wrappers and quantization pipeline.
/evaluation: Core evaluation logic, metrics, prompting, and inference.
/results: Generated leaderboards, plots, and JSON output.

🖊️ Citation

If you use this framework, please cite:

@misc{prnvpwr2612-physics-benchmark-2025,
author = {Pranav Pawar,Kavish Shah,Hadi Gala},
title = {Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/prnvpwr2612/Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

Overview

🎯 The Goal of This Framework

✨ Features

📁 Repository Structure

📂 Repository Structure

📊 Datasets, Models, and Evaluation Results

Supported Physics Domains (Environments)

Supported Models

Evaluation Results

Baseline Model Performance Summary

🏗️ Quickstart

1. Installation

2. Running the Full Benchmark

3. Evaluating a New Model

🛠️ Framework Architecture

🖊️ Citation

About

Uh oh!

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
configs		configs
docs		docs
environments		environments
evaluation		evaluation
experiments		experiments
models		models
results		results
utils		utils
.gitignore		.gitignore
Code_Combined.py		Code_Combined.py
README.md		README.md
Results.ipynb		Results.ipynb
Run_Benchmark.py		Run_Benchmark.py
requirements.txt		requirements.txt

prnvpwr2612/Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

Overview

🎯 The Goal of This Framework

✨ Features

📁 Repository Structure

📂 Repository Structure

📊 Datasets, Models, and Evaluation Results

Supported Physics Domains (Environments)

Supported Models

Evaluation Results

Baseline Model Performance Summary

🏗️ Quickstart

1. Installation

2. Running the Full Benchmark

3. Evaluating a New Model

🛠️ Framework Architecture

🖊️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 4

Uh oh!

Languages