This repository contains a comprehensive framework for benchmarking the physics reasoning capabilities of Vision-Language Models (VLMs). It provides an end-to-end solution, from programmatic generation of physics scenarios and their ground-truth solutions to a multi-faceted evaluation pipeline that assesses not just accuracy but the quality of a model's reasoning.
The primary goal is to provide a robust, reproducible, and interpretable framework to systematically evaluate the physics reasoning abilities of VLMs.
This project is designed to:
- Go Beyond Accuracy: Move past simple right/wrong scoring by analyzing the quality and physical correctness of a model's step-by-step reasoning.
- Ensure Interpretability: Through detailed failure analysis, identify specific conceptual gaps (e.g., misunderstanding "momentum conservation") in different models.
- Standardize Evaluation: Offer a unified benchmark with diverse, programmatically generated scenarios across multiple physics domains, complete with ground-truth solutions and visualizations.
- Analyze Model Capabilities: Enable deep dives into model performance, including their adaptability to problem difficulty, the impact of quantization, and the effectiveness of various prompting strategies.
- Diverse Physics Environments:
- Projectile Motion: 2D trajectory problems with gravity and air resistance.
- Collision Dynamics: Elastic and inelastic collisions.
- Mechanics: Levers, pulleys, and inclined planes.
- Fluid Dynamics: Scenarios based on the continuity equation and Bernoulli's principle.
- Automated Content Generation:
- Scenario Generator: Programmatically creates hundreds of unique problems with varying difficulty levels.
- Ground Truth Engine: Calculates high-precision solutions for every scenario.
- Scene Renderer: Automatically generates visualizations for each problem.
- Multi-Faceted Evaluation Metrics:
- Physics Accuracy: Quantitative comparison against ground-truth solutions.
- Reasoning Quality: NLP-based analysis of explanations for logical consistency and correct application of physics principles.
- Adaptability: Performance measurement across different domains and difficulty levels.
- Advanced Prompt Engineering:
- Includes strategies like Chain-of-Thought (CoT), Few-Shot, and Socratic prompting to elicit detailed reasoning from models.
- Comprehensive Failure Analysis:
- Automatically categorizes errors into Conceptual, Computational, and Perceptual types.
- Identifies specific physics principles that are challenging for each model.
- Resource-Aware Design:
- Features a
QuantizationPipeline
for efficient evaluation of large models (e.g., 4-bit and 8-bit quantization) on consumer-grade or cloud GPUs (e.g., T4). - Tracks computational metrics like GPU memory and CPU usage.
- Features a
- Publication-Ready Outputs:
- Automatically generates leaderboards, performance tables, and a rich set of visualizations (radar charts, heatmaps, learning curves) to facilitate research and analysis.
physic_vlm_benchmark/
│
├── README.md # Project overview, usage instructions
├── requirements.txt # Dependencies
├── run.py # Main entrypoint to run experiments
│
├── configs/ # Experiment and model configuration files
│ ├── default.yaml
│ ├── models.yaml
│ └── environments.yaml
│
├── environments/ # Physics simulation environments
│ ├── projectile.py
│ ├── collisions.py
│ └── ... # (Other physics domains)
│
├── models/ # Wrappers for different VLMs
│ ├── base_model.py # Abstract base class
│ └── ... # (Wrappers for LLaMA, Qwen, etc.)
│
├── evaluation/ # Evaluation logic, metrics, and pipeline
│ ├── metrics.py
│ └── evaluator.py
│
├── utils/ # Helper functions (logging, plotting)
│
├── experiments/ # Scripts for specific experiments
│ ├── runner.py # Orchestrates experiments
│ └── ablation_study.py
│
├── results/ # Stores outputs (JSON, CSV, plots)
│
└── logs/ # Log files from experiment runs
This framework includes a benchmark suite of over 400 procedurally generated scenarios distributed across four core physics domains, each with difficulty levels (easy, medium, hard), ground-truth solutions, and visualizations.
Environment | Description | Key Concepts | Scenarios |
---|---|---|---|
Projectile Motion | 2D trajectory prediction with gravity and air resistance. | Kinematics, Vector Components, Energy Conservation | 100 |
Collision Dynamics | 1D elastic and inelastic collisions between objects. | Momentum & Energy Conservation, Coefficient of Restitution | 100 |
Mechanics | Analysis of levers, pulleys, and inclined planes. | Torque, Mechanical Advantage, Static Equilibrium, Friction | 100 |
Fluid Dynamics | Pipe flow and pressure change problems. | Continuity Equation, Bernoulli's Principle | 100 |
The framework is model-agnostic. The VLMWrapper
class provides a unified interface to integrate a broad range of open-source VLMs. Baseline models used include:
- Gemma2-27B-Vision
- Qwen2.5-VL-7B
- LLaMA-3.2-Vision-11B
- DeepSeek-VL-1.3B
Summary of baseline performances with an overall score combining Physics Accuracy and Reasoning Quality:
Rank | Model | Size (B) | Overall Score | 95% CI | Physics Accuracy | Reasoning Quality | Computational Efficiency | Success Rate | Avg. Inference (s) |
---|---|---|---|---|---|---|---|---|---|
1 | Qwen-VL-7B | 7B | 0.815 | [0.800, 0.830] | 0.85 | 0.78 | 0.92 | 0.89 | 2.3 |
2 | LLaMA-Vision-11B | 11B | 0.765 | [0.750, 0.780] | 0.79 | 0.83 | 0.68 | 0.84 | 3.7 |
3 | Gemma-27B | 27B | 0.750 | [0.735, 0.765] | 0.88 | 0.85 | 0.45 | 0.91 | 8.2 |
4 | DeepSeek-VL-1.3B | 1.3B | 0.700 | [0.685, 0.715] | 0.72 | 0.69 | 0.95 | 0.78 | 1.1 |
Clone the repository and install dependencies:
git clone https://github.com/prnvpwr2612/Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models.git
cd Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models
pip install -r requirements.txt
Launch the full benchmark pipeline via the notebook notebook/Code.ipynb
by running sequential cells grouped into phases that cover setup, scenario generation, evaluation, ablation, and result visualization.
- Implement a custom wrapper subclassing
BaseModelWrapper
if needed. - Register your new model in the evaluation pipeline.
- Run evaluation notebooks or experiment scripts to benchmark.
/environments
: Physics domain logic, scenario generation, ground-truth calculation./models
: Model wrappers and quantization pipeline./evaluation
: Core evaluation logic, metrics, prompting, and inference./results
: Generated leaderboards, plots, and JSON output.
If you use this framework, please cite:
@misc{prnvpwr2612-physics-benchmark-2025,
author = {Pranav Pawar,Kavish Shah,Hadi Gala},
title = {Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/prnvpwr2612/Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models}}
}