Skip to content

prnvpwr2612/Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models

Repository files navigation

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

License: MIT

Python 3.9+

PyTorch

Transformers


Overview

This repository contains a comprehensive framework for benchmarking the physics reasoning capabilities of Vision-Language Models (VLMs). It provides an end-to-end solution, from programmatic generation of physics scenarios and their ground-truth solutions to a multi-faceted evaluation pipeline that assesses not just accuracy but the quality of a model's reasoning.


🎯 The Goal of This Framework

The primary goal is to provide a robust, reproducible, and interpretable framework to systematically evaluate the physics reasoning abilities of VLMs.

This project is designed to:

  • Go Beyond Accuracy: Move past simple right/wrong scoring by analyzing the quality and physical correctness of a model's step-by-step reasoning.
  • Ensure Interpretability: Through detailed failure analysis, identify specific conceptual gaps (e.g., misunderstanding "momentum conservation") in different models.
  • Standardize Evaluation: Offer a unified benchmark with diverse, programmatically generated scenarios across multiple physics domains, complete with ground-truth solutions and visualizations.
  • Analyze Model Capabilities: Enable deep dives into model performance, including their adaptability to problem difficulty, the impact of quantization, and the effectiveness of various prompting strategies.

✨ Features

  • Diverse Physics Environments:
    • Projectile Motion: 2D trajectory problems with gravity and air resistance.
    • Collision Dynamics: Elastic and inelastic collisions.
    • Mechanics: Levers, pulleys, and inclined planes.
    • Fluid Dynamics: Scenarios based on the continuity equation and Bernoulli's principle.
  • Automated Content Generation:
    • Scenario Generator: Programmatically creates hundreds of unique problems with varying difficulty levels.
    • Ground Truth Engine: Calculates high-precision solutions for every scenario.
    • Scene Renderer: Automatically generates visualizations for each problem.
  • Multi-Faceted Evaluation Metrics:
    • Physics Accuracy: Quantitative comparison against ground-truth solutions.
    • Reasoning Quality: NLP-based analysis of explanations for logical consistency and correct application of physics principles.
    • Adaptability: Performance measurement across different domains and difficulty levels.
  • Advanced Prompt Engineering:
    • Includes strategies like Chain-of-Thought (CoT), Few-Shot, and Socratic prompting to elicit detailed reasoning from models.
  • Comprehensive Failure Analysis:
    • Automatically categorizes errors into Conceptual, Computational, and Perceptual types.
    • Identifies specific physics principles that are challenging for each model.
  • Resource-Aware Design:
    • Features a QuantizationPipeline for efficient evaluation of large models (e.g., 4-bit and 8-bit quantization) on consumer-grade or cloud GPUs (e.g., T4).
    • Tracks computational metrics like GPU memory and CPU usage.
  • Publication-Ready Outputs:
    • Automatically generates leaderboards, performance tables, and a rich set of visualizations (radar charts, heatmaps, learning curves) to facilitate research and analysis.

📁 Repository Structure

📂 Repository Structure

physic_vlm_benchmark/
│
├── README.md               # Project overview, usage instructions
├── requirements.txt        # Dependencies
├── run.py                  # Main entrypoint to run experiments
│
├── configs/                # Experiment and model configuration files
│   ├── default.yaml
│   ├── models.yaml
│   └── environments.yaml
│
├── environments/           # Physics simulation environments
│   ├── projectile.py
│   ├── collisions.py
│   └── ...                 # (Other physics domains)
│
├── models/                 # Wrappers for different VLMs
│   ├── base_model.py       # Abstract base class
│   └── ...                 # (Wrappers for LLaMA, Qwen, etc.)
│
├── evaluation/             # Evaluation logic, metrics, and pipeline
│   ├── metrics.py
│   └── evaluator.py
│
├── utils/                  # Helper functions (logging, plotting)
│
├── experiments/            # Scripts for specific experiments
│   ├── runner.py           # Orchestrates experiments
│   └── ablation_study.py
│
├── results/                # Stores outputs (JSON, CSV, plots)
│
└── logs/                   # Log files from experiment runs

📊 Datasets, Models, and Evaluation Results

Supported Physics Domains (Environments)

This framework includes a benchmark suite of over 400 procedurally generated scenarios distributed across four core physics domains, each with difficulty levels (easy, medium, hard), ground-truth solutions, and visualizations.

Environment Description Key Concepts Scenarios
Projectile Motion 2D trajectory prediction with gravity and air resistance. Kinematics, Vector Components, Energy Conservation 100
Collision Dynamics 1D elastic and inelastic collisions between objects. Momentum & Energy Conservation, Coefficient of Restitution 100
Mechanics Analysis of levers, pulleys, and inclined planes. Torque, Mechanical Advantage, Static Equilibrium, Friction 100
Fluid Dynamics Pipe flow and pressure change problems. Continuity Equation, Bernoulli's Principle 100

Supported Models

The framework is model-agnostic. The VLMWrapper class provides a unified interface to integrate a broad range of open-source VLMs. Baseline models used include:

  • Gemma2-27B-Vision
  • Qwen2.5-VL-7B
  • LLaMA-3.2-Vision-11B
  • DeepSeek-VL-1.3B

Evaluation Results

Summary of baseline performances with an overall score combining Physics Accuracy and Reasoning Quality:

Baseline Model Performance Summary

Rank Model Size (B) Overall Score 95% CI Physics Accuracy Reasoning Quality Computational Efficiency Success Rate Avg. Inference (s)
1 Qwen-VL-7B 7B 0.815 [0.800, 0.830] 0.85 0.78 0.92 0.89 2.3
2 LLaMA-Vision-11B 11B 0.765 [0.750, 0.780] 0.79 0.83 0.68 0.84 3.7
3 Gemma-27B 27B 0.750 [0.735, 0.765] 0.88 0.85 0.45 0.91 8.2
4 DeepSeek-VL-1.3B 1.3B 0.700 [0.685, 0.715] 0.72 0.69 0.95 0.78 1.1

🏗️ Quickstart

1. Installation

Clone the repository and install dependencies:

git clone https://github.com/prnvpwr2612/Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models.git
cd Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models
pip install -r requirements.txt

2. Running the Full Benchmark

Launch the full benchmark pipeline via the notebook notebook/Code.ipynb by running sequential cells grouped into phases that cover setup, scenario generation, evaluation, ablation, and result visualization.

3. Evaluating a New Model

  • Implement a custom wrapper subclassing BaseModelWrapper if needed.
  • Register your new model in the evaluation pipeline.
  • Run evaluation notebooks or experiment scripts to benchmark.

🛠️ Framework Architecture

  • /environments: Physics domain logic, scenario generation, ground-truth calculation.
  • /models: Model wrappers and quantization pipeline.
  • /evaluation: Core evaluation logic, metrics, prompting, and inference.
  • /results: Generated leaderboards, plots, and JSON output.

🖊️ Citation

If you use this framework, please cite:

@misc{prnvpwr2612-physics-benchmark-2025,
author = {Pranav Pawar,Kavish Shah,Hadi Gala},
title = {Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/prnvpwr2612/Interpretable-Physics-Reasoning-and-Performance-Taxonomy-in-Vision-Language-Models}}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •