Skip to content

prism-visual-spatial-intelligence/Awesome-Visual-Spatial-Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Visual Spatial Reasoning

Contributors:

Hao Ju, Songsong Yu, Lianjie Jia, Rundi Cui, Yuhan Wu, Binghao Ran, Zhang Zaibin, Zhang Fuxi, Zhipeng Zhang, Lin Song, Yuxin Chen🌟, Yanwei Li

🌟 Project Lead

News and Updates

  • Preprint a survey article on visual spatial reasoning tasks.

  • Open-source evaluation toolkit.

  • Open-source evaluation data for visual spatial reasoning tasks.

  • Release comprehensive evaluation results of mainstream models in visual spatial reasoning.

  • ✍️🦾💼25.6.28 - Collected the "Datasets" section.

  • 🏃🏃‍♀️🏃‍♂️25.6.16 - The "Awesome Visual Spatial Reasoning" project is now live!

  • 👏🕮💻25.6.12 - The project has conducted research and collected 100 relevant works.

  • 🙋‍♀️🙋‍♂️🙋25.6.10 - We launches a review project on visual spatial reasoning.

Contributing

We welcome contributions to this repository! If you'd like to contribute, please follow these steps:

  • Fork the repository.
  • Create a new branch with your changes.
  • Submit a pull request with a clear description of your changes.

You can also open an issue if you have anything to add or comment.

Please feel free to contact us (SongsongYu203@163.com).

Overview

The research community is increasingly focused on the visual spatial reasoning (VSR) abilities of Vision-Language Models (VLMs). Yet, the field lacks a clear overview of its evolution and a standardized benchmark for evaluation. Current assessment methods are disparate and lack a common toolkit. This project aims to fill that void. We are developing a unified, comprehensive, and diverse evaluation toolkit, along with an accompanying survey paper. We are actively seeking collaboration and discussion with fellow experts to advance this initiative.

Task Explanation

Visual spatial understanding is a key task at the intersection of computer vision and cognitive science. It aims to enable intelligent agents (such as robots and AI systems) to parse spatial relationships in the environment through visual inputs (images, videos, etc.), forming an abstract cognition of the physical world. In Embodied Intelligence, it serves as the foundation for agents to achieve the "perception-decision-action" loop—only by understanding attributes like object positions, distances, sizes, and orientations in space can intelligent agents navigate environments, manipulate objects, or interact with humans.

Timeline

Visual Spatial Intelligence-A Survey

Table of Contents

To facilitate the community's quick understanding of visual-spatial reasoning, we first categorized it by input modalities into Single image, Monocular Video, and Multi-View Images. We also surveyed other input modalities such as point clouds, as well as specific applications like embodied robotics. These are temporarily grouped under "Others," and we will conduct a more detailed sorting in the future.

Papers

Single Image

Title Venue Date Code Stars Benchmark Illustration
R2D3:ImpartingSpatial Reasoning by Reconstructing 3D Scenes from 2D Images ARXIV -- -- -- R2D3 img
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robtics ARXIV 25-06 link -- RefSpatial-Bench img
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces ARXIV 25-06 link Star count VeBrain-600k img
SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization ARXIV 25-06 -- -- SVQA-R1 img
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models -- 25-06 link Star count OmniSpatial img
Can Multimodal Large Language Models Understand Spatial Relations ARXIV 25-05 link Star count SpatialMQA img
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning -- 25-05 -- -- SSR-CoT img
Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding ARXIV 25-05 link -- SUNSPOT img
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? ARXIV 25-05 link -- OSR-Bench img
SITE: towards Spatial Intelligence Thorough Evaluation ARXIV 25-05 link Star count SITE img
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning ARXIV 25-05 -- -- TallyQA, V*
InfographicVQA, MVBench
img
Improved Visual-Spatial Reasoning via R1-Zero-Like Training ARXIV 25-04 link Star count VSI-100K img
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation ARXIV 25-04 link Star count COMFORT++, 3DSRBench img
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning ARXIV 25-04 link -- -- img
Vision language models are unreliable at trivial spatial cognition ARXIV 25-04 -- -- TableTest img
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data ARXIV 25-04 -- -- vsr, what's up
3DSR-Bench, RealWorldQA
img
NUSCENES-SPATIALQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving ARXIV 25-04 link Star count NuScenes-SpatialQA img
Beyond Semantics Rediscovering Spatial Awareness in Vision-Language Models -- 25-03 link -- -- img
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse ARXIV 25-03 link Star count MetaSpatial img
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models ARXIV 25-03 link Star count SRBench img
Open3DVQA: ABenchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space ARXIV 25-03 link Star count Open3DVQA img
AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning ARXIV 25-03 link Star count AutoSpatial img
Why Is Spatial Reasoning Hard for VLMs? AnAttention Mechanism Perspective on Focus Areas ARXIV 25-03 link Star count -- img
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning ARXIV 25-03 link Star count LEGO-Puzzles img
Visual Agentic AI for Spatial Reasoning with a Dynamic API ARXIV 25-02 link Star count -- img
iVISPAR —AnInteractive Visual-Spatial Reasoning Benchmark for VLMs ARXIV 25-02 link Star count iVISPAR img
Visual Agentic AI for Spatial Reasoning with a Dynamic API ARXIV 25-02 link Star count Q-Spatial Bench, VSI-Bench img
Defining and Evaluating Visual Language Models’ Basic Spatial Abilities: A Perspective from Psychometrics ARXIV 25-02 -- -- BSA-Tests img
Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting NAACL 25-02 link Star count ARO, GQA
MMRel
img
Do Vision-Language Models Represent Space and How. Evaluating Spatial Frame of Reference under Ambiguities ICLR 25-01 link -- COMFORT img
ROBOSPATIAL: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics CVPR 25-01 -- -- -- img
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models CVPR 25-01 -- -- -- img
COARSE CORRESPONDENCES Boost Spatial-Temporal Reasoning in Multimodal Language Model CVPR 25-01 link -- -- img
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models CVPR 25-01 link -- -- img
SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language CVPR 25-01 link Star count SpatialBench img
SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning ARXIV 25-01 -- -- -- img
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning CVPR 25-01 link -- ReasoningGD img
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought ARXIV 25-01 -- -- LEC23, WMS+24
LZZ+24, RDT+24
img
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations ARXIV 24-12 link Star count SpaceSGG img
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark ARXIV 24-12 link -- 3DSRBench img
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation ACL 24-12 link Star count SPHERE img
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation ARXIV 24-11 -- -- -- img
AnEmpirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models EMNLP 24-11 link Star count Spatial-MM img
ROOT: VLM-based System for Indoor Scene Understanding and Beyond ARXIV 24-11 link Star count SceneVLM img
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models EMNLP 2024 24-11 link Star count Spatial-MM, GQA-spatial img
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks ARXIV 24-11 link Star count GEOBench-VLM img
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning NIPS 24-10 -- -- what's up, coco-spatial
GQA-spatial
img
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models ARXIV 24-09 link Star count Q-Spatial Bench img
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning? ARXIV 24-09 link Star count SVAT img
Understanding Depth and Height Perception in Large Visual-Language Models CVPRW 24-08 link Star count GeoMeter img
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model -- 24-08 -- -- ScanQA, OpenEQA’s episodic memory subset
EgoSchema, R2R
SQA3D
img
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs ARXIV 24-07 link Star count VSP img
SpatialBot: Precise Spatial Understanding with Vision Language Models ICRA 24-06 link Star count SpatialBench img
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models ARXIV 24-06 link Star count SpatialRGPT-Bench img
TOPVIEWRS: Vision-Language Models as Top-View Spatial Reasoners ARXIV 24-06 link -- -- img
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models NIPS 24-06 link Star count SpatialEval img
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models ARXIV 24-06 link Star count EmbSpatial-Bench img
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs NEURIPS 2024 WORKSHOP 24-06 -- -- GSR-BENCH img
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics CORL2024 24-06 link Star count RoboPoint img
Reframing Spatial Reasoning Evaluation in Language Models:A Real-World Simulation Benchmark for Qualitative Reasoning ARXIV 24-05 -- -- RoomSpace, bAbI
StepGame, SpartQA
SpaRTUN
img
RAG-Guided Large Language Models for Visual Spatial Description with Adaptive Hallucination Corrector ACMMM24 24-05 -- -- VSD img
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models NIPS 24-04 link Star count VoT img
BLINK: Multimodal Large Language Models Can See but Not Perceive ECCV 24-04 link Star count BLINK img
Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning CVPR2024 24-04 link Star count KITTI-360 img
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning -- 24-03 link Star count Visual-CoT img
Can Transformers Capture Spatial Relations between Objects? ICLR 24-03 link Star count SRP img
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors NEURIPS2024 24-03 link Star count NOCS, RT-1
BridgeData V2, YCBInEOAT
img
SpatialVLM Endowing Vision-Language Models with Spatial Reasoning Capabilities CVPR 24-01 link Star count -- img
LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description ACM MM 24-01 -- -- -- img
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis ARXIV 24-01 link Star count Proximity-110K img
Improving Vision-and-Language Reasoning via Spatial Relations Modeling WACV 23-11 -- -- -- img
3D-Aware Visual Question Answering about Parts, Poses and Occlusions NIPS 23-10 link Star count Super-CLEVR-3D img
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals ACL2022 22-03 link Star count -- img

Monocular-Video

Title Venue Date Code Stars Benchmark Illustration
OpenEQA: Embodied Question Answering in the Era of Foundation Models CVPR -- link Star count OpenEQA img
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data ARXIV 25-06 link Star count -- img
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction ARXIV 25-05 link Star count VSTiBench img
3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model ARXIV 25-05 link Star count 3DMEM-Bench img
Spatial-MLLM Boosting MLLM Capabilities in Visual-based Spatial Intelligence ARXIV 25-05 link Star count Spatial-MLLM-120k img
Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames ARXIV 25-05 -- -- DISJOINT-3DQA img
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors ARXIV 25-05 link Star count VSI-Bench img
SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models ARXIV 25-05 -- -- -- img
Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts ARXIV 25-05 link Star count -- img
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning ARXIV 25-04 link Star count -- img
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning ARXIV 25-04 link Star count Embodied-R img
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning ARXIV 25-04 link Star count VSI-Bench, STI-Bench
and SPAR-Bench
img
Towards Understanding Camera Motions in Any Video ARXIV 25-04 link Star count CameraBench img
EgoDTM:Towards 3D-Aware Egocentric Video-Language Pretraining ARXIV 25-03 link Star count EgoMCQ... img
STI-Bench: Are MLLMsReadyfor Precise Spatial-Temporal World Understanding? ARXIV 25-03 link Star count STI-Bench img
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos ARXIV 25-03 link Star count Ego-ST img
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D ARXIV 25-03 link Star count SPAR-Bench img
ST-VLM:KinematicInstruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models ARXIV 25-03 link Star count STKit img
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces CVPR 25-01 link Star count VSI-Bench img
M3: 3D-Spatial Multimodal Memory ICLR 25-01 link Star count -- img
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding CVPR 25-01 -- -- -- img
Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering ICLR 25-01 link Star count DynSuperCLEVR img
DOES SPATIAL COGNITION EMERGE IN FRONTIER MODELS? ICLR 24-10 link Star count SPACE img
Explore until Confident: Efficient Exploration for Embodied Question Answering ARXIV 24-03 link Star count -- img
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI CVPR 24-01 -- -- -- img

Multi-View Images

Title Venue Date Code Stars Benchmark Illustration
InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models ARXIV 25-06 -- -- InternSpatial img
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with MultiModal Large Language Models ARXIV 25-05 link Star count -- img
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence ARXIV 25-05 link Star count MMSI-Bench img
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models ARXIV 25-05 link Star count ViewSpatial-Bench img
Seeing from Another Perspective Evaluating Multi-View Understanding in MLLMS ARXIV 25-04 link Star count All-Angles-Bench img
MM-Spatial Exploring 3D Spatial Understanding in Multimodal LLMs ARXIV 25-03 -- -- Cubify Anything VQA img
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models CVPR 25-01 -- -- CoSpace img
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models ARXIV 24-12 -- -- -- img
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection CVPR2025 24-12 -- -- CLIPort, Omnigibson
RL-Bench
img
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings CVPR2020 20-03 link Star count SPARE3D img

Others

Title Venue Date Code Stars Benchmark Illustration
SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence ARXIV 25-06 link Star count SpaCE-10 img
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics ARXIV 25-06 -- -- Robot-R1 Bench img
Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models ARXIV 25-06 link Star count -- img
SpatialScore Towards Unified Evaluation for Multimodal Spatial Understanding ARXIV 25-05 link Star count SpatialScore img
A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision ECCV 25-05 -- -- LVSQA img
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation ARXIV 25-05 link -- ManipBench img
InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning ARXIV 25-05 link Star count -- img
Universal Visuo-Tactile Video Understanding for Embodied Interaction ARXIV 25-05 -- -- -- img
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents ARXIV 25-05 link Star count -- img
LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding ARXIV 25-05 -- -- scan2cap, scanqa
scanref, multi3drefer
chat4d
img
Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes ARXIV 25-04 -- -- -- img
A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science ARXIV 25-04 -- -- -- img
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness ARXIV 25-04 link Star count sqa3d, scanqa img
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D ARXIV 25-04 link Star count SR3D, NR3D
ScanRefer
img
3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o -- 25-03 -- -- -- img
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks ARXIV 25-03 -- -- -- img
FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks ARXIV 25-02 link Star count FoREST img
A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards ICRA 25-02 link Star count -- img
pace-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired ARXIV 25-02 link Star count SA-Bench img
VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning ARXIV 25-02 -- -- -- img
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation ARXIV 25-02 link Star count -- img
PHYSBENCH: BENCHMARKING AND ENHANCING VISION-LANGUAGE MODELS FOR PHYSICAL WORLD UNDERSTANDING ARXIV 25-01 link Star count PhysBench img
3D-Mem: 3DScene Memory for Embodied Exploration and Reasoning CVPR 25-01 link Star count -- img
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability CVPR 25-01 -- -- -- img
Evaluating and enhancing spatial cognition abilities of large language models IJGIS 25-01 link Star count -- img
EMBODIEDEVAL: Evaluate Multimodal LLMs as Embodied Agents ARXIV 25-01 -- -- EMBODIEDEVAL img
Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces ARXIV 25-01 link -- Social-LLaVA img
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model ARXIV 25-01 link Star count SpatialVLA img
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints CVPR2025 25-01 link Star count OmniManip img
Synthetic Vision: Training Vision-Language Models to Understand Physics -- 24-12 -- -- -- img
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning -- 24-12 link Star count Emma-X img
Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning ARXIV 24-12 -- -- -- img
GPT-4V(ision) for Robotics: Multimodal Task Planning From Human Demonstration OBOTICS AND AUTOMATION LETTERS 24-11 -- -- -- img
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities ICLR 24-10 link -- COMFORT img
I KnowAbout“Up”! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction ARXIV 24-07 -- -- -- img
GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning ARXIV 24-07 link Star count -- img
RoboPoint: AVision-Language Model for Spatial Affordance Prediction for Robotics ARXIV 24-06 link Star count Robopoint img
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models ACL 24-06 link Star count SpaRP img
KnowYour Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning CVPR 24-04 link Star count -- img
Agent3D-Zero: An Agent for Zero-shot 3D Understanding ECCV 24-03 -- -- -- img
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning WACV 24-03 -- -- ScanQA, SQA3D
ALFRED
img
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction ECCV 24-02 link Star count -- img
BAT: Learning to Reason about Spatial Sounds with Large Language Models ARXIV 24-02 link Star count -- img
Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models ARXIV 24-02 -- -- Euclidea img
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning CVPR 24-01 link Star count -- img
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark AAAI 24-01 -- -- StepGame img
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models CVPR 24-01 link Star count NuInstruct img

Datasets

Title Venue Date Download-Link Citation Input Type Illustration
Do Vision-Language Models Represent Space and How. Evaluating Spatial Frame of Reference under Ambiguities ICLR 25-11 link 0 Image img
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models ARXIV 25-06 link 0 Image img
SpatialScore Towards Unified Evaluation for Multimodal Spatial Understanding ARXIV 25-05 link 0 Image img
Can Multimodal Large Language Models Understand Spatial Relations ARXIV 25-05 link -- Image img
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? ARXIV 25-05 link 0 Image img
SITE: towards Spatial Intelligence Thorough Evaluation ARXIV 25-05 link 0 Image/Multi-view Image/Video img
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction ARXIV 25-05 link 0 Video img
MMSI-Bench: A Benchmark for Multi-ImagecSpatial Intelligence ARXIV 25-05 link 1 multi-view img
ViewSpatial-Bench:Evaluating Multi-perspective Spatial Localization in Vision-Language Models ARXIV 25-05 link -- multi-view img
Improved Visual-Spatial Reasoning via R1-Zero-Like Training ARXIV 25-04 link 9 -- img
Towards Understanding Camera Motions in Any Video ARXIV 25-04 link 1 Video img
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs ARXIV 25-04 link 3 multi-view img
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models ARXIV 25-03 link 5 Image img
Open3DVQA: ABenchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space OPEN3DVQA 25-03 link 4 Image img
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? ARXIV 25-03 link 8 Multi-view img
STI-Bench: Are MLLMsReadyfor Precise Spatial-Temporal World Understanding? ARXIV 25-03 -- 7 Video img
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos ARXIV 25-03 link 5 Video img
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D ARXIV 25-03 link 2 Image img
Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired ARXIV 25-02 link 1 Image img
VADAR: Visual Agentic AI for Spatial Reasoning with a Dynamic API CVPR 25-02 link 5 image img
PHYSBENCH: BENCHMARKING AND ENHANCING VISION-LANGUAGE MODELS FOR PHYSICAL WORLD UNDERSTANDING ARXIV 25-01 link 22 Image/Video img
Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces ARXIV 25-01 link 2 Image img
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model ARXIV 25-01 link -- Video img
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning ARXIV 24-12 link -- Video img
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations ARXIV 24-12 link 2 image img
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark ARXIV 24-12 link 12 Image img
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation ACL 24-12 link 3 Image img
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces ARXIV 24-12 link 86 Video img
ROOT: VLM-based System for Indoor Scene Understanding and Beyond ARXIV 24-11 -- 2 image img
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models EMNLP 2024 24-11 link 9 Image img
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks ARXIV 24-11 link 7 Image img
OpenEQA: Embodied Question Answering in the Era of Foundation Models CVPR 24-11 link 149 Video img
DOES SPATIAL COGNITION EMERGE IN FRONTIER MODELS? ICLR 24-10 link 19 Video img
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models ARXIV 24-09 link 15 Image img
Understanding Depth and Height Perception of Large Visual-Language Models CVPRW 24-08 link 0 2D/3D image img
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs ARXIV 24-07 link 4 Image img
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models ACL 24-06 link 4 text img
SpatialBot: Precise Spatial Understanding with Vision Language Models ICRA 24-06 link 43 Image img
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models ARXIV 24-06 link 104 Image, point cloud img
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models NEURIPS 24-06 link 61 text only/image only/image-text img
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models ACL 2024 SHORT 24-06 link 22 Image img
COMPOSITIONAL 4D DYNAMIC SCENES UNDERSTANDING WITH PHYSICS PRIORS FOR VIDEO QUESTION ANSWERING ICLR2025 24-06 link 5 Video img
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning IJCAI 2024 24-05 link 9 Multi-view img
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models NEURIPS 24-04 link 33 image img
BLINK: Multimodal Large Language Models Can See but Not Perceive ECCV 24-04 link 180 Image/Multi-view Image img
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning NEURIPS 24-03 link 74 image img
Can Transformers Capture Spatial Relations between Objects? ICLR 24-03 link 6 Image img
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models CVPR 24-01 link -- Video img
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis ARXIV 24-01 link 2 Image img
3D-Aware Visual Question Answering about Parts, Poses and Occlusions NIPS 23-10 link 14 Image img
Sqa3d: Situated question answering in 3d scenes ICLR 22-10 link 161 point cloud img
ScanQA: 3D Question Answering for Spatial Scene Understanding CVPR 21-12 link 234 point cloud img
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings CVPR2020 20-03 link 23 multi-view img

Acknowledgements

🌍 Visitor Statistics

About

This is a project about visual spatial reasoning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 8

Languages