Contributors:
Hao Ju, Songsong Yu, Lianjie Jia, Rundi Cui, Yuhan Wu, Binghao Ran, Zhang Zaibin, Zhang Fuxi, Zhipeng Zhang, Lin Song, Yuxin Chen🌟, Yanwei Li
🌟 Project Lead
-
Preprint a survey article on visual spatial reasoning tasks.
-
Open-source evaluation toolkit.
-
Open-source evaluation data for visual spatial reasoning tasks.
-
Release comprehensive evaluation results of mainstream models in visual spatial reasoning.
-
✍️🦾💼25.6.28 - Collected the "Datasets" section.
-
🏃🏃♀️🏃♂️25.6.16 - The "Awesome Visual Spatial Reasoning" project is now live!
-
👏🕮💻25.6.12 - The project has conducted research and collected 100 relevant works.
-
🙋♀️🙋♂️🙋25.6.10 - We launches a review project on visual spatial reasoning.
We welcome contributions to this repository! If you'd like to contribute, please follow these steps:
- Fork the repository.
- Create a new branch with your changes.
- Submit a pull request with a clear description of your changes.
You can also open an issue if you have anything to add or comment.
Please feel free to contact us (SongsongYu203@163.com).
The research community is increasingly focused on the visual spatial reasoning (VSR) abilities of Vision-Language Models (VLMs). Yet, the field lacks a clear overview of its evolution and a standardized benchmark for evaluation. Current assessment methods are disparate and lack a common toolkit. This project aims to fill that void. We are developing a unified, comprehensive, and diverse evaluation toolkit, along with an accompanying survey paper. We are actively seeking collaboration and discussion with fellow experts to advance this initiative.
Visual spatial understanding is a key task at the intersection of computer vision and cognitive science. It aims to enable intelligent agents (such as robots and AI systems) to parse spatial relationships in the environment through visual inputs (images, videos, etc.), forming an abstract cognition of the physical world. In Embodied Intelligence, it serves as the foundation for agents to achieve the "perception-decision-action" loop—only by understanding attributes like object positions, distances, sizes, and orientations in space can intelligent agents navigate environments, manipulate objects, or interact with humans.
To facilitate the community's quick understanding of visual-spatial reasoning, we first categorized it by input modalities into Single image, Monocular Video, and Multi-View Images. We also surveyed other input modalities such as point clouds, as well as specific applications like embodied robotics. These are temporarily grouped under "Others," and we will conduct a more detailed sorting in the future.