OmniSpatial diagnoses the limits of today's vision-language models (VLMs) on higher-order spatial cognition.
It spans 50 fine-grained tasks grouped into 4 dimensions—dynamic reasoning, complex spatial logic, spatial interaction and perspective-taking—covering 1.3K samples and 1.5K question-answer pairs.
Mengdi Jia *, Zekun Qi *, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang and Li Yi.
Dimension | Example Skills | % of QA |
---|---|---|
Dynamic Reasoning | motion prediction, temporal ordering, manipulation planning | 27 % |
Complex Spatial Logic | geometric transformations, pattern completion | 16 % |
Spatial Interaction | collision checking, path planning, traffic analysis | 20 % |
Perspective Taking | egocentric ↔ allocentric transforms, hypothetical views | 37 % |
- Real-world diversity: Internet images, driving-test frames, HOI4D videos, IQ tests
- Manual QA: multi-round human annotation––no templates
- Challenging: SOTA VLMs top out at 56.3 % vs. human 92.6 % accuracy
- Plug-and-play toolkit: unified evaluation scripts for open-source, closed-source, and reasoning models
- Research Improvements:
PointGraph
(scene-graph reasoning) andSpatial CoT
(novel-view chain-of-thought)
Environment
# create conda env
conda create -n omnispatial python=3.12 -y
conda activate omnispatial
# clone repo
git clone https://github.com/qizekun/OmniSpatial.git
cd OmniSpatial
Install dependencies
Open-source VLMs
pip install torch==2.5.1 torchvision==0.20.1 transformers==4.49.0 qwen-vl-utils[decord]==0.0.8 triton accelerate timm ninja
MAX_JOBS=64 pip install -v flash-attn --no-build-isolation
Closed-source (API) VLMs
pip install openai==1.81.0
export OPENAI_API_KEY="sk-..."
# optional: export OPENAI_API_BASE="https://api.openai.com/v1"
Download dataset
# export HF_ENDPOINT="https://hf-mirror.com"
mkdir -p dataset
huggingface-cli download --resume-download qizekun/OmniSpatial --local-dir dataset --repo-type dataset
find dataset/ -name '*.zip' -exec unzip -o {} -d dataset/ \;
rm -f dataset/*.zip && rm -rf dataset/__MACOSX
The dataset is downloaded to dataset/
, and the structure is as follows:
dataset/
├── Complex_Logic/
├── Dynamic_Reasoning/
├── Perspective_Taking/
├── Spatial_Interaction/
└── data.json
Run evaluation
# Example: GPT-4.1 via OpenAI API
python api_eval.py --model_id gpt-4.1 --prompt_type manual_cot --eval_type re
# Example: local Qwen-VL-2.5-3B
cd vlm_eval
python qwenvl_eval.py --model_id Qwen/Qwen2.5-VL-3B-Instruct --prompt_type manual_cot --eval_type re
# Example: parallel evaluation
cd vlm_eval
python parallel_eval.py --model qwenvl --model_id Qwen/Qwen2.5-VL-3B-Instruct --group 8 --visible_nodes 0,1,2,3,4,5,6,7
Results are written to result/{model_id}.json
.
Rank | Model | Overall ↑ | Dyn. | Interact | Logic | Persp. |
---|---|---|---|---|---|---|
🥇 | o3-2025-04-16 | 56.3 | 70.9 | 65.3 | 35.4 | 53.6 |
🥈 | Gemini-2.5-Pro-05-06 | 55.2 | 68.2 | 67.7 | 39.4 | 44.6 |
🥉 | Gemini-2.5-Flash-Thinking-05-20 | 53.2 | 69.3 | 64.0 | 35.5 | 45.7 |
… | Human (upper-bound) | 92.6 | 96.7 | 95.0 | 89.6 | 96.1 |
Full table in our homepage.
OmniSpatial’s 50 fine-grained tasks span dynamic motion prediction, geometric logic, real-world traffic and object interaction analysis, map-level navigation planning, and egocentric, hypothetical, and allocentric perspective-taking for counting, size, direction, order, and distance, together forming a single benchmark that comprehensively probes spatial reasoning, multimodal perception, and decision-making across both 2D and 3D scenes.
Images / Clips | QA Pairs | #Tasks | License | |
---|---|---|---|---|
OmniSpatial | 1 387 | 1 533 | 50 | CC BY-NC 4.0 |
- Sources: Web crawls (MIT-licensed or CC images), driving-test banks, HOI4D, MME
- Annotation format: JSON with fields:
{ "id": "0_0", "question": "How long will it take for the moving car closest to the camera that captured this image to reach it if it's going at 10 m/s?", "options": ["2.7s", "14.7s", "25.7s", "3.9s"], "answer": 0, "task_type": "Dynamic_Reasoning", "sub_task_type": "Motion_Analysis" }
- Multiple-choice; accuracy is averaged over 5 seeds to reduce randomness
- Evaluation Types:
direct
: Direct answer extractionre
: Regular expression pattern matchingjson
: JSON format parsingllm
: Using LLM (GPT-4.1-mini) as judge
- Prompt Types:
none
: No system promptzeroshot_cot
: Zero-shot chain-of-thoughtmanual_cot
: Manual chain-of-thought
Tricks | What it does |
---|---|
PointGraph | Builds a scene graph from SAM + SAM-centers & Bounding Boxes |
Spatial CoT | Generates novel views via InstantMesh & Zero1-2-3++ |
cd pointgraph
pip install -r requirements.txt
# Example: openai api
python api_eval.py --model_id gpt-4.1 --eval_type re
# Example: local Qwen-VL-2.5-3B
python vlm_eval.py --model_id Qwen/Qwen2.5-VL-3B-Instruct --eval_type direct
# Example: parallel evaluation
python parallel_eval.py --model_id gpt-4.1 --group 8 --visible_nodes 0,1,2,3,4,5,6,7
cd spatialcot
pip install -r requirements.txt
# Example: openai api
python api_eval.py configs/instant-mesh-large.yaml --model_id gpt-4.1
# Example: local Qwen-VL-2.5-3B
python vlm_eval.py configs/instant-mesh-large.yaml --model_id Qwen/Qwen2.5-VL-3B-Instruct
If you find OmniSpatial useful, please cite:
@article{omnispatial25,
title = {OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models},
author = {Mengdi Jia and Zekun Qi and Shaochen Zhang and Wenyao Zhang and Xinqiang Yu and Jiawei He and He Wang and Li Yi},
journal = {arXiv preprint arXiv:2506.03135},
year = {2025}
}
- Code — MIT License
- Data — CC BY-NC 4.0 (non-commercial research only)
Please check individual images for additional constraints.
This project builds upon open-source efforts including Segment Anything, InstantMesh, Qwen-VL, InternVL, and the many contributors of the open-source VLM community. Special thanks to our annotator team for their meticulous QA work.