Skip to content

qizekun/OmniSpatial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

OmniSpatial diagnoses the limits of today's vision-language models (VLMs) on higher-order spatial cognition.
It spans 50 fine-grained tasks grouped into 4 dimensions—dynamic reasoning, complex spatial logic, spatial interaction and perspective-taking—covering 1.3K samples and 1.5K question-answer pairs.

Mengdi Jia *, Zekun Qi *, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang and Li Yi.

Project Page Paper PDF Hugging Face

teaser

🌟 Highlights

Dimension Example Skills % of QA
Dynamic Reasoning motion prediction, temporal ordering, manipulation planning 27 %
Complex Spatial Logic geometric transformations, pattern completion 16 %
Spatial Interaction collision checking, path planning, traffic analysis 20 %
Perspective Taking egocentric ↔ allocentric transforms, hypothetical views 37 %
  • Real-world diversity: Internet images, driving-test frames, HOI4D videos, IQ tests
  • Manual QA: multi-round human annotation––no templates
  • Challenging: SOTA VLMs top out at 56.3 % vs. human 92.6 % accuracy
  • Plug-and-play toolkit: unified evaluation scripts for open-source, closed-source, and reasoning models
  • Research Improvements: PointGraph (scene-graph reasoning) and Spatial CoT (novel-view chain-of-thought)

🚀 Quick Start

Environment
# create conda env
conda create -n omnispatial python=3.12 -y
conda activate omnispatial

# clone repo
git clone https://github.com/qizekun/OmniSpatial.git
cd OmniSpatial
Install dependencies

Open-source VLMs

pip install torch==2.5.1 torchvision==0.20.1 transformers==4.49.0 qwen-vl-utils[decord]==0.0.8 triton accelerate timm ninja
MAX_JOBS=64 pip install -v flash-attn --no-build-isolation

Closed-source (API) VLMs

pip install openai==1.81.0
export OPENAI_API_KEY="sk-..."
# optional: export OPENAI_API_BASE="https://api.openai.com/v1"
Download dataset
# export HF_ENDPOINT="https://hf-mirror.com"
mkdir -p dataset
huggingface-cli download --resume-download qizekun/OmniSpatial --local-dir dataset --repo-type dataset
find dataset/ -name '*.zip' -exec unzip -o {} -d dataset/ \;
rm -f dataset/*.zip && rm -rf dataset/__MACOSX

The dataset is downloaded to dataset/, and the structure is as follows:

dataset/
├── Complex_Logic/
├── Dynamic_Reasoning/
├── Perspective_Taking/
├── Spatial_Interaction/
└── data.json
Run evaluation
# Example: GPT-4.1 via OpenAI API
python api_eval.py --model_id gpt-4.1 --prompt_type manual_cot --eval_type re

# Example: local Qwen-VL-2.5-3B
cd vlm_eval
python qwenvl_eval.py --model_id Qwen/Qwen2.5-VL-3B-Instruct --prompt_type manual_cot --eval_type re

# Example: parallel evaluation
cd vlm_eval
python parallel_eval.py --model qwenvl --model_id Qwen/Qwen2.5-VL-3B-Instruct --group 8 --visible_nodes 0,1,2,3,4,5,6,7

Results are written to result/{model_id}.json.


📊 Leaderboard

Rank Model Overall ↑ Dyn. Interact Logic Persp.
🥇 o3-2025-04-16 56.3 70.9 65.3 35.4 53.6
🥈 Gemini-2.5-Pro-05-06 55.2 68.2 67.7 39.4 44.6
🥉 Gemini-2.5-Flash-Thinking-05-20 53.2 69.3 64.0 35.5 45.7
Human (upper-bound) 92.6 96.7 95.0 89.6 96.1

Full table in our homepage.


🗃️ Dataset Details

OmniSpatial’s 50 fine-grained tasks span dynamic motion prediction, geometric logic, real-world traffic and object interaction analysis, map-level navigation planning, and egocentric, hypothetical, and allocentric perspective-taking for counting, size, direction, order, and distance, together forming a single benchmark that comprehensively probes spatial reasoning, multimodal perception, and decision-making across both 2D and 3D scenes.

tasks

Images / Clips QA Pairs #Tasks License
OmniSpatial 1 387 1 533 50 CC BY-NC 4.0
  • Sources: Web crawls (MIT-licensed or CC images), driving-test banks, HOI4D, MME
  • Annotation format: JSON with fields:
    {
      "id": "0_0",
      "question": "How long will it take for the moving car closest to the camera that captured this image to reach it if it's going at 10 m/s?",
      "options": ["2.7s", "14.7s", "25.7s", "3.9s"],
      "answer": 0,
      "task_type": "Dynamic_Reasoning",
      "sub_task_type": "Motion_Analysis"
    }

🔍 Evaluation Protocol

  • Multiple-choice; accuracy is averaged over 5 seeds to reduce randomness
  • Evaluation Types:
    • direct: Direct answer extraction
    • re: Regular expression pattern matching
    • json: JSON format parsing
    • llm: Using LLM (GPT-4.1-mini) as judge
  • Prompt Types:
    • none: No system prompt
    • zeroshot_cot: Zero-shot chain-of-thought
    • manual_cot: Manual chain-of-thought

🛠️ Extensions

Tricks What it does
PointGraph Builds a scene graph from SAM + SAM-centers & Bounding Boxes
Spatial CoT Generates novel views via InstantMesh & Zero1-2-3++

PointGraph

cd pointgraph
pip install -r requirements.txt

# Example: openai api
python api_eval.py --model_id gpt-4.1 --eval_type re

# Example: local Qwen-VL-2.5-3B
python vlm_eval.py --model_id Qwen/Qwen2.5-VL-3B-Instruct --eval_type direct

# Example: parallel evaluation
python parallel_eval.py --model_id gpt-4.1 --group 8 --visible_nodes 0,1,2,3,4,5,6,7

Spatial CoT

cd spatialcot
pip install -r requirements.txt

# Example: openai api
python api_eval.py configs/instant-mesh-large.yaml --model_id gpt-4.1

# Example: local Qwen-VL-2.5-3B
python vlm_eval.py configs/instant-mesh-large.yaml --model_id Qwen/Qwen2.5-VL-3B-Instruct

📜 Citation

If you find OmniSpatial useful, please cite:

@article{omnispatial25,
  title   = {OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models},
  author  = {Mengdi Jia and Zekun Qi and Shaochen Zhang and Wenyao Zhang and Xinqiang Yu and Jiawei He and He Wang and Li Yi},
  journal = {arXiv preprint arXiv:2506.03135},
  year = {2025}
}

📄 License

  • Code — MIT License
  • Data — CC BY-NC 4.0 (non-commercial research only)
    Please check individual images for additional constraints.

🙏 Acknowledgements

This project builds upon open-source efforts including Segment Anything, InstantMesh, Qwen-VL, InternVL, and the many contributors of the open-source VLM community. Special thanks to our annotator team for their meticulous QA work.

About

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages