SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

Ziyang Gong^1*, Wenhao Li^2*, Oliver Ma³, Songyuan Li⁴, Jiayi Ji⁵, Xue Yang¹, Gen Luo³, Junchi Yan¹, Rongrong Ji²

¹ Shanghai Jiao Tong University, ² Xiamen University,
³ Shanghai AI Lab, ⁴ Sun Yat-sen University, ⁵ National University of Singapore

^* Equal contribution

🧠 What is SpaCE-10?

SpaCE-10 is a compositional spatial intellegence benchmark for evaluating Multimodal Large Language Models (MLLMs) in indoor environments. Our contribution as follows:

🧬 We define an Atomic Capability Pool, proposing 10 atomic spatial capabilities.
🔗 Based on the composition of different atomic capabilities, we design 8 compositional QA types.
📈 SpaCE-10 benchmark contains 5,000+ QA pairs.
🏠 All QA pairs come from 811 indoor scenes (ScanNet++, ScanNet, 3RScan, ARKitScene)
🌍 SpaCE-10 spans both 2D and 3D MLLM evaluations and can be seamlessly adapted to MLLMs that accept 3D scan input.

🔥🔥🔥 News

[2025/07/12] Adjust some QAs of Space-10 and update RemyxAI models' performance to leader board.
[2025/06/11] Scans for 3D MLLMs and our manually collected 3D snapshots will be coming soon.
[2025/06/10] Evaluation code is released at followings.
[2025/06/09] We have released the benchmark for 2D MLLMs at Hugging Face.
[2025/06/09] The paper of SpaCE-10 is released at Arxiv!

Performance Leader Board - Single-Choice

🎉 LLaVA-OneVision-72B achieves the Rank 1 in all tested models.

🎉 GPT-4o achieves the best score in tested Close-Source models.

A large gap still exists between human and models in compositional spatial intelligence.

Single-Choice vs. Double-Choice

Capability Score Ranking - Single-Choice

Environment

The evaluation of SpaCE-10 is based on lmms-eval. Thus, we follow the environment settings of lmms-eval.

git clone https://github.com/Cuzyoung/SpaCE-10.git
cd SpaCE-10
uv venv dev --python=3.10
source dev/bin/activate
uv pip install -e .

Evaluation

Take InternVL2.5-8B as an example:

cd lmms-eval/run_bash
bash internvl2.5-8b.sh

Notably, each time we test a new model, the corresponding environment of this model needs to be installed.

Citation

@article{gong2025space10, title={SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence}, author={Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, Rongrong Ji}, journal={arXiv preprint arXiv:2506.07966}, year={2025} }

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
lmms_eval		lmms_eval
miscs		miscs
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

🧠 What is SpaCE-10?

🔥🔥🔥 News

Performance Leader Board - Single-Choice

Single-Choice vs. Double-Choice

Capability Score Ranking - Single-Choice

Environment

Evaluation

Notably, each time we test a new model, the corresponding environment of this model needs to be installed.

Citation

About

Uh oh!

Releases

Packages

Languages

VisionXLab/SpaCE-10

Folders and files

Latest commit

History

Repository files navigation

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

🧠 What is SpaCE-10?

🔥🔥🔥 News

Performance Leader Board - Single-Choice

Single-Choice vs. Double-Choice

Capability Score Ranking - Single-Choice

Environment

Evaluation

Notably, each time we test a new model, the corresponding environment of this model needs to be installed.

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages