InternSR is an open-source toolbox for studying spatial reasoning capabilities of LVLMs based on PyTorch.
- High-quality Challenging Benchmarks
InternSR supports our latest challenging benchmarks with high-quality human annotations for evaluating the spatial capabilities of LVLMs, covering different inputs and scenarios.
- Easy to Use
The evaluation part is built upon VLMEvalKit, inheriting its one-command convenience for using different models and benchmarks.
- Focus on Vision-based Embodied Spatial Intelligence
Currently, InternSR focuses on spatial reasoning from ego-centric raw visual observations. Thus, it gets rid of 3D inputs and supports commonly used 2D LVLMs, meanwhile highlighting the applications in embodied interaction. We plan to support more models and benchmarks along this line.
- [2025/07] - InternSR v0.1.0 released.
- 🏠 Introduction
- 🔥 News
- 📚 Getting Started
- 📦 Overview of Benchmark and Model Zoo
- 👥 Contribute
- 🔗 Citation
- 📄 License
- 👏 Acknowledgements
Please refer to the documentation for quick start with InternSR, from installation to evaluating supported models.
Benchmark | Focus | Method | Input Modality | Data Scale |
---|---|---|---|---|
MMScan | Spatial Understanding | InternVL, LLaVA, QwenVL, Proprietary Models, LLaVA-3D | ego-centric videos | 300k |
OST-Bench | Online Spatio-temporal Reasoning | InternVL, LLaVA, QwenVL, Proprietary Models | ego-centric videos | 10k |
MMSI-Bench | Multi-image Spatial Reasoning | InternVL, LLaVA, QwenVL, Proprietary Models | multi-view images | 1k |
EgoExo-Bench | Ego-Exo Cross-view Spatial Reasoning | InternVL, LLaVA, QwenVL, Proprietary Models | ego-exo cross-view videos | 7k |
Models | OST-Bench | MMSI-Bench | EgoExo-Bench | MMScan |
---|---|---|---|---|
GPT-4o | 51.19 | 30.3 | 38.5 | 43.97 |
GPT-4.1 | 50.96 | 30.9 | - | - |
Claude-3.7-sonnet | - | 30.2 | 32.8 | - |
QwenVL2.5-7B | 41.07 | 25.9 | 32.8 | - |
QwenVL2.5-32B | 47.33 | - | 39.7 | - |
QwenVL2.5-72B | - | 30.7 | 44.7 | - |
QwenVL2.5-7B | 41.07 | 25.9 | 32.8 | 39.53 |
QwenVL2.5-32B | 47.33 | - | 39.7 | - |
QwenVL2.5-72B | - | 30.7 | 44.7 | - |
InternVL2.5-8B | 47.94 | 28.7 | - | 39.36 |
InternVL2.5-38B | - | - | - | 46.02 |
InternVL2.5-78B | 47.94 | 28.5 | - | - |
InternVL3-8B | - | 25.7 | 31.3 | 44.97 |
InternVL3-38B | - | - | - | - |
LLaVA-OneVision-7B | 34.92 | - | 29.5 | 39.36 |
LLaVA-OneVision-72B | 44.59 | 28.4 | - | - |
LLaVA-3D | - | - | - | 46.35* |
Note :
- Different
transformers
versions may cause output variations within ±3% score for the same model. - For more detailed results, please refer to the original repositories/papers of these works.
- * refers to evaluating the models with pose and depth as input as well as 3D bounding boxes as prompts on MMScan.
We appreciate all contributions to improve InternSR. Please refer to our contribution guide for the detailed instruction. For new models and benchmarks support based on VLMEvalKit, the user can also refer to the guideline from VLMEvalKit.
If you find our work helpful, please cite:
@misc{internsr2025,
title = {{InternSR: InternRobotics'} open-source toolbox for vision-based embodied spatial intelligence.},
author = {InternSR Contributors},
howpublished={\url{https://github.com/InternRobotics/InternSR}},
year = {2025}
}
If you use the specific pretrained models and benchmarks, please kindly cite the original papers involved in our work. Related BibTex entries of our papers are provided below.
Related Work BibTex
@misc{mmsibench,
title = {{MMSI-Bench: A} Benchmark for Multi-Image Spatial Intelligence},
author = {Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and Lin, Dahua and Wang, Tai and Pang, Jiangmiao},
year = {2025},
booktitle={arXiv},
}
@misc{ostbench,
title = {{OST-Bench: Evaluating} the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding},
author = {Wang, Liuyi and Xia, Xinyuan and Zhao, Hui and Wang, Hanqing and Wang, Tai and Chen, Yilun and Liu, Chengju and Chen, Qijun and Pang, Jiangmiao},
year = {2025},
booktitle={arXiv},
}
@misc{egoexobench,
title = {{EgoExoBench: A} Benchmark for First- and Third-person View Video Understanding in MLLMs},
author = {He, Yuping and Huang, Yifei and Chen, Guo and Pei, Baoqi and Xu, Jilan and Lu, Tong and Pang, Jiangmiao},
year = {2025},
booktitle={arXiv},
}
@inproceedings{mmscan,
title={{MMScan: A} Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations},
author={Lyu, Ruiyuan and Lin, Jingli and Wang, Tai and Yang, Shuai and Mao, Xiaohan and Chen, Yilun and Xu, Runsen and Huang, Haifeng and Zhu, Chenming and Lin, Dahua and Pang, Jiangmiao},
year={2024},
booktitle={Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track},
}
@inproceedings{embodiedscan,
title={{EmbodiedScan: A} Holistic Multi-Modal 3D Perception Suite Towards Embodied AI},
author={Wang, Tai and Mao, Xiaohan and Zhu, Chenming and Xu, Runsen and Lyu, Ruiyuan and Li, Peisen and Chen, Xiao and Zhang, Wenwei and Chen, Kai and Xue, Tianfan and Liu, Xihui and Lu, Cewu and Lin, Dahua and Pang, Jiangmiao},
year={2024},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
}
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License .
- VLMEvalKit: The evaluation code for OST-Bench, MMSI-Bench, EgoExo-Bench is based on VLMEvalKit.