GitHub - InternRobotics/InternSR: InternRobotics' open-source toolbox for vision-based embodied spatial intelligence.

🏠 Introduction

InternSR is an open-source toolbox for studying spatial reasoning capabilities of LVLMs based on PyTorch.

Highlights

High-quality Challenging Benchmarks

InternSR supports our latest challenging benchmarks with high-quality human annotations for evaluating the spatial capabilities of LVLMs, covering different inputs and scenarios.

Easy to Use

The evaluation part is built upon VLMEvalKit, inheriting its one-command convenience for using different models and benchmarks.

Focus on Vision-based Embodied Spatial Intelligence

Currently, InternSR focuses on spatial reasoning from ego-centric raw visual observations. Thus, it gets rid of 3D inputs and supports commonly used 2D LVLMs, meanwhile highlighting the applications in embodied interaction. We plan to support more models and benchmarks along this line.

🔥 News

[2025/07] - InternSR v0.1.0 released.

📚 Getting Started

Please refer to the documentation for quick start with InternSR, from installation to evaluating supported models.

📦 Overview of Benchmark and Model Zoo

Benchmark	Focus	Method	Input Modality	Data Scale
MMScan	Spatial Understanding	InternVL, LLaVA, QwenVL, Proprietary Models, LLaVA-3D	ego-centric videos	300k
OST-Bench	Online Spatio-temporal Reasoning	InternVL, LLaVA, QwenVL, Proprietary Models	ego-centric videos	10k
MMSI-Bench	Multi-image Spatial Reasoning	InternVL, LLaVA, QwenVL, Proprietary Models	multi-view images	1k
EgoExo-Bench	Ego-Exo Cross-view Spatial Reasoning	InternVL, LLaVA, QwenVL, Proprietary Models	ego-exo cross-view videos	7k

🏆 Leaderboard

Models	OST-Bench	MMSI-Bench	EgoExo-Bench	MMScan
GPT-4o	51.19	30.3	38.5	43.97
GPT-4.1	50.96	30.9	-	-
Claude-3.7-sonnet	-	30.2	32.8	-
QwenVL2.5-7B	41.07	25.9	32.8	-
QwenVL2.5-32B	47.33	-	39.7	-
QwenVL2.5-72B	-	30.7	44.7	-
QwenVL2.5-7B	41.07	25.9	32.8	39.53
QwenVL2.5-32B	47.33	-	39.7	-
QwenVL2.5-72B	-	30.7	44.7	-
InternVL2.5-8B	47.94	28.7	-	39.36
InternVL2.5-38B	-	-	-	46.02
InternVL2.5-78B	47.94	28.5	-	-
InternVL3-8B	-	25.7	31.3	44.97
InternVL3-38B	-	-	-	-
LLaVA-OneVision-7B	34.92	-	29.5	39.36
LLaVA-OneVision-72B	44.59	28.4	-	-
LLaVA-3D	-	-	-	46.35*

Note :

Different transformers versions may cause output variations within ±3% score for the same model.
For more detailed results, please refer to the original repositories/papers of these works.
* refers to evaluating the models with pose and depth as input as well as 3D bounding boxes as prompts on MMScan.

👥 Contribute

We appreciate all contributions to improve InternSR. Please refer to our contribution guide for the detailed instruction. For new models and benchmarks support based on VLMEvalKit, the user can also refer to the guideline from VLMEvalKit.

🔗 Citation

If you find our work helpful, please cite:

@misc{internsr2025,
    title = {{InternSR: InternRobotics'} open-source toolbox for vision-based embodied spatial intelligence.},
    author = {InternSR Contributors},
    howpublished={\url{https://github.com/InternRobotics/InternSR}},
    year = {2025}
}

If you use the specific pretrained models and benchmarks, please kindly cite the original papers involved in our work. Related BibTex entries of our papers are provided below.

Related Work BibTex

@misc{mmsibench,
    title = {{MMSI-Bench: A} Benchmark for Multi-Image Spatial Intelligence},
    author = {Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and Lin, Dahua and Wang, Tai and Pang, Jiangmiao},
    year = {2025},
    booktitle={arXiv},
}
@misc{ostbench,
    title = {{OST-Bench: Evaluating} the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding},
    author = {Wang, Liuyi and Xia, Xinyuan and Zhao, Hui and Wang, Hanqing and Wang, Tai and Chen, Yilun and Liu, Chengju and Chen, Qijun and Pang, Jiangmiao},
    year = {2025},
    booktitle={arXiv},
}
@misc{egoexobench,
    title = {{EgoExoBench: A} Benchmark for First- and Third-person View Video Understanding in MLLMs},
    author = {He, Yuping and Huang, Yifei and Chen, Guo and Pei, Baoqi and Xu, Jilan and Lu, Tong and Pang, Jiangmiao},
    year = {2025},
    booktitle={arXiv},
}
@inproceedings{mmscan,
    title={{MMScan: A} Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations},
    author={Lyu, Ruiyuan and Lin, Jingli and Wang, Tai and Yang, Shuai and Mao, Xiaohan and Chen, Yilun and Xu, Runsen and Huang, Haifeng and Zhu, Chenming and Lin, Dahua and Pang, Jiangmiao},
    year={2024},
    booktitle={Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track},
}
@inproceedings{embodiedscan,
    title={{EmbodiedScan: A} Holistic Multi-Modal 3D Perception Suite Towards Embodied AI},
    author={Wang, Tai and Mao, Xiaohan and Zhu, Chenming and Xu, Runsen and Lyu, Ruiyuan and Li, Peisen and Chen, Xiao and Zhang, Wenwei and Chen, Kai and Xue, Tianfan and Liu, Xihui and Lu, Cewu and Lin, Dahua and Pang, Jiangmiao},
    year={2024},
    booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
}

📄 License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License .

👏 Acknowledgement

VLMEvalKit: The evaluation code for OST-Bench, MMSI-Bench, EgoExo-Bench is based on VLMEvalKit.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
dataset		dataset
eval_tool		eval_tool
scripts		scripts
utils		utils
vlm		vlm
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏠 Introduction

Highlights

🔥 News

📋 Table of Contents

📚 Getting Started

📦 Overview of Benchmark and Model Zoo

🏆 Leaderboard

👥 Contribute

🔗 Citation

📄 License

👏 Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Languages

InternRobotics/InternSR

Folders and files

Latest commit

History

Repository files navigation

🏠 Introduction

Highlights

🔥 News

📋 Table of Contents

📚 Getting Started

📦 Overview of Benchmark and Model Zoo

🏆 Leaderboard

👥 Contribute

🔗 Citation

📄 License

👏 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages