Skip to content

InternRobotics/InternSR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gradio Demo doc GitHub star chart GitHub Issues Discord

🏠 Introduction

InternSR is an open-source toolbox for studying spatial reasoning capabilities of LVLMs based on PyTorch.

Highlights

  • High-quality Challenging Benchmarks

InternSR supports our latest challenging benchmarks with high-quality human annotations for evaluating the spatial capabilities of LVLMs, covering different inputs and scenarios.

  • Easy to Use

The evaluation part is built upon VLMEvalKit, inheriting its one-command convenience for using different models and benchmarks.

  • Focus on Vision-based Embodied Spatial Intelligence

Currently, InternSR focuses on spatial reasoning from ego-centric raw visual observations. Thus, it gets rid of 3D inputs and supports commonly used 2D LVLMs, meanwhile highlighting the applications in embodied interaction. We plan to support more models and benchmarks along this line.

🔥 News

  • [2025/07] - InternSR v0.1.0 released.

📋 Table of Contents

📚 Getting Started

Please refer to the documentation for quick start with InternSR, from installation to evaluating supported models.

📦 Overview of Benchmark and Model Zoo

Benchmark Focus Method Input Modality Data Scale
MMScan Spatial Understanding InternVL, LLaVA, QwenVL, Proprietary Models, LLaVA-3D ego-centric videos 300k
OST-Bench Online Spatio-temporal Reasoning InternVL, LLaVA, QwenVL, Proprietary Models ego-centric videos 10k
MMSI-Bench Multi-image Spatial Reasoning InternVL, LLaVA, QwenVL, Proprietary Models multi-view images 1k
EgoExo-Bench Ego-Exo Cross-view Spatial Reasoning InternVL, LLaVA, QwenVL, Proprietary Models ego-exo cross-view videos 7k

🏆 Leaderboard

Models OST-Bench MMSI-Bench EgoExo-Bench MMScan
GPT-4o 51.19 30.3 38.5 43.97
GPT-4.1 50.96 30.9 - -
Claude-3.7-sonnet - 30.2 32.8 -
QwenVL2.5-7B 41.07 25.9 32.8 -
QwenVL2.5-32B 47.33 - 39.7 -
QwenVL2.5-72B - 30.7 44.7 -
QwenVL2.5-7B 41.07 25.9 32.8 39.53
QwenVL2.5-32B 47.33 - 39.7 -
QwenVL2.5-72B - 30.7 44.7 -
InternVL2.5-8B 47.94 28.7 - 39.36
InternVL2.5-38B - - - 46.02
InternVL2.5-78B 47.94 28.5 - -
InternVL3-8B - 25.7 31.3 44.97
InternVL3-38B - - - -
LLaVA-OneVision-7B 34.92 - 29.5 39.36
LLaVA-OneVision-72B 44.59 28.4 - -
LLaVA-3D - - - 46.35*

Note :

  • Different transformers versions may cause output variations within ±3% score for the same model.
  • For more detailed results, please refer to the original repositories/papers of these works.
  • * refers to evaluating the models with pose and depth as input as well as 3D bounding boxes as prompts on MMScan.

👥 Contribute

We appreciate all contributions to improve InternSR. Please refer to our contribution guide for the detailed instruction. For new models and benchmarks support based on VLMEvalKit, the user can also refer to the guideline from VLMEvalKit.

🔗 Citation

If you find our work helpful, please cite:

@misc{internsr2025,
    title = {{InternSR: InternRobotics'} open-source toolbox for vision-based embodied spatial intelligence.},
    author = {InternSR Contributors},
    howpublished={\url{https://github.com/InternRobotics/InternSR}},
    year = {2025}
}

If you use the specific pretrained models and benchmarks, please kindly cite the original papers involved in our work. Related BibTex entries of our papers are provided below.

Related Work BibTex
@misc{mmsibench,
    title = {{MMSI-Bench: A} Benchmark for Multi-Image Spatial Intelligence},
    author = {Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and Lin, Dahua and Wang, Tai and Pang, Jiangmiao},
    year = {2025},
    booktitle={arXiv},
}
@misc{ostbench,
    title = {{OST-Bench: Evaluating} the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding},
    author = {Wang, Liuyi and Xia, Xinyuan and Zhao, Hui and Wang, Hanqing and Wang, Tai and Chen, Yilun and Liu, Chengju and Chen, Qijun and Pang, Jiangmiao},
    year = {2025},
    booktitle={arXiv},
}
@misc{egoexobench,
    title = {{EgoExoBench: A} Benchmark for First- and Third-person View Video Understanding in MLLMs},
    author = {He, Yuping and Huang, Yifei and Chen, Guo and Pei, Baoqi and Xu, Jilan and Lu, Tong and Pang, Jiangmiao},
    year = {2025},
    booktitle={arXiv},
}
@inproceedings{mmscan,
    title={{MMScan: A} Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations},
    author={Lyu, Ruiyuan and Lin, Jingli and Wang, Tai and Yang, Shuai and Mao, Xiaohan and Chen, Yilun and Xu, Runsen and Huang, Haifeng and Zhu, Chenming and Lin, Dahua and Pang, Jiangmiao},
    year={2024},
    booktitle={Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track},
}
@inproceedings{embodiedscan,
    title={{EmbodiedScan: A} Holistic Multi-Modal 3D Perception Suite Towards Embodied AI},
    author={Wang, Tai and Mao, Xiaohan and Zhu, Chenming and Xu, Runsen and Lyu, Ruiyuan and Li, Peisen and Chen, Xiao and Zhang, Wenwei and Chen, Kai and Xue, Tianfan and Liu, Xihui and Lu, Cewu and Lin, Dahua and Pang, Jiangmiao},
    year={2024},
    booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
}

📄 License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License Creative Commons License.

👏 Acknowledgement

  • VLMEvalKit: The evaluation code for OST-Bench, MMSI-Bench, EgoExo-Bench is based on VLMEvalKit.

About

InternRobotics' open-source toolbox for vision-based embodied spatial intelligence.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages