Awesome Multimodal Token Compression

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios [arXiv]
Kele Shao^*,1,2, Keda Tao^*,1,2, Kejia Zhang³, Sicheng Feng^2,4, Mu Cai⁵, Yuzhang Shang⁶, Haoxuan You⁷, Can Qin⁸, Yang Sui⁹, Huan Wang^†,2

¹Zhejiang University, ²Westlake University, ³Xiamen University, ⁴National University of Singapore, ⁵University of Wisconsin-Madison, ⁶University of Central Florida, ⁷Columbia University, ⁸Salesforce AI Research, ⁹Rice University

* Equal Contribution. † Corresponding Author (wanghuan@westlake.edu.cn).

Contact

For questions, suggestions, or collaboration opportunities, please feel free to reach out:

✉️ Email: shaokele@gmail.com or KD.TAO.CT@outlook.com

🔥 News

[2025.07.29] The v1 survey is now published! We've also initialized the repository.

🎯 Motivation

Motivation: Up: Image, video, and audio data types can scale in their representation dimensions, leading to a corresponding increase in the number of tokens. Down: Top-performing MLLMs cannot address real-world demands, as the number of tokens for multimodal information, especially video, vastly exceeds that of text. Therefore, token compression is crucial to address this limitation.

📚 Contents

Awesome Token Compression

🚧 TODO

Release a web page for easily finding relevant research papers.
Release a download tool.
Release an easy-to-use pull request tool.

📌 Citation

Please consider citing our paper in your publications if our findings help your research.

@article{token_compression_survey,
  title={When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios},
  author={Shao, Kele and Tao, Keda and Zhang, Kejia and Feng, Sicheng and Cai, Mu and Shang, Yuzhang and You, Haoxuan and Qin, Can and Sui, Yang and Wang, Huan},
  journal={arXiv preprint arXiv:2507.20198},
  year={2025}
}