When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios [arXiv]
Kele Shao*,1,2, Keda Tao*,1,2, Kejia Zhang3, Sicheng Feng2,4, Mu Cai5, Yuzhang Shang6, Haoxuan You7, Can Qin8, Yang Sui9, Huan Wangβ ,21Zhejiang University, 2Westlake University, 3Xiamen University, 4National University of Singapore, 5University of Wisconsin-Madison, 6University of Central Florida, 7Columbia University, 8Salesforce AI Research, 9Rice University
* Equal Contribution. β Corresponding Author (wanghuan@westlake.edu.cn).
For questions, suggestions, or collaboration opportunities, please feel free to reach out:
βοΈ Email: shaokele@gmail.com or KD.TAO.CT@outlook.com
- [2025.07.29] The v1 survey is now published! We've also initialized the repository.
Motivation: Up: Image, video, and audio data types can scale in their representation dimensions, leading to a corresponding increase in the number of tokens. Down: Top-performing MLLMs cannot address real-world demands, as the number of tokens for multimodal information, especially video, vastly exceeds that of text. Therefore, token compression is crucial to address this limitation.
- Release a web page for easily finding relevant research papers.
- Release a download tool.
- Release an easy-to-use pull request tool.
Please consider citing our paper in your publications if our findings help your research.
@article{token_compression_survey,
title={When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios},
author={Shao, Kele and Tao, Keda and Zhang, Kejia and Feng, Sicheng and Cai, Mu and Shang, Yuzhang and You, Haoxuan and Qin, Can and Sui, Yang and Wang, Huan},
journal={arXiv preprint arXiv:2507.20198},
year={2025}
}
We welcome contributions to this survey! Please follow these guidelines:
- Fork the repository
- Create a feature branch
- Add relevant papers with proper formatting
- Submit a pull request with a clear description
red
for arXiv papersblue
for conference/journal paperswhite
for GitHub repositoriespurple
for research areasgreen
for categoriesyellow
for training cost
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More
AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph
ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
VisionZip: Longer is Better but Not Necessary in Vision Language Models
[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Efficient Multi-modal Large Language Models via Visual Token Grouping
Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
AdaCM2: Adaptive CrossβModality Memory Reduction
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
NVLM: Open Frontier-Class Multimodal LLMs
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models
TokenPacker: Efficient Visual Projector for Multimodal LLM
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
VoCo-LLaMA: Towards Vision Compression with Large Language Models
VoCo-LLaMA: Towards Vision Compression with Large Language Models
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Matryoshka Multimodal Models
Boosting multimodal large language models with visual tokens withdrawal for rapid inference.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
LLaVA-PruMerge:Β Adaptive Token Reduction for Efficient Large Multimodal Models
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
LaCo: Layer-wise Compression for Efficient MLLMs
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Honeybee: Locality-enhanced Projector for Multimodal LLM
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Minigpt-4: Enhancing vision-language understanding with advanced large language models.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
HoliTom: Holistic Token Merging for Fast Video Large Language Models
AdaTP: Attention-Debiased Token Pruning for Video Large Language Models
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Seed1.5-VL Technical Report
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Token-Efficient Long Video Understanding for Multimodal LLMs
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Qwen2.5-VL Technical Report
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Zero-shot 3D Question Answering via Voxel-based Dynamic Token Compression
ToSA: Token Merging with Spatial Awareness
PruneVid: Visual Token Pruning for Efficient Video Large Language Models
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
AdaCM2: Adaptive CrossβModality Memory Reduction
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Video Instruction Tuning with Synthetic Data
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
LLaVA-OneVision: Easy Visual Task Transfer
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
LongVLM: Efficient Long Video Understanding via Large Language Models
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Qwen2.5-Omni Technical Report
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
Large Language Models are Strong Audio-Visual Speech Recognition Learners
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Qwen2-Audio Technical Report
Qwen2-Audio Technical Report
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
SpeechVerse: A Large-scale Generalizable Audio Language Model
SpeechVerse: A Large-scale Generalizable Audio Language Model
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Connecting Speech Encoder and Large Language Model for ASR
Prompting Large Language Models with Speech Recognition Abilities
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
This project is licensed under the MIT License - see the LICENSE file for details.
This repository is inspired by Awesome-Efficient-Reasoning-Models, Awesome-Efficient-LLM, Awesome-Context-Engineering
π Thanks to these contributors for this excellent workοΌ