A curated and up-to-date paper list of awesome efficient LLM inference research.
In the AGI era, efficient inference for LLMs is critical to unlocking scalable and accessible applications. While LLMs deliver powerful capabilities, their substantial computational and memory requirements pose significant deployment challenges, particularly in resource-constrained environments. Research into optimization techniques—such as model pruning, quantization, and knowledge distillation—enables the creation of streamlined LLMs that retain high performance while minimizing resource demands. These advancements not only expand the scope of practical applications but also improve accessibility, ensuring broader utilization of LLMs across diverse platforms and use cases.
If you find some interesting work/projects, please contact me through issues or email withhaotian [at] gmail [dot] com.
This list only focuses on the efficient inference for LLMs. If you are interested in edge AI computing and system, please refer to awesome-edge-AI-papers.
This project is licensed under the GPL-3.0 license - see the LICENSE file for details.
- [arXiv'24] On-Device Language Models: A Comprehensive Review - [PDF] [Code]
- [arXiv'24] A Survey of Small Language Models - [PDF]
- [arXiv'24] Small Language Models: Survey, Measurements, and Insights - [PDF] [Code] [Demo]
- [arXiv'24] A Survey of Resource-efficient LLM and Multimodal Foundation Models - [PDF] [Code]
- [arXiv'24] On-Device Language Models: A Comprehensive Review - [PDF]
- [arXiv'24] A Survey on Model Compression for Large Language Models - [PDF]
- [arXiv'24] OpenELM: An Efficient Language Model Family with Open Training and Inference Framework - [PDF] [Code] [HuggingFace]
- [arXiv'24] FOX-1 TECHNICAL REPORT - [PDF] [HuggingFace]
- [arXiv'24] Tinyllama: An open-source small language model - [PDF] [Code]
- [arXiv'24] MobileVLM V2: Faster and Stronger Baseline for Vision Language Model - [PDF] [Code]
- [arXiv'24] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits - [PDF]
- [arXiv'24] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone - [PDF]
- [arXiv'24] MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT - [PDF] [Code]
- [arXiv'24] BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models - [PDF]
- [arXiv'24] Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules - [PDF] [Code]
- [arXiv'24] LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order - [PDF]
- [ICLR'25] CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences - [PDF] [Code]
- [arXiv'25] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs - [PDF]
- [arXiv'25] ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference - [PDF]
- [ICASSP'25] DynamicAttention: Dynamic KV Cache for Disaggregate LLM Inference - [PDF]
- [arXiv'24] vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving - [PDF]
- [arXiv'24] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget Allocation - [PDF] [Code]
- [arXiv'24] XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference - [PDF]
- [arXiv'24] Squeezed Attention: Accelerating Long Context Length LLM Inference - [PDF] [Code]
- [arXiv'24] LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference - [PDF]
- [NIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models - [PDF] [Code]
- [ACL'25] CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers - [PDF] [Code]
- [arXiv'25] Prompt-based Depth Pruning of Large Language Models - [PDF]
- [arXiv'25] Layer by Layer: Uncovering Hidden Representations in Language Models - [PDF]
- [arXiv'25] Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models - [PDF]
- [arXiv'25] A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs - [PDF]
- [AAAI'25] AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference - [PDF] [Code]
- [ICLR'25] Streamlining Redundant Layers to Compress Large Language Models - [PDF] [Code]
- [EMNLP'24] FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping - [PDF]
- [arXiv'24] Dynamic layer selection in decoder-only transformers - [PDF]
- [arXiv'24] Not All Layers of LLMs Are Necessary During Inference - [PDF]
- [arXiv'24] Hierarchical Skip Decoding for Efficient Autoregressive Text Generation - [PDF]
- [arXiv'24] Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy - [PDF]
- [arXiv'24] ShortGPT: Layers in Large Language Models are More Redundant Than You Expect - [PDF]
- [arXiv'24] A deeper look at depth pruning of LLMs - [PDF]
- [arXiv'24] SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks - [PDF]
- [ACL'24] LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding - [PDF] - [Code]
- [arXiv'23] Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE - [PDF]
- [arXiv'23] SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference - [PDF]
- [arXiv'23] Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding - [PDF]
- [ICML'23] Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time - [PDF] [Code]
- [ICML'23] EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism - [PDF] [Code]
- [NIPS'22] Confident Adaptive Language Modeling - [PDF] [Code]
- [ICLR'25] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration - [PDF] [Code]
- [arXiv'25] QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache - [PDF]
- [arXiv'25] DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting - [PDF] [Code]
- [arXiv'25] RASD: Retrieval-Augmented Speculative Decoding - [PDF]
- [NIPS'24] Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting - [PDF] [Code]
- [arXiv'25] 2SSP: A Two-Stage Framework for Structured Pruning of LLMs - [PDF]
- [arXiv'24] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - [PDF] [Code]
- [arXiv'24] Exploring post-training quantization in llms from comprehensive study to low rank compensation - [PDF]
- [NIPS'23] Llm-pruner: On the structural pruning of large language models - [PDF] [Code]
- [OSDI'24] ServerlessLLM: Low-Latency Serverless Inference for Large Language Models - [PDF] [Code]
- [arXiv'24] LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management - [PDF]
- [arXiv'24] Efficiently Serving LLM Reasoning Programs with Certaindex - [PDF]
- [arXiv'23]AutoDroid: LLM-powered Task Automation in Android - [PDF] [Code]
- [EuroSys'23] Tabi: An Efficient Multi-Level Inference System for Large Language Models - [PDF]
- [OSDI'23] Efficient memory management for large language model serving with pagedattention - [PDF] [Project]
- [arXiv'24] MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases - [PDF]
- [EdgeFM'24] Large Language Models on Mobile Devices: Measurements, Analysis, and Insights - [PDF]
- [arXiv'24] Toward Scalable Generative AI via Mixture of Experts in Mobile Edge Networks - [PDF]