A curated list of research works on efficient on-device AI systems, methods, and applications for mobile and edge devices.
Note: Some of the works are designed for inference acceleration on cloud/server infrastructure, which has much higher computational resources, but I also include them here if they can be potentially generalized to on-device inference use cases.
- [MLSys 2025] MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
- [MLSys 2025] TurboAttention: Efficient attention approximation for High Throughputs LLMs
- [ASPLOS 2023] FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks
- [NeurIPS 2022] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- [arXiv 2025] HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators
- [ASPLOS 2025] Fast On-device LLM Inference with NPUs
- [arXiv 2024] PowerInfer-2: Fast Large Language Model Inference on a Smartphone
- [arXiv 2025] HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration
- [ISCA 2025] MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization
- [MLSys 2024] AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
- [ISCA 2023] OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
- [MLSys 2025] TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
- [ASPLOS 2024] SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile
- [ASPLOS 2024] SoD2: Statically Optimizing Dynamic Deep Neural Network Execution
- [MICRO 2022] GCD2: A Globally Optimizing Compiler for Mapping DNNs to Mobile DSPs
- [PLDI 2021] DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion
- [MobiSys 2025] ARIA: Optimizing Vision Foundation Model Inference on Heterogeneous Mobile Processors for Augmented Reality
- [PPoPP 2024] Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous SoCs
- [MobiSys 2024] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs
- [MobiCom 2024] Perceptual-Centric Image Super-Resolution using Heterogeneous Processors on Mobile Devices
- [Sensys 2023] Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
- [MobiSys 2023] NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors
- [ATC 2023] Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices
- [IPSN 2023] PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators
- [SenSys 2022] BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference
- [MobiSys 2022] Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors
- [MobiSys 2022] CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices
- [RTSS 2024] FLEX: Adaptive Task Batch Scheduling with Elastic Fusion in Multi-Modal Multi-View Machine Perception
- [MobiCom 2024] Panopticus: Omnidirectional 3D Object Detection on Resource-constrained Edge Devices
- [MobiSys 2023] OmniLive: Super-Resolution Enhanced 360° Video Live Streaming for Mobile Devices
- [MobiSys 2023] HarvNet: Resource-Optimized Operation of Multi-Exit Deep Neural Networks on Energy Harvesting Devices
- [MobiCom 2022] NeuLens: Spatial-based Dynamic Acceleration of Convolutional Neural Networks on Edge
- [MobiCom 2021] Flexible high-resolution object detection on edge devices with tunable latency
- [ASPLOS 2025] Nazar: Monitoring and Adapting ML Models on Mobile Devices
- [SenSys 2024] AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile Environments
- [SenSys 2023] EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge
- [MobiCom 2023] Cost-effective On-device Continual Learning over Memory Hierarchy with Miro
- [MobiCom 2023] AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments
- [MobiSys 2023] ElasticTrainer: Speeding Up On-Device Training with Runtime Elastic Tensor Selection
- [SenSys 2023] On-NAS: On-Device Neural Architecture Search on Memory-Constrained Intelligent Embedded Systems
- [MobiCom 2022] Mandheling: mixed-precision on-device DNN training with DSP offloading
- [MobiSys 2022] Memory-efficient DNN training on mobile devices
- [MobiCom 2024] MELTing point: Mobile Evaluation of Language Transformers [code]
- [SenSys 2023] nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms
- [MobiSys 2021] nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices
MLSys 2025
- [MLSys 2025] MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
- [MLSys 2025] Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
- [MLSys 2025] TurboAttention: Efficient attention approximation for High Throughputs LLMs
- [MLSys 2025] SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
- [MLSys 2025] LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
ASPLOS 2025
- [Fast On-device LLM Inference with NPUs]
- Energy-aware Scheduling and Input Buffer Overflow Prevention for Energy-harvesting Systems
- Generalizing Reuse Patterns for Efficient DNN on Microcontrollers
- Nazar: Monitoring and Adapting ML Models on Mobile Devices
EuroSys 2025
- Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution
- T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
SOSP 2025
MobiSys 2025
- ARIA: Optimizing Vision Foundation Model Inference on Heterogeneous Mobile Processors for Augmented Reality
MobiCom 2025
Preprint 2025
- HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators