This list highlights academic work focused on running AI models efficiently on resource-constrained mobile devices, including (1) edge devices (e.g., NVIDIA Jetson), (2) smartphones (e.g., Snapdragon/Exynos), (3) and microcontrollers for energy-harvesting or batteryless IoT devices, with a primary focus on research conducted for edge devices & smartphones. This repo references Awesome-On-Device-AI-Systems by Jeho Lee.
General DNN inference
- [MLSys 2025] AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration (Paper)
- LLM; Desktop & edge devices; GPU
- [MLSys 2025] MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices (Paper)
- Attention-based NN; Edge devices; NPU
- [MLSys 2025] Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking (Paper)
- LLM; Smartphones; NPU (Simulation)
- [ASPLOS 2025] Fast On-device LLM Inference with NPUs (llm.npu) (Paper)
- LLM; Smartphones; NPU
- [IEEE TMC 2025] NeuroBalancer: Balancing System Frequencies With Punctual Laziness for Timely and Energy-Efficient DNN Inferences (Paper)
- CNN; Smartphones; GPU
- [ASPLOS 2024] SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile (Paper)
- CNN, Transformer, and LLM; Smartphones; GPU
- [MobiCom 2024] FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices (Paper)
- CNN; Smartphones; CPU
- [MobiCom 2024] Mobile Foundation Model as Firmware (Paper)
- Foundation model; Edge devices & smartphones; CPU or GPU
- [MobiSys 2024] Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimization (Paper)
- Transformer; Smartphones & laptops; (Web)GPU
- [ASPLOS 2023] STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining (Paper)
- NLP (BERT); Edge devices; CPU or GPU
- [MobiCom 2022] Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs (Paper)
- CNN; Smartphones; GPU
- [MobiCom 2022] NeuLens: Spatial-based Dynamic Acceleration of Convolutional Neural Networks on Edge (Paper)
- CNN; Edge devices; GPU
- [MICRO 2022] GCD2: A Globally Optimizing Compiler for Mapping DNNs to Mobile DSPs (Paper)
- CNN and GAN; Smartphones; DSP (NPU)
- [MobiCom 2021] AsyMo: Scalable and Efficient Deep-Learning Inference on Asymmetric Mobile CPUs (Paper)
- CNN and RNN; Smartphones; CPU
- [PLDI 2021] DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion (Paper)
- CNN and Transformer; Smartphones; CPU or GPU
Application-specific optimization
- [MobiSys 2025] ARIA: Optimizing Vision Foundation Model Inference on Heterogeneous Mobile Processors for Augmented Reality (To Appear)
- Vision foundation model for augmented reality; Smartphones
- [AAAI 2025] E4: Energy-Efficient DNN Inference for Edge Video Analytics Via Early Exiting and DVFS (Paper)
- Video analytics; Edge devices; GPU
- [MobiCom 2024] Panopticus: Omnidirectional 3D Object Detection on Resource-constrained Edge Devices (Paper)
- 3D object detection; Edge devices; GPU
- [MobiSys 2023] OmniLive: Super-Resolution Enhanced 360° Video Live Streaming for Mobile Devices (Paper)
- Video super-resolution; Smartphones; GPU
- [IEEE TMC 2023] NAWQ-SR: A Hybrid-Precision NPU Engine for Efficient On-Device Super-Resolution (Paper)
- Single-image super-resolution; Smartphones; NPU
- [EuroSys 2022] LiteReconfig: Cost and Content Aware Reconfiguration of Video Object Detection Systems for Mobile GPUs (Paper)
- Video analytics; Edge devices; GPU
- [UbiComp 2022] Efficient On-Device Visual Question Answering (Paper)
- Visual question answering; Edge devices & smartphones; GPU
- [MobiCom 2021] Flexible High-Resolution Object Detection on Edge Devices with Tunable Latency (Paper)
- Object detection; Edge devices; GPU
- [MobiCom 2020] NEMO: enabling neural-enhanced video streaming on commodity mobile devices (Paper)
- Video super-resolution; Smartphones; GPU
General DNN inference
- [EuroSys 2025] Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution (Paper)
- Smartphones; CPU + GPU + NPU (TPU/DSP)
- [arXiv 2025] HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators (Paper)
- LLM; Smartphones; CPU + GPU + NPU
- [arXiv 2024] PowerInfer-2: Fast Large Language Model Inference on a Smartphone (Paper)
- LLM; Smartphones; CPU + NPU
- [IEEE TMC 2024] Thermal-Aware Scheduling for Deep Learning on Mobile Devices with NPU (Paper)
- CNN; Smartphones; GPU + NPU
- [ICDE 2023] EdgeNN: Efficient Neural Network Inference for CPU-GPU Integrated Edge Devices (Paper)
- CNN; Edge devices; CPU + GPU
- [ATC 2023] Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices (Paper)
- CNN; Smartphones; CPU + GPU
- [IPSN 2021] Efficient Execution of Deep Neural Networks on Mobile Devices with NPU (Paper)
- CNN; Smartphones; CPU + NPU
- [EuroSys 2019] µLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization (Paper)
- CNN; Smartphones; CPU + GPU
Application-specific optimization
- [MobiCom 2024] Perceptual-Centric Image Super-Resolution using Heterogeneous Processors on Mobile Devices (Paper)
- Single-image super-resolution; Smartphones; GPU + NPU
- [ICDE 2024] COUPLE: Orchestrating Video Analytics on Heterogeneous Mobile Processors (Paper)
- Video analytics (object detection); Smartphones; GPU + DSP (NPU)
- [IPSN 2023] PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators (Paper)
- 3D object detection; Edge devices; GPU + NPU
- [MobiCom 2019] MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors (Paper)
- Single-image super-resolution; Smartphones; CPU + GPU + DSP (NPU)
- [INFOCOM 2024] Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference (Paper)
- Transformer; Multiple edge devices; CPU + GPU
General DNN inference
- [IEEE TMC 2024] SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget (Paper)
- CNN; Edge devices; GPU
- [MobiSys 2024] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs (Paper)
- Edge devices; GPU
- [MICRO 2023] Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads (Paper)
- CNN, Attention-based NN, and NLP; From smartphones to data centers; NPU (Simulation)
- [SenSys 2023] Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU (Paper)
- Edge devices; GPU
- [HPCA 2021] Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling (Paper)
- NPU (Simulation)
- [MobiCom 2021] LegoDNN: Block-grained Scaling of Deep Neural Networks for Mobile Vision (Paper)
- Edge devices & smartphones; CPU or GPU
- [PerCom 2021] MASA: Responsive Multi-DNN Inference on the Edge (Paper)
- CNN; Edge devices (Raspberry Pi); CPU
Application-specific optimization
- [MobiCom 2020] Heimdall: Mobile GPU Coordination Platform for Augmented Reality Applications (Paper)
- Augmented reality; Smartphones; GPU
General DNN inference
- [PPoPP 2024] Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous SoCs (HaX-CoNN) (Paper)
- Edge devices; GPU + DLA (NPU)
- [SEC 2024] Elastic Execution of Multi-Tenant DNNs on Heterogeneous Edge MPSoCs (Paper)
- Smartphones; CPU + GPU + DSP (NPU)
- [MobiSys 2023] NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors (Paper)
- Smartphones; CPU + GPU + DSP (NPU)
- [MobiSys 2022] Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors (Paper)
- Smartphones; CPU + GPU + DSP + NPU
- [MobiSys 2022] CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices (Paper)
- Smartphones; CPU + GPU
- [SenSys 2022] BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference (Paper)
- Edge devices; CPU + GPU
- [ACM TACO 2021] SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms (Paper)
- Smartphones, edge devices, and desktop computers; CPU + GPU + DSP (NPU)
- [RTSS 2019] Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference (Paper)
- Edge devices (NVIDIA TX2) and desktop computers (Intel x86 Xeon); CPU + GPU
- [SenSys 2025] Lupe: Integrating the Top-down Approach with DNN Execution on Ultra-Low-Power Devices (To Appear)
- Ultra-low-power MCU (MSP430 series)
- [SenSys 2024] Intermittent Inference: Trading a 1% Accuracy Loss for a 1.9 x Throughput Speedup (Paper)
- High-performance MCU (ARM Cortex-M series)
- [SenSys 2024] Fast-Inf: Ultra-Fast Embedded Intelligence on the Batteryless Edge (Paper)
- Ultra-low-power MCU (MSP430 series)
- [MobiSys 2023] HarvNet: Resource-Optimized Operation of Multi-Exit Deep Neural Networks on Energy Harvesting Devices (Paper)
- Ultra-low-power MCU (MSP430 series)
- [ASPLOS 2023] Space-Efficient TREC for Enabling Deep Learning on Microcontrollers (Paper)
- High-performance MCU (ARM Cortex-M series)
- [MLSys 2021] MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers (Paper)
- High-performance MCU (ARM Cortex-M series)