A curated list of Deep Reinforcement Learning (DRL) research and projects on:
- Resource scheduling / placement in cloud environments
- Service autoscaling (microservice scaling‑in / scaling‑out / scaling‑up / scaling‑down)
- Cloud‑level scheduling decisions involving both task placement and dynamic resource allocation
- Introduction
- Research Papers
- Open‑Source Projects
- Datasets
- Tools & Frameworks
- Contributing
- References
- License
With the increasing complexity of cloud environments and the dynamic nature of workloads, traditional scheduling algorithms often fall short in optimizing resource allocation. Deep Reinforcement Learning (DRL) offers a promising approach by learning optimal scheduling policies through environment interaction.
This repository collects significant works and tools that leverage DRL for:
- Resource scheduling / placement
- Service autoscaling (microservice scaling‑in / scaling‑out / scaling‑up / scaling‑down)
- Cloud‑level scheduling decisions involving both placement and autoscaling
-
Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement Learning
Authors: Tiangang Li, Shi Ying, Yishi Zhao, Jianga Shang
Publication: TPDS-2024
Summary: This work tackles dynamic cluster load balancing for batch job scheduling in cloud computing. It highlights limitations in standard DRL approaches (like DQN), which ignore the full value distribution and struggle with time-varying jobs/resources. The authors propose a Distributional Reinforcement Learning (DRL) solution using quantile regression to model the cumulative return's value distribution, capturing environmental stochasticity. They formulate scheduling as a multi-objective optimization problem, develop a custom training environment, and introduce a novel DRL-based scheduling algorithm. Evaluated on real Alibaba cluster traces (v2018 & v2020), the algorithm outperforms baselines in load balancing, instance creation success rate, and average task completion time, demonstrating strong scalability. -
DRL‑Cloud: DRL‑based Resource Provisioning and Task Scheduling for Cloud Service Providers
Authors: Mingxi Cheng, Ji Li, Shahin Nazarian
Publication: ASP-DAC-2018
Summary: This paper addresses energy cost minimization for large-scale Cloud Service Providers (CSPs). It identifies limitations in prior Resource Provisioning (RP) and Task Scheduling (TS) approaches, including scalability issues and neglect of crucial task dependencies. The authors propose DRL-Cloud, a novel Deep Reinforcement Learning (DRL) system featuring a two-stage RP-TS processor based on deep Q-learning. This system autonomously learns optimal long-term decisions by adapting to dynamic factors like user request patterns and electricity prices. Utilizing standard DRL techniques (target network, experience replay, exploration-exploitation), DRL-Cloud achieves significantly improved energy cost efficiency, low task reject rate, low runtime, and fast convergence. Evaluations demonstrate dramatic improvements: up to 320% higher energy cost efficiency compared to state-of-the-art methods while maintaining lower reject rates, and up to 144% runtime reduction versus a fast round-robin baseline in a large-scale scenario (5,000 servers, 200,000 tasks). -
(Additional DRL-based resource scheduling papers can be listed here.)
-
StatuScale: Status-aware and Elastic Scaling Strategy for Microservice Applications
Authors: Linfeng Wen, Minxian Xu, Sukhpal Singh Gill, Muhammad Hilman, Satish Narayana Srirama, Kejiang Ye, Chengzhong Xu
Publication: TAAS-2025 Summary: Microservice architecture has transformed traditional monolithic applications into lightweight components. Scaling these lightweight microservices is more efficient than scaling servers. However, scaling microservices still faces the challenges resulting from the unexpected spikes or bursts of requests, which are difficult to detect and can degrade performance instantaneously. To address this challenge and ensure the performance of microservice-based applications, we propose a status-aware and elastic scaling framework called StatuScale, which is based on load status detector that can select appropriate elastic scaling strategies for differentiated resource scheduling in vertical scaling. Additionally, StatuScale employs a horizontal scaling controller that utilizes comprehensive evaluation and resource reduction to manage the number of replicas for each microservice. We also present a novel metric named correlation factor to evaluate the resource usage efficiency. Finally, we use Kubernetes, an open source container orchestration and management platform, and realistic traces from Alibaba to validate our approach. The experimental results have demonstrated that the proposed framework can reduce the average response time in the Sock-Shop application by 8.59% to 12.34% and in the Hotel-Reservation application by 7.30% to 11.97%, decrease service level objective violations, and offer better performance in resource usage compared to baselines. -
DRPC: Distributed Reinforcement Learning Approach for Scalable Resource Provisioning in Container-Based Clusters
Authors: Haoyu Bai, Minxian Xu, Kejiang Ye, Rajkumar Buyya, Chengzhong Xu
Publication: TSC-2024 Summary: Microservices have transformed monolithic applications into lightweight, self-contained, and isolated application components, establishing themselves as a dominant paradigm for application development and deployment in public clouds such as Google and Alibaba. Autoscaling emerges as an efficient strategy for managing resources allocated to microservices’ replicas. However, the dynamic and intricate dependencies within microservice chains present challenges to the effective management of scaled microservices. Additionally, the centralized autoscaling approach can encounter scalability issues, especially in the management of large-scale microservice-based clusters. To address these challenges and enhance scalability, we propose an innovative distributed resource provisioning approach for microservices based on the Twin Delayed Deep Deterministic Policy Gradient algorithm. This approach enables effective autoscaling decisions and decentralizes responsibilities from a central node to distributed nodes. Comparative results with state-of-the-art approaches, obtained from a realistic testbed and traces, indicate that our approach reduces the average response time by 15% and the number of failed requests by 24%, validating improved scalability as the number of requests increases. -
FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices
Authors: Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer, University of Illinois at Urbana-Champaign
Publication: OSDI-2020
Summary: User-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16x while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11x. -
Burst-Aware Predictive Autoscaling for Containerized Microservices
Authors: Muhammad Abdullah , Waheed Iqbal , Josep Lluis Berral , Jorda Polo , and David Carrera
Publication: TSC-2022
Summary: -
AWARE: Automate Workload Autoscaling with Reinforcement Learning in Production Cloud Systems
Authors: Haoran Qiu and Weichao Mao, University of Illinois at Urbana-Champaign; Chen Wang, Hubertus Franke, and Alaa Youssef, IBM Research; Zbigniew T. Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer, University of Illinois at Urbana-Champaign
Publication: ATC-2023
Summary: Workload autoscaling is widely used in public and private cloud systems to maintain stable service performance and save resources. However, it remains challenging to set the optimal resource limits and dynamically scale each workload at runtime. Reinforcement learning (RL) has recently been proposed and applied in various systems tasks, including resource management. In this paper, we first characterize the state-of-the-art RL approaches for workload autoscaling in a public cloud and point out that there is still a large gap in taking the RL advances to production systems. We then propose AWARE, an extensible framework for deploying and managing RL-based agents in production systems. AWARE leverages meta-learning and bootstrapping to (a) automatically and quickly adapt to different workloads, and (b) provide safe and robust RL exploration. AWARE provides a common OpenAI Gym-like RL interface to agent developers for easy integration with different systems tasks. We illustrate the use of AWARE in the case of workload autoscaling. Our experiments show that AWARE adapts a learned autoscaling policy to new workloads 5.5x faster than the existing transfer-learning-based approach and provides stable online policy-serving performance with less than 3.6% reward degradation. With bootstrapping, AWARE helps achieve 47.5% and 39.2% higher CPU and memory utilization while reducing SLO violations by a factor of 16.9x during policy training. -
DeepScaler: Holistic Autoscaling for Microservices Based on Spatiotemporal GNN with Adaptive Graph Learning
Authors: Chunyang Meng, Shijie Song, Haogang Tong, Maolin Pan, Yang Yu
Publication: ASE-2023
Summary: -
HRA: An Intelligent Holistic Resource Autoscaling Framework for Multi-service Applications
Authors: Chunyang Meng, Jingwan Tong, Maolin Pan, Yang Yu
Publication: ICWS-2022
Summary: -
DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems
Authors: Ziliang Wang, Shiyi Zhu, Jianguo Li, Wei Jiang, K. K. Ramakrishnan, Yangfei Zheng, Meng Yan, Xiaohong Zhang, Alex X. Liu
Publication: SoCC-2022
Summary: -
DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems
Authors: Ziliang Wang; Shiyi Zhu; Jianguo Li; Wei Jiang; K. K. Ramakrishnan; Meng Yan
Publication: IEEE/ACM Transactions on Networking-2024
Summary:
- Dependent Task Offloading for Edge Computing based on Deep Reinforcement Learning
Authors: Jin Wang , Jia Hu , Geyong Min , Wenhan Zhan , Albert Y. Zomaya , Fellow, IEEE, and Nektarios Georgalas
Publication: TC-2022
Summary:
- Title
Authors: Author list
Publication: Pub-info
Summary:
(List of open‑source DRL scheduling/autoscaling implementations go here.)
(Datasets for DRL training and evaluation — job traces, metrics logs, cloud simulators, etc.)
- alibaba/clusterdata – Alibaba Cluster Trace Program, cluster data collected from production clusters in Alibaba for cluster management research.
(Tools and DRL frameworks that facilitate building and testing scheduling/autoscaling agents.)
Contributions welcome!
Please follow these steps:
- Fork this repository.
- Create a branch for your addition (paper, dataset, project, etc.).
- Add your entry including title, link, authors, and a brief summary.
- Submit a pull request—maintainers will review and merge.
Below are GitHub repositories and related works that inspired this collection:
- pkoperek/drl-cloud-management-list – A list of DRL cloud management papers.
We sincerely thank the maintainers and authors of these repositories for their foundational contributions and for providing the community a reference point to build upon.
This repository is licensed under the MIT License. See the LICENSE file for details.