Skip to content

A curated list of research papers, code, and tools applying deep reinforcement learning (DRL) to cloud/microservice resource scheduling and autoscaling.

Notifications You must be signed in to change notification settings

igeng/awesome-drl-cloud-scheduling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

Awesome DRL Cloud Scheduling

Awesome

A curated list of Deep Reinforcement Learning (DRL) research and projects on:

  • Resource scheduling / placement in cloud environments
  • Service autoscaling (microservice scaling‑in / scaling‑out / scaling‑up / scaling‑down)
  • Cloud‑level scheduling decisions involving both task placement and dynamic resource allocation

Contents


Introduction

With the increasing complexity of cloud environments and the dynamic nature of workloads, traditional scheduling algorithms often fall short in optimizing resource allocation. Deep Reinforcement Learning (DRL) offers a promising approach by learning optimal scheduling policies through environment interaction.

This repository collects significant works and tools that leverage DRL for:

  • Resource scheduling / placement
  • Service autoscaling (microservice scaling‑in / scaling‑out / scaling‑up / scaling‑down)
  • Cloud‑level scheduling decisions involving both placement and autoscaling

Research Papers

Resource Scheduling

  • Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement Learning
    Authors: Tiangang Li, Shi Ying, Yishi Zhao, Jianga Shang
    Publication: TPDS-2024
    Summary: This work tackles dynamic cluster load balancing for batch job scheduling in cloud computing. It highlights limitations in standard DRL approaches (like DQN), which ignore the full value distribution and struggle with time-varying jobs/resources. The authors propose a Distributional Reinforcement Learning (DRL) solution using quantile regression to model the cumulative return's value distribution, capturing environmental stochasticity. They formulate scheduling as a multi-objective optimization problem, develop a custom training environment, and introduce a novel DRL-based scheduling algorithm. Evaluated on real Alibaba cluster traces (v2018 & v2020), the algorithm outperforms baselines in load balancing, instance creation success rate, and average task completion time, demonstrating strong scalability.

  • DRL‑Cloud: DRL‑based Resource Provisioning and Task Scheduling for Cloud Service Providers
    Authors: Mingxi Cheng, Ji Li, Shahin Nazarian
    Publication: ASP-DAC-2018
    Summary: This paper addresses energy cost minimization for large-scale Cloud Service Providers (CSPs). It identifies limitations in prior Resource Provisioning (RP) and Task Scheduling (TS) approaches, including scalability issues and neglect of crucial task dependencies. The authors propose DRL-Cloud, a novel Deep Reinforcement Learning (DRL) system featuring a two-stage RP-TS processor based on deep Q-learning. This system autonomously learns optimal long-term decisions by adapting to dynamic factors like user request patterns and electricity prices. Utilizing standard DRL techniques (target network, experience replay, exploration-exploitation), DRL-Cloud achieves significantly improved energy cost efficiency, low task reject rate, low runtime, and fast convergence. Evaluations demonstrate dramatic improvements: up to 320% higher energy cost efficiency compared to state-of-the-art methods while maintaining lower reject rates, and up to 144% runtime reduction versus a fast round-robin baseline in a large-scale scenario (5,000 servers, 200,000 tasks).

  • (Additional DRL-based resource scheduling papers can be listed here.)

Autoscaling

  • StatuScale: Status-aware and Elastic Scaling Strategy for Microservice Applications
    Authors: Linfeng Wen, Minxian Xu, Sukhpal Singh Gill, Muhammad Hilman, Satish Narayana Srirama, Kejiang Ye, Chengzhong Xu
    Publication: TAAS-2025 Summary: Microservice architecture has transformed traditional monolithic applications into lightweight components. Scaling these lightweight microservices is more efficient than scaling servers. However, scaling microservices still faces the challenges resulting from the unexpected spikes or bursts of requests, which are difficult to detect and can degrade performance instantaneously. To address this challenge and ensure the performance of microservice-based applications, we propose a status-aware and elastic scaling framework called StatuScale, which is based on load status detector that can select appropriate elastic scaling strategies for differentiated resource scheduling in vertical scaling. Additionally, StatuScale employs a horizontal scaling controller that utilizes comprehensive evaluation and resource reduction to manage the number of replicas for each microservice. We also present a novel metric named correlation factor to evaluate the resource usage efficiency. Finally, we use Kubernetes, an open source container orchestration and management platform, and realistic traces from Alibaba to validate our approach. The experimental results have demonstrated that the proposed framework can reduce the average response time in the Sock-Shop application by 8.59% to 12.34% and in the Hotel-Reservation application by 7.30% to 11.97%, decrease service level objective violations, and offer better performance in resource usage compared to baselines.

  • DRPC: Distributed Reinforcement Learning Approach for Scalable Resource Provisioning in Container-Based Clusters
    Authors: Haoyu Bai, Minxian Xu, Kejiang Ye, Rajkumar Buyya, Chengzhong Xu
    Publication: TSC-2024 Summary: Microservices have transformed monolithic applications into lightweight, self-contained, and isolated application components, establishing themselves as a dominant paradigm for application development and deployment in public clouds such as Google and Alibaba. Autoscaling emerges as an efficient strategy for managing resources allocated to microservices’ replicas. However, the dynamic and intricate dependencies within microservice chains present challenges to the effective management of scaled microservices. Additionally, the centralized autoscaling approach can encounter scalability issues, especially in the management of large-scale microservice-based clusters. To address these challenges and enhance scalability, we propose an innovative distributed resource provisioning approach for microservices based on the Twin Delayed Deep Deterministic Policy Gradient algorithm. This approach enables effective autoscaling decisions and decentralizes responsibilities from a central node to distributed nodes. Comparative results with state-of-the-art approaches, obtained from a realistic testbed and traces, indicate that our approach reduces the average response time by 15% and the number of failed requests by 24%, validating improved scalability as the number of requests increases.

  • FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices
    Authors: Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer, University of Illinois at Urbana-Champaign
    Publication: OSDI-2020
    Summary: User-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16x while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11x.

  • Burst-Aware Predictive Autoscaling for Containerized Microservices
    Authors: Muhammad Abdullah , Waheed Iqbal , Josep Lluis Berral , Jorda Polo , and David Carrera
    Publication: TSC-2022
    Summary:

  • AWARE: Automate Workload Autoscaling with Reinforcement Learning in Production Cloud Systems
    Authors: Haoran Qiu and Weichao Mao, University of Illinois at Urbana-Champaign; Chen Wang, Hubertus Franke, and Alaa Youssef, IBM Research; Zbigniew T. Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer, University of Illinois at Urbana-Champaign
    Publication: ATC-2023
    Summary: Workload autoscaling is widely used in public and private cloud systems to maintain stable service performance and save resources. However, it remains challenging to set the optimal resource limits and dynamically scale each workload at runtime. Reinforcement learning (RL) has recently been proposed and applied in various systems tasks, including resource management. In this paper, we first characterize the state-of-the-art RL approaches for workload autoscaling in a public cloud and point out that there is still a large gap in taking the RL advances to production systems. We then propose AWARE, an extensible framework for deploying and managing RL-based agents in production systems. AWARE leverages meta-learning and bootstrapping to (a) automatically and quickly adapt to different workloads, and (b) provide safe and robust RL exploration. AWARE provides a common OpenAI Gym-like RL interface to agent developers for easy integration with different systems tasks. We illustrate the use of AWARE in the case of workload autoscaling. Our experiments show that AWARE adapts a learned autoscaling policy to new workloads 5.5x faster than the existing transfer-learning-based approach and provides stable online policy-serving performance with less than 3.6% reward degradation. With bootstrapping, AWARE helps achieve 47.5% and 39.2% higher CPU and memory utilization while reducing SLO violations by a factor of 16.9x during policy training.

  • DeepScaler: Holistic Autoscaling for Microservices Based on Spatiotemporal GNN with Adaptive Graph Learning
    Authors: Chunyang Meng, Shijie Song, Haogang Tong, Maolin Pan, Yang Yu
    Publication: ASE-2023
    Summary:

  • HRA: An Intelligent Holistic Resource Autoscaling Framework for Multi-service Applications
    Authors: Chunyang Meng, Jingwan Tong, Maolin Pan, Yang Yu
    Publication: ICWS-2022
    Summary:

  • DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems
    Authors: Ziliang Wang, Shiyi Zhu, Jianguo Li, Wei Jiang, K. K. Ramakrishnan, Yangfei Zheng, Meng Yan, Xiaohong Zhang, Alex X. Liu
    Publication: SoCC-2022
    Summary:

  • DeepScaling: Autoscaling Microservices With Stable CPU Utilization for Large Scale Production Cloud Systems
    Authors: Ziliang Wang; Shiyi Zhu; Jianguo Li; Wei Jiang; K. K. Ramakrishnan; Meng Yan
    Publication: IEEE/ACM Transactions on Networking-2024
    Summary:

Edge Computing

Paper Template

  • Title
    Authors: Author list
    Publication: Pub-info
    Summary:

Open‑Source Projects

(List of open‑source DRL scheduling/autoscaling implementations go here.)


Datasets

(Datasets for DRL training and evaluation — job traces, metrics logs, cloud simulators, etc.)

  • alibaba/clusterdata – Alibaba Cluster Trace Program, cluster data collected from production clusters in Alibaba for cluster management research.

Tools & Frameworks

(Tools and DRL frameworks that facilitate building and testing scheduling/autoscaling agents.)


Contributing

Contributions welcome!

Please follow these steps:

  1. Fork this repository.
  2. Create a branch for your addition (paper, dataset, project, etc.).
  3. Add your entry including title, link, authors, and a brief summary.
  4. Submit a pull request—maintainers will review and merge.

References

Below are GitHub repositories and related works that inspired this collection:

We sincerely thank the maintainers and authors of these repositories for their foundational contributions and for providing the community a reference point to build upon.


License

This repository is licensed under the MIT License. See the LICENSE file for details.

About

A curated list of research papers, code, and tools applying deep reinforcement learning (DRL) to cloud/microservice resource scheduling and autoscaling.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published