Skip to content

IntelligentDDS/CloudHeal

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CloudHeal: A Lightweight Online Learning Based Self-Healing Framework for Cloud-Native Systems

With the increasing scale and complexity of cloud-native systems, faults have become common and impact the end-user experience. Therefore, it is necessary to propose efficient self-healing mechanisms to ensure reliability and reduce service disruptions. However, traditional self-healing approaches, including rule-based and offline machine learning methods, suffer from inefficiency and limited adaptability. In this paper, we propose CloudHeal, a lightweight online learning-based self-healing framework that dynamically optimizes fault mitigation strategies for cloud-native environments. CloudHeal leverages Spectrum-Based Fault Localization (SBFL) and feedback-enhanced PageRank for root cause identification, significantly reducing unnecessary self-healing actions. To support adaptive decision-making, CloudHeal employs a context-aware online learning algorithm that continuously refines fault recovery strategies based on real-time system feedback. Experimental evaluations on microservice benchmarks demonstrate that CloudHeal achieves a 94.8% fault recovery rate, outperforming existing fault self-healing methods. Additionally, the Root Cause Analyzer in CloudHeal improves fault localization accuracy by 16.9% compared to state-of-the-art approaches, further enhancing self-healing effectiveness. These results highlight CloudHeal’s potential in advancing self-healing capabilities for large-scale cloud-native systems.

cloudsh

Figure 1. Overview of CloudHeal.

Quickly Start

1️⃣ Deployment of Online Boutique.

2️⃣ Replace the configurations in sh/action.py.

deploy_nodes = [
    "x.x.x.x",
]

3️⃣ Deployment of tempo, Prometheus and Loki.

4️⃣ Replace the configurations in utils.py.

METRIC_ENDPOINT = "http://x.x.x.x:xxxx/"
TRACE_ENDPOINT = "http://x.x.x.x:xxxx/"
LOG_ENDPOINT = "http://x.x.x.x:xxxx/"

5️⃣ Run the code

python sh_linucb.py

File Content

.
├── data # Some experimental data
├── log
├── query # Get traces and metrics of Online Boutique
│   ├── log
│   ├── log_query.py
│   ├── metric_query.py
│   └── trace_query.py
├── rca  # Root Cause Analyzer
│   ├── detector.py # Anomaly Detector
│   ├── pagerank.py	# Feedback-driven PageRank
│   ├── preprocess.py
│   └── sbfl.py			# Spectrum-Based Fault Localization
├── sh	# Self-Healing Engine
│   ├── action.py		
│   ├── mab.py
│   ├── model
│   └── yaml
├── sh_linucb.py # Main code
└── utils.py    

About

The official implementation of CloudHeal (SRDS 2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%