This repo provides official code for "Subgraphs as First-Class Citizens in Incident Management for Large-Scale Online Systems: An Evolution-Aware Framework" (TSE 2025).
This work extends upon our previous publication ``Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems'' at the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE 2022)''.
./data
contains the simulation environment dataset and open-sourced datasets used for helping the understanding and reporduction of each step of GEM../src
contains the implementation of GEM extracted for reproduction../demo
contains ipython notebooks which provide examples to show how each step of GEM is performed. Their order is as follow:anomaly_detection_and_impact_extraction.ipynb
contains code for telemetry data anomaly detection and impact extraction.data_labelling.ipynb
contains code for data labelling using fault injection records.feature_engineering.ipynb
contains code for feature engineering.incident_detection.ipynb
contains code for the graph neural networks based model training and testing for incident detection on the simulation environment dataset.incident_diagnosis_using_edge_clues.ipynb
contains code for the incident diagnosis on the simulation environment dataset using edge clues.incident_diagnosis_using_node_clues_with_continual_optimization_OB.ipynb
contains code for the incident diagnosis on dataset OB using node clues with continual optimization.incident_diagnosis_using_node_clues_with_continual_optimization_AIOPS2021.ipynb
contains code for the incident diagnosis on dataset AIOPS2021 using node clues with continual optimization.
The GEM framework has different requirements for incident detection and incident diagnosis components. Install the appropriate dependencies based on your use case:
The following two dependencies are tested on Python 3.7 and 3.8 (recommended).
pip install -r requirements_for_incident_detection_py37.txt
pip install -r requirements_for_incident_detection_py38.txt
The following dependency is tested on Python 3.8.
pip install -r requirements_for_incident_diagnosis.txt
The GEM framework follows a sequential workflow for incident management in large-scale online systems. Follow these steps in order:
Start by detecting anomalies in raw telemetry data and extracting their impact:
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import networkx as nx
import pickle
# Load monitoring data
all_data = pd.read_csv("./data/calling_relationships_monitoring.csv")
# Run anomaly detection and impact extraction
# See demo/anomaly_detection_and_impact_extraction.ipynb for detailed implementation
Label the data using historical incident reports or fault injection records:
# Load raw topologies
with open('./data/raw_topoloies.pkl', 'rb') as f:
Topologies = pickle.load(f)
# Perform data labelling
# See demo/data_labelling.ipynb for detailed implementation
Preprosses the data and create feature vectors for incident detection:
# Feature engineering for graph-based incident detection
# See demo/feature_engineering.ipynb for detailed implementation
Train and test the graph neural network model for incident detection:
import sys
sys.path.append('./src')
from incident_detection import callSpatioDevNet
import torch
from torch_geometric.data import Data, DataLoader
# Load training and test data
with open('./data/train_cases.pkl', 'rb') as f:
train_cases = pickle.load(f)
with open('./data/test_cases.pkl', 'rb') as f:
test_cases = pickle.load(f)
# Initialize and train the model
model = callSpatioDevNet.callSpatioDevNet(
num_epochs=100,
batch_size=32,
lr=1e-3,
hidden_dim=20
)
# Train the model
model.fit(train_cases)
# Test the model
results = model.predict(test_cases)
Perform incident diagnosis using either edge clues or node clues:
from incident_diagnosis import incident_diagnosis
# Diagnose incidents using edge clues
# See demo/incident_diagnosis_using_edge_clues.ipynb for detailed implementation
# For Online Boutique dataset
# See demo/incident_diagnosis_using_node_clues_with_continual_optimization_OB.ipynb
# For AIOPS2021 dataset
# See demo/incident_diagnosis_using_node_clues_with_continual_optimization_AIOPS2021.ipynb
For comprehensive examples and detailed implementations, refer to the Jupyter notebooks in the ./demo
directory:
- Anomaly Detection:
demo/anomaly_detection_and_impact_extraction.ipynb
- Data Labelling:
demo/data_labelling.ipynb
- Feature Engineering:
demo/feature_engineering.ipynb
- Incident Detection:
demo/incident_detection.ipynb
- Incident Diagnosis (Edge Clues):
demo/incident_diagnosis_using_edge_clues.ipynb
- Incident Diagnosis (Node Clues - OB):
demo/incident_diagnosis_using_node_clues_with_continual_optimization_OB.ipynb
- Incident Diagnosis (Node Clues - AIOPS2021):
demo/incident_diagnosis_using_node_clues_with_continual_optimization_AIOPS2021.ipynb
The framework works with the following data files:
calling_relationships_monitoring.csv
: Monitoring data for service call relationshipsinjected_faults.csv
: Records of injected faults for trainingplatform_faults.csv
: Platform-level fault informationraw_topoloies.pkl
: Raw topology dataissue_topoloies.pkl
: Processed issue topology datatrain_cases.pkl
/test_cases.pkl
: Preprocessed training and testing datasetsAIOPS2021.pkl
/OB.pkl
: Specific datasets for evaluation
Pre-trained models are available in the ./demo
directory:
FinalModel_OnlineBoutique.pt
: Trained model for Online Boutique dataset
Below is the detailed API documentation for GEM.
A graph neural network-based model for incident detection using spatio-temporal features.
Import:
from src.incident_detection import callSpatioDevNet
Constructor:
model = callSpatioDevNet(
name='SpatioDevNetPackage',
num_epochs=10,
batch_size=32,
lr=1e-3,
input_dim=None,
hidden_dim=20,
edge_attr_len=60,
global_fea_len=2,
num_layers=2,
edge_module='linear',
act=True,
pooling='attention',
is_bilinear=False,
nonlinear_scorer=False,
head=4,
aggr='mean',
concat=False,
dropout=0.4,
weight_decay=1e-2,
loss_func='focal_loss',
seed=None,
gpu=None,
ipython=True,
details=True
)
Key Parameters:
name
(str): Model identifier for saving/loadingnum_epochs
(int): Number of training epochsbatch_size
(int): Training batch sizelr
(float): Learning rate for optimizationinput_dim
(int): Input feature dimensionhidden_dim
(int): Hidden layer dimensionedge_attr_len
(int): Edge attribute lengthglobal_fea_len
(int): Global feature lengthnum_layers
(int): Number of GNN layersedge_module
(str): Edge processing module ('linear' or 'lstm')pooling
(str): Graph pooling method ('attention', 'max', 'mean', 'add')loss_func
(str): Loss function ('focal_loss', 'dev_loss', 'cross_entropy')dropout
(float): Dropout rate for regularizationseed
(int): Random seed for reproducibilitygpu
(int): GPU device ID
Methods:
fit(datalist, valid_list=None, log_step=20, patience=10, valid_proportion=0.0, early_stop_fscore=None)
Train the incident detection model.
Parameters:
datalist
(list): Training data as PyTorch Geometric Data objectsvalid_list
(list, optional): Validation datasetlog_step
(int): Logging frequency during trainingpatience
(int): Early stopping patiencevalid_proportion
(float): Proportion of data for validation splitearly_stop_fscore
(float, optional): F-score threshold for early stopping
Predict anomaly scores for input data.
Parameters:
datalist
(list): Input data as PyTorch Geometric Data objects
Returns:
outputs
(numpy.ndarray): Anomaly scoresfeatures
(numpy.ndarray): Extracted features
Perform prediction with cold start using k-nearest neighbors.
Parameters:
datalist
(list): Input datan_neighbors
(int): Number of neighbors for KNN
Returns:
knn_preds
(list): KNN predictionsknn_pred_proba
(list): KNN prediction probabilitiesknn
(object): Trained KNN classifier
Load a pre-trained model.
Parameters:
model_file
(str, optional): Path to model file
Find optimal threshold using binary search for best F1-score.
Parameters:
labels
(array): True labelsscores
(array): Prediction scores
Returns:
results
(tuple): Precision, recall, F-score metricsthreshold
(float): Optimal threshold
Import:
from src.incident_diagnosis import incident_diagnosis
Calculate edge weights from topology information.
Parameters:
topology
(dict): Network topology with node and edge informationclue_tag
(str): Metric tag for weight calculationstatistics
(dict, optional): Statistical normalization parameters
Returns:
weight
(dict): Calculated edge weights
get_anomaly_graph(topology, node_clue_tags=[], edge_clue_tags=[], a=None, get_edge_weight=None, edge_backward_factor=0.3)
Construct anomaly graph from topology and clues.
Parameters:
topology
(dict): Network topology datanode_clue_tags
(list): Node-level clue tagsedge_clue_tags
(list): Edge-level clue tagsa
(dict, optional): Rescaling factors for cluesget_edge_weight
(function): Edge weight calculation functionedge_backward_factor
(float): Backward edge weight factor
Returns:
anomaly_graph
(networkx.DiGraph): Constructed anomaly graph
root_cause_localization(case, node_clue_tags, edge_clue_tags, a, get_edge_weight=None, edge_backward_factor=0.3)
Localize root cause using PageRank on anomaly graph.
Parameters:
case
(dict): Incident case datanode_clue_tags
(list): Node clue tagsedge_clue_tags
(list): Edge clue tagsa
(dict): Clue rescaling factorsget_edge_weight
(function, optional): Edge weight functionedge_backward_factor
(float): Backward propagation factor
Returns:
root_cause
(str): Identified root cause node
Generate explanations for incident diagnosis.
Parameters:
case
(dict): Incident case datatarget
(str): Target field to explain
Returns:
explanation
(list): Sorted explanation features with power scores
optimize(case, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor, historical_incident_topologies, init_clue_tag, range_a=5, num_trials=100)
Optimize clue weights using historical data and Optuna.
Parameters:
case
(dict): Current incident casenode_clue_tags
(list): Node clue tagsedge_clue_tags
(list): Edge clue tagsa
(dict): Initial clue weightsget_edge_weight
(function): Edge weight calculation functionedge_backward_factor
(float): Backward edge factorhistorical_incident_topologies
(list): Historical incident datainit_clue_tag
(str): Initial clue tagrange_a
(float): Optimization range for weightsnum_trials
(int): Number of optimization trials
Returns:
node_clue_tags
(list): Updated node clue tagsa
(dict): Optimized clue weights
eval(historical_incident_topologies, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor)
Evaluate diagnosis performance on historical data.
Parameters:
historical_incident_topologies
(list): Historical incident casesnode_clue_tags
(list): Node clue tagsedge_clue_tags
(list): Edge clue tagsa
(dict): Clue weightsget_edge_weight
(function): Edge weight functionedge_backward_factor
(float): Backward edge factor
Returns:
reward
(float): Accuracy rewardpunishment
(float): Regularization punishment
topology = {
'nodes': ['node1', 'node2', ...],
'edge_info': {
'node1_node2': {
'metric1': [values...],
'metric2': [values...]
}
},
'node_info': {
'node1': {
'metric1': [values...],
'metric2': [values...]
}
},
'root_cause': 'node_id' # for labeled data
}
from torch_geometric.data import Data
data = Data(
x=node_features, # Node feature matrix
edge_index=edge_index, # Edge connectivity
edge_attr=edge_attr, # Edge features
global_x=global_features, # Global features
y=label, # Graph label
batch=batch_vector # Batch assignment
)
DAMPING = 0.1
: PageRank damping factor for root cause localization
# Incident Detection
from src.incident_detection import callSpatioDevNet
model = callSpatioDevNet(
num_epochs=100,
batch_size=32,
lr=1e-3,
hidden_dim=20
)
model.fit(train_data)
scores, features = model.predict(test_data)
# Incident Diagnosis
from src.incident_diagnosis.incident_diagnosis import (
get_weight_from_edge_info,
root_cause_localization,
optimize
)
# Diagnose root cause
root_cause = root_cause_localization(
case=incident_case,
node_clue_tags=['cpu_usage', 'memory_usage'],
edge_clue_tags=['response_time', 'error_rate'],
a={'cpu_usage': 1.0, 'memory_usage': 0.8},
get_edge_weight=get_weight_from_edge_info
)
- Ensure you have the correct Python version (3.7+)
- Install PyTorch with appropriate CUDA support if using GPU
- For torch_geometric installation issues, refer to the official documentation
- Make sure all data files are in the correct
./data
directory
If you find this work useful, please cite our paper:
@article{he2025subgraphs,
title={Subgraphs as First-Class Citizens in Incident Management for Large-Scale Online Systems: An Evolution-Aware Framework},
author={He, Zilong and Chen, Pengfei and Luo, Yu and Yan, Qiuyu and Chen, Hongyang and Yu, Guangba and Li, Fangyuan and Li, Xiaoyun and Zheng, Zibin},
journal={IEEE Transactions on Software Engineering},
year={2025},
publisher={IEEE}
}
@inproceedings{DBLP:conf/kbse/HeCLYCYL22,
author = {Zilong He and
Pengfei Chen and
Yu Luo and
Qiuyu Yan and
Hongyang Chen and
Guangba Yu and
Fangyuan Li},
title = {Graph based Incident Extraction and Diagnosis in Large-Scale Online
Systems},
booktitle = {37th {IEEE/ACM} International Conference on Automated Software Engineering,
{ASE} 2022, Rochester, MI, USA, October 10-14, 2022},
pages = {48:1--48:13},
publisher = {{ACM}},
year = {2022},
url = {https://doi.org/10.1145/3551349.3556904},
doi = {10.1145/3551349.3556904},
timestamp = {Thu, 22 Jun 2023 07:45:51 +0200},
biburl = {https://dblp.org/rec/conf/kbse/HeCLYCYL22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}