GEM

This repo provides official code for "Subgraphs as First-Class Citizens in Incident Management for Large-Scale Online Systems: An Evolution-Aware Framework" (TSE 2025).

📖 Introduction

This work extends upon our previous publication ``Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems'' at the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE 2022)''.

🏗️ Project Structure

./data contains the simulation environment dataset and open-sourced datasets used for helping the understanding and reporduction of each step of GEM.
./src contains the implementation of GEM extracted for reproduction.
./demo contains ipython notebooks which provide examples to show how each step of GEM is performed. Their order is as follow:
- anomaly_detection_and_impact_extraction.ipynb contains code for telemetry data anomaly detection and impact extraction.
- data_labelling.ipynb contains code for data labelling using fault injection records.
- feature_engineering.ipynb contains code for feature engineering.
- incident_detection.ipynb contains code for the graph neural networks based model training and testing for incident detection on the simulation environment dataset.
- incident_diagnosis_using_edge_clues.ipynb contains code for the incident diagnosis on the simulation environment dataset using edge clues.
- incident_diagnosis_using_node_clues_with_continual_optimization_OB.ipynb contains code for the incident diagnosis on dataset OB using node clues with continual optimization.
- incident_diagnosis_using_node_clues_with_continual_optimization_AIOPS2021.ipynb contains code for the incident diagnosis on dataset AIOPS2021 using node clues with continual optimization.

🚀 Usage

⚙️ Prerequisites

The GEM framework has different requirements for incident detection and incident diagnosis components. Install the appropriate dependencies based on your use case:

🔍 For Incident Detection

The following two dependencies are tested on Python 3.7 and 3.8 (recommended).

pip install -r requirements_for_incident_detection_py37.txt

pip install -r requirements_for_incident_detection_py38.txt

🩺 For Incident Diagnosis

The following dependency is tested on Python 3.8.

pip install -r requirements_for_incident_diagnosis.txt

⚡ Quick Start

The GEM framework follows a sequential workflow for incident management in large-scale online systems. Follow these steps in order:

Step 1: Anomaly Detection and Impact Extraction

Start by detecting anomalies in raw telemetry data and extracting their impact:

import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import networkx as nx
import pickle

# Load monitoring data
all_data = pd.read_csv("./data/calling_relationships_monitoring.csv")

# Run anomaly detection and impact extraction
# See demo/anomaly_detection_and_impact_extraction.ipynb for detailed implementation

Step 2: Data Labelling

Label the data using historical incident reports or fault injection records:

# Load raw topologies
with open('./data/raw_topoloies.pkl', 'rb') as f:
    Topologies = pickle.load(f)

# Perform data labelling
# See demo/data_labelling.ipynb for detailed implementation

Step 3: Feature Engineering

Preprosses the data and create feature vectors for incident detection:

# Feature engineering for graph-based incident detection
# See demo/feature_engineering.ipynb for detailed implementation

Step 4: Incident Detection

Train and test the graph neural network model for incident detection:

import sys
sys.path.append('./src')
from incident_detection import callSpatioDevNet
import torch
from torch_geometric.data import Data, DataLoader

# Load training and test data
with open('./data/train_cases.pkl', 'rb') as f:
    train_cases = pickle.load(f)
    
with open('./data/test_cases.pkl', 'rb') as f:
    test_cases = pickle.load(f)

# Initialize and train the model
model = callSpatioDevNet.callSpatioDevNet(
    num_epochs=100,
    batch_size=32,
    lr=1e-3,
    hidden_dim=20
)

# Train the model
model.fit(train_cases)

# Test the model
results = model.predict(test_cases)

Step 5: Incident Diagnosis

Perform incident diagnosis using either edge clues or node clues:

Using Edge Clues

from incident_diagnosis import incident_diagnosis

# Diagnose incidents using edge clues
# See demo/incident_diagnosis_using_edge_clues.ipynb for detailed implementation

Using Node Clues with Continual Optimization

# For Online Boutique dataset
# See demo/incident_diagnosis_using_node_clues_with_continual_optimization_OB.ipynb

# For AIOPS2021 dataset
# See demo/incident_diagnosis_using_node_clues_with_continual_optimization_AIOPS2021.ipynb

📚 Detailed Examples

For comprehensive examples and detailed implementations, refer to the Jupyter notebooks in the ./demo directory:

Anomaly Detection: demo/anomaly_detection_and_impact_extraction.ipynb
Data Labelling: demo/data_labelling.ipynb
Feature Engineering: demo/feature_engineering.ipynb
Incident Detection: demo/incident_detection.ipynb
Incident Diagnosis (Edge Clues): demo/incident_diagnosis_using_edge_clues.ipynb
Incident Diagnosis (Node Clues - OB): demo/incident_diagnosis_using_node_clues_with_continual_optimization_OB.ipynb
Incident Diagnosis (Node Clues - AIOPS2021): demo/incident_diagnosis_using_node_clues_with_continual_optimization_AIOPS2021.ipynb

🏷️ Data Files

The framework works with the following data files:

calling_relationships_monitoring.csv: Monitoring data for service call relationships
injected_faults.csv: Records of injected faults for training
platform_faults.csv: Platform-level fault information
raw_topoloies.pkl: Raw topology data
issue_topoloies.pkl: Processed issue topology data
train_cases.pkl / test_cases.pkl: Preprocessed training and testing datasets
AIOPS2021.pkl / OB.pkl: Specific datasets for evaluation

🤖 Model Files

Pre-trained models are available in the ./demo directory:

FinalModel_OnlineBoutique.pt: Trained model for Online Boutique dataset

📋 API Reference

Below is the detailed API documentation for GEM.

🔍 Incident Detection Module

`callSpatioDevNet`

A graph neural network-based model for incident detection using spatio-temporal features.

Import:

from src.incident_detection import callSpatioDevNet

Constructor:

model = callSpatioDevNet(
    name='SpatioDevNetPackage',
    num_epochs=10,
    batch_size=32,
    lr=1e-3,
    input_dim=None,
    hidden_dim=20,
    edge_attr_len=60,
    global_fea_len=2,
    num_layers=2,
    edge_module='linear',
    act=True,
    pooling='attention',
    is_bilinear=False,
    nonlinear_scorer=False,
    head=4,
    aggr='mean',
    concat=False,
    dropout=0.4,
    weight_decay=1e-2,
    loss_func='focal_loss',
    seed=None,
    gpu=None,
    ipython=True,
    details=True
)

Key Parameters:

name (str): Model identifier for saving/loading
num_epochs (int): Number of training epochs
batch_size (int): Training batch size
lr (float): Learning rate for optimization
input_dim (int): Input feature dimension
hidden_dim (int): Hidden layer dimension
edge_attr_len (int): Edge attribute length
global_fea_len (int): Global feature length
num_layers (int): Number of GNN layers
edge_module (str): Edge processing module ('linear' or 'lstm')
pooling (str): Graph pooling method ('attention', 'max', 'mean', 'add')
loss_func (str): Loss function ('focal_loss', 'dev_loss', 'cross_entropy')
dropout (float): Dropout rate for regularization
seed (int): Random seed for reproducibility
gpu (int): GPU device ID

Methods:

`fit(datalist, valid_list=None, log_step=20, patience=10, valid_proportion=0.0, early_stop_fscore=None)`

Train the incident detection model.

Parameters:

datalist (list): Training data as PyTorch Geometric Data objects
valid_list (list, optional): Validation dataset
log_step (int): Logging frequency during training
patience (int): Early stopping patience
valid_proportion (float): Proportion of data for validation split
early_stop_fscore (float, optional): F-score threshold for early stopping

`predict(datalist)`

Predict anomaly scores for input data.

Parameters:

datalist (list): Input data as PyTorch Geometric Data objects

Returns:

outputs (numpy.ndarray): Anomaly scores
features (numpy.ndarray): Extracted features

`cold_start_predict(datalist, n_neighbors=3)`

Perform prediction with cold start using k-nearest neighbors.

Parameters:

datalist (list): Input data
n_neighbors (int): Number of neighbors for KNN

Returns:

knn_preds (list): KNN predictions
knn_pred_proba (list): KNN prediction probabilities
knn (object): Trained KNN classifier

`load(model_file=None)`

Load a pre-trained model.

Parameters:

model_file (str, optional): Path to model file

Utility Functions

`bf_search(labels, scores)`

Find optimal threshold using binary search for best F1-score.

Parameters:

labels (array): True labels
scores (array): Prediction scores

Returns:

results (tuple): Precision, recall, F-score metrics
threshold (float): Optimal threshold

🩺 Incident Diagnosis Module

Core Functions

Import:

from src.incident_diagnosis import incident_diagnosis

`get_weight_from_edge_info(topology, clue_tag, statistics=None)`

Calculate edge weights from topology information.

Parameters:

topology (dict): Network topology with node and edge information
clue_tag (str): Metric tag for weight calculation
statistics (dict, optional): Statistical normalization parameters

Returns:

weight (dict): Calculated edge weights

`get_anomaly_graph(topology, node_clue_tags=[], edge_clue_tags=[], a=None, get_edge_weight=None, edge_backward_factor=0.3)`

Construct anomaly graph from topology and clues.

Parameters:

topology (dict): Network topology data
node_clue_tags (list): Node-level clue tags
edge_clue_tags (list): Edge-level clue tags
a (dict, optional): Rescaling factors for clues
get_edge_weight (function): Edge weight calculation function
edge_backward_factor (float): Backward edge weight factor

Returns:

anomaly_graph (networkx.DiGraph): Constructed anomaly graph

`root_cause_localization(case, node_clue_tags, edge_clue_tags, a, get_edge_weight=None, edge_backward_factor=0.3)`

Localize root cause using PageRank on anomaly graph.

Parameters:

case (dict): Incident case data
node_clue_tags (list): Node clue tags
edge_clue_tags (list): Edge clue tags
a (dict): Clue rescaling factors
get_edge_weight (function, optional): Edge weight function
edge_backward_factor (float): Backward propagation factor

Returns:

root_cause (str): Identified root cause node

`explain(case, target='root_cause')`

Generate explanations for incident diagnosis.

Parameters:

case (dict): Incident case data
target (str): Target field to explain

Returns:

explanation (list): Sorted explanation features with power scores

`optimize(case, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor, historical_incident_topologies, init_clue_tag, range_a=5, num_trials=100)`

Optimize clue weights using historical data and Optuna.

Parameters:

case (dict): Current incident case
node_clue_tags (list): Node clue tags
edge_clue_tags (list): Edge clue tags
a (dict): Initial clue weights
get_edge_weight (function): Edge weight calculation function
edge_backward_factor (float): Backward edge factor
historical_incident_topologies (list): Historical incident data
init_clue_tag (str): Initial clue tag
range_a (float): Optimization range for weights
num_trials (int): Number of optimization trials

Returns:

node_clue_tags (list): Updated node clue tags
a (dict): Optimized clue weights

`eval(historical_incident_topologies, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor)`

Evaluate diagnosis performance on historical data.

Parameters:

historical_incident_topologies (list): Historical incident cases
node_clue_tags (list): Node clue tags
edge_clue_tags (list): Edge clue tags
a (dict): Clue weights
get_edge_weight (function): Edge weight function
edge_backward_factor (float): Backward edge factor

Returns:

reward (float): Accuracy reward
punishment (float): Regularization punishment

📊 Data Structures

Topology Format

topology = {
    'nodes': ['node1', 'node2', ...],
    'edge_info': {
        'node1_node2': {
            'metric1': [values...],
            'metric2': [values...]
        }
    },
    'node_info': {
        'node1': {
            'metric1': [values...],
            'metric2': [values...]
        }
    },
    'root_cause': 'node_id'  # for labeled data
}

PyTorch Geometric Data Format

from torch_geometric.data import Data

data = Data(
    x=node_features,      # Node feature matrix
    edge_index=edge_index, # Edge connectivity
    edge_attr=edge_attr,   # Edge features
    global_x=global_features, # Global features
    y=label,              # Graph label
    batch=batch_vector    # Batch assignment
)

🔧 Global Configuration

DAMPING = 0.1: PageRank damping factor for root cause localization

💡 Example Usage

# Incident Detection
from src.incident_detection import callSpatioDevNet

model = callSpatioDevNet(
    num_epochs=100,
    batch_size=32,
    lr=1e-3,
    hidden_dim=20
)
model.fit(train_data)
scores, features = model.predict(test_data)

# Incident Diagnosis
from src.incident_diagnosis.incident_diagnosis import (
    get_weight_from_edge_info,
    root_cause_localization,
    optimize
)

# Diagnose root cause
root_cause = root_cause_localization(
    case=incident_case,
    node_clue_tags=['cpu_usage', 'memory_usage'],
    edge_clue_tags=['response_time', 'error_rate'],
    a={'cpu_usage': 1.0, 'memory_usage': 0.8},
    get_edge_weight=get_weight_from_edge_info
)

🛠️ Troubleshooting

Ensure you have the correct Python version (3.7+)
Install PyTorch with appropriate CUDA support if using GPU
For torch_geometric installation issues, refer to the official documentation
Make sure all data files are in the correct ./data directory

📝 Citation

If you find this work useful, please cite our paper:

@article{he2025subgraphs,
  title={Subgraphs as First-Class Citizens in Incident Management for Large-Scale Online Systems: An Evolution-Aware Framework},
  author={He, Zilong and Chen, Pengfei and Luo, Yu and Yan, Qiuyu and Chen, Hongyang and Yu, Guangba and Li, Fangyuan and Li, Xiaoyun and Zheng, Zibin},
  journal={IEEE Transactions on Software Engineering},
  year={2025},
  publisher={IEEE}
}

@inproceedings{DBLP:conf/kbse/HeCLYCYL22,
  author       = {Zilong He and
                  Pengfei Chen and
                  Yu Luo and
                  Qiuyu Yan and
                  Hongyang Chen and
                  Guangba Yu and
                  Fangyuan Li},
  title        = {Graph based Incident Extraction and Diagnosis in Large-Scale Online
                  Systems},
  booktitle    = {37th {IEEE/ACM} International Conference on Automated Software Engineering,
                  {ASE} 2022, Rochester, MI, USA, October 10-14, 2022},
  pages        = {48:1--48:13},
  publisher    = {{ACM}},
  year         = {2022},
  url          = {https://doi.org/10.1145/3551349.3556904},
  doi          = {10.1145/3551349.3556904},
  timestamp    = {Thu, 22 Jun 2023 07:45:51 +0200},
  biburl       = {https://dblp.org/rec/conf/kbse/HeCLYCYL22.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
demo		demo
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements_for_incident_detection_py37.txt		requirements_for_incident_detection_py37.txt
requirements_for_incident_detection_py38.txt		requirements_for_incident_detection_py38.txt
requirements_for_incident_diagnosis.txt		requirements_for_incident_diagnosis.txt

IntelligentDDS/GEM

Folders and files

Latest commit

History

Repository files navigation

GEM

📖 Introduction

🏗️ Project Structure

🚀 Usage

⚙️ Prerequisites

🔍 For Incident Detection

🩺 For Incident Diagnosis

⚡ Quick Start

Step 1: Anomaly Detection and Impact Extraction

Step 2: Data Labelling

Step 3: Feature Engineering

Step 4: Incident Detection

Step 5: Incident Diagnosis

Using Edge Clues

Using Node Clues with Continual Optimization

📚 Detailed Examples

🏷️ Data Files

🤖 Model Files

📋 API Reference

🔍 Incident Detection Module

callSpatioDevNet

fit(datalist, valid_list=None, log_step=20, patience=10, valid_proportion=0.0, early_stop_fscore=None)

predict(datalist)

cold_start_predict(datalist, n_neighbors=3)

load(model_file=None)

Utility Functions

bf_search(labels, scores)

🩺 Incident Diagnosis Module

Core Functions

get_weight_from_edge_info(topology, clue_tag, statistics=None)

get_anomaly_graph(topology, node_clue_tags=[], edge_clue_tags=[], a=None, get_edge_weight=None, edge_backward_factor=0.3)

root_cause_localization(case, node_clue_tags, edge_clue_tags, a, get_edge_weight=None, edge_backward_factor=0.3)

explain(case, target='root_cause')

optimize(case, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor, historical_incident_topologies, init_clue_tag, range_a=5, num_trials=100)

eval(historical_incident_topologies, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor)

📊 Data Structures

Topology Format

PyTorch Geometric Data Format

🔧 Global Configuration

💡 Example Usage

🛠️ Troubleshooting

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`callSpatioDevNet`

`fit(datalist, valid_list=None, log_step=20, patience=10, valid_proportion=0.0, early_stop_fscore=None)`

`predict(datalist)`

`cold_start_predict(datalist, n_neighbors=3)`

`load(model_file=None)`

`bf_search(labels, scores)`

`get_weight_from_edge_info(topology, clue_tag, statistics=None)`

`get_anomaly_graph(topology, node_clue_tags=[], edge_clue_tags=[], a=None, get_edge_weight=None, edge_backward_factor=0.3)`

`root_cause_localization(case, node_clue_tags, edge_clue_tags, a, get_edge_weight=None, edge_backward_factor=0.3)`

`explain(case, target='root_cause')`

`optimize(case, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor, historical_incident_topologies, init_clue_tag, range_a=5, num_trials=100)`

`eval(historical_incident_topologies, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor)`

Packages