Aegis - Cloud-Native AIOps Framework for Kubernetes

Aegis is an event-driven, cloud-native automated operations system running on Kubernetes. It is designed to automatically respond to and handle various abnormal states in the cluster. By connecting alerts to standardized operations (SOPs), it significantly improves operational efficiency and failure response speed. Through the integration of custom resources (CRDs) and workflow engines (e.g., Argo Workflows), Aegis forms a complete closed-loop from alert reception, rule matching, automatic template rendering, workflow execution, to status feedback. Additional features include AI-HPC cluster diagnostics and periodic node health checks.

Core Capabilities

Automated Cluster Operations

Aegis defines several Kubernetes CRDs:

AegisAlert: Represents alert resources, including alert type, status, and affected objects.
AegisAlertOpsRule: Defines rules for alert-triggered workflows. On one hand, it specifies matching conditions for AegisAlert resources based on type, status, and labels. On the other hand, it references AegisOpsTemplate.
AegisOpsTemplate: Contains Argo Workflow templates for automated operations.

Aegis supports converting alert messages from sources (now support for the different alert source parsed by AI, such as AlertManager, Datadog, Zabbix and so on) into AegisAlert resources. These alerts are matched against AegisAlertOpsRule, which then triggers the rendering and execution of an AegisOpsTemplate.

Unified alert access: Supports alerts from AlertManager and a default custom format via webhook.
Event-driven response: Alerts are converted into AegisAlert objects to drive workflows.
Automated execution: Integrates with Argo Workflows to execute complex operational tasks.
Custom rules and scripts: Managed via AegisCli to define rules, generate templates, and build images.
Full lifecycle tracking: Every alert’s processing progress can be tracked via CRD status fields.

Cluster Diagnosis (Experimental)

Defines diagnostic objects through the AegisDiagnosis CRD, supporting LLM-based summary diagnosis. Currently supported types:

Node
Pod

Planned support for:

Argo Workflow
PytorchJob

Cluster Health Checks (Experimental)

Supports standardized node and cluster inspection via AegisNodeHealthCheck and AegisClusterHealthCheck CRDs. Custom inspection scripts can be defined and executed from the Pod perspective for enhanced inspection flexibility.

Note: Compared to node-problem-detector (NPD), Aegis supports inspections from within pods—useful in AI/HPC scenarios where simulated production environments are required.

Build and Deploy

Build Image

docker build -t aegis:test -f Dockerfile .

Deploy Aegis Services

# Install CRDs
kubectl apply -f manifest/install/

# Deploy Aegis Controller
kubectl apply -f manifest/deploy/ -n monitoring

Alert Source Integration

Support three apis for alert message parse

/ai/alert: AIAlertParser, uses LLM to parse various alert messages into the unified Aegis format.
/alertmanager/alert: Alertmanager HTTP POST format.
/alert: Custom JSON format for external system integrations.

Alertmanager

According to Alertmanager documentation, the following config sends all alerts to Aegis:

global:
  resolve_timeout: 5m
inhibit_rules:
- equal:
    - alertname
  source_matchers:
    - severity = critical
  target_matchers:
    - severity =~ warning
receivers:
- name: Aegis
  webhook_configs:
    - url: http://aegis.monitoring:8080/alertmanager/alert
route:
  group_by: [alertname]
  group_interval: 5m
  group_wait: 0s
  receiver: Aegis
  repeat_interval: 12h

Custom Alert Format

Custom JSON structure in Go:

type Alert struct {
	AlertSourceType AlertSourceType
	Type            string              `json:"type"`
	Status          string              `json:"status"`
	InvolvedObject  AlertInvolvedObject `json:"involvedObject"`
	Details         map[string]string   `json:"details"`
	FingerPrint     string              `json:"fingerprint"`
}

type AlertInvolvedObject struct {
	Kind      string `json:"kind"`
	Name      string `json:"name"`
	Namespace string `json:"namespace"`
	Node      string `json:"node"`
}

Send alerts via curl:

curl http://aegis.monitoring:8080/default/alert -d '{
    "type": "NodeOutOfDiskSpace",
    "status": "Firing",
    "involvedObject": {
        "Kind": "Node",
        "Name": "node1"
    },
    "details": {
        "startAt": "2022021122",
        "node": "node1"
    },
    "fingerprint": "5f972974ccf1ee9b"
}'

Install Ops Rules

Example: Cordon a node when a `NodeHasEmergencyEvent` alert is triggered.

Define SOP

The basic shell script:

kubectl cordon $node

Define Ops Rule

Link the NodeHasEmergencyEvent alert to the SOP using a templated workflow ({{.node}} will be dynamically rendered based on the alert):

---
apiVersion: aegis.io/v1alpha1
kind: AegisAlertOpsRule
metadata:
  name: nodehasemergencyevent
spec:
  alertConditions:
  - type: NodeHasEmergencyEvent
    status: Firing
  opsTemplate:
    kind: AegisOpsTemplate
    apiVersion: aegis.io/v1alpha1
    namespace: monitoring
    name: nodehasemergencyevent
---
apiVersion: aegis.io/v1alpha1
kind: AegisOpsTemplate
metadata:
  name: nodehasemergencyevent
spec:
  manifest: |
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    spec:
      serviceAccountName: aegis-workflow
      ttlSecondsAfterFinished: 60
      entrypoint: start
      templates:
      - name: start
        retryStrategy:
          limit: 1
        container:
          image: bitnami/kubectl:latest
          command:
          - /bin/bash
          - -c
          - |
            kubectl cordon {{.node}}

Deploy Rule

kubectl apply -f rule.yaml

Check resources:

kubectl get aegisalertopsrule
kubectl get aegisopstemplate

Trigger Automated Ops

Send a test alert to Aegis to trigger the workflow:

curl -X POST http://127.0.0.1:8080/default/alert -d '{
    "type": "NodeHasEmergencyEvent",
    "status": "Firing",
    "involvedObject": {
        "Kind": "Node",
        "Name": "dev1"
    },
    "details": {
        "startAt": "2022021122",
        "node": "dev1"
    },
    "fingerprint": "5f972974ccf1ee9b"
}'

Watch the alert lifecycle:

$ kubectl -n monitoring get aegisalert --watch | grep default
default-nodehasemergencyevent-9njt4                NodeHasEmergencyEvent           Node         dev1     Firing     1                                   
default-nodehasemergencyevent-9njt4                NodeHasEmergencyEvent           Node         dev1     Firing     1       Triggered       Pending     0s
default-nodehasemergencyevent-9njt4                NodeHasEmergencyEvent           Node         dev1     Firing     1       Triggered       Running     0s
default-nodehasemergencyevent-9njt4                NodeHasEmergencyEvent           Node         dev1     Firing     1       Triggered       Succeeded   11s

Check the corresponding workflow and logs:

$ kubectl -n monitoring get workflow | grep default-nodehasemergencyevent-9njt4
default-nodehasemergencyevent-9njt4-s82rh            Succeeded   79s

$ kubectl -n monitoring get pods | grep default-nodehasemergencyevent-9njt4
default-nodehasemergencyevent-9njt4-s82rh-start-4152452869                    0/2     Completed   0               89s

$ kubectl -n monitoring logs default-nodehasemergencyevent-9njt4-s82rh-start-4152452869
node/dev1 cordoned

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
api		api
cli		cli
cmd		cmd
deploy		deploy
docs		docs
examples		examples
hack		hack
internal		internal
manifests/install		manifests/install
pkg		pkg
script		script
test		test
tools		tools
version		version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Aegis - Cloud-Native AIOps Framework for Kubernetes

Table of Contents

Core Capabilities

Automated Cluster Operations

Cluster Diagnosis (Experimental)

Cluster Health Checks (Experimental)

Build and Deploy

Build Image

Deploy Aegis Services

Alert Source Integration

Alertmanager

Custom Alert Format

Install Ops Rules

Example: Cordon a node when a `NodeHasEmergencyEvent` alert is triggered.

Define SOP

Define Ops Rule

Deploy Rule

Trigger Automated Ops

Typical Scenario Examples

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

scitix/aegis

Folders and files

Latest commit

History

Repository files navigation

Aegis - Cloud-Native AIOps Framework for Kubernetes

Table of Contents

Core Capabilities

Automated Cluster Operations

Cluster Diagnosis (Experimental)

Cluster Health Checks (Experimental)

Build and Deploy

Build Image

Deploy Aegis Services

Alert Source Integration

Alertmanager

Custom Alert Format

Install Ops Rules

Example: Cordon a node when a NodeHasEmergencyEvent alert is triggered.

Define SOP

Define Ops Rule

Deploy Rule

Trigger Automated Ops

Typical Scenario Examples

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Example: Cordon a node when a `NodeHasEmergencyEvent` alert is triggered.

Packages