Aegis is an event-driven, cloud-native automated operations system running on Kubernetes. It is designed to automatically respond to and handle various abnormal states in the cluster. By connecting alerts to standardized operations (SOPs), it significantly improves operational efficiency and failure response speed. Through the integration of custom resources (CRDs) and workflow engines (e.g., Argo Workflows), Aegis forms a complete closed-loop from alert reception, rule matching, automatic template rendering, workflow execution, to status feedback. Additional features include AI-HPC cluster diagnostics and periodic node health checks.
- Core Capabilities
- Build and Deploy
- Alert Source Integration
- Install Ops Rules
- Trigger Automated Ops
- Typical Scenario Examples
Aegis defines several Kubernetes CRDs:
- AegisAlert: Represents alert resources, including alert type, status, and affected objects.
- AegisAlertOpsRule: Defines rules for alert-triggered workflows. On one hand, it specifies matching conditions for
AegisAlert
resources based on type, status, and labels. On the other hand, it referencesAegisOpsTemplate
. - AegisOpsTemplate: Contains Argo Workflow templates for automated operations.
Aegis supports converting alert messages from sources (now support for the different alert source parsed by AI, such as AlertManager, Datadog, Zabbix and so on) into AegisAlert
resources. These alerts are matched against AegisAlertOpsRule
, which then triggers the rendering and execution of an AegisOpsTemplate
.
- Unified alert access: Supports alerts from AlertManager and a default custom format via webhook.
- Event-driven response: Alerts are converted into
AegisAlert
objects to drive workflows. - Automated execution: Integrates with Argo Workflows to execute complex operational tasks.
- Custom rules and scripts: Managed via AegisCli to define rules, generate templates, and build images.
- Full lifecycle tracking: Every alert’s processing progress can be tracked via CRD status fields.
Defines diagnostic objects through the AegisDiagnosis
CRD, supporting LLM-based summary diagnosis. Currently supported types:
- Node
- Pod
Planned support for:
- Argo Workflow
- PytorchJob
Supports standardized node and cluster inspection via AegisNodeHealthCheck
and AegisClusterHealthCheck
CRDs. Custom inspection scripts can be defined and executed from the Pod perspective for enhanced inspection flexibility.
Note: Compared to node-problem-detector (NPD), Aegis supports inspections from within pods—useful in AI/HPC scenarios where simulated production environments are required.
docker build -t aegis:test -f Dockerfile .
# Install CRDs
kubectl apply -f manifest/install/
# Deploy Aegis Controller
kubectl apply -f manifest/deploy/ -n monitoring
Support three apis for alert message parse
/ai/alert
: AIAlertParser, uses LLM to parse various alert messages into the unified Aegis format./alertmanager/alert
: Alertmanager HTTP POST format./alert
: Custom JSON format for external system integrations.
According to Alertmanager documentation, the following config sends all alerts to Aegis:
global:
resolve_timeout: 5m
inhibit_rules:
- equal:
- alertname
source_matchers:
- severity = critical
target_matchers:
- severity =~ warning
receivers:
- name: Aegis
webhook_configs:
- url: http://aegis.monitoring:8080/alertmanager/alert
route:
group_by: [alertname]
group_interval: 5m
group_wait: 0s
receiver: Aegis
repeat_interval: 12h
Custom JSON structure in Go:
type Alert struct {
AlertSourceType AlertSourceType
Type string `json:"type"`
Status string `json:"status"`
InvolvedObject AlertInvolvedObject `json:"involvedObject"`
Details map[string]string `json:"details"`
FingerPrint string `json:"fingerprint"`
}
type AlertInvolvedObject struct {
Kind string `json:"kind"`
Name string `json:"name"`
Namespace string `json:"namespace"`
Node string `json:"node"`
}
Send alerts via curl:
curl http://aegis.monitoring:8080/default/alert -d '{
"type": "NodeOutOfDiskSpace",
"status": "Firing",
"involvedObject": {
"Kind": "Node",
"Name": "node1"
},
"details": {
"startAt": "2022021122",
"node": "node1"
},
"fingerprint": "5f972974ccf1ee9b"
}'
The basic shell script:
kubectl cordon $node
Link the NodeHasEmergencyEvent
alert to the SOP using a templated workflow ({{.node}}
will be dynamically rendered based on the alert):
---
apiVersion: aegis.io/v1alpha1
kind: AegisAlertOpsRule
metadata:
name: nodehasemergencyevent
spec:
alertConditions:
- type: NodeHasEmergencyEvent
status: Firing
opsTemplate:
kind: AegisOpsTemplate
apiVersion: aegis.io/v1alpha1
namespace: monitoring
name: nodehasemergencyevent
---
apiVersion: aegis.io/v1alpha1
kind: AegisOpsTemplate
metadata:
name: nodehasemergencyevent
spec:
manifest: |
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
serviceAccountName: aegis-workflow
ttlSecondsAfterFinished: 60
entrypoint: start
templates:
- name: start
retryStrategy:
limit: 1
container:
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
kubectl cordon {{.node}}
kubectl apply -f rule.yaml
Check resources:
kubectl get aegisalertopsrule
kubectl get aegisopstemplate
Send a test alert to Aegis to trigger the workflow:
curl -X POST http://127.0.0.1:8080/default/alert -d '{
"type": "NodeHasEmergencyEvent",
"status": "Firing",
"involvedObject": {
"Kind": "Node",
"Name": "dev1"
},
"details": {
"startAt": "2022021122",
"node": "dev1"
},
"fingerprint": "5f972974ccf1ee9b"
}'
Watch the alert lifecycle:
$ kubectl -n monitoring get aegisalert --watch | grep default
default-nodehasemergencyevent-9njt4 NodeHasEmergencyEvent Node dev1 Firing 1
default-nodehasemergencyevent-9njt4 NodeHasEmergencyEvent Node dev1 Firing 1 Triggered Pending 0s
default-nodehasemergencyevent-9njt4 NodeHasEmergencyEvent Node dev1 Firing 1 Triggered Running 0s
default-nodehasemergencyevent-9njt4 NodeHasEmergencyEvent Node dev1 Firing 1 Triggered Succeeded 11s
Check the corresponding workflow and logs:
$ kubectl -n monitoring get workflow | grep default-nodehasemergencyevent-9njt4
default-nodehasemergencyevent-9njt4-s82rh Succeeded 79s
$ kubectl -n monitoring get pods | grep default-nodehasemergencyevent-9njt4
default-nodehasemergencyevent-9njt4-s82rh-start-4152452869 0/2 Completed 0 89s
$ kubectl -n monitoring logs default-nodehasemergencyevent-9njt4-s82rh-start-4152452869
node/dev1 cordoned