A comprehensive framework for testing system resilience through rack-level failure simulation in Kubernetes clusters using Argo Workflows. This project focuses on validating rack resiliency by testing the system's ability to continue operating during rack-level failures in a Kubernetes cluster environment (HPC).
This repository is organized as follows:
- Automation_Scripts - Scripts for automating chaos testing and workflow execution
- Critical_Services - Definitions and configurations for critical services to be tested
- Logs - Log files generated during test runs
- k8s_cluster - Kubernetes cluster configuration files
This framework is designed to test application resilience in Kubernetes environments by simulating different types of failures at the rack level. Key features include:
- Node Failure Simulation: Simulates single node failures by cordoning and tainting nodes.
- Rack Failure Simulation: Simulates failures of entire racks/zones of nodes.
- Automated Recovery: Automatically recovers nodes after a configurable downtime.
- Detailed Monitoring: Provides comprehensive logging of cluster state before, during, and after failures.
- Zone-aware Testing: Ensures proper testing of multi-zone resilience.
- Argo Workflow Integration: All tests are orchestrated using Argo Workflows for reliability and reproducibility.
The framework uses the following components:
- Kubernetes API: For node cordoning, tainting, and monitoring.
- Argo Workflows: Orchestrates the chaos testing workflows.
- Python Scripts: Handle the actual simulation logic.
- Custom Docker Container: Encapsulates all dependencies.
- RBAC: Provides necessary permissions for node operations.
The system works by:
- Identifying target nodes based on zone/rack assignment
- Cordoning and tainting nodes to simulate failures
- Monitoring application behavior during the simulated failure
- Uncordoning and untainting nodes to simulate recovery
- Verifying proper application recovery
- Vagrant installed in your system working directory
- kubectl configured for your cluster
- Docker for building the simulation container
- Argo Workflows installed in your cluster
- Proper RBAC permissions for node management
Install Vagrant in your working directory For detailed cluster setup instructions, please refer to the k8s_cluster directory in the repository.
Label your nodes with zones to simulate racks:
# Example: Label nodes with zones
python3 label-nodes-by-zones.py
Install Argo Workflows in your cluster:
# Create namespace
kubectl create namespace argo
# Install Argo Workflows
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.4.0/install.yaml
# Install the Argo CLI
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/v3.4.0/argo-linux-amd64.gz
gunzip argo-linux-amd64.gz
chmod +x argo-linux-amd64
sudo mv argo-linux-amd64 /usr/local/bin/argo
Check the Automation_Scripts directory for the available test workflows and scripts.
### Node Failure Simulation
To simulate a single node failure:
```bash
# Apply the node failure workflow from the Automation_Scripts directory
argo submit -n argo <path-to-workflow-yaml> --watch
This will:
- Run a health check
- Simulate a node failure (cordon + taint)
- Wait for the specified time
- Recover the node
- Run a final health check
To simulate a rack/zone failure:
# Using Argo CLI
argo submit -n argo <path-to-rack-workflow-yaml> --watch
You can monitor your chaos tests in several ways:
Access the Argo UI to monitor workflow progress:
# Port forward the Argo UI
kubectl port-forward svc/argo-server -n argo 2746:2746
# Access in browser
# https://localhost:2746
Visible in Argo UI -> under Argo namespace -> go to workfow -> LOGS tab To view the Argo UI : Run in the host terminal:
kubectl -n argo port-forward svc/argo-server 2746:2746
Go to:
https://localhost:2746/
Will be stored inside master-m003 node inside the directory argo-logs within subfolders for each DAG template logs under the name rack_resilience_simulation.log
Some sample Log files that were generated are stored in the Logs directory of the repository.
During health checks, the system logs detailed information:
-
Node Status:
- Basic node information with
kubectl get nodes -o wide
- Enhanced status with cordoned state and taints
- Warning indicators for problem states
- Basic node information with
-
Pod Information:
- All pods with
kubectl get pods -o wide --all-namespaces
- Pod distribution by node
- Service-specific pod details
- Zone distribution analysis
- All pods with
If a chaos test is interrupted, nodes might be left cordoned or tainted. To clean up:
# Script to clean up ALL nodes at once
for NODE in $(kubectl get nodes -o name); do
kubectl uncordon ${NODE}
kubectl taint nodes ${NODE} simulated-failure- 2>/dev/null || true
echo "Cleaned up ${NODE}"
done
To restart all deployments (which will restart all pods):
# Restart all deployments
kubectl get deployments --all-namespaces -o name | xargs -I {} kubectl rollout restart {}
# Or restart specific simulation services from the Critical_Services directory
kubectl rollout restart deployment <service-name>
If nodes are stuck in an unschedulable state:
# Check node status
kubectl get nodes
# Manually uncordon a node
kubectl uncordon <node-name>
# Remove taints
kubectl taint nodes <node-name> simulated-failure-
If you see errors about insufficient permissions:
# Check RBAC binding
kubectl describe clusterrolebinding argo-node-admin
# Verify the service account exists
kubectl get sa -n argo argo-workflow
If services aren't properly distributed across zones after recovery:
# Restart services to trigger rebalancing
kubectl rollout restart deployments
# Check distribution
kubectl get pods -o wide