- Overview
- How It Works
- Features
- Prerequisites
- Kestra Plugins
- Workflow Configuration
- Chaos Test Types
- Input Parameters
- Usage
- Results & Evaluation
- Troubleshooting
- Contributing
A-CERT (Advanced Chaos Engineering Resilience Testing) is a sophisticated chaos testing tool built on Kestra workflows that automates the entire process of chaos engineering for microservices. It provides a comprehensive solution for testing application resilience by introducing controlled failures and measuring system behavior under stress.
The tool combines Chaos Mesh for chaos injection, Artillery for load testing, and Kubernetes for container orchestration to deliver enterprise-grade chaos testing capabilities.
A-CERT follows a structured approach to chaos testing:
- Source Code Acquisition: Clones your microservice repository from GitHub
- Container Building: Builds a Docker image from your application
- Registry Push: Pushes the image to Docker Hub for deployment
- Kubernetes Deployment: Deploys your microservice to a Kubernetes cluster
- Service Exposure: Creates Kubernetes services for external access
- Baseline Testing: Runs initial load tests to establish performance baselines
- Chaos Injection: Systematically introduces various types of failures
- Performance Monitoring: Continuously monitors application behavior during chaos
- Results Analysis: Evaluates test results against predefined thresholds
- Pass/Fail Determination: Provides clear certification of application resilience
- Multi-Type Chaos Testing: CPU stress, memory stress, network latency, packet loss, and pod killing
- Configurable Test Modes: Support for different targeting modes (one, all, fixed, percentage-based)
- Automated CI/CD Integration: Seamless integration with existing development workflows
- Real-time Performance Monitoring: Live tracking of response times and error rates
- Customizable Thresholds: Define your own acceptance criteria for resilience
- Comprehensive Reporting: Detailed test results with pass/fail analysis
- Scalable Load Testing: Configurable virtual user simulation with Artillery
- Cloud-Native Architecture: Built for Kubernetes environments
- Kubernetes Cluster (v1.19+)
- Chaos Mesh installed on the cluster
- Docker Hub Account for image registry
- Kestra Instance (v0.15+)
- Microservice with Dockerfile in repository root
- Health check endpoint (
/health
) for readiness probes - Application listening on port 3000
- GitHub repository with public access
A-CERT utilizes the following Kestra plugins:
io.kestra.plugin.core.flow.WorkingDirectory
- Workspace managementio.kestra.plugin.core.flow.ForEach
- Iterative task executionio.kestra.plugin.core.flow.Subflow
- Workflow compositionio.kestra.plugin.core.flow.Switch
- Conditional executionio.kestra.plugin.core.flow.Sleep
- Timing control
io.kestra.plugin.git.Clone
- Git repository operationsio.kestra.plugin.docker.Build
- Docker image buildingio.kestra.plugin.docker.Run
- Container executionio.kestra.plugin.kubernetes.kubectl.Apply
- Kubernetes resource managementio.kestra.plugin.scripts.shell.Commands
- Shell script executionio.kestra.plugin.scripts.runner.docker.Docker
- Containerized script execution
The tool consists of two main workflows:
- Namespace:
github.clone
- Purpose: Orchestrates the complete chaos testing pipeline
- Key Tasks: Repository cloning, image building, deployment, test execution
- Namespace:
acert
- Purpose: Executes individual chaos experiments
- Key Tasks: Chaos injection, load testing, result collection
- Injects high CPU load on target pods
- Configurable worker threads and load percentage
- Tests application behavior under compute pressure
- Creates memory pressure on target containers
- Configurable memory size and worker count
- Validates memory management and garbage collection
- Introduces artificial network delays
- Configurable latency, jitter, and correlation
- Tests timeout handling and user experience
- Simulates packet loss scenarios
- Configurable loss percentage and correlation
- Validates retry mechanisms and error handling
- Terminates pods to simulate failures
- Tests service recovery and failover mechanisms
- Validates high availability configurations
Parameter | Type | Description |
---|---|---|
github_repo_url |
STRING | GitHub repository URL |
docker_hub_username |
STRING | Docker Hub username |
docker_hub_password |
STRING | Docker Hub password |
chaos_test |
MULTISELECT | Chaos tests to execute |
tests_duration |
INT | Test duration in seconds |
chaos_mode |
SELECT | Test targeting mode |
Parameter | Type | Description | Default |
---|---|---|---|
network_latency_ms |
INT | Network latency in ms | 1000 |
network_loss_percent |
INT | Packet loss percentage | 25 |
memory_size |
STRING | Memory stress size | "128MB" |
cpu_workers |
INT | CPU stress workers | 1 |
threshold_error_rate |
FLOAT | Max acceptable error rate | 0.05 |
threshold_response_time |
INT | Max response time in ms | 500 |
arrival_rate |
INT | Virtual users per second | 10 |
- one: Target single pod
- all: Target all pods
- fixed: Target specific number of pods
- fixed-percent: Target percentage of pods
- random-max-percent: Random percentage up to maximum
1. On your host machine, run the following command to grant permissions to the default service account
kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin --serviceaccount=default:default
https://chaos-mesh.org/docs/production-installation-using-helm/
# Trigger the main workflow with minimal configuration
inputs:
github_repo_url: "https://github.com/your-org/your-microservice.git"
docker_hub_username: "your-username"
docker_hub_password: "your-password"
chaos_test: ["pod_kill", "cpu_stress"]
tests_duration: 60
chaos_mode: "one"
# Comprehensive chaos testing configuration
inputs:
github_repo_url: "https://github.com/your-org/your-microservice.git"
docker_hub_username: "your-username"
docker_hub_password: "your-password"
chaos_test: ["cpu_stress", "memory_stress", "network_latency", "pod_kill"]
tests_duration: 120
chaos_mode: "fixed-percent"
chaos_mode_value: "50"
cpu_workers: 4
cpu_load: 80
memory_size: "256MB"
network_latency_ms: 500
threshold_error_rate: 0.02
threshold_response_time: 300
arrival_rate: 20
# Example GitHub Actions integration
- name: Run Chaos Tests
run: |
curl -X POST "${KESTRA_URL}/api/v1/executions/acert/github.clone" \
-H "Content-Type: application/json" \
-d @chaos-test-config.json
A-CERT evaluates the following metrics:
- Response Time: Average, P95, and P99 response times
- Error Rate: Percentage of failed requests
- Success Rate: Percentage of successful responses
- Throughput: Requests per second handled
- Availability: Service uptime during chaos
Tests are evaluated against configurable thresholds:
β
PASSED: Response time < 500ms AND Error rate < 5%
β FAILED: Response time β₯ 500ms OR Error rate β₯ 5%
π A-CERT Chaos Test Results Analysis
=====================================
π Test: cpu-stress
β€ Total Requests: 1200
β€ Successful Responses: 1180
β€ Failed Users: 0
β€ Timeout Errors: 20
β€ Success Rate: 98.33%
β€ Avg Response Time: 245 ms
β€ P95 Response Time: 380 ms
β€ P99 Response Time: 450 ms
β€ Error Rate: 0.016667
β
PASSED
π All tests passed thresholds!
# Check Dockerfile exists in repository root
# Ensure base images are accessible
# Verify Docker Hub credentials
# Verify Chaos Mesh is installed
kubectl get pods -n chaos-mesh
# Check pod labels match selectors
kubectl get pods -l app=sample-microservice
# Verify service is accessible
kubectl get svc sample-microservice
# Check pod readiness
kubectl get pods -l app=sample-microservice
# Check workflow execution logs
kestra logs flow acert github.clone <execution-id>
# Verify Kubernetes resources
kubectl get all -l app=sample-microservice
# Check chaos experiments
kubectl get networkchaos,podchaos,stresschaos -n default
We welcome contributions to A-CERT! Please follow these guidelines:
- Fork the repository
- Create a feature branch
- Implement your changes
- Test thoroughly
- Submit a pull request
# Clone repository
git clone https://github.com/your-org/a-cert.git
# Setup Kestra development environment
# Install required plugins
# Configure test cluster
This project is licensed under the MIT License - see the LICENSE file for details.
- Chaos Mesh team for the excellent chaos engineering platform
- Artillery team for the powerful load testing framework
- Kestra team for the flexible workflow orchestration
- Kubernetes community for the robust container platform
A-CERT: Making your microservices antifragile, one chaos test at a time! πͺοΈ