Skip to content

Feature: add dynamic interval support. #652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
33 changes: 32 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,35 @@ Remember that `chaoskube` by default kills any pod in all your namespaces, inclu

`chaoskube` provides a simple HTTP endpoint that can be used to check that it is running. This can be used for [Kubernetes liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/). By default, this listens on port 8080. To disable, pass `--metrics-address=""` to `chaoskube`.

## Dynamic Interval

chaoskube supports a dynamic interval feature that automatically adjusts the time between pod terminations based on the number of candidate pods in your cluster. This helps ensure appropriate chaos levels in both small and large environments.

### How it works

With dynamic interval enabled, chaoskube will calculate the interval between pod terminations using the following formula:

```
interval = totalWorkingMinutes / (podCount * factor)
```

Where:
- `totalWorkingMinutes` = 10 days * 8 hours * 60 minutes = 4800 minutes (we asume that all pods should be killed during 2 work weeks)
- `factor` is the configurable dynamic interval factor

The dynamic interval factor lets you control the aggressiveness of the terminations:

- With `factor = 1.0`: Standard interval calculation
- With `factor > 1.0`: More aggressive terminations (shorter intervals)
- With `factor < 1.0`: Less aggressive terminations (longer intervals)

### Example scenarios

- Small cluster (100 pods, factor 1.0): interval = 48 minutes
- Small cluster (100 pods, factor 1.5): interval = 32 minutes
- Small cluster (100 pods, factor 2.0): interval = 24 minutes
- Large cluster (1500 pods, factor 1.0): interval = 3 minutes

## Filtering targets

However, you can limit the search space of `chaoskube` by providing label, annotation, and namespace selectors, pod name include/exclude patterns, as well as a minimum age setting.
Expand Down Expand Up @@ -200,6 +229,8 @@ Use `UTC`, `Local` or pick a timezone name from the [(IANA) tz database](https:/
| Option | Environment | Description | Default |
| -------------------------- | ---------------------------------- | -------------------------------------------------------------------- | -------------------------- |
| `--interval` | `CHAOSKUBE_INTERVAL` | interval between pod terminations | 10m |
| `--dynamic-interval` | `CHAOSKUBE_DYNAMIC_INTERVAL` | enable dynamic interval calculation | false |
| `--dynamic-factor` | `CHAOSKUBE_DYNAMIC_FACTOR` | factor to adjust the dynamic interval | 1.0 |
| `--labels` | `CHAOSKUBE_LABELS` | label selector to filter pods by | (matches everything) |
| `--annotations` | `CHAOSKUBE_ANNOTATIONS` | annotation selector to filter pods by | (matches everything) |
| `--kinds` | `CHAOSKUBE_KINDS` | owner's kind selector to filter pods by | (all kinds) |
Expand Down Expand Up @@ -244,4 +275,4 @@ This project wouldn't be where it is with the ideas and help of several awesome

## Contributing

Feel free to create issues or submit pull requests.
Feel free to create issues or submit pull requests.
202 changes: 176 additions & 26 deletions chaoskube/chaoskube.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import (
"context"
"errors"
"fmt"
"math"
"regexp"
"time"

Expand Down Expand Up @@ -74,6 +75,11 @@ type Chaoskube struct {
Notifier notifier.Notifier
// namespace scope for the Kubernetes client
ClientNamespaceScope string

// Dynamic interval configuration
DynamicInterval bool
DynamicIntervalFactor float64
BaseInterval time.Duration
}

var (
Expand All @@ -97,51 +103,160 @@ var (
// * a logger implementing logrus.FieldLogger to send log output to
// * what specific terminator to use to imbue chaos on victim pods
// * whether to enable/disable dry-run mode
func New(client kubernetes.Interface, labels, annotations, kinds, namespaces, namespaceLabels labels.Selector, includedPodNames, excludedPodNames *regexp.Regexp, excludedWeekdays []time.Weekday, excludedTimesOfDay []util.TimePeriod, excludedDaysOfYear []time.Time, timezone *time.Location, minimumAge time.Duration, logger log.FieldLogger, dryRun bool, terminator terminator.Terminator, maxKill int, notifier notifier.Notifier, clientNamespaceScope string) *Chaoskube {
func New(client kubernetes.Interface, labels, annotations, kinds, namespaces, namespaceLabels labels.Selector, includedPodNames, excludedPodNames *regexp.Regexp, excludedWeekdays []time.Weekday, excludedTimesOfDay []util.TimePeriod, excludedDaysOfYear []time.Time, timezone *time.Location, minimumAge time.Duration, logger log.FieldLogger, dryRun bool, terminator terminator.Terminator, maxKill int, notifier notifier.Notifier, clientNamespaceScope string, dynamicInterval bool, dynamicIntervalFactor float64, baseInterval time.Duration) *Chaoskube {
broadcaster := record.NewBroadcaster()
broadcaster.StartRecordingToSink(&typedcorev1.EventSinkImpl{Interface: client.CoreV1().Events(clientNamespaceScope)})
recorder := broadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "chaoskube"})

return &Chaoskube{
Client: client,
Labels: labels,
Annotations: annotations,
Kinds: kinds,
Namespaces: namespaces,
NamespaceLabels: namespaceLabels,
IncludedPodNames: includedPodNames,
ExcludedPodNames: excludedPodNames,
ExcludedWeekdays: excludedWeekdays,
ExcludedTimesOfDay: excludedTimesOfDay,
ExcludedDaysOfYear: excludedDaysOfYear,
Timezone: timezone,
MinimumAge: minimumAge,
Logger: logger,
DryRun: dryRun,
Terminator: terminator,
EventRecorder: recorder,
Now: time.Now,
MaxKill: maxKill,
Notifier: notifier,
ClientNamespaceScope: clientNamespaceScope,
Client: client,
Labels: labels,
Annotations: annotations,
Kinds: kinds,
Namespaces: namespaces,
NamespaceLabels: namespaceLabels,
IncludedPodNames: includedPodNames,
ExcludedPodNames: excludedPodNames,
ExcludedWeekdays: excludedWeekdays,
ExcludedTimesOfDay: excludedTimesOfDay,
ExcludedDaysOfYear: excludedDaysOfYear,
Timezone: timezone,
MinimumAge: minimumAge,
Logger: logger,
DryRun: dryRun,
Terminator: terminator,
EventRecorder: recorder,
Now: time.Now,
MaxKill: maxKill,
Notifier: notifier,
ClientNamespaceScope: clientNamespaceScope,
DynamicInterval: dynamicInterval,
DynamicIntervalFactor: dynamicIntervalFactor,
BaseInterval: baseInterval,
}
}

// CalculateDynamicInterval calculates a dynamic interval based on current pod count
func (c *Chaoskube) CalculateDynamicInterval(ctx context.Context) time.Duration {

// Get total number of pods
listOptions := metav1.ListOptions{LabelSelector: c.Labels.String()}
podList, err := c.Client.CoreV1().Pods(c.ClientNamespaceScope).List(ctx, listOptions)

if err != nil {
c.Logger.WithField("err", err).Error("failed to get list of pods, using base interval")
return c.BaseInterval
}

pods, err := filterByNamespaces(podList.Items, c.Namespaces)
if err != nil {
c.Logger.WithField("err", err).Error("failed to filterByNamespaces, using base interval")
return c.BaseInterval
}

pods, err = filterPodsByNamespaceLabels(ctx, pods, c.NamespaceLabels, c.Client)
if err != nil {
c.Logger.WithField("err", err).Error("failed to filterPodsByNamespaceLabels, using base interval")
return c.BaseInterval
}

pods, err = filterByKinds(pods, c.Kinds)
if err != nil {
c.Logger.WithField("err", err).Error("failed to filterByKinds, using base interval")
return c.BaseInterval
}

pods = filterByAnnotations(pods, c.Annotations)

podCount := len(pods)

// Add debug logging for pod details
logger := c.Logger
// Check if debug logging is enabled
// We need to handle both *log.Logger and *log.Entry types that implement FieldLogger
debugEnabled := false
switch l := logger.(type) {
case *log.Logger:
debugEnabled = l.Level >= log.DebugLevel
case *log.Entry:
debugEnabled = l.Logger.Level >= log.DebugLevel
}

if debugEnabled {
c.Logger.Debug("Listing candidate pods for dynamic interval calculation:")
for i, pod := range pods {
c.Logger.WithFields(log.Fields{
"index": i,
"name": pod.Name,
"namespace": pod.Namespace,
"labels": pod.Labels,
"phase": pod.Status.Phase,
}).Debug("candidate pod")
}
}

// Guard against division by zero, pods could be all filtered!
if podCount == 0 {
c.Logger.WithField("podCount", 0).Info("no pods found, using base interval")
return c.BaseInterval
}
// As a simple reference, we asume that every pod should be killed during 10 working days (9-17h)
totalWorkingMinutes := 10 * 8 * 60

// Calculate raw interval in minutes
// Higher pod counts = shorter intervals, lower pod counts = longer intervals
rawIntervalMinutes := float64(totalWorkingMinutes) / (float64(podCount) * c.DynamicIntervalFactor)

// Round to nearest minute and ensure minimum of 1 minute
minutes := int(math.Max(1, math.Round(rawIntervalMinutes)))
roundedInterval := time.Duration(minutes) * time.Minute

// Provide detailed logging about the calculation
c.Logger.WithFields(log.Fields{
"podCount": podCount,
"totalWorkMinutes": totalWorkingMinutes,
"factor": c.DynamicIntervalFactor,
"rawIntervalMins": rawIntervalMinutes,
"roundedInterval": roundedInterval,
}).Info("calculated dynamic interval")

return roundedInterval
}

// Run continuously picks and terminates a victim pod at a given interval
// described by channel next. It returns when the given context is canceled.
func (c *Chaoskube) Run(ctx context.Context, next <-chan time.Time) {
for {
// If dynamic interval is enabled, calculate new interval before terminating victims
var waitDuration time.Duration
if c.DynamicInterval {
waitDuration = c.CalculateDynamicInterval(ctx)
metrics.CurrentIntervalSeconds.Set(float64(waitDuration.Seconds()))
}

if err := c.TerminateVictims(ctx); err != nil {
c.Logger.WithField("err", err).Error("failed to terminate victim")
metrics.ErrorsTotal.Inc()
}

c.Logger.Debug("sleeping...")
metrics.IntervalsTotal.Inc()
select {
case <-next:
case <-ctx.Done():
return

// Use the appropriate waiting mechanism
if c.DynamicInterval {
select {
case <-time.After(waitDuration):
// Continue to next iteration
case <-ctx.Done():
return
}
} else {
// Use original fixed interval from ticker
select {
case <-next:
case <-ctx.Done():
return
}
}
}
}
Expand Down Expand Up @@ -218,28 +333,63 @@ func (c *Chaoskube) Candidates(ctx context.Context) ([]v1.Pod, error) {
if err != nil {
return nil, err
}
c.Logger.WithFields(log.Fields{
"count": len(podList.Items),
}).Debug("Initial pod count after API list")

pods, err := filterByNamespaces(podList.Items, c.Namespaces)
if err != nil {
return nil, err
}
c.Logger.WithFields(log.Fields{
"count": len(pods),
}).Debug("Pod count after namespace filtering")

pods, err = filterPodsByNamespaceLabels(ctx, pods, c.NamespaceLabels, c.Client)
if err != nil {
return nil, err
}
c.Logger.WithFields(log.Fields{
"count": len(pods),
}).Debug("Pod count after namespace labels filtering")

pods, err = filterByKinds(pods, c.Kinds)
if err != nil {
return nil, err
}
c.Logger.WithFields(log.Fields{
"count": len(pods),
}).Debug("Pod count after kinds filtering")

pods = filterByAnnotations(pods, c.Annotations)
c.Logger.WithFields(log.Fields{
"count": len(pods),
}).Debug("Pod count after annotations filtering")

pods = filterByPhase(pods, v1.PodRunning)
c.Logger.WithFields(log.Fields{
"count": len(pods),
}).Debug("Pod count after phase filtering")

pods = filterTerminatingPods(pods)
c.Logger.WithFields(log.Fields{
"count": len(pods),
}).Debug("Pod count after terminating pods filtering")

pods = filterByMinimumAge(pods, c.MinimumAge, c.Now())
c.Logger.WithFields(log.Fields{
"count": len(pods),
}).Debug("Pod count after minimum age filtering")

pods = filterByPodName(pods, c.IncludedPodNames, c.ExcludedPodNames)
c.Logger.WithFields(log.Fields{
"count": len(pods),
}).Debug("Pod count after pod name filtering")

pods = filterByOwnerReference(pods)
c.Logger.WithFields(log.Fields{
"count": len(pods),
}).Debug("Final pod count after owner reference filtering")

return pods, nil
}
Expand Down
Loading