Skip to content

Conversation

@mselim00
Copy link
Contributor

Issue #, if available:

Description of changes:
Some improvements on top of #692

Switches to directly using kubelet /healthz requests instead of pod exec being killed with a 143 for gauging that the instance has started shutting down. This brings more determinism in two ways:

  1. a pod can exit with a 143 for various unrelated reasons, but checking a connection refused from kubelet confirms kubelet has stopped listening on it's serving port and was therefore most probably killed
  2. since the pod could also be killed well before kubelet, directly checking that kubelet is unresponsive makes sure that the second pod cannot be created until a new boot is complete. In the prior version, pods were sometimes left in ContainerCreating while the node was rebooting, if the creation was early enough in the shutdown cycle.

Also switches from assigning pods to nodes to allowing the scheduler to do so. This makes sure we're not bypassing any health checks the scheduler provides.

New sample happy path output:

2025/09/27 19:06:14 Starting quick test suite...
=== RUN   TestGracefulReboot
=== RUN   TestGracefulReboot/graceful-reboot
=== RUN   TestGracefulReboot/graceful-reboot/Node_gracefully_reboots
    graceful_reboot_test.go:73: Pod termination-canary-1758999974 is running on node ip-172-31-42-36.us-west-2.compute.internal
    graceful_reboot_test.go:91: Node ip-172-31-42-36.us-west-2.compute.internal is responding to /healthz
    graceful_reboot_test.go:95: Rebooting underlying instance i-079f0dd190956bc92 for node ip-172-31-42-36.us-west-2.compute.internal...
    graceful_reboot_test.go:104: Successfully triggered reboot of instance i-079f0dd190956bc92, waiting for kubelet to become unresponsive...
    graceful_reboot_test.go:135: Node ip-172-31-42-36.us-west-2.compute.internal has become unresponsive, waiting for the node to become schedulable again...
    graceful_reboot_test.go:154: Node ip-172-31-42-36.us-west-2.compute.internal became ready and schedulable within 2m5.207167609s!
--- PASS: TestGracefulReboot (132.33s)
    --- PASS: TestGracefulReboot/graceful-reboot (132.33s)
        --- PASS: TestGracefulReboot/graceful-reboot/Node_gracefully_reboots (132.26s)
PASS
ok      github.com/aws/aws-k8s-tester/test/cases/disruptive     132.365s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ndbaker1 ndbaker1 merged commit ecaddc7 into aws:main Sep 29, 2025
18 of 20 checks passed
@mselim00 mselim00 deleted the reboot-test branch September 30, 2025 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants