fix: make reboot test more deterministic #693

mselim00 · 2025-09-27T19:14:46Z

Issue #, if available:

Description of changes:
Some improvements on top of #692

Switches to directly using kubelet /healthz requests instead of pod exec being killed with a 143 for gauging that the instance has started shutting down. This brings more determinism in two ways:

a pod can exit with a 143 for various unrelated reasons, but checking a connection refused from kubelet confirms kubelet has stopped listening on it's serving port and was therefore most probably killed
since the pod could also be killed well before kubelet, directly checking that kubelet is unresponsive makes sure that the second pod cannot be created until a new boot is complete. In the prior version, pods were sometimes left in ContainerCreating while the node was rebooting, if the creation was early enough in the shutdown cycle.

Also switches from assigning pods to nodes to allowing the scheduler to do so. This makes sure we're not bypassing any health checks the scheduler provides.

New sample happy path output:

2025/09/27 19:06:14 Starting quick test suite...
=== RUN   TestGracefulReboot
=== RUN   TestGracefulReboot/graceful-reboot
=== RUN   TestGracefulReboot/graceful-reboot/Node_gracefully_reboots
    graceful_reboot_test.go:73: Pod termination-canary-1758999974 is running on node ip-172-31-42-36.us-west-2.compute.internal
    graceful_reboot_test.go:91: Node ip-172-31-42-36.us-west-2.compute.internal is responding to /healthz
    graceful_reboot_test.go:95: Rebooting underlying instance i-079f0dd190956bc92 for node ip-172-31-42-36.us-west-2.compute.internal...
    graceful_reboot_test.go:104: Successfully triggered reboot of instance i-079f0dd190956bc92, waiting for kubelet to become unresponsive...
    graceful_reboot_test.go:135: Node ip-172-31-42-36.us-west-2.compute.internal has become unresponsive, waiting for the node to become schedulable again...
    graceful_reboot_test.go:154: Node ip-172-31-42-36.us-west-2.compute.internal became ready and schedulable within 2m5.207167609s!
--- PASS: TestGracefulReboot (132.33s)
    --- PASS: TestGracefulReboot/graceful-reboot (132.33s)
        --- PASS: TestGracefulReboot/graceful-reboot/Node_gracefully_reboots (132.26s)
PASS
ok      github.com/aws/aws-k8s-tester/test/cases/disruptive     132.365s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

fix: make reboot test more deterministic

0a89940

ndbaker1 approved these changes Sep 29, 2025

View reviewed changes

ndbaker1 merged commit ecaddc7 into aws:main Sep 29, 2025
18 of 20 checks passed

mselim00 deleted the reboot-test branch September 30, 2025 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: make reboot test more deterministic #693

fix: make reboot test more deterministic #693

Uh oh!

mselim00 commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: make reboot test more deterministic #693

fix: make reboot test more deterministic #693

Uh oh!

Conversation

mselim00 commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants