Skip to content

Conversation

@mselim00
Copy link
Contributor

@mselim00 mselim00 commented Sep 26, 2025

Issue #, if available:

Description of changes:
Adds a new test case that reboots the instance, intended to help sanity check that reboots are graceful. It's quite difficult from AWS APIs alone to determine the status of an instance after reboot (the RebootInstance API itself is asynchronous, EC2 status checks and SSM agent connectivity status are reconciled sparsely so they some times miss state changes). Ultimately, something must be executed on the node to determine whether or not it's running, this uses a pod to do so to keep it more as Kubernetes-oriented as possible.

The test is based on the assumption that the pod Exec require kubelet responsiveness, and therefore a 143 to a command execution within a pod will decisively indicate that the node is shutting down. This is a bit of a simplification since any SIGTERM would lead to this state, but given the timing and the presumed clean state of the instance, it's taken to mean the reboot is starting. After this, a second pod is created, and it follows from the prior state that this pod should not start running until after the boot, since kubelet was already non-responsive or evicting existing pods.

Sample happy path output:

2025/09/26 23:36:11 Starting quick test suite...
=== RUN   TestGracefulReboot
=== RUN   TestGracefulReboot/graceful-reboot
=== RUN   TestGracefulReboot/graceful-reboot/Node_gracefully_reboots
    graceful_reboot_test.go:98: Node ip-172-31-42-36.us-west-2.compute.internal corresponds to EC2 instance: i-079f0dd190956bc92
    graceful_reboot_test.go:116: Started exec into pod termination-canary-1758929771
    graceful_reboot_test.go:106: Rebooting instance i-079f0dd190956bc92 to test graceful reboot...
    graceful_reboot_test.go:113: Successfully initiated reboot of instance i-079f0dd190956bc92, waiting for pod termination-canary-1758929771 to terminate...
    graceful_reboot_test.go:123: Pod termination-canary-1758929771 was terminated
    graceful_reboot_test.go:129: Waiting up to 5 minutes for node ip-172-31-42-36.us-west-2.compute.internal to become schedulable again
    graceful_reboot_test.go:146: Node ip-172-31-42-36.us-west-2.compute.internal became ready and schedulable within 2m5.597110799s!
=== NAME  TestGracefulReboot/graceful-reboot
    graceful_reboot_test.go:153: Successfully cleaned up pod termination-canary-1758929771
    graceful_reboot_test.go:159: Successfully cleaned up pod boot-detection-1758929771
--- PASS: TestGracefulReboot (132.70s)
    --- PASS: TestGracefulReboot/graceful-reboot (132.70s)
        --- PASS: TestGracefulReboot/graceful-reboot/Node_gracefully_reboots (132.64s)
PASS
ok      github.com/aws/aws-k8s-tester/test/cases/disruptive     132.725s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mselim00 mselim00 requested a review from ndbaker1 September 26, 2025 22:37
@mselim00 mselim00 force-pushed the reboot-test branch 4 times, most recently from 7d38b5b to bfa6444 Compare September 26, 2025 22:56
Comment on lines 126 to 115
// Attempt to execute a blocking command in the pod until we get a 143, which would indicate a SIGTERM.
// This a reliable way to check termination since it requires direction response from Kubelet
var execOut, execErr bytes.Buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how reliable exactly? i just want to know whether this is deterministic or highly probable.

Copy link
Contributor Author

@mselim00 mselim00 Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now I want to say highly probable, but I think deterministic given a clean environment. the main risk I think is that the node starts rebooting before the exec. I've moved the reboot into the background with a 1 second delay to make that even less likely, but we can make this part completely deterministic if we see it flake in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go 1.25 seems to validate against non-test go routines in a test context, so I went back to doing this serially. the RebootInstances API call is async and reboots take time, so it's still very unlikely that the exec happens too late even doing this serially. can revise as needed though

@mselim00 mselim00 force-pushed the reboot-test branch 7 times, most recently from 5c57096 to 2372094 Compare September 26, 2025 23:53
@mselim00 mselim00 force-pushed the reboot-test branch 3 times, most recently from a6f9693 to c7c9f86 Compare September 27, 2025 00:32
@mselim00 mselim00 merged commit d899392 into aws:main Sep 27, 2025
9 of 10 checks passed
@mselim00 mselim00 deleted the reboot-test branch September 27, 2025 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants