[WIP] Fix issue 251 peer control nodes #255

mark-scott-jr-dell · 2025-03-26T22:10:48Z

Why we need this PR

Unit test to show problem with Issue #251 as well as a potential fix.

Please feel free to disregard as much of or as little of this as you want, hopefully the unit test refactoring is seen as useful, but I believe it sufficiently highlights the core issue. I separated it out into multiple commits so you can look at branch history and see the individual updates prior to the squash commit that will happen to main. This is intended so you can see a delta of the actual meat & potatoes of the code fix itself which was relatively small and could be pulled in separately in the worst case scenario.

Note: I created an official account for my Dell contributions, still the same Mark who opened the issue 251 originally though.

Changes made

Update selfnoderemediation_controller_test.go in order to add control plane node configurability. To make this work right I tried to refactor it based on my understanding of ginkgo/gomega best practices, which fixed some overall issues with setup & teardown that I would see occasionally as unit test instabilities.
Add a new unit test, which shows the Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy #251 scenario that we discovered in the lab, and fails
Implement a possible fix (if I'm understanding the intent right), and show that all new and existing unit tests still work.

Which issue(s) this PR fixes

Fixes #251

Test plan

Added new unit tests.

…#251 exists and is a real problem.

…y and getWorkerPeersResponse for issue medik8s#251

…mmand.

openshift-ci · 2025-03-26T22:10:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mark-scott-jr-dell
Once this PR has been reviewed and has the lgtm label, please assign razo7 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-03-26T22:10:59Z

Hi @mark-scott-jr-dell. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

slintes

Sorry for the late first review, we are pretty busy...

slintes · 2025-04-01T08:56:27Z

pkg/apicheck/check.go

+		//canOtherControlPlanesBeReached := c.canOtherControlPlanesBeReached()
+		peersResponse = c.getPeersResponse(peers.ControlPlane)
+
+		// MES: This does not appear to have any actual relevance.  To me, it appears that all the necessary


IsControlPlaneHealthy() not being relevant is a bold statement ;)

However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

cc @mshitrit

I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

+1

I think this change significantly changes current logic.
Couple of things I've noticed:

In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.

diagnostic logic (i.e isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases

IsControlPlaneHealthy() not being relevant is a bold statement ;)

Haha I did say "does not APPEAR to have any actual relevence" to be fair, based on my observations. I definitely left room in there for me to be wrong 😂.

However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

This would help a lot, my actual code changes were based on how I understood the expected flow to go, I attempted to interpret this based on the intention I saw in the code. My goal was to attempt not to change far too much and keep behaviors the same today since I personally don't know all the intentions, nor did I find it documented in detail anywhere (correct me if I'm wrong to be sure!).

I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

+1

I think this change significantly changes current logic. Couple of things I've noticed:

In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.

diagnostic logic (i.e isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases

I'll look back at this later today to respond, especially w.r.t isDiagnosticsPassed, but I did spend some time walking through the flows and found multiple checks that basically referenced the same data multiple times, so I was attempting to simplify so it was clear what the code was doing. I felt it was unclear visually what was actually going on.

Ultimately, if we use the updated unit test just to prove out the core issue, I'm still good - our goal was to prove it so that it could be fixed, since it's a PITA to get logs in that case due to the nature of the cluster status at that point, so I personally picked the stretch goal of creating the unit test which would be better for the long term (in theory).

@mshitrit Not sure if you've had a chance to think about the logic flows here, but, if I'm reading things correctly, an update like this?

Attempt to get control plane responses

Attempt to get worker responses

Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

I'm willing to implement it and push a new PR, just want to be sure that I use the flow that you have in mind.

We are looking to pull the latest build whenever this is merged to main, and get some needed CVE fixes and other things, so I'd love to drive this to a close ASAP. I thought I had posted this message a week and a half ago but I guess it went into the ether.

Hi I think that pretty close, writing down both current and what I understand is the desired flow:

Attempt to get worker responses

If a worker return that response

else (assuming it's a control plane)

get a control plane response

Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

IIUC the fix is aiming for this flow:

Attempt to get control plane responses

Attempt to get worker responses

If a worker Some combination of these should say that the node is healthy (CP response is only relevant for some use cases of a worker node healthy response otherwise it can be ignored)

else Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

mshitrit · 2025-04-01T10:37:57Z

pkg/apicheck/check.go

 }

-func (c *ApiConnectivityCheck) getWorkerPeersResponse() peers.Response {
+func (c *ApiConnectivityCheck) getPeersResponse(role peers.Role) peers.Response {


I like this refactoring 👍

mark-scott-jr-dell added 4 commits March 26, 2025 16:57

Added new failing test to showcase bug ticket

0fc80b8

This test intentionally breaks, because it proves that defect medik8s…

450e787

…#251 exists and is a real problem.

Fixing breaking unit test with proposed changes to IsConsideredHealth…

06b4be6

…y and getWorkerPeersResponse for issue medik8s#251

gofmt updates to watchdog_test after running the local "make test" co…

eb2c8c8

…mmand.

openshift-ci bot added the do-not-merge/work-in-progress label Mar 26, 2025

openshift-ci bot requested review from clobrano and slintes March 26, 2025 22:10

openshift-ci bot added the needs-ok-to-test label Mar 26, 2025

slintes reviewed Apr 1, 2025

View reviewed changes

mshitrit reviewed Apr 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix issue 251 peer control nodes #255

[WIP] Fix issue 251 peer control nodes #255

mark-scott-jr-dell commented Mar 26, 2025 •

edited

Loading

openshift-ci bot commented Mar 26, 2025

openshift-ci bot commented Mar 26, 2025

slintes left a comment

slintes Apr 1, 2025

slintes Apr 1, 2025

mshitrit Apr 1, 2025

mark-scott-jr-dell Apr 2, 2025

mark-scott-jr-dell Apr 2, 2025

mark-scott-jr-dell Apr 15, 2025

mshitrit Apr 16, 2025 •

edited

Loading

mshitrit Apr 1, 2025

[WIP] Fix issue 251 peer control nodes #255

Are you sure you want to change the base?

[WIP] Fix issue 251 peer control nodes #255

Conversation

mark-scott-jr-dell commented Mar 26, 2025 • edited Loading

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

openshift-ci bot commented Mar 26, 2025

openshift-ci bot commented Mar 26, 2025

slintes left a comment

Choose a reason for hiding this comment

slintes Apr 1, 2025

Choose a reason for hiding this comment

slintes Apr 1, 2025

Choose a reason for hiding this comment

mshitrit Apr 1, 2025

Choose a reason for hiding this comment

mark-scott-jr-dell Apr 2, 2025

Choose a reason for hiding this comment

mark-scott-jr-dell Apr 2, 2025

Choose a reason for hiding this comment

mark-scott-jr-dell Apr 15, 2025

Choose a reason for hiding this comment

mshitrit Apr 16, 2025 • edited Loading

Choose a reason for hiding this comment

mshitrit Apr 1, 2025

Choose a reason for hiding this comment

mark-scott-jr-dell commented Mar 26, 2025 •

edited

Loading

mshitrit Apr 16, 2025 •

edited

Loading