Skip to content

[WIP] Fix issue 251 peer control nodes #255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mark-scott-jr-dell
Copy link

@mark-scott-jr-dell mark-scott-jr-dell commented Mar 26, 2025

Why we need this PR

Unit test to show problem with Issue #251 as well as a potential fix.

Please feel free to disregard as much of or as little of this as you want, hopefully the unit test refactoring is seen as useful, but I believe it sufficiently highlights the core issue. I separated it out into multiple commits so you can look at branch history and see the individual updates prior to the squash commit that will happen to main. This is intended so you can see a delta of the actual meat & potatoes of the code fix itself which was relatively small and could be pulled in separately in the worst case scenario.

Note: I created an official account for my Dell contributions, still the same Mark who opened the issue 251 originally though.

Changes made

  • Update selfnoderemediation_controller_test.go in order to add control plane node configurability. To make this work right I tried to refactor it based on my understanding of ginkgo/gomega best practices, which fixed some overall issues with setup & teardown that I would see occasionally as unit test instabilities.
  • Add a new unit test, which shows the Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy #251 scenario that we discovered in the lab, and fails
  • Implement a possible fix (if I'm understanding the intent right), and show that all new and existing unit tests still work.

Which issue(s) this PR fixes

Fixes #251

Test plan

  • Added new unit tests.

Copy link
Contributor

openshift-ci bot commented Mar 26, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mark-scott-jr-dell
Once this PR has been reviewed and has the lgtm label, please assign razo7 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Mar 26, 2025

Hi @mark-scott-jr-dell. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late first review, we are pretty busy...

//canOtherControlPlanesBeReached := c.canOtherControlPlanesBeReached()
peersResponse = c.getPeersResponse(peers.ControlPlane)

// MES: This does not appear to have any actual relevance. To me, it appears that all the necessary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsControlPlaneHealthy() not being relevant is a bold statement ;)

However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

+1

I think this change significantly changes current logic.
Couple of things I've noticed:

  • In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.
  • diagnostic logic (i.e isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsControlPlaneHealthy() not being relevant is a bold statement ;)

Haha I did say "does not APPEAR to have any actual relevence" to be fair, based on my observations. I definitely left room in there for me to be wrong 😂.

However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

This would help a lot, my actual code changes were based on how I understood the expected flow to go, I attempted to interpret this based on the intention I saw in the code. My goal was to attempt not to change far too much and keep behaviors the same today since I personally don't know all the intentions, nor did I find it documented in detail anywhere (correct me if I'm wrong to be sure!).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

+1

I think this change significantly changes current logic. Couple of things I've noticed:

  • In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.
  • diagnostic logic (i.e isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases

I'll look back at this later today to respond, especially w.r.t isDiagnosticsPassed, but I did spend some time walking through the flows and found multiple checks that basically referenced the same data multiple times, so I was attempting to simplify so it was clear what the code was doing. I felt it was unclear visually what was actually going on.

Ultimately, if we use the updated unit test just to prove out the core issue, I'm still good - our goal was to prove it so that it could be fixed, since it's a PITA to get logs in that case due to the nature of the cluster status at that point, so I personally picked the stretch goal of creating the unit test which would be better for the long term (in theory).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mshitrit Not sure if you've had a chance to think about the logic flows here, but, if I'm reading things correctly, an update like this?

  1. Attempt to get control plane responses
  2. Attempt to get worker responses
  3. Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

I'm willing to implement it and push a new PR, just want to be sure that I use the flow that you have in mind.

We are looking to pull the latest build whenever this is merged to main, and get some needed CVE fixes and other things, so I'd love to drive this to a close ASAP. I thought I had posted this message a week and a half ago but I guess it went into the ether.

Copy link
Member

@mshitrit mshitrit Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I think that pretty close, writing down both current and what I understand is the desired flow:

  1. Attempt to get worker responses
  2. If a worker return that response
  3. else (assuming it's a control plane)
    1. get a control plane response
    2. Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

IIUC the fix is aiming for this flow:

  1. Attempt to get control plane responses
  2. Attempt to get worker responses
  3. If a worker Some combination of these should say that the node is healthy (CP response is only relevant for some use cases of a worker node healthy response otherwise it can be ignored)
  4. else Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

}

func (c *ApiConnectivityCheck) getWorkerPeersResponse() peers.Response {
func (c *ApiConnectivityCheck) getPeersResponse(role peers.Role) peers.Response {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this refactoring 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy
3 participants