-
Notifications
You must be signed in to change notification settings - Fork 20
[WIP] Fix issue 251 peer control nodes #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Fix issue 251 peer control nodes #255
Conversation
…#251 exists and is a real problem.
…y and getWorkerPeersResponse for issue medik8s#251
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mark-scott-jr-dell The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @mark-scott-jr-dell. Thanks for your PR. I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late first review, we are pretty busy...
//canOtherControlPlanesBeReached := c.canOtherControlPlanesBeReached() | ||
peersResponse = c.getPeersResponse(peers.ControlPlane) | ||
|
||
// MES: This does not appear to have any actual relevance. To me, it appears that all the necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsControlPlaneHealthy()
not being relevant is a bold statement ;)
However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mshitrit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
+1
I think this change significantly changes current logic.
Couple of things I've noticed:
- In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.
- diagnostic logic (i.e
isDiagnosticsPassed()
) is removed, which means the node can be falsely considered healthy for some use cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsControlPlaneHealthy()
not being relevant is a bold statement ;)
Haha I did say "does not APPEAR to have any actual relevence" to be fair, based on my observations. I definitely left room in there for me to be wrong 😂.
However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
This would help a lot, my actual code changes were based on how I understood the expected flow to go, I attempted to interpret this based on the intention I saw in the code. My goal was to attempt not to change far too much and keep behaviors the same today since I personally don't know all the intentions, nor did I find it documented in detail anywhere (correct me if I'm wrong to be sure!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
+1
I think this change significantly changes current logic. Couple of things I've noticed:
- In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.
- diagnostic logic (i.e
isDiagnosticsPassed()
) is removed, which means the node can be falsely considered healthy for some use cases
I'll look back at this later today to respond, especially w.r.t isDiagnosticsPassed, but I did spend some time walking through the flows and found multiple checks that basically referenced the same data multiple times, so I was attempting to simplify so it was clear what the code was doing. I felt it was unclear visually what was actually going on.
Ultimately, if we use the updated unit test just to prove out the core issue, I'm still good - our goal was to prove it so that it could be fixed, since it's a PITA to get logs in that case due to the nature of the cluster status at that point, so I personally picked the stretch goal of creating the unit test which would be better for the long term (in theory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit Not sure if you've had a chance to think about the logic flows here, but, if I'm reading things correctly, an update like this?
- Attempt to get control plane responses
- Attempt to get worker responses
- Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed
I'm willing to implement it and push a new PR, just want to be sure that I use the flow that you have in mind.
We are looking to pull the latest build whenever this is merged to main, and get some needed CVE fixes and other things, so I'd love to drive this to a close ASAP. I thought I had posted this message a week and a half ago but I guess it went into the ether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I think that pretty close, writing down both current and what I understand is the desired flow:
- Attempt to get worker responses
- If a worker return that response
- else (assuming it's a control plane)
- get a control plane response
- Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed
IIUC the fix is aiming for this flow:
- Attempt to get control plane responses
- Attempt to get worker responses
- If a worker Some combination of these should say that the node is healthy (CP response is only relevant for some use cases of a worker node healthy response otherwise it can be ignored)
- else Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed
} | ||
|
||
func (c *ApiConnectivityCheck) getWorkerPeersResponse() peers.Response { | ||
func (c *ApiConnectivityCheck) getPeersResponse(role peers.Role) peers.Response { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this refactoring 👍
Why we need this PR
Unit test to show problem with Issue #251 as well as a potential fix.
Please feel free to disregard as much of or as little of this as you want, hopefully the unit test refactoring is seen as useful, but I believe it sufficiently highlights the core issue. I separated it out into multiple commits so you can look at branch history and see the individual updates prior to the squash commit that will happen to main. This is intended so you can see a delta of the actual meat & potatoes of the code fix itself which was relatively small and could be pulled in separately in the worst case scenario.
Note: I created an official account for my Dell contributions, still the same Mark who opened the issue 251 originally though.
Changes made
Which issue(s) this PR fixes
Fixes #251
Test plan