-
Notifications
You must be signed in to change notification settings - Fork 45
[MGS] Add endpoints for host phase 1 flash hashing #8593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: john/sp-sim-host-flash-hashing
Are you sure you want to change the base?
[MGS] Add endpoints for host phase 1 flash hashing #8593
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I just have a question about the sp_component_hash_firmware_start
endpoint
// The SP (reasonably!) returns a `HashInProgress` error if we try | ||
// to start hashing while hashing is being calculated, but we're | ||
// presenting an idempotent "start hashing if it isn't started" | ||
// endpoint instead. Swallow that error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for the hashing to fail after it has reported it's in progress? If that happens we might still get a HttpResponseUpdatedNoContent
even though the hashing failed no? What would the consequences of this be?
if sp.start_host_flash_hash(firmware_slot)
returns HfError::HashInProgress
, would it make sense to call sp.get_host_flash_hash(firmware_slot)
on a loop with a timeout until we get an Ok(), or another error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for the hashing to fail after it has reported it's in progress?
Yeah, definitely, but it would have a different error code.
If that happens we might still get a
HttpResponseUpdatedNoContent
even though the hashing failed no?
No, I don't think so - the only error we turn into HttpResponseUpdatedNoContent
is HfError::HashInProgress
; any other error turns into an SpCommunicationFailed
in the second arm of the match.
If the SP gets stuck and returns HashInProgress
indefinitely, we'd keep returning HttpResponseUpdatedNoContent
from this method, but presumably a client will be polling the get
endpoint with a timeout. Which gets to your second point!
if
sp.start_host_flash_hash(firmware_slot)
returnsHfError::HashInProgress
, would it make sense to callsp.get_host_flash_hash(firmware_slot)
on a loop with a timeout until we get an Ok(), or another error?
Maybe - I definitely considered this! At some level someone has to do exactly that, and it's a question of who:
a) Put it in MGS - tempting, but now MGS has to have (or accept from its client) a timeout for that loop.
b) Expose these two endpoints as-is from MGS and make Nexus do the looping + timeout
I think I slightly prefer b, just because of a bias to have MGS do as little as possible in principle? Nexus already has to deal with looping and timeouts for all kinds of update-related things, so adding one more seems better than putting one in MGS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but presumably a client will be polling the get endpoint with a timeout. Which gets to your second point!
Ah! ok, that's the bit of information I was missing here. Can we add a comment explaining this?
I think I slightly prefer b, just because of a bias to have MGS do as little as possible in principle? Nexus already has to deal with looping and timeouts for all kinds of update-related things, so adding one more seems better than putting one in MGS.
Yeah, that makes sense to me as well
// The SP (reasonably!) returns a `HashInProgress` error if we try | ||
// to start hashing while hashing is being calculated, but we're | ||
// presenting an idempotent "start hashing if it isn't started" | ||
// endpoint instead. Swallow that error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but presumably a client will be polling the get endpoint with a timeout. Which gets to your second point!
Ah! ok, that's the bit of information I was missing here. Can we add a comment explaining this?
I think I slightly prefer b, just because of a bias to have MGS do as little as possible in principle? Nexus already has to deal with looping and timeouts for all kinds of update-related things, so adding one more seems better than putting one in MGS.
Yeah, that makes sense to me as well
This builds on #8584. The main point of this PR is adding two new MGS endpoints ("start" and "get status" for the async host phase1 hashing operation); the bulk of the diff is adding a test that exercises the simulator to show the expected behavior of these endpoints.