Skip to content

[MGS] Add endpoints for host phase 1 flash hashing #8593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: john/sp-sim-host-flash-hashing
Choose a base branch
from

Conversation

jgallagher
Copy link
Contributor

This builds on #8584. The main point of this PR is adding two new MGS endpoints ("start" and "get status" for the async host phase1 hashing operation); the bulk of the diff is adding a test that exercises the simulator to show the expected behavior of these endpoints.

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I just have a question about the sp_component_hash_firmware_start endpoint

Comment on lines +562 to +565
// The SP (reasonably!) returns a `HashInProgress` error if we try
// to start hashing while hashing is being calculated, but we're
// presenting an idempotent "start hashing if it isn't started"
// endpoint instead. Swallow that error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the hashing to fail after it has reported it's in progress? If that happens we might still get a HttpResponseUpdatedNoContent even though the hashing failed no? What would the consequences of this be?

if sp.start_host_flash_hash(firmware_slot) returns HfError::HashInProgress, would it make sense to call sp.get_host_flash_hash(firmware_slot) on a loop with a timeout until we get an Ok(), or another error?

Copy link
Contributor Author

@jgallagher jgallagher Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the hashing to fail after it has reported it's in progress?

Yeah, definitely, but it would have a different error code.

If that happens we might still get a HttpResponseUpdatedNoContent even though the hashing failed no?

No, I don't think so - the only error we turn into HttpResponseUpdatedNoContent is HfError::HashInProgress; any other error turns into an SpCommunicationFailed in the second arm of the match.

If the SP gets stuck and returns HashInProgress indefinitely, we'd keep returning HttpResponseUpdatedNoContent from this method, but presumably a client will be polling the get endpoint with a timeout. Which gets to your second point!

if sp.start_host_flash_hash(firmware_slot) returns HfError::HashInProgress, would it make sense to call sp.get_host_flash_hash(firmware_slot) on a loop with a timeout until we get an Ok(), or another error?

Maybe - I definitely considered this! At some level someone has to do exactly that, and it's a question of who:

a) Put it in MGS - tempting, but now MGS has to have (or accept from its client) a timeout for that loop.
b) Expose these two endpoints as-is from MGS and make Nexus do the looping + timeout

I think I slightly prefer b, just because of a bias to have MGS do as little as possible in principle? Nexus already has to deal with looping and timeouts for all kinds of update-related things, so adding one more seems better than putting one in MGS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but presumably a client will be polling the get endpoint with a timeout. Which gets to your second point!

Ah! ok, that's the bit of information I was missing here. Can we add a comment explaining this?

I think I slightly prefer b, just because of a bias to have MGS do as little as possible in principle? Nexus already has to deal with looping and timeouts for all kinds of update-related things, so adding one more seems better than putting one in MGS.

Yeah, that makes sense to me as well

Comment on lines +562 to +565
// The SP (reasonably!) returns a `HashInProgress` error if we try
// to start hashing while hashing is being calculated, but we're
// presenting an idempotent "start hashing if it isn't started"
// endpoint instead. Swallow that error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but presumably a client will be polling the get endpoint with a timeout. Which gets to your second point!

Ah! ok, that's the bit of information I was missing here. Can we add a comment explaining this?

I think I slightly prefer b, just because of a bias to have MGS do as little as possible in principle? Nexus already has to deal with looping and timeouts for all kinds of update-related things, so adding one more seems better than putting one in MGS.

Yeah, that makes sense to me as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants