[MGS] Add endpoints for host phase 1 flash hashing #8593

jgallagher · 2025-07-14T20:54:35Z

This builds on #8584. The main point of this PR is adding two new MGS endpoints ("start" and "get status" for the async host phase1 hashing operation); the bulk of the diff is adding a test that exercises the simulator to show the expected behavior of these endpoints.

karencfv

Looks good to me! I just have a question about the sp_component_hash_firmware_start endpoint

karencfv · 2025-07-15T00:22:27Z

gateway/src/http_entrypoints.rs

+            // The SP (reasonably!) returns a `HashInProgress` error if we try
+            // to start hashing while hashing is being calculated, but we're
+            // presenting an idempotent "start hashing if it isn't started"
+            // endpoint instead. Swallow that error.


Is it possible for the hashing to fail after it has reported it's in progress? If that happens we might still get a HttpResponseUpdatedNoContent even though the hashing failed no? What would the consequences of this be?

if sp.start_host_flash_hash(firmware_slot) returns HfError::HashInProgress, would it make sense to call sp.get_host_flash_hash(firmware_slot) on a loop with a timeout until we get an Ok(), or another error?

Is it possible for the hashing to fail after it has reported it's in progress?

Yeah, definitely, but it would have a different error code.

If that happens we might still get a HttpResponseUpdatedNoContent even though the hashing failed no?

No, I don't think so - the only error we turn into HttpResponseUpdatedNoContent is HfError::HashInProgress; any other error turns into an SpCommunicationFailed in the second arm of the match.

If the SP gets stuck and returns HashInProgress indefinitely, we'd keep returning HttpResponseUpdatedNoContent from this method, but presumably a client will be polling the get endpoint with a timeout. Which gets to your second point!

if sp.start_host_flash_hash(firmware_slot) returns HfError::HashInProgress, would it make sense to call sp.get_host_flash_hash(firmware_slot) on a loop with a timeout until we get an Ok(), or another error?

Maybe - I definitely considered this! At some level someone has to do exactly that, and it's a question of who:

a) Put it in MGS - tempting, but now MGS has to have (or accept from its client) a timeout for that loop.
b) Expose these two endpoints as-is from MGS and make Nexus do the looping + timeout

I think I slightly prefer b, just because of a bias to have MGS do as little as possible in principle? Nexus already has to deal with looping and timeouts for all kinds of update-related things, so adding one more seems better than putting one in MGS.

but presumably a client will be polling the get endpoint with a timeout. Which gets to your second point!

Ah! ok, that's the bit of information I was missing here. Can we add a comment explaining this?

I think I slightly prefer b, just because of a bias to have MGS do as little as possible in principle? Nexus already has to deal with looping and timeouts for all kinds of update-related things, so adding one more seems better than putting one in MGS.

Yeah, that makes sense to me as well

karencfv · 2025-07-15T21:42:21Z

gateway/src/http_entrypoints.rs

+            // The SP (reasonably!) returns a `HashInProgress` error if we try
+            // to start hashing while hashing is being calculated, but we're
+            // presenting an idempotent "start hashing if it isn't started"
+            // endpoint instead. Swallow that error.


but presumably a client will be polling the get endpoint with a timeout. Which gets to your second point!

Ah! ok, that's the bit of information I was missing here. Can we add a comment explaining this?

I think I slightly prefer b, just because of a bias to have MGS do as little as possible in principle? Nexus already has to deal with looping and timeouts for all kinds of update-related things, so adding one more seems better than putting one in MGS.

Yeah, that makes sense to me as well

jgallagher added 5 commits July 14, 2025 10:36

add mgs endpoints for host flash hashing

5a33179

custom error type for hash response

6c36953

return hash status instead of custom errors

d3e4313

flesh out test_host_phase1_hashing()

6a01f8e

test cleanup

b6724da

karencfv reviewed Jul 15, 2025

View reviewed changes

karencfv approved these changes Jul 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MGS] Add endpoints for host phase 1 flash hashing #8593

[MGS] Add endpoints for host phase 1 flash hashing #8593

Uh oh!

jgallagher commented Jul 14, 2025

Uh oh!

karencfv left a comment

Uh oh!

karencfv Jul 15, 2025

Uh oh!

jgallagher Jul 15, 2025 •

edited

Loading

Uh oh!

karencfv Jul 15, 2025

Uh oh!

karencfv Jul 15, 2025

Uh oh!

Uh oh!

[MGS] Add endpoints for host phase 1 flash hashing #8593

Are you sure you want to change the base?

[MGS] Add endpoints for host phase 1 flash hashing #8593

Uh oh!

Conversation

jgallagher commented Jul 14, 2025

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

karencfv Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

karencfv Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jgallagher Jul 15, 2025 •

edited

Loading