Skip to content

Support Bundles: Chunked upload to Sled Agent #8559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 14, 2025
Merged

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jul 9, 2025

Support bundle collection in Nexus previously transferred the entire bundle as a streamed file from Nexus -> Sled Agent. Unfortunately, dropshot has upper bounds for streaming file sizes, and as we hit the 2 GiB boundary, we noticed that collection started to fail during transfers.

This PR modifies how Nexus transfers bundles to Sled Agents: it uses an incremental upload scheme, transferring
up to 1 GiB at a time, at a requested offset. This should allow for incremental transfer of files which are much larger than 2 GiB, and makes it easier for us to "re-do" failed operations later.

(Changing behavior on partial failure would be a welcome change, but is out-of-scope for this PR, which tries to only modify the behavior from a "oneshot" upload to a "start, transfer, finalize" API).

Added tests for:

  • Chunked files in the bg task ("are the requests split successfully")
  • Chunked files in the bundle storage implementation ("are we putting things where we should")
  • New lifetime quirks ("does start_creation restart the file? What about finalize after completion?")

Fixes #8556

@smklein smklein force-pushed the support-bundles-multipart branch 4 times, most recently from 80b04a6 to cbc04e3 Compare July 10, 2025 17:48
@david-crespo
Copy link
Contributor

The doc comments in the OpenAPI definition are already clean in this regard, but I want to caution you against using the word "multi-part" in anything user-facing because it already has a specific meaning in the context of file uploads:

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Methods/POST#multipart_form_submission

@smklein
Copy link
Collaborator Author

smklein commented Jul 10, 2025

If it's any consolation, this is an internal API. I saw the multipart form submission interface, and candidly skipped it for simplicity (I don't want to deal with "delimiters" in support bundles; I just want to make requests at offsets).

EDIT: I'll change the PR name; this isn't really a term used in the change anyway

@smklein smklein changed the title Support Bundles: Multi-part upload Support Bundles: Chunked upload Jul 10, 2025
@smklein smklein changed the title Support Bundles: Chunked upload Support Bundles: Chunked upload to Sled Agent Jul 10, 2025
@@ -494,8 +494,12 @@ impl BundleCollection {

// Create the zipfile as a temporary file
let mut zipfile = tokio::fs::File::from_std(bundle_to_zipfile(&dir)?);
let total_len = zipfile.metadata().await?.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way John pointed this out on one of my previous PRs but you can SeekFrom::End(0) (since we later seek back to the beginning on line 503) to get the length without having to do an extra stat call. Either way works, just thought I would mention it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will end up being roughly equivalent; it's a "stat now, seek later" or "seek now, seek later", right? Either way, we need a syscall to get the length

@smklein smklein force-pushed the support-bundles-multipart branch from cbc04e3 to 52140ad Compare July 10, 2025 22:25
@smklein smklein force-pushed the support-bundles-multipart branch from 52140ad to 84be721 Compare July 10, 2025 22:28
@smklein smklein marked this pull request as ready for review July 10, 2025 22:28
@smklein smklein requested a review from augustuswm July 11, 2025 17:46
@augustuswm
Copy link
Contributor

First time looking at support bundle code and I'll give it a better read through, but is the ultimate idea that we are using disk storage as our tracking of the state of an upload? If we never call finalize, then the transferred data would just sit there correct?

@smklein
Copy link
Collaborator Author

smklein commented Jul 14, 2025

First time looking at support bundle code and I'll give it a better read through, but is the ultimate idea that we are using disk storage as our tracking of the state of an upload? If we never call finalize, then the transferred data would just sit there correct?

This is correct; the transferred data would exist on the sled agent with the name bundle.zip.tmp instead of bundle.zip, to indicate that the transfer has not completed. It would exist on the dataset specifically provisioned for the support bundle's storage (finalizing is just renaming it in the same directory from bundle.zip.tmp to bundle.zip).

I think this is okay:

  • If the transfer is still ongoing, we're only using space that has already been allocated specifically to the support bundle dataset
  • Regardless of whether or not we "finalize" the transfer, restart the transfer process, or delete the bundle entirely: we're using this dedicated-to-the-bundle dataset for storage, until nexus decides we don't need to use it anymore.

The question of "when and how do we release this storage" is up to Nexus: The background task only considers the bundle "deleted" if the sled agent can affirmatively say: "this bundle has been deleted, or no longer exists". This could happen before "finalize" or after "finalize" -- either way, we don't mark the bundle deleted in the DB until either (1) the sled responds affirmatively, or (2) the disk being used by the sled agent has been expunged

@augustuswm
Copy link
Contributor

Awesome, thanks for the explanation! That makes sense to me.

@smklein smklein merged commit 248e7ed into main Jul 14, 2025
17 checks passed
@smklein smklein deleted the support-bundles-multipart branch July 14, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support bundle stuck in collecting status (should use multi-part upload to sled agent)
4 participants