-
Notifications
You must be signed in to change notification settings - Fork 250
Optimize blobFS transfers by eliminating redundant API calls and enabling parallelism #3288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: mover/c2c-stage
Are you sure you want to change the base?
Conversation
When isDFS is true (HNS-enabled accounts), optimize blob traverser to: 1. Skip GetProperties calls for virtual directories - they're always real folders in HNS accounts 2. Skip checking blob metadata for hdi_isfolder - folders are represented as virtual directories, not as blobs This significantly improves enumeration performance for blobFS to blobFS transfers with large folder structures by eliminating many additional API calls. Co-authored-by: otendolkar <185434248+otendolkar@users.noreply.github.com>
|
@copilot This is mostly for enumeration right, is data transfer also slowed down for HNS folders? If not just let me know |
Correct, this optimization only affects the enumeration phase. The blob traverser is responsible for discovering and listing folders/files before transfer begins. My changes eliminate the GetProperties API calls during this discovery phase for HNS-enabled accounts. The actual data transfer speed is not impacted by these changes - that happens later in the transfer engine after enumeration completes. |
|
@copilot would it be good to find places in the transfer itself that might be slowed down for hns folders and improve that? Can you find |
I found a performance bottleneck in the transfer phase for HNS folders: Issue: In Impact: For datasets with many files, this means N extra API calls (where N = number of files), as each file transfer calls Current behavior: // Line 155: Extra API call per file
err := u.doEnsureDirExists(u.parentDirClient)
// Line 162: File creation that could auto-create parent directories
_, err = u.getFileClient().Create(...)The Potential optimization: The folderCreationTracker already has an
However, this requires careful analysis of the folder tracking logic to ensure overwrite and folder property semantics remain correct. Would you like me to implement this optimization? |
|
@copilot Is there something that changes less of the code, like maybe some parallelism we could exploit? |
Yes, there's a simpler optimization using existing parallelism mechanisms. Current behavior: The folder creation tracker uses a lock to serialize folder creation for BlobFS to BlobFS transfers. This is only skipped for Local-to-BlobFS transfers (line 21-24 in Optimization: Extend the lock-skipping logic to include BlobFS-to-BlobFS transfers. This allows parallel file uploads to the same directory without serialization. Why it's safe:
Code change (single line modification in // Line 21-24, extend to include BlobFS source:
skipFolderCreationLock := buildmode.IsMover &&
(plan.FromTo.From() == common.ELocation.Local() || plan.FromTo.From() == common.ELocation.BlobFS()) &&
(plan.FromTo.To() == common.ELocation.File() ||
plan.FromTo.To() == common.ELocation.Blob() ||
plan.FromTo.To() == common.ELocation.BlobFS())This enables better parallelism with minimal code changes. Would you like me to implement this? |
|
@copilot Could you explain it very clearly in detail once more |
Detailed Explanation of the Parallelism OptimizationThe ProblemWhen transferring files in BlobFS, each file transfer must ensure its parent directory exists. Currently, this happens with locking that serializes operations: Step-by-step for each file transfer:
The bottleneck: If 100 files are being uploaded to Current Lock Logic (line 21-24 in
|
|
@copilot So is that lock there because other paths are not idempotent? |
Correct. The lock exists because different storage types have different guarantees: BlobFS (HNS/ADLS Gen2):
Azure Files:
Blob Storage (non-HNS):
The lock was added as a conservative safety measure to handle the most complex case (Blob storage). However, for BlobFS and potentially Azure Files, the underlying APIs are idempotent and the lock only serializes operations unnecessarily. The proposed change to skip the lock for BlobFS→BlobFS is safe because BlobFS directory creation is inherently idempotent. |
|
@copilot sure make this change |
Extend the folder creation lock-skipping logic to include BlobFS as a source location. This allows parallel file uploads to the same directory without serialization during BlobFS to BlobFS transfers. The change is safe because: - BlobFS directoryClient.Create() uses IfNoneMatch: ETagAny (idempotent) - PathAlreadyExists errors are handled as success - HNS accounts have native directory support for concurrent creation This eliminates lock contention and enables full parallelism for file transfers to the same directory in BlobFS to BlobFS scenarios. Co-authored-by: otendolkar <185434248+otendolkar@users.noreply.github.com>
Implemented in commit aee8160. The change extends the folder creation lock-skipping logic to include BlobFS as a source, enabling parallel folder creation for BlobFS→BlobFS transfers. This eliminates lock contention when multiple files are uploaded to the same directory, allowing full parallelism during transfers. |
Description
BlobFS to blobFS transfers with large folder hierarchies are significantly slower than expected due to redundant API calls during both enumeration and transfer phases. This PR implements two optimizations to address performance bottlenecks in HNS-enabled storage accounts.
Feature / Bug Fix: Performance optimization for HNS-enabled storage account enumeration and transfer
Related Links:
Changes
1. Enumeration Phase Optimization (
cmd/zc_traverser_blob.go)For HNS-enabled accounts (
isDFS == true):parallelList, virtual directories (BlobPrefixes) are directly enqueued as folders without validation, eliminating O(n) API calls for n foldersparallelListandserialList, blob items bypass folder stub detection since folders exist as virtual directories, not as blobs with metadata2. Transfer Phase Parallelism Optimization (
ste/folderCreationTracker.go)directoryClient.Create()usesIfNoneMatch: ETagAny(idempotent) andPathAlreadyExistserrors are handled as successBefore:
After:
Performance Impact:
Non-HNS accounts retain original behavior for both optimizations.
Type of Change
How Has This Been Tested?
Test500FollowedBy412Logicis pre-existing and requires Azure credentialsFixes #3287
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.