Skip to content

[Feature] Large file uploads (901) #902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions databricks/sdk/config.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import configparser
import copy
import datetime
import logging
import os
import pathlib
Expand Down Expand Up @@ -98,6 +99,49 @@ class Config:
files_api_client_download_max_total_recovers = None
files_api_client_download_max_total_recovers_without_progressing = 1

# File multipart upload parameters
# ----------------------

# Minimal input stream size (bytes) to use multipart / resumable uploads.
# For small files it's more efficient to make one single-shot upload request.
# When uploading a file, SDK will initially buffer this many bytes from input stream.
# This parameter can be less or bigger than multipart_upload_chunk_size.
multipart_upload_min_stream_size: int = 5 * 1024 * 1024

# Maximum number of presigned URLs that can be requested at a time.
#
# The more URLs we request at once, the higher chance is that some of the URLs will expire
# before we get to use it. We discover the presigned URL is expired *after* sending the
# input stream partition to the server. So to retry the upload of this partition we must rewind
# the stream back. In case of a non-seekable stream we cannot rewind, so we'll abort
# the upload. To reduce the chance of this, we're requesting presigned URLs one by one
# and using them immediately.
multipart_upload_batch_url_count: int = 1

# Size of the chunk to use for multipart uploads.
#
# The smaller chunk is, the less chance for network errors (or URL get expired),
# but the more requests we'll make.
# For AWS, minimum is 5Mb: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
# For GCP, minimum is 256 KiB (and also recommended multiple is 256 KiB)
# boto uses 8Mb: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig
multipart_upload_chunk_size: int = 10 * 1024 * 1024

# use maximum duration of 1 hour
multipart_upload_url_expiration_duration: datetime.timedelta = datetime.timedelta(hours=1)

# This is not a "wall time" cutoff for the whole upload request,
# but a maximum time between consecutive data reception events (even 1 byte) from the server
multipart_upload_single_chunk_upload_timeout_seconds: float = 60

# Cap on the number of custom retries during incremental uploads:
# 1) multipart: upload part URL is expired, so new upload URLs must be requested to continue upload
# 2) resumable: chunk upload produced a retryable response (or exception), so upload status must be
# retrieved to continue the upload.
# In these two cases standard SDK retries (which are capped by the `retry_timeout_seconds` option) are not used.
# Note that retry counter is reset when upload is successfully resumed.
multipart_upload_max_retries = 3
Comment on lines +109 to +143
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As defined, these are static variables that would be shared by all instance of the Config class. My understanding is that these should be considered as static constants. If so, let's rename rename them in uppercase to make that clear.

    # Minimal input stream size (bytes) to use multipart / resumable uploads.
    # For small files it's more efficient to make one single-shot upload request.
    # When uploading a file, SDK will initially buffer this many bytes from input stream.
    # This parameter can be less or bigger than multipart_upload_chunk_size.
    MULTIPART_UPLOAD_MIN_STREAM_SIZE: int = 5 * 1024 * 1024

    # and so on...

If the intent is to make them actual instance variable, then these should be defined in the init function:

def __init__(
        self,
        *,
        # Deprecated. Use credentials_strategy instead.
        credentials_provider: Optional[CredentialsStrategy] = None,
        credentials_strategy: Optional[CredentialsStrategy] = None,
        product=None,
        product_version=None,
        clock: Optional[Clock] = None,
        **kwargs,
    ):

        # Init file settings.
        self.multipart_upload_min_stream_size = 5 * 1024 * 1024

        # and so on...


def __init__(
self,
*,
Expand Down
Loading
Loading