[Feature] Large file uploads (901) #902

ksafonov-db · 2025-02-26T23:17:42Z

What changes are proposed in this pull request?

Supporting large file uploads using Files API.

NO_CHANGELOG=true

How is this tested?

Unit tests.

…ort subscripting).

…loads

renaudhartert-db · 2025-03-04T10:27:42Z

databricks/sdk/config.py

+    multipart_upload_min_stream_size: int = 5 * 1024 * 1024
+
+    # Maximum number of presigned URLs that can be requested at a time.
+    #
+    # The more URLs we request at once, the higher chance is that some of the URLs will expire
+    # before we get to use it. We discover the presigned URL is expired *after* sending the
+    # input stream partition to the server. So to retry the upload of this partition we must rewind
+    # the stream back. In case of a non-seekable stream we cannot rewind, so we'll abort
+    # the upload. To reduce the chance of this, we're requesting presigned URLs one by one
+    # and using them immediately.
+    multipart_upload_batch_url_count: int = 1
+
+    # Size of the chunk to use for multipart uploads.
+    #
+    # The smaller chunk is, the less chance for network errors (or URL get expired),
+    # but the more requests we'll make.
+    # For AWS, minimum is 5Mb: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
+    # For GCP, minimum is 256 KiB (and also recommended multiple is 256 KiB)
+    # boto uses 8Mb: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig
+    multipart_upload_chunk_size: int = 10 * 1024 * 1024
+
+    # use maximum duration of 1 hour
+    multipart_upload_url_expiration_duration: datetime.timedelta = datetime.timedelta(hours=1)
+
+    # This is not a "wall time" cutoff for the whole upload request,
+    # but a maximum time between consecutive data reception events (even 1 byte) from the server
+    multipart_upload_single_chunk_upload_timeout_seconds: float = 60
+
+    # Cap on the number of custom retries during incremental uploads:
+    # 1) multipart: upload part URL is expired, so new upload URLs must be requested to continue upload
+    # 2) resumable: chunk upload produced a retryable response (or exception), so upload status must be
+    # retrieved to continue the upload.
+    # In these two cases standard SDK retries (which are capped by the `retry_timeout_seconds` option) are not used.
+    # Note that retry counter is reset when upload is successfully resumed.
+    multipart_upload_max_retries = 3


As defined, these are static variables that would be shared by all instance of the Config class. My understanding is that these should be considered as static constants. If so, let's rename rename them in uppercase to make that clear.

# Minimal input stream size (bytes) to use multipart / resumable uploads. # For small files it's more efficient to make one single-shot upload request. # When uploading a file, SDK will initially buffer this many bytes from input stream. # This parameter can be less or bigger than multipart_upload_chunk_size. MULTIPART_UPLOAD_MIN_STREAM_SIZE: int = 5 * 1024 * 1024 # and so on...

If the intent is to make them actual instance variable, then these should be defined in the init function:

def __init__( self, *, # Deprecated. Use credentials_strategy instead. credentials_provider: Optional[CredentialsStrategy] = None, credentials_strategy: Optional[CredentialsStrategy] = None, product=None, product_version=None, clock: Optional[Clock] = None, **kwargs, ): # Init file settings. self.multipart_upload_min_stream_size = 5 * 1024 * 1024 # and so on...

renaudhartert-db · 2025-03-04T10:32:47Z