Skip to content

Conversation

@geovalexis
Copy link

@geovalexis geovalexis commented Jun 15, 2025

Motivation

Append mode is not a widely used feature in Azure Blob Storage, but it can be particularly useful for specific scenarios, such as writing logs or updating sentinel or marker files. As the native Azure Blob Storage SDK provides support for append mode, this PR integrates that functionality into the smart_open interface, making it more accessible and convenient for developers.

This implementation comes with a few limitations that are worth noting:

  • There is a maximum size of the blob that can be written in append mode: 195 GiB. This is a limitation of the Azure Blob Storage API itself and cannot be changed.
  • The blob must be created in append mode on the first place, meaning that we cannot simply append to an existing blob that may have been created in a different mode (e.g. block or page mode).

As per these limitations, we observe that the append mode is not suitable for all use cases and is not meant to be used indiscriminately by anyone looking to append data to a blob. However, bearing them in mind, many developers may find it convenient for their use cases, especially when they are already using smart_open for other operations.

Closes #836

Tests

New tests have been added to cover the new functionality, including:

  • Writing to a new blob in append mode
  • Appending to an existing blob in append mode
  • Attempting to append to a blob that was not created in append mode (which should raise an error)

Also, I took the chance to fix a few existing tests that were failing when using Azurite. The method BlobClient._stage_contents does not exist in the Azure Blob Storage SDK (or at least not in the latest versions), so I replaced it with BlobClient.get_block_list("uncommitted"), which returns the same information.

Checklist

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Checked that all unit tests pass

Comment on lines +983 to +999
def test_append_compressed_gzip(self):
"""
Does appending into an Azure Blob file work correctly when the file is compressed?
We should be able to append into a compressed file. We will test this with a Gzip file.
"""
expected = u'а не спеть ли мне песню... о любви'.encode('utf-8')
blob_name = "test_append_gzip_%s" % BLOB_NAME

with smart_open.azure.AppendWriter(CONTAINER_NAME, blob_name, CLIENT) as fout:
with gzip.GzipFile(fileobj=fout, mode='w') as zipfile:
zipfile.write(expected)

with smart_open.azure.Reader(CONTAINER_NAME, blob_name, CLIENT) as fin:
with gzip.GzipFile(fileobj=fin) as zipfile:
actual = zipfile.read()

self.assertEqual(expected, actual)
Copy link
Collaborator

@ddelange ddelange Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noting here that not all compression algorithms allow appending. gzip just adds a new header and 'restarts' compression without prior knowlegde of preceding bytes. we should be raising the appropriate errors, which might be nice to add a test case for as well.

that being said, could you refactor this test to use the top-level smart_open.open and the builtin (de)compression mechanism using multiple writes?

some pseudo-code:

with smart_open.open('azure://fname.txt.gz', 'wb') as fp:
    fp.write(expected)

with smart_open.open('azure://fname.txt.gz', 'ab') as fp:
    fp.write(expected)

with smart_open.open('azure://fname.txt.gz', 'rb') as fp:
    actual = fp.read()

self.assertEqual(actual, expected * 2)

Comment on lines +621 to +628
# Uploads data as an AppendBlob type with automatic block chunking.
# The AppendBlob will be created at first if it does not exist or append to it if it does already.
return self._blob.upload_blob(
data=b,
blob_type=azure.storage.blob.BlobType.APPENDBLOB,
overwrite=False,
**self._blob_kwargs,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you implement the _min_part_size buffer mechanic similar to the azure Writer class? this current implementation is unbuffered, and you'll quickly hit the max parts limit an AppendBlob support if you do many small writes

@ddelange
Copy link
Collaborator

ddelange commented Jul 1, 2025

fyi: a packaging refactor and corresponding fixes for CI rot just merged into develop. below it mentions conflicts, but a local git pull upstream develop will figure it out without conflicts. it's a rename commit moving the tests folder one level up to the top level directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[azure] Support for "append" mode for Azure Blobs

2 participants