-
-
Notifications
You must be signed in to change notification settings - Fork 387
Add support for append mode for Azure Blob Storage
#865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add support for append mode for Azure Blob Storage
#865
Conversation
…to feature/azure-append-mode
…to feature/azure-append-mode
…o feature/azure-append-mode
…o feature/azure-append-mode
| def test_append_compressed_gzip(self): | ||
| """ | ||
| Does appending into an Azure Blob file work correctly when the file is compressed? | ||
| We should be able to append into a compressed file. We will test this with a Gzip file. | ||
| """ | ||
| expected = u'а не спеть ли мне песню... о любви'.encode('utf-8') | ||
| blob_name = "test_append_gzip_%s" % BLOB_NAME | ||
|
|
||
| with smart_open.azure.AppendWriter(CONTAINER_NAME, blob_name, CLIENT) as fout: | ||
| with gzip.GzipFile(fileobj=fout, mode='w') as zipfile: | ||
| zipfile.write(expected) | ||
|
|
||
| with smart_open.azure.Reader(CONTAINER_NAME, blob_name, CLIENT) as fin: | ||
| with gzip.GzipFile(fileobj=fin) as zipfile: | ||
| actual = zipfile.read() | ||
|
|
||
| self.assertEqual(expected, actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noting here that not all compression algorithms allow appending. gzip just adds a new header and 'restarts' compression without prior knowlegde of preceding bytes. we should be raising the appropriate errors, which might be nice to add a test case for as well.
that being said, could you refactor this test to use the top-level smart_open.open and the builtin (de)compression mechanism using multiple writes?
some pseudo-code:
with smart_open.open('azure://fname.txt.gz', 'wb') as fp:
fp.write(expected)
with smart_open.open('azure://fname.txt.gz', 'ab') as fp:
fp.write(expected)
with smart_open.open('azure://fname.txt.gz', 'rb') as fp:
actual = fp.read()
self.assertEqual(actual, expected * 2)| # Uploads data as an AppendBlob type with automatic block chunking. | ||
| # The AppendBlob will be created at first if it does not exist or append to it if it does already. | ||
| return self._blob.upload_blob( | ||
| data=b, | ||
| blob_type=azure.storage.blob.BlobType.APPENDBLOB, | ||
| overwrite=False, | ||
| **self._blob_kwargs, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you implement the _min_part_size buffer mechanic similar to the azure Writer class? this current implementation is unbuffered, and you'll quickly hit the max parts limit an AppendBlob support if you do many small writes
|
fyi: a packaging refactor and corresponding fixes for CI rot just merged into develop. below it mentions conflicts, but a local |
Motivation
Append mode is not a widely used feature in Azure Blob Storage, but it can be particularly useful for specific scenarios, such as writing logs or updating sentinel or marker files. As the native Azure Blob Storage SDK provides support for append mode, this PR integrates that functionality into the
smart_openinterface, making it more accessible and convenient for developers.This implementation comes with a few limitations that are worth noting:
As per these limitations, we observe that the append mode is not suitable for all use cases and is not meant to be used indiscriminately by anyone looking to append data to a blob. However, bearing them in mind, many developers may find it convenient for their use cases, especially when they are already using
smart_openfor other operations.Closes #836
Tests
New tests have been added to cover the new functionality, including:
Also, I took the chance to fix a few existing tests that were failing when using Azurite. The method
BlobClient._stage_contentsdoes not exist in the Azure Blob Storage SDK (or at least not in the latest versions), so I replaced it withBlobClient.get_block_list("uncommitted"), which returns the same information.Checklist