-
|
Hi! This is a super interesting project! I'm just trying to understand the model. If I open an icechunk store into an xarray dataset, then immediately write that dataset back to the same store, what's the performance and storage implications? What if I change one attr in one of the data variables? If I change the data in one block? What if I add a new data variable but keep the existing data variables untouched? Does this differ with the different write modes? Does this differ if I'm reading from a Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
|
Hello @SohumB, thank you for your questions and glad you are having some fun with Icechunk.
Icechunk is not content addressable and doesn't implement deduplication today. We investigated this type of functionality before, and in fact the initial version of Arraylake V1 repos (Earthmover's Icechunk predecessor) were deduplicating. We found two issues:
So, Icechunk doesn't attempt to deduplicate. If you open a repository, rewrite every element in it to the same value, you will effectively duplicate you storage. This is not an operation we expect people to do, so Icechunk doesn't optimize for it. But I'm curious about your use case.
Storage usage and operation time will be (amortized) linear on the size of the new write. So, if you write a few attributes and a few chunks (doesn't matter if new or updated) it will be fast and use very little storage.
Same, doesn't matter the size of the repository at version N, version N+1 will take extra storage and time linear (amortized) on the size of the new writes.
No, same answer is valid in all these cases.
No it doesn't. |
Beta Was this translation helpful? Give feedback.
The way to handle these cases today is to use the Zarr api directly.
There is the concept of "region" writes where you can tell Xarray to update specific regions of the store: https://docs.xarray.dev/en/stable/user-guide/io.html#modifying-existing-zarr-stores .
region="auto"will attempt to detect the …