Deduplication? #922

SohumB · 2025-04-20T22:07:59Z

SohumB
Apr 20, 2025

Hi! This is a super interesting project! I'm just trying to understand the model.

If I open an icechunk store into an xarray dataset, then immediately write that dataset back to the same store, what's the performance and storage implications? What if I change one attr in one of the data variables? If I change the data in one block? What if I add a new data variable but keep the existing data variables untouched? Does this differ with the different write modes? Does this differ if I'm reading from a readonly_session created from an earlier snapshot_id, but writing back to a writable_session("main")? Does this differ if I'm using dask (i.e., explicit chunks parameter to open_dataset)?

Thank you!

Answered by dcherian

Apr 23, 2025

generate a new commit containing only an attrs change to one of the data variables — is it even possible to schedule that with xarray? Or, similarly, change an existing chunk? Or would the way to handle that be to use the lower level zarr library instead?

The way to handle these cases today is to use the Zarr api directly.

does the version of this where xarray checks the destination for, I think, distinct coords? and only writes out the ones that are new

There is the concept of "region" writes where you can tell Xarray to update specific regions of the store: https://docs.xarray.dev/en/stable/user-guide/io.html#modifying-existing-zarr-stores . region="auto" will attempt to detect the …

View full answer

paraseba · 2025-04-22T15:44:01Z

paraseba
Apr 22, 2025
Maintainer

Hello @SohumB, thank you for your questions and glad you are having some fun with Icechunk.

If I open an icechunk store into an xarray dataset, then immediately write that dataset back to the same store, what's the performance and storage implications?

Icechunk is not content addressable and doesn't implement deduplication today. We investigated this type of functionality before, and in fact the initial version of Arraylake V1 repos (Earthmover's Icechunk predecessor) were deduplicating. We found two issues:

Content addressable stores are much harder to garbage collect
Duplicated chunks are usually very compressable, so deduplication doesn't save much storage.

So, Icechunk doesn't attempt to deduplicate. If you open a repository, rewrite every element in it to the same value, you will effectively duplicate you storage. This is not an operation we expect people to do, so Icechunk doesn't optimize for it.

But I'm curious about your use case.

What if I change one attr in one of the data variables? If I change the data in one block?

Storage usage and operation time will be (amortized) linear on the size of the new write. So, if you write a few attributes and a few chunks (doesn't matter if new or updated) it will be fast and use very little storage.

What if I add a new data variable but keep the existing data variables untouched?

Same, doesn't matter the size of the repository at version N, version N+1 will take extra storage and time linear (amortized) on the size of the new writes.

Does this differ with the different write modes? Does this differ if I'm reading from a readonly_session created from an earlier snapshot_id, but writing back to a writable_session("main")

No, same answer is valid in all these cases.

Does this differ if I'm using dask (i.e., explicit chunks parameter to open_dataset)

No it doesn't.

3 replies

SohumB Apr 22, 2025
Author

Thanks for the answer! To make sure I've understood:

import icechunk as ic
import xarray as xr

repo = ic.Repository(...)
session = repo.writable_session(...)
ds = xr.open_dataset(session.store)

icechunk.xarray.to_icechunk(ds, session, mode="w")

What I understand you saying is that this will write out a whole new copy of every data variable. Fair enough. I guess my question then boils down to, how do I get xarray to keep track of what the minimal set of changes to write back is. mode="a" (I think?) does the version of this where xarray checks the destination for, I think, distinct coords? and only writes out the ones that are new?

In particular, this operation where I want to take an existing dataset, and generate a new commit containing only an attrs change to one of the data variables — is it even possible to schedule that with xarray? Or, similarly, change an existing chunk? Or would the way to handle that be to use the lower level zarr library instead?

dcherian Apr 23, 2025
Maintainer

generate a new commit containing only an attrs change to one of the data variables — is it even possible to schedule that with xarray? Or, similarly, change an existing chunk? Or would the way to handle that be to use the lower level zarr library instead?

The way to handle these cases today is to use the Zarr api directly.

does the version of this where xarray checks the destination for, I think, distinct coords? and only writes out the ones that are new

There is the concept of "region" writes where you can tell Xarray to update specific regions of the store: https://docs.xarray.dev/en/stable/user-guide/io.html#modifying-existing-zarr-stores . region="auto" will attempt to detect the region automatically. Importantly you have to subset the dataset you will write to the appropriate region. In common use cases, this is pretty natural.

Answer selected by SohumB

SohumB Apr 23, 2025
Author

That all makes perfect sense. Thank you. I think I was confused by trying to directly apply my instincts from git over; I appreciate the explication!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplication? #922

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Deduplication? #922

Uh oh!

Uh oh!

SohumB Apr 20, 2025

Replies: 1 comment · 3 replies

Uh oh!

paraseba Apr 22, 2025 Maintainer

Uh oh!

Uh oh!

SohumB Apr 22, 2025 Author

Uh oh!

dcherian Apr 23, 2025 Maintainer

Uh oh!

SohumB Apr 23, 2025 Author

SohumB
Apr 20, 2025

Replies: 1 comment 3 replies

paraseba
Apr 22, 2025
Maintainer

SohumB Apr 22, 2025
Author

dcherian Apr 23, 2025
Maintainer

SohumB Apr 23, 2025
Author