Compaction? #961

vladidobro · 2025-05-27T15:18:53Z

vladidobro
May 27, 2025

Hi, Icechunk community,
thank you for the amazing work that you are doing. While we are all patiently waiting for icechunk to reach v1, I want to make clear if I understand correctly how icechunk handles non-cooperative concurrent writes.

As an example scenario, assume that I have data coming regularly from the satellites every 15 minutes with chunking (time, lat, lon) = (1, 1000, 1000).
I need the data as soon as possible, so one option is to pump it into vanilla zarr with the same chunks. This way, I will not need to handle any potential races or concurrency, because each chunk will be written by exactly one pipeline run.

But for analysis purposes, I would like to have the final dataset with finer spatial chunking, e.g. (100, 100, 100).
If I were to do it with vanilla zarr (and if I understand correctly), each job will have to download all the intersecting chunks from the already written zarr store, do the update in memory, and then write again the new chunks.
So writing a single satellite image will require me to download 100 satellite images from my zarr store, and then upload again 100 satellite images.
This also has the obvious problem that if two satellite images were being written at the same time, there would be no consistency guarantees.

How exactly does Icechunk solve this problem?
The way I understand it from the spec is that it does the exact same thing, except that it does clever serialization, so that one of the racing jobs will fail and thus avoid inconsistency.
But this would still mean that writing a single image will be ~200 slower than with the (1, 1000, 1000) chunks. Also the object storage write costs will be 100 times larger, because I will be writing the same data again and again.

The way this problem is solved in the open table formats like Iceberg is "compaction", the writers only really write new data in small chunks. The dataset abstracts over these small files, and one can later run a compaction operation that merges them into one big file to optimize them for analysis.
This way, both the realtime and historical data have their optimal chunking and the user has a unified interface over the dataset.

From what I've seen, Icechunk spec seems to be flexible enough to allow this, but I have not seen anything in the API that would help me do this compaction.
The sharding seems to be similar, but does it allow this? Will there be API in Icechunk that will help me shard historical data in big shards, while also enabling me to write new data in small shards?

An idea that I toyed with is to have two zarr stores, one "realtime" and one "historical", each with their own optimal chunking, then there would be a "compaction" pipeline that will move data from the realtime to the historical zarr.
But this is obviously very clunky and not user-friendly.

Does Icechunk solve this?
Thank you

vladidobro · 2025-05-27T15:31:34Z

vladidobro
May 27, 2025
Author

The issue is quite similar to what was mentioned in a different discussion topic

#802 (reply in thread)

Except that here I really don't have an option to process the images together, because they arrive at different times.

@rabernat mentioned there that there will be some rechunking on the roadmap, that seems to be the solution?

0 replies

rabernat · 2025-05-29T17:02:12Z

rabernat
May 29, 2025
Maintainer

Hi @vladidobro. 👋 Thanks for your interesting question.

The short answer is no--today Icechunk does not solve this problem generally. This is the flavor of the "rechunking" problem. You are feeling the tension between choosing chunks that are optimized for your write pattern vs. chunks that are optimized for your read pattern. This problem is generic to the Zarr data model, which is what Icechunk implements. Icechunk does not provide any capabilities for automatic rechunking today.

This also has the obvious problem that if two satellite images were being written at the same time, there would be no consistency guarantees.

This is something that Icechunk can help with. By wrapping every update in a transaction, it guarantees there are no inconsistent or incomplete writes.

In the longer term, rechunking is definitely a problem we are working on at Earthmover. The solution we are developing involves Icechunk but also additional layers of software and orchestration.

In the meantime, this:

to have two zarr stores, one "realtime" and one "historical", each with their own optimal chunking

is a very common approach used by others facing this problem.

0 replies

vladidobro · 2025-05-29T18:26:08Z

vladidobro
May 29, 2025
Author

Thank you for your response.

I see. I guess the only way to avoid having two stores is to wait for ZEP003, and then the only sane way to implement such compaction scheme will be with Icechunk (so that we don't have to move all the other already written chunks). Or to use a completely different technology (I think TensorStore and TileDB have this feature, but they lack in other aspects compared to zarr).

Even if I were to do the rechunking manually, there is still no way to do it with Icechunk now, right?
I could imagine that I could just concat the historical chunks into one file and then use Icechunk's sharding or references to reference byte ranges in that one file. That way, from the point of view of zarr, the chunks would still be regular, but effectively compacted.
This still seems too convoluted though and would require a huge amount of work.

2 replies

rabernat Jun 2, 2025
Maintainer

I think TensorStore and TileDB have this feature, but they lack in other aspects compared to zarr

Neither of these has true dynamic Rechunking. For both, once the chunking is set, it can't be changed. TileDB does have a feature called "consolidation" which allows you to repack tiles (their name for chunks) into larger fragments (akin to Zarr shards).

vladidobro Jun 3, 2025
Author

Thank you. I guess I will live with two zarr stores.

kevmcgrath · 2025-10-16T16:41:43Z

kevmcgrath
Oct 16, 2025

I'm very interested in this discussion and look forward to the day when one could rechunk in place rather than duplicating the data to a new repo. For now, my project will likely use two zarr stores, like Vladislav.

Cheers,
Kevin

4 replies

rabernat Oct 16, 2025
Maintainer

What exactly do you mean by "rechunk in place"?

kevmcgrath Oct 17, 2025

Hi, Ryan. In our use case, we independently convert GRIBs for a given model cycle into an icechunk repo. Each GRIB is for a specific valid time, resulting in (time, lat, lon) chunks of (1, 200, 200). We perform the conversion in real-time as the GRIBs arrive over a period of up to 6 hours (in the case of GEFS). We don't wait for all the GRIBs to be available before kicking off the conversion.

Once all valid times have been added to the repo, we wish to rechunk to (10, 200, 200) for improved performance when requesting a time series. Our understanding is that due to zarr limitations, we cannot perform this compaction on an existing repo and that instead, we'll need to download the existing data, chunk, and then push the results to a new repo.

We're investigating the utilization of a single repo but with zarr groups: one group for "real time" (e.g. /realtime/t2m) and another for the final optimized chunking once all valid times are available (e.g. /historical/t2m). We'd then drop the real time group. While this doesn't avoid needing to download the existing data, chunk, and push it, it does avoid the need for two repos.

It would be fantastic if one could change the chunking of an existing group (or repo), which is what I meant by "in place", but as you mentioned, this limitation is generic to the zarr data model. I know what I'll be asking Santa for this Christmas. :-)

dcherian Oct 17, 2025
Maintainer

If you initialize the zarr array with a chunk size of (10, 200, 200) each write will do the "compaction" automatically for you :). granted this is slower at write time but your chunks are small; it doesn't matter & saves you all this headache

kevmcgrath Oct 20, 2025

Thanks for that suggestion, Deepak. That's a step in the right direction. However, we anticipate 2 problems (perhaps incorrectly) with setting chunks to (10, 200, 200) at the start:

Since our data will not be appended in chronological order, we would have to do a sort operation once all valid times have been added, resulting in all of the chunks being overwritten.
Our process (parallel AWS Lambda invocations) will likely have concurrent writes to the same chunk, resulting in either a high number of write lock failures or potential data loss.

Compaction? #961

Uh oh!

Uh oh!

vladidobro May 27, 2025

Replies: 4 comments · 6 replies

Uh oh!

vladidobro May 27, 2025 Author

Uh oh!

rabernat May 29, 2025 Maintainer

Uh oh!

vladidobro May 29, 2025 Author

Uh oh!

rabernat Jun 2, 2025 Maintainer

Uh oh!

vladidobro Jun 3, 2025 Author

Uh oh!

Uh oh!

kevmcgrath Oct 16, 2025

Uh oh!

rabernat Oct 16, 2025 Maintainer

Uh oh!

kevmcgrath Oct 17, 2025

Uh oh!

dcherian Oct 17, 2025 Maintainer

Uh oh!

Uh oh!

kevmcgrath Oct 20, 2025

vladidobro
May 27, 2025

Replies: 4 comments 6 replies

vladidobro
May 27, 2025
Author

rabernat
May 29, 2025
Maintainer

vladidobro
May 29, 2025
Author

rabernat Jun 2, 2025
Maintainer

vladidobro Jun 3, 2025
Author

kevmcgrath
Oct 16, 2025

rabernat Oct 16, 2025
Maintainer

dcherian Oct 17, 2025
Maintainer