Compaction? #961
Replies: 4 comments 6 replies
-
|
The issue is quite similar to what was mentioned in a different discussion topic Except that here I really don't have an option to process the images together, because they arrive at different times. @rabernat mentioned there that there will be some rechunking on the roadmap, that seems to be the solution? |
Beta Was this translation helpful? Give feedback.
-
|
Hi @vladidobro. 👋 Thanks for your interesting question. The short answer is no--today Icechunk does not solve this problem generally. This is the flavor of the "rechunking" problem. You are feeling the tension between choosing chunks that are optimized for your write pattern vs. chunks that are optimized for your read pattern. This problem is generic to the Zarr data model, which is what Icechunk implements. Icechunk does not provide any capabilities for automatic rechunking today.
This is something that Icechunk can help with. By wrapping every update in a transaction, it guarantees there are no inconsistent or incomplete writes. In the longer term, rechunking is definitely a problem we are working on at Earthmover. The solution we are developing involves Icechunk but also additional layers of software and orchestration. In the meantime, this:
is a very common approach used by others facing this problem. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for your response. I see. I guess the only way to avoid having two stores is to wait for ZEP003, and then the only sane way to implement such compaction scheme will be with Icechunk (so that we don't have to move all the other already written chunks). Or to use a completely different technology (I think TensorStore and TileDB have this feature, but they lack in other aspects compared to zarr). Even if I were to do the rechunking manually, there is still no way to do it with Icechunk now, right? |
Beta Was this translation helpful? Give feedback.
-
|
I'm very interested in this discussion and look forward to the day when one could rechunk in place rather than duplicating the data to a new repo. For now, my project will likely use two zarr stores, like Vladislav. Cheers, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, Icechunk community,
thank you for the amazing work that you are doing. While we are all patiently waiting for icechunk to reach v1, I want to make clear if I understand correctly how icechunk handles non-cooperative concurrent writes.
As an example scenario, assume that I have data coming regularly from the satellites every 15 minutes with chunking (time, lat, lon) = (1, 1000, 1000).
I need the data as soon as possible, so one option is to pump it into vanilla zarr with the same chunks. This way, I will not need to handle any potential races or concurrency, because each chunk will be written by exactly one pipeline run.
But for analysis purposes, I would like to have the final dataset with finer spatial chunking, e.g. (100, 100, 100).
If I were to do it with vanilla zarr (and if I understand correctly), each job will have to download all the intersecting chunks from the already written zarr store, do the update in memory, and then write again the new chunks.
So writing a single satellite image will require me to download 100 satellite images from my zarr store, and then upload again 100 satellite images.
This also has the obvious problem that if two satellite images were being written at the same time, there would be no consistency guarantees.
How exactly does Icechunk solve this problem?
The way I understand it from the spec is that it does the exact same thing, except that it does clever serialization, so that one of the racing jobs will fail and thus avoid inconsistency.
But this would still mean that writing a single image will be ~200 slower than with the (1, 1000, 1000) chunks. Also the object storage write costs will be 100 times larger, because I will be writing the same data again and again.
The way this problem is solved in the open table formats like Iceberg is "compaction", the writers only really write new data in small chunks. The dataset abstracts over these small files, and one can later run a compaction operation that merges them into one big file to optimize them for analysis.
This way, both the realtime and historical data have their optimal chunking and the user has a unified interface over the dataset.
From what I've seen, Icechunk spec seems to be flexible enough to allow this, but I have not seen anything in the API that would help me do this compaction.
The sharding seems to be similar, but does it allow this? Will there be API in Icechunk that will help me shard historical data in big shards, while also enabling me to write new data in small shards?
An idea that I toyed with is to have two zarr stores, one "realtime" and one "historical", each with their own optimal chunking, then there would be a "compaction" pipeline that will move data from the realtime to the historical zarr.
But this is obviously very clunky and not user-friendly.
Does Icechunk solve this?
Thank you
Beta Was this translation helpful? Give feedback.
All reactions