managing source file of virtual datasets within icechunk #1053

bendichter · 2025-02-18T19:12:46Z

bendichter
Feb 18, 2025

Inspired by a convo in the webinar today, I wanted to propose a feature related to virtual datasets.

Currently, when you want to use a virtual dataset, that dataset is external to icechunk. If the source data changes, icechunk can detect that there was a change and invalidate the references.

An alternative approach would be to manage source data within icechunk. Then, if a change is made to the source file, and that source file is referenced by an existing snapshot, then the file remains and a new file is created. You may also want to automatically virtualize the new version of the source data file and create a new snapshot in icechunk. This would allow you to maintain versions of datasets that include virtual components, and ensure that the source data stays consistent with the icechunk references.

One downside of this approach: This would require users to interact with their source non-zarr files through the icechunk interface, which may not be a reasonable expectation.

cc @rly

p.s. this is a very cool project, thanks for working on it!

TomNicholas · 2025-02-18T20:14:52Z

TomNicholas
Feb 18, 2025
Maintainer

This would require users to interact with their source non-zarr files through the icechunk interface

If you're willing to ask all users to do all writes of data through the icechunk interface, then why not just write as zarr?

I'm guessing that the reason you want to write new chunks as some other file format is because you want to access the chunks from a reader that only understands that other file format. But there are other solutions for the problem of access-zarr-as-if-not-zarr, e.g. arraylake's query service, or the clever subclassing of h5py.Dataset mentioned here (NeurodataWithoutBorders/lindi#91 (comment)).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

managing source file of virtual datasets within icechunk #1053

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

managing source file of virtual datasets within icechunk #1053

Uh oh!

bendichter Feb 18, 2025

Replies: 1 comment

Uh oh!

TomNicholas Feb 18, 2025 Maintainer

bendichter
Feb 18, 2025

TomNicholas
Feb 18, 2025
Maintainer