managing source file of virtual datasets within icechunk #1053
bendichter
started this conversation in
Ideas
Replies: 1 comment
-
If you're willing to ask all users to do all writes of data through the icechunk interface, then why not just write as zarr? I'm guessing that the reason you want to write new chunks as some other file format is because you want to access the chunks from a reader that only understands that other file format. But there are other solutions for the problem of access-zarr-as-if-not-zarr, e.g. arraylake's query service, or the clever subclassing of |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Inspired by a convo in the webinar today, I wanted to propose a feature related to virtual datasets.
Currently, when you want to use a virtual dataset, that dataset is external to icechunk. If the source data changes, icechunk can detect that there was a change and invalidate the references.
An alternative approach would be to manage source data within icechunk. Then, if a change is made to the source file, and that source file is referenced by an existing snapshot, then the file remains and a new file is created. You may also want to automatically virtualize the new version of the source data file and create a new snapshot in icechunk. This would allow you to maintain versions of datasets that include virtual components, and ensure that the source data stays consistent with the icechunk references.
One downside of this approach: This would require users to interact with their source non-zarr files through the icechunk interface, which may not be a reasonable expectation.
cc @rly
p.s. this is a very cool project, thanks for working on it!
Beta Was this translation helpful? Give feedback.
All reactions