-
Notifications
You must be signed in to change notification settings - Fork 45
Proof of concept: lazily loading datasets with VirtualiZarr #654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #654 +/- ##
===========================================
+ Coverage 88.44% 88.47% +0.03%
===========================================
Files 34 34
Lines 1791 1796 +5
===========================================
+ Hits 1584 1589 +5
Misses 207 207
🚀 New features to boost your workflow:
|
I don't understand why this would be useful. It
It increases the API surface without adding the ability to do anything new - it's totally the same as just manifest_store = HDFParser()('file.nc')
ds = xr.open_zarr(manifest_store) |
|
My interpretation of the ask from #647 is that he had a lazy xarray data that he created without using virtualizarr at all, and wanted to somehow serialize that as virtual references. Which isn't possible unless somehow that references information is still in there somewhere to be extracted. |
You're right that this pathway doesn't exist, but my understanding is that you're suggesting we implement basically def open_mfdataset(files):
datasets = [vz.open_dataset(file) for file in files]
return xr.combine_nested(datasets) with the idea being that it returns a lazily-loaded dataset. But until xarray supports lazy concatenation, that will not work - instead it will trigger loading the whole dataset. And if it loads the whole dataset you may as well just do stores = [parser(file) for file in files]
datasets = [vz.open_zarr(store) for store in stores]
result = xr.combine_nested(datasets) because you'll get the same outcome. |
If xarray supported lazy concatenation then the whole idea becomes a lot more interesting, because then there's suddenly way more crossover between concatenation in virtualizarr and concatenation in xarray. |
This is helpful, thanks for the discussion. I'll close the PR in the morning (I don't see how to close from the app). |
Happy to discuss this with you further, but shall we close this for now? I don't think 3 lines of code in a dangling PR is a very good way to track this... |
xref pydata/xarray#4628 and pydata/xarray#10402 |
Note that the reason I didn't put effort into that and instead made virtualizarr is that you can't persist that entire datacube to kerchunk / icechunk if you do it this way. |
Just showing how simple it would be to support the ask from #647. I think we should consider it.
docs/releases.rst
api.rst