Skip to content

Proof of concept: lazily loading datasets with VirtualiZarr #654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

maxrjones
Copy link
Member

Just showing how simple it would be to support the ask from #647. I think we should consider it.

  • Closes #xxxx
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

Copy link

codecov bot commented Jul 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.47%. Comparing base (ebed400) to head (4080b4c).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #654      +/-   ##
===========================================
+ Coverage    88.44%   88.47%   +0.03%     
===========================================
  Files           34       34              
  Lines         1791     1796       +5     
===========================================
+ Hits          1584     1589       +5     
  Misses         207      207              
Files with missing lines Coverage Δ
virtualizarr/__init__.py 75.00% <100.00%> (ø)
virtualizarr/xarray.py 86.23% <100.00%> (+0.66%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TomNicholas
Copy link
Member

what I would hope is a simple use case for VirtualiZarr, which is to create a lazy-loaded xarray of existing NetCDF files.

this statement is absolutely true - it is a simple use case.

I don't understand why this would be useful. It

  1. isn't serialized permanently
  2. doesn't involve or allow lazy concatenation (unless xarray supported lazy concatenation)
  3. saves the user only 2 or 3 lines of code
  4. offers nothing beyond what ManifestStore now offers

It increases the API surface without adding the ability to do anything new - it's totally the same as just

manifest_store = HDFParser()('file.nc')
ds = xr.open_zarr(manifest_store)

@maxrjones
Copy link
Member Author

what I would hope is a simple use case for VirtualiZarr, which is to create a lazy-loaded xarray of existing NetCDF files.

this statement is absolutely true - it is a simple use case.

I don't understand why this would be useful. It

  1. isn't serialized permanently
  2. doesn't involve or allow lazy concatenation (unless xarray supported lazy concatenation)
  3. saves the user only 2 or 3 lines of code
  4. offers nothing beyond what ManifestStore now offers

It increases the API surface without adding the ability to do anything new - it's totally the same as just

manifest_store = HDFParser()('file.nc')
ds = xr.open_zarr(manifest_store)

virtualizarr.open_mfdataset() is where the user would actually save effort. What is the equivalent for going from virtualizarr.open_virtual_mfdataset() to an actual xarray dataset without temporarily serializing to kerchunk/icechunk?

@TomNicholas
Copy link
Member

My interpretation of the ask from #647 is that he had a lazy xarray data that he created without using virtualizarr at all, and wanted to somehow serialize that as virtual references. Which isn't possible unless somehow that references information is still in there somewhere to be extracted.

@TomNicholas
Copy link
Member

What is the equivalent for going from virtualizarr.open_virtual_mfdataset() to an actual xarray dataset without temporarily serializing to kerchunk/icechunk?

You're right that this pathway doesn't exist, but my understanding is that you're suggesting we implement basically

def open_mfdataset(files):
    datasets = [vz.open_dataset(file) for file in files]

    return xr.combine_nested(datasets)

with the idea being that it returns a lazily-loaded dataset. But until xarray supports lazy concatenation, that will not work - instead it will trigger loading the whole dataset. And if it loads the whole dataset you may as well just do

stores = [parser(file) for file in files]
datasets = [vz.open_zarr(store) for store in stores]
result = xr.combine_nested(datasets)

because you'll get the same outcome.

@TomNicholas
Copy link
Member

TomNicholas commented Jul 2, 2025

If xarray supported lazy concatenation then the whole idea becomes a lot more interesting, because then there's suddenly way more crossover between concatenation in virtualizarr and concatenation in xarray.

@maxrjones
Copy link
Member Author

maxrjones commented Jul 2, 2025

If xarray supported lazy concatenation then the whole idea becomes a lot more interesting, because then there's suddenly way more crossover between concatenation in virtualizarr and concatenation in xarray.

This is helpful, thanks for the discussion. I'll close the PR in the morning (I don't see how to close from the app).

@TomNicholas
Copy link
Member

Happy to discuss this with you further, but shall we close this for now? I don't think 3 lines of code in a dangling PR is a very good way to track this...

@TomNicholas
Copy link
Member

xref pydata/xarray#4628 and pydata/xarray#10402

@TomNicholas
Copy link
Member

TomNicholas commented Jul 2, 2025

Note that the reason I didn't put effort into that and instead made virtualizarr is that you can't persist that entire datacube to kerchunk / icechunk if you do it this way.

@maxrjones maxrjones closed this Jul 2, 2025
@TomNicholas TomNicholas deleted the open-dataset branch July 4, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants