Proof of concept: lazily loading datasets with VirtualiZarr #654

maxrjones · 2025-07-02T00:17:34Z

Just showing how simple it would be to support the ask from #647. I think we should consider it.

codecov · 2025-07-02T00:19:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.47%. Comparing base (ebed400) to head (4080b4c).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #654      +/-   ##
===========================================
+ Coverage    88.44%   88.47%   +0.03%     
===========================================
  Files           34       34              
  Lines         1791     1796       +5     
===========================================
+ Hits          1584     1589       +5     
  Misses         207      207

Files with missing lines	Coverage Δ
virtualizarr/__init__.py	`75.00% <100.00%> (ø)`
virtualizarr/xarray.py	`86.23% <100.00%> (+0.66%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TomNicholas · 2025-07-02T00:41:57Z

what I would hope is a simple use case for VirtualiZarr, which is to create a lazy-loaded xarray of existing NetCDF files.

this statement is absolutely true - it is a simple use case.

I don't understand why this would be useful. It

isn't serialized permanently
doesn't involve or allow lazy concatenation (unless xarray supported lazy concatenation)
saves the user only 2 or 3 lines of code
offers nothing beyond what ManifestStore now offers

It increases the API surface without adding the ability to do anything new - it's totally the same as just

manifest_store = HDFParser()('file.nc')
ds = xr.open_zarr(manifest_store)

maxrjones · 2025-07-02T01:08:02Z

what I would hope is a simple use case for VirtualiZarr, which is to create a lazy-loaded xarray of existing NetCDF files.

this statement is absolutely true - it is a simple use case.

I don't understand why this would be useful. It

isn't serialized permanently

doesn't involve or allow lazy concatenation (unless xarray supported lazy concatenation)

saves the user only 2 or 3 lines of code

offers nothing beyond what ManifestStore now offers

It increases the API surface without adding the ability to do anything new - it's totally the same as just
manifest_store = HDFParser()('file.nc')
ds = xr.open_zarr(manifest_store)

virtualizarr.open_mfdataset() is where the user would actually save effort. What is the equivalent for going from virtualizarr.open_virtual_mfdataset() to an actual xarray dataset without temporarily serializing to kerchunk/icechunk?

TomNicholas · 2025-07-02T01:09:00Z

My interpretation of the ask from #647 is that he had a lazy xarray data that he created without using virtualizarr at all, and wanted to somehow serialize that as virtual references. Which isn't possible unless somehow that references information is still in there somewhere to be extracted.

TomNicholas · 2025-07-02T01:15:35Z

What is the equivalent for going from virtualizarr.open_virtual_mfdataset() to an actual xarray dataset without temporarily serializing to kerchunk/icechunk?

You're right that this pathway doesn't exist, but my understanding is that you're suggesting we implement basically

def open_mfdataset(files):
    datasets = [vz.open_dataset(file) for file in files]

    return xr.combine_nested(datasets)

with the idea being that it returns a lazily-loaded dataset. But until xarray supports lazy concatenation, that will not work - instead it will trigger loading the whole dataset. And if it loads the whole dataset you may as well just do

stores = [parser(file) for file in files]
datasets = [vz.open_zarr(store) for store in stores]
result = xr.combine_nested(datasets)

because you'll get the same outcome.

TomNicholas · 2025-07-02T01:16:14Z

If xarray supported lazy concatenation then the whole idea becomes a lot more interesting, because then there's suddenly way more crossover between concatenation in virtualizarr and concatenation in xarray.

maxrjones · 2025-07-02T01:24:43Z

If xarray supported lazy concatenation then the whole idea becomes a lot more interesting, because then there's suddenly way more crossover between concatenation in virtualizarr and concatenation in xarray.

This is helpful, thanks for the discussion. I'll close the PR in the morning (I don't see how to close from the app).

TomNicholas · 2025-07-02T01:28:01Z

Happy to discuss this with you further, but shall we close this for now? I don't think 3 lines of code in a dangling PR is a very good way to track this...

TomNicholas · 2025-07-02T01:29:11Z

xref pydata/xarray#4628 and pydata/xarray#10402

TomNicholas · 2025-07-02T01:30:06Z

Note that the reason I didn't put effort into that and instead made virtualizarr is that you can't persist that entire datacube to kerchunk / icechunk if you do it this way.

Proof of concept for lazily loading datasets with VirtualiZarr

4080b4c

maxrjones temporarily deployed to test-release July 2, 2025 00:18 — with GitHub Actions Inactive

maxrjones mentioned this pull request Jul 2, 2025

Can't isel on VirtualiZarr of multi-dimensional NetCDF files #647

Open

maxrjones closed this Jul 2, 2025

TomNicholas deleted the open-dataset branch July 4, 2025 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proof of concept: lazily loading datasets with VirtualiZarr #654

Proof of concept: lazily loading datasets with VirtualiZarr #654

Uh oh!

maxrjones commented Jul 2, 2025

Uh oh!

codecov bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

maxrjones commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025 •

edited

Loading

Uh oh!

maxrjones commented Jul 2, 2025 •

edited

Loading

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Proof of concept: lazily loading datasets with VirtualiZarr #654

Proof of concept: lazily loading datasets with VirtualiZarr #654

Uh oh!

Conversation

maxrjones commented Jul 2, 2025

Uh oh!

codecov bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

maxrjones commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxrjones commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025

Uh oh!

TomNicholas commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jul 2, 2025 •

edited

Loading

TomNicholas commented Jul 2, 2025 •

edited

Loading

maxrjones commented Jul 2, 2025 •

edited

Loading

TomNicholas commented Jul 2, 2025 •

edited

Loading