Does Xarray do lazy reads when opening netcdf from s3? #6404

ianliu · 2022-03-23T14:55:07Z

ianliu
Mar 23, 2022

The documentation for netCDF says it reads data lazily from disk. Is this still valid for files residing in AWS S3?

Aug 3, 2023

When you read NetCDF4 files (which are HDF5 files with certain conventions) from S3 using Xarray, only the metadata and coordinate variables are loaded eagerly, while the data variables are loaded lazily, just as if the NetCDF4 file were on a local filesystem.

Here's a Jupyter Notebook demonstrating opening a 25GB file from S3 in a few seconds, then reading data lazily.

View full answer

simlmx · 2023-02-23T17:15:23Z

simlmx
Feb 23, 2023

You should be able to do this:

with fsspec.open("s3://bucket/file.netcdf", mode="rb") as f:
    ds = xr.open_dataset(f, engine="h5netcdf")
    # Do stuff with `ds`

I had a case where I didn't want to use with and ended up with something like this, which feels hacky but worked for me:

f = fsspec.open("s3://bucket/file.netcdf", mode="rb")
f = f.__enter__()
ds = xr.open_dataset(f, engine="h5netcdf")

3 replies

keewis Feb 23, 2023
Maintainer

as far as I can tell, the context manager is not really necessary:

xr.open_dataset(fsspec.open("...", mode="rb"), engine="h5netcdf")

should work, as well

simlmx Feb 23, 2023

That's what I tried but I got this error:

File "mycode.py", line 45, in _open
  self._data = xr.open_dataset(f, engine="h5netcdf")
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/xarray/backends/api.py", line 540, in open_dataset
  backend_ds = backend.open_dataset(
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 407, in open_dataset
  store = H5NetCDFStore.open(
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 170, in open
  return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 120, in __init__
  self._filename = find_root_and_group(self.ds)[0].filename
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 181, in ds
  return self._acquire()
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 173, in _acquire
  with self._manager.acquire_context(needs_lock) as root:
File "/home/simon/.pyenv/versions/3.10.8/lib/python3.10/contextlib.py", line 135, in __enter__
  return next(self.gen)
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/xarray/backends/file_manager.py", line 197, in acquire_context
  file, cached = self._acquire_with_cache_info(needs_lock)
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/xarray/backends/file_manager.py", line 215, in _acquire_with_cache_info
  file = self._opener(*self._args, **kwargs)
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/h5netcdf/core.py", line 1064, in __init__
  self._h5file = self._h5py.File(
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/h5py/_hl/files.py", line 542, in __init__
  name = filename_encode(name)
File "/home/simon/.cache/pypoetry/virtualenvs/project-gDBmdbVF-py3.10/lib/python3.10/site-packages/h5py/_hl/compat.py", line 19, in filename_encode
  filename = fspath(filename)

TypeError: expected str, bytes or os.PathLike object, not OpenFile
main simon@simon-rd:~/code/pv-site-production

Versions of potentially relevant libraries:

xarray==2022.12.0
fsspec==2022.11.0
s3fs==2022.11.0
h5netcdf==1.1.0
h5py==3.8.0

keewis Feb 23, 2023
Maintainer

I don't have access to HDF5 files over s3 so I can't verify, but I know that

fs = fsspec.filesystem("http")
xr.open_dataset(fs.open("http://127.0.0.1:9500/path/to/file.nc"), engine="h5netcdf")

succeeds in my environment, so I guess you either have to use fsspec.open() with a context manager or call open through a matching filesystem object (see also fsspec/filesystem_spec#579 for a lot of discussion on the differences and a few other options)

rsignell-usgs · 2023-08-03T16:48:12Z

rsignell-usgs
Aug 3, 2023

When you read NetCDF4 files (which are HDF5 files with certain conventions) from S3 using Xarray, only the metadata and coordinate variables are loaded eagerly, while the data variables are loaded lazily, just as if the NetCDF4 file were on a local filesystem.

Here's a Jupyter Notebook demonstrating opening a 25GB file from S3 in a few seconds, then reading data lazily.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Does Xarray do lazy reads when opening netcdf from s3? #6404

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Does Xarray do lazy reads when opening netcdf from s3? #6404

Uh oh!

ianliu Mar 23, 2022

Replies: 2 comments · 3 replies

Uh oh!

simlmx Feb 23, 2023

Uh oh!

keewis Feb 23, 2023 Maintainer

Uh oh!

simlmx Feb 23, 2023

Uh oh!

keewis Feb 23, 2023 Maintainer

Uh oh!

Uh oh!

rsignell-usgs Aug 3, 2023

ianliu
Mar 23, 2022

Replies: 2 comments 3 replies

simlmx
Feb 23, 2023

keewis Feb 23, 2023
Maintainer

keewis Feb 23, 2023
Maintainer

rsignell-usgs
Aug 3, 2023