-
Notifications
You must be signed in to change notification settings - Fork 38
Description
What happened:
I've been trying to use h5netcdf since it gives access to the underlying h5py
dataset which makes it easier to prototype with different HDF5 backends. netcdf4-c is much harder to compile and thus prototype with since many of the internals are not accessible.
My original concern arose from observations that, for some of our datasets, they open with
- xarray + netcdf4-python in about 80 ms
- xarray + h5netcdf: 600 ms.
Ultimately, I believe that I think things boil down to the fact an access to the _h5ds
property is quite costly. This makes "every" operation "slow".
the code below:
import h5netcdf
from pathlib import Path
h5nc_file = h5netcdf.File(
Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc',
)
images = h5nc_file.variables['images']
for i in range(1000):
images.shape
Using the spyder profile, one finds that 1000 calls to size, result in about 8000 calls to _h5ds
import h5netcdf
from pathlib import Path
h5nc_file = h5netcdf.File(
Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc',
)
images = h5nc_file.variables['images']
for i in range(8000):
images._h5ds
This is because this property is created on demand.
I think this has cascading effects because everything really relies on the access to the underlying h5py
structure.
A previous suggestion tried to remove the property, but it seems like there was some concern about "resetting" the underlying _h5py
pointer. I am unclear on where this kind of usecase would come up. My understanding is that this pointer would be something unchanging.
Expected Output
As a comparison, we can compare how long it takes to access the shape property from netcdf4:
import h5netcdf
from pathlib import Path
import netCDF4
h5nc_file = netCDF4.Dataset(
str(Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc'),
)
images = h5nc_file.variables['images']
for i in range(1000):
images.shape
About 400 us instead of 350 ms.
Anything else we need to know?:
The sample image has 4 dimensions
sample_file_zebrafish_0.18.55_4uCams_small.zip
Version
Output of print(h5py.version.info, f"\nh5netcdf {h5netcdf.__version__}")
Summary of the h5py configuration
h5py 3.7.0
HDF5 1.12.2
Python 3.9.13 | packaged by ...
[GCC 10.4.0]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.23.5
cython (built with) 0.29.32
numpy (built against) 1.20.3
HDF5 (built against) 1.12.2
h5netcdf 0.14.0.dev51+g8302776