Skip to content

Improving performance for h5netcdf #195

@hmaarrfk

Description

@hmaarrfk

What happened:

I've been trying to use h5netcdf since it gives access to the underlying h5py dataset which makes it easier to prototype with different HDF5 backends. netcdf4-c is much harder to compile and thus prototype with since many of the internals are not accessible.

My original concern arose from observations that, for some of our datasets, they open with

  • xarray + netcdf4-python in about 80 ms
  • xarray + h5netcdf: 600 ms.

Ultimately, I believe that I think things boil down to the fact an access to the _h5ds property is quite costly. This makes "every" operation "slow".

the code below:

import h5netcdf
from pathlib import Path

h5nc_file = h5netcdf.File(
    Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc',
)

images = h5nc_file.variables['images']
for i in range(1000):
    images.shape

Using the spyder profile, one finds that 1000 calls to size, result in about 8000 calls to _h5ds
image

import h5netcdf
from pathlib import Path

h5nc_file = h5netcdf.File(
    Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc',
)

images = h5nc_file.variables['images']

for i in range(8000):
    images._h5ds

This is because this property is created on demand.

I think this has cascading effects because everything really relies on the access to the underlying h5py structure.

A previous suggestion tried to remove the property, but it seems like there was some concern about "resetting" the underlying _h5py pointer. I am unclear on where this kind of usecase would come up. My understanding is that this pointer would be something unchanging.

Expected Output

As a comparison, we can compare how long it takes to access the shape property from netcdf4:

import h5netcdf
from pathlib import Path
import netCDF4

h5nc_file = netCDF4.Dataset(
    str(Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc'),
)

images = h5nc_file.variables['images']
for i in range(1000):
    images.shape

About 400 us instead of 350 ms.
image

Anything else we need to know?:

The sample image has 4 dimensions

sample_file_zebrafish_0.18.55_4uCams_small.zip

Version

Output of print(h5py.version.info, f"\nh5netcdf {h5netcdf.__version__}")

Summary of the h5py configuration

h5py 3.7.0
HDF5 1.12.2
Python 3.9.13 | packaged by ...
[GCC 10.4.0]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.23.5
cython (built with) 0.29.32
numpy (built against) 1.20.3
HDF5 (built against) 1.12.2

h5netcdf 0.14.0.dev51+g8302776

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions