Creating a new backend for hdf5 files with lazy loading as dask arrays #7263

mraspaud · 2022-11-07T16:36:23Z

mraspaud
Nov 7, 2022

Hi,
I'm looking into creating a backend (engine) for xarray.open_dataset with lazy loading. Following the documentation works in that xarray does keep the data lazy, but I would be interested in having the DataArrays containing dask array instances, how would I achieve that?
Here is a minimal example:

import h5py
from xarray.backends import BackendEntrypoint, BackendArray
from xarray import Dataset, DataArray, Variable
from xarray.core import indexing
import numpy as np

filename = "/tmp/testsgli.h5"
varname = "Lt_VN01"

def create_h5_data(filename, varname, shape):
    h5f = h5py.File(filename, mode="w")
    h5f[varname] = np.random.rand(*shape)
    h5f.close()

create_h5_data(filename, varname, (2000, 2000))

class H5Array(BackendArray):
    def __init__(self, array):
        self.shape = array.shape
        self.dtype = array.dtype
        self.array = array

    def __getitem__(self, key):
        return indexing.explicit_indexing_adapter(
            key, self.shape, indexing.IndexingSupport.BASIC, self._getitem
        )

    def _getitem(self, key):
        return self.array[key]

class SGLIBackend(BackendEntrypoint):

    def open_dataset(self, filename, *, drop_variables=None, **kwargs):
        ds = Dataset()
        h5f = h5py.File(filename)
        h5_arr = h5f["Lt_VN01"]
        ds["Lt_VN01"] = Variable(["y", "x"],
                                 indexing.LazilyIndexedArray(H5Array(h5_arr)),
                                 encoding={"preferred_chunks": h5_arr.chunks})
        return ds

print(SGLIBackend().open_dataset(filename)[varname].data)

As you can see, the result is not a dask array.

Annex question: I see in the source of existing engines the usage of datastore, where can I find documentation about how to use it in the context of a backend?

Answered by keewis

Nov 7, 2022

as far as I can tell, you're not supposed to handle dask in the backend, this will be taken care of by open_dataset:

(Note that the backend is not directly involved in Dask chunking, because Xarray internally manages chunking.)

As such, I think you should change the last line to xr.open_dataset(filename, engine=SGLIBackend, chunks={}), but that raises an AttributeError.

Traceback

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [1], in <cell line: 49>()
     41         ds["Lt_VN01"] = Variable(
     42             ["y", "x"],
     43             indexing.LazilyIndexedArray(H5A…

View full answer

keewis · 2022-11-07T16:53:04Z

keewis
Nov 7, 2022
Maintainer

as far as I can tell, you're not supposed to handle dask in the backend, this will be taken care of by open_dataset:

(Note that the backend is not directly involved in Dask chunking, because Xarray internally manages chunking.)

As such, I think you should change the last line to xr.open_dataset(filename, engine=SGLIBackend, chunks={}), but that raises an AttributeError.

Traceback

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [1], in <cell line: 49>()
     41         ds["Lt_VN01"] = Variable(
     42             ["y", "x"],
     43             indexing.LazilyIndexedArray(H5Array(h5_arr)),
     44             encoding={"preferred_chunks": h5_arr.chunks},
     45         )
     46         return ds
---> 49 xr.open_dataset(filename, engine=SGLIBackend, chunks={})

File .../xarray/backends/api.py:545, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs)
    538 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
    539 backend_ds = backend.open_dataset(
    540     filename_or_obj,
    541     drop_variables=drop_variables,
    542     **decoders,
    543     **kwargs,
    544 )
--> 545 ds = _dataset_from_backend_dataset(
    546     backend_ds,
    547     filename_or_obj,
    548     engine,
    549     chunks,
    550     cache,
    551     overwrite_encoded_chunks,
    552     inline_array,
    553     drop_variables=drop_variables,
    554     **decoders,
    555     **kwargs,
    556 )
    557 return ds

File .../xarray/backends/api.py:357, in _dataset_from_backend_dataset(backend_ds, filename_or_obj, engine, chunks, cache, overwrite_encoded_chunks, inline_array, **extra_tokens)
    355     ds = backend_ds
    356 else:
--> 357     ds = _chunk_ds(
    358         backend_ds,
    359         filename_or_obj,
    360         engine,
    361         chunks,
    362         overwrite_encoded_chunks,
    363         inline_array,
    364         **extra_tokens,
    365     )
    367 ds.set_close(backend_ds._close)
    369 # Ensure source filename always stored in dataset object

File .../xarray/backends/api.py:325, in _chunk_ds(backend_ds, filename_or_obj, engine, chunks, overwrite_encoded_chunks, inline_array, **extra_tokens)
    323 variables = {}
    324 for name, var in backend_ds.variables.items():
--> 325     var_chunks = _get_chunk(var, chunks)
    326     variables[name] = _maybe_chunk(
    327         name,
    328         var,
   (...)
    333         inline_array=inline_array,
    334     )
    335 return backend_ds._replace(variables)

File .../xarray/core/dataset.py:211, in _get_chunk(var, chunks)
    209 # Determine the explicit requested chunks.
    210 preferred_chunks = var.encoding.get("preferred_chunks", {})
--> 211 preferred_chunk_shape = tuple(
    212     preferred_chunks.get(dim, size) for dim, size in zip(dims, shape)
    213 )
    214 if isinstance(chunks, Number) or (chunks == "auto"):
    215     chunks = dict.fromkeys(dims, chunks)

File .../xarray/core/dataset.py:212, in <genexpr>(.0)
    209 # Determine the explicit requested chunks.
    210 preferred_chunks = var.encoding.get("preferred_chunks", {})
    211 preferred_chunk_shape = tuple(
--> 212     preferred_chunks.get(dim, size) for dim, size in zip(dims, shape)
    213 )
    214 if isinstance(chunks, Number) or (chunks == "auto"):
    215     chunks = dict.fromkeys(dims, chunks)

AttributeError: 'NoneType' object has no attribute 'get'

which means that h5_arr.chunks is None. I can get it to work by replacing h5_arr.chunks with {}, but you'll probably have to translate h5_arr.chunks to a dictionary mapping dimension names to chunksizes.

Working example

In [1]: import h5py
   ...: from xarray.backends import BackendEntrypoint, BackendArray
   ...: from xarray import Dataset, DataArray, Variable
   ...: from xarray.core import indexing
   ...: import xarray as xr
   ...: import numpy as np
   ...: 
   ...: filename = "/tmp/testsgli.h5"
   ...: varname = "Lt_VN01"
   ...: 
   ...: 
   ...: def create_h5_data(filename, varname, shape):
   ...:     h5f = h5py.File(filename, mode="w")
   ...:     h5f[varname] = np.random.rand(*shape)
   ...:     h5f.close()
   ...: 
   ...: 
   ...: create_h5_data(filename, varname, (2000, 2000))
   ...: 
   ...: 
   ...: class H5Array(BackendArray):
   ...:     def __init__(self, array):
   ...:         self.shape = array.shape
   ...:         self.dtype = array.dtype
   ...:         self.array = array
   ...: 
   ...:     def __getitem__(self, key):
   ...:         return indexing.explicit_indexing_adapter(
   ...:             key, self.shape, indexing.IndexingSupport.BASIC, self._getitem
   ...:         )
   ...: 
   ...:     def _getitem(self, key):
   ...:         return self.array[key]
   ...: 
   ...: 
   ...: class SGLIBackend(BackendEntrypoint):
   ...:     def open_dataset(self, filename, *, drop_variables=None, **kwargs):
   ...:         ds = Dataset()
   ...:         h5f = h5py.File(filename)
   ...:         h5_arr = h5f["Lt_VN01"]
   ...:         ds["Lt_VN01"] = Variable(
   ...:             ["y", "x"],
   ...:             indexing.LazilyIndexedArray(H5Array(h5_arr)),
   ...:             encoding={"preferred_chunks": {}},
   ...:         )
   ...:         return ds
   ...: 
   ...: 
   ...: xr.open_dataset(filename, engine=SGLIBackend, chunks={})
Out[1]: 
<xarray.Dataset>
Dimensions:  (y: 2000, x: 2000)
Dimensions without coordinates: y, x
Data variables:
    Lt_VN01  (y, x) float64 dask.array<chunksize=(2000, 2000), meta=np.ndarray>

I see in the source of existing engines the usage of datastore, where can I find documentation about how to use it in the context of a backend?

As mentioned in one of the PRs that introduced the custom backends, the datastore is an implementation detail and can be removed at any moment (hence the lack of documentation).

2 replies

mraspaud Nov 7, 2022
Author

Ok, thanks a lot, indeed using xr.open_dataset makes it work the way I want (and changing the preferred_chunk according to you suggestion). Just a follow on question though, the dask array don't seem to follow if I make that backend an entry point, should I do something more or do I need to reinstall my entry point somehow?

mraspaud Nov 7, 2022
Author

Scratch that, my bad, forgot to add the chunks={} to the string backend call. thanks again!

dcherian · 2022-11-07T17:29:13Z

dcherian
Nov 7, 2022
Maintainer

Does https://tutorial.xarray.dev/advanced/backends/2.Backend_with_Lazy_Loading.html help? (suggestions and PRs to improve docs and tutorials are always welcome)

1 reply

mraspaud Nov 7, 2022
Author

That is very useful indeed, however it didn't have the missing piece that xarray will do the conversion from lazy array to dask array internally.
But thanks still!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Creating a new backend for hdf5 files with lazy loading as dask arrays #7263

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Creating a new backend for hdf5 files with lazy loading as dask arrays #7263

Uh oh!

mraspaud Nov 7, 2022

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

keewis Nov 7, 2022 Maintainer

Uh oh!

mraspaud Nov 7, 2022 Author

Uh oh!

mraspaud Nov 7, 2022 Author

Uh oh!

dcherian Nov 7, 2022 Maintainer

Uh oh!

mraspaud Nov 7, 2022 Author

mraspaud
Nov 7, 2022

Replies: 2 comments 3 replies

keewis
Nov 7, 2022
Maintainer

mraspaud Nov 7, 2022
Author

mraspaud Nov 7, 2022
Author

dcherian
Nov 7, 2022
Maintainer

mraspaud Nov 7, 2022
Author