Poor performance in a custom backend #6574

lstngr · 2022-05-05T08:11:00Z

lstngr
May 5, 2022

Hello everyone! I'm looking to implement a xarray backend which would allow to read a series of HDF5 files (not written following netCFD conventions).
Each file stores a series of dataset containing a 3D field

% h5ls data_65.h5  # along with data_0.h5, data_1.h5, etc...
325                      Dataset {25, 400, 400}  # 3D field at a specific time
326                      Dataset {25, 400, 400}  # 3D field at a later time
327                      Dataset {25, 400, 400}
328                      Dataset {25, 400, 400}
329                      Dataset {25, 400, 400}
t                        Dataset {5}
x                        Dataset {400}
y                        Dataset {400}
z                        Dataset {25}

I managed to get a backend which allows lazy-loading the data into xarray (I can provide the code if it's relevant, let me know),

import xarray as xr
ds = xr.open_mfdataset('/BIG14_TB/tmp/data_*.h5', engine='mytest', parallel=True)
print(ds)

<xarray.Dataset>
Dimensions:  (t: 750, z: 25, x: 400, y: 400)
Coordinates:
  * t        (t) int64 0 1 2 3 4 5 6 7 8 ... 741 742 743 744 745 746 747 748 749
  * z        (z) int64 0 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 22 23 24
  * x        (x) int64 0 1 2 3 4 5 6 7 8 ... 391 392 393 394 395 396 397 398 399
  * y        (y) int64 0 1 2 3 4 5 6 7 8 ... 391 392 393 394 395 396 397 398 399
Data variables:
    myvar    (t, z, x, y) float64 dask.array<chunksize=(5, 25, 400, 400), meta=np.ndarray>

In this case, the full array is about 22GB and could still fit into memory (I've been experimenting with 32GB RAM). I'm looking to open multi-variable datasets which are in the range of 1TB.
I'm puzzled by two issues so far on which I would gladly like your opinion.

First off, let's consider the following computation:

pixel_in_time = ds['myvar'].isel(t=slice(-50, None)).mean(('x', 'y','z')).persist()

I can't help but notice that most of the time reported by the Dask profile is spent in the wait routine. After checking "by hand" with timers, I'm fairly sure this comes from workers waiting for a lock to access the data. This time would be spend in the _raw_indexing_method from the documentation.
Is there any way I can reduce this waiting time?

Secondly, when the dataset I'm manipulating approaches the size of the RAM, computations will freeze. This only happens when I'm using Dask's distributed scheduler. The task stream appears to create a lot of threads at the start, and at some point, Dask starts to complain about having too much unmanaged memory and starts to spill to disk (leading to the freeze). Such a behaviour could be triggered by the following code (I halted the client before it started spilling, but the task stream would continue to grow diagonally like shown below),

import numpy as np
time_average = np.sqrt(ds['myvar']).mean(('x', 'y')).persist()

I tried rechunking the data in various ways to no avail.
I would expect the task stream to keep as much threads as cores on my CPU, this is the way Dask-only computations behave. Am I missing something?

Thank you very much in advance!

INSTALLED VERSIONS
------------------
commit: None
python: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.3.18-150300.59.63-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: None

xarray: 0.20.1
pandas: 1.4.2
numpy: 1.21.5
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: 3.6.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.4
dask: 2022.02.1
distributed: 2022.2.1
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.02.0
cupy: None
pint: None
sparse: None
setuptools: 61.2.0
pip: 21.2.4
conda: None
pytest: None
IPython: 8.3.0
sphinx: None

Answered by lstngr

May 18, 2022

I've investigated a bit more, and traced back the issue to calls to dask.array.stack that I made within xarray.backends.BackendArray. It seems Dask Distributed does not like it,

These calls do not appear in the task graphs after the dataset and dataarrays have been loaded,
New threads in the task stream usually appear to be related to stack operations

I will close (I'd prefer delete) this issue as it fails to describe the problem correctly, and eventually come back with a MCVE in the future.

View full answer

lstngr · 2022-05-18T15:37:51Z

lstngr
May 18, 2022
Author

I've investigated a bit more, and traced back the issue to calls to dask.array.stack that I made within xarray.backends.BackendArray. It seems Dask Distributed does not like it,

These calls do not appear in the task graphs after the dataset and dataarrays have been loaded,
New threads in the task stream usually appear to be related to stack operations

I will close (I'd prefer delete) this issue as it fails to describe the problem correctly, and eventually come back with a MCVE in the future.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Poor performance in a custom backend #6574

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Poor performance in a custom backend #6574

Uh oh!

lstngr May 5, 2022

Replies: 1 comment

Uh oh!

lstngr May 18, 2022 Author

lstngr
May 5, 2022

lstngr
May 18, 2022
Author