Huge runtime discrepancies between Matlab and xarray to bin data and to apply functions to binned data #6170

mliukis · 2022-01-18T22:22:08Z

mliukis
Jan 18, 2022

Hello,
Just trying to figure out if I am using xarray functionality in the most efficient way or hitting some xarray functionality limitation. I am getting pretty slow runtime as compared to our Matlab prototype code which seems to be running much faster (minutes in Matlab vs. hours in Python).

We have to apply the following filter, which bins data based on the dt variable values and then identifies invalid entries in data variable. It takes about 2.1-4 seconds to process 100 spacial points (ds.data[:, y, 100]) with dask in parallel, which is rather slow. While processing the whole dataset (with dimensions (40981, 834, 834)) it would take ~3.5 hours to run such filter for one data variable, and we need to run it for multiple variables. Matlab code seems to be doing the same job much faster and processes the whole dataset in minutes. So I wonder if I am doing something inefficient with xarray here and if there is a better and more efficient way to do it. Or do I just hit some inefficiency of xarray? Here is the code snippet that reproduces the problem:

import dask
import numpy as np
import xarray as xr

# Make it reproducible
np.random.seed(123)
bins = [0, 10, 20, 30, 40]

DTBIN_RATIO = 1.42

def madFunction(x):
    """
    Compute median absolute deviation (MAD).
    """
    return (np.fabs(x - x.median(dim='t'))).median()

t_len = 40000
t_repeat_len = 40
block_size = 100

ds = xr.Dataset(
    {"data": (("t", "y", "x"), np.random.rand(t_len, block_size, block_size)),
     "dt":   (("t"), list(np.linspace(0, t_repeat_len, num=t_repeat_len))*int(t_len/t_repeat_len))},
    coords={"t": range(t_len), "x": range(block_size), "y": range(block_size)}
)

def filter_iteration(data, dt):
#     Takes ~5 times slower than np.digitize + data.groupby approach:
#     groups = data.groupby_bins(ds.dt, bins, right=False, include_lowest=True)

    np_digitize = np.digitize(dt, bins, right=False)
    index_var = xr.IndexVariable('t', np_digitize)
    groups = data.groupby(index_var)

    median = groups.median()
    xmad = groups.map(madFunction)

    std_dev = xmad * DTBIN_RATIO
    minBound = median - std_dev
    maxBound = median + std_dev

    exclude = (minBound > maxBound[0]) | (maxBound < minBound[0])

Sequential processing:

for i in range(100):
    filter_iteration(ds.data.isel(x=i, y=0), ds.dt)

Parallel processing:

tasks = [dask.delayed(filter_iteration)(ds.data.isel(x=i, y=0), ds.dt) for i in range(0, 100)]

results = dask.compute(
    tasks,
    scheduler="threads",
    num_workers=4
)

Also:

Using
groups = data.groupby_bins(ds.dt, bins, right=False, include_lowest=True)

to create the groups seems to take ~5 times longer, which I am also curious about. Why does it take so much longer than:

    np_digitize = np.digitize(dt.values, bins, right=False)
    index_var = xr.IndexVariable('t', np_digitize)
    groups = data.groupby(index_var)

Sequential processing seems to be running slightly faster than parallel processing with 4 threads on 4 CPUs. What could be the reason? I am getting 4.25 secs for sequential run vs. 4.45 secs for parallel run.

I posted the question on Pangeo discussion board, but thought I try to ask the question at the source, i.e. xarray :) Any help is much appreciated!

Here is my xarray.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:12:38) 
[Clang 11.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 20.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.1.4
numpy: 1.20.2
scipy: 1.7.1
netCDF4: 1.5.4
pydap: None
h5netcdf: 0.8.1
h5py: 3.1.0
Nio: None
zarr: 2.6.1
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.30.0
distributed: 2.30.1
matplotlib: 3.3.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 52.0.0.post20210125
pip: 20.2.4
conda: None
pytest: None
IPython: 7.22.0
sphinx: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Huge runtime discrepancies between Matlab and xarray to bin data and to apply functions to binned data #6170

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Huge runtime discrepancies between Matlab and xarray to bin data and to apply functions to binned data #6170

Uh oh!

Uh oh!

mliukis Jan 18, 2022

Replies: 0 comments

mliukis
Jan 18, 2022