Questions with map_blocks and apply_ufunc #6370

philippemiron · 2022-03-16T14:20:52Z

philippemiron
Mar 16, 2022

Hi everyone, I'm testing ways to process a ragged array with xarray and I have difficulties understanding the apply_ufunc() function and parameters. Here is a simple example:

I have a dataset that is loaded with chunks of different sizes, as an example:

dt = xr.Dataset(
    data_vars=dict(
        value=(["x"], [1,1,2,2,2,3,3,3,3,3]), 
    ),
    coords=dict(
        lon=(["x"], np.linspace(0,1,10)),
    ),
).chunk(chunks={'x': tuple([2,3,5])}) # three chunks of different size

I can map a function to each block using map_blocks() as followed:

dt.value.data.map_blocks(lambda x: np.array([np.mean(x)])).compute()
> array([1., 2., 3.])

This works fine, two side questions:

why do we have to use .data here?
Functions passed to map_blocks() have to either return a xr.DataArray or an np.array, is that correct?

Alright, so if the output of the function is the same length as the input, e.g.:

xr.apply_ufunc( 
    lambda x: x-np.mean(x), # value per chunk is constant so x-mean(x) should be 0 
    dt, 
    dask='parallelized'
)

> Dimensions:  (x: 10)
Coordinates:
    lon      (x) float64 0.0 0.1111 0.2222 0.3333 ... 0.6667 0.7778 0.8889 1.0
Dimensions without coordinates: x
Data variables:
    value    (x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0>

I can use apply_ufunc() fine, but I don’t understand how to perform a simple mean (or other reducing operation) per chunk and reproduce the map_blocks() behavior?

As I understand, apply_ufunc() is more versatile and from the doc: Note that due to the overhead xarray.map_blocks() is considerably slower than apply_ufunc(). I tried understanding the parameters input_core_dims, output_core_dims and exclude_dims but it is still far from clear in my mind. I managed to do the following, which runs without error but does not perform the operation by chunks anymore, so the mean is obtained for the full array. I guess it’s probably due to the change from dask=‘parallelized’ to dask=‘allowed’.

xr.apply_ufunc(
    lambda x: np.mean(x),
    dt,
    input_core_dims=[['x']],
    exclude_dims=set(('x')),
    dask='allowed'
).compute()

> 2.3 but I would expect something like the map_blocks() -> [1, 2, 3].

Anyone could help me with this and maybe point to some examples? Thanks!

Answered by dcherian

Mar 17, 2022

why do we have to use .data here?

.data pulls out the underlying dask array; .data.map_blocks calls dask.array.map_blocks.

Functions passed to map_blocks() have to either return a xr.DataArray or an np.array, is that correct?

For xarray.map_blocks must return either a Dataset or a DataArray, For dask.array.map_blocks must return a numpy array.

Your example with dask="parallelized" cannot be right. It raises ValueError: axes don't match array on compute and should never have worked AFAICT (#6372)

For the solution, this operation is too complicated for dask="parallelized" so we use dask="allowed". This means your function must know how to handle dask arrays.
1. The function itself loo…

View full answer

dcherian · 2022-03-17T04:46:38Z

dcherian
Mar 17, 2022
Maintainer

why do we have to use .data here?

.data pulls out the underlying dask array; .data.map_blocks calls dask.array.map_blocks.

Functions passed to map_blocks() have to either return a xr.DataArray or an np.array, is that correct?

For xarray.map_blocks must return either a Dataset or a DataArray, For dask.array.map_blocks must return a numpy array.

Your example with dask="parallelized" cannot be right. It raises ValueError: axes don't match array on compute and should never have worked AFAICT (#6372)

For the solution, this operation is too complicated for dask="parallelized" so we use dask="allowed". This means your function must know how to handle dask arrays.
1. The function itself looks like your data.map_blocks call.
2. The only complication is that we should pass the chunks argument so that the output has the right shape and chunk sizes.

def per_block_mean(array):
    assert array.ndim == 1
    nchunks = len(array.chunks[0])
    # function returns one value per chunk, so output chunks are easy to construct
    output_chunks = ([1] * nchunks,)
    # mapped function MUST return an array
    # https://github.com/dask/dask/issues/8822
    return array.map_blocks(lambda x: np.mean(x, keepdims=True), chunks=output_chunks)

Now that we have a function, we use apply_ufunc to automate applying per_block_mean to all variables in the dataset:

per_block_mean expects at least a 1D array, so there must be one input core dimension: x
Since x is not completely removed, x must be an output core dimension.
The size of x changes, so we must pass x to exclude_dims (otherwise x is not expected to change size)

xr.apply_ufunc(
    per_block_mean,
    dt,
    input_core_dims=[["x"]],
    output_core_dims=[["x"]],
    exclude_dims=set("x"),
    dask="allowed",
)

2 replies

philippemiron Mar 17, 2022
Author

In that last example, since we are calling map_blocks in per_block_mean is there any advantage to using apply_ufunc in that situation?

dcherian Mar 17, 2022
Maintainer

It takes care of handling the input being either a DataArray or Dataset, and constructing a new DataArray or Dataset for the output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Questions with map_blocks and apply_ufunc #6370

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Questions with map_blocks and apply_ufunc #6370

Uh oh!

philippemiron Mar 16, 2022

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

dcherian Mar 17, 2022 Maintainer

Uh oh!

Uh oh!

philippemiron Mar 17, 2022 Author

Uh oh!

dcherian Mar 17, 2022 Maintainer

philippemiron
Mar 16, 2022

Replies: 1 comment 2 replies

dcherian
Mar 17, 2022
Maintainer

philippemiron Mar 17, 2022
Author

dcherian Mar 17, 2022
Maintainer