Applying GroupBy.map along chunks instead of dimensions #8076

sadsimulation · 2023-08-17T07:17:15Z

sadsimulation
Aug 17, 2023

I'm processing larger than memory datasets using dask-backed xarrays.
However I often need to perform indexing using dataset1.where(array2, drop=True).

The intermediate computations to get dataset1 and array2 are fairly computationally intensive and make use of the xarray indexing and dimension name features, so I feel like xr.apply_ufunc with dask='allowed' wouldn't be a good fit here. Unfortunately the masking causes the output sizes of the output xr.Dataset dimensions to change based on the data (peak detection), so xr.map_blocks is difficult to apply because I don't know the output template shape.

A workaround that I have used to get things working at all is to use dataset.groupby('mydim').map(myfunc) on a dataset that is not backed by dask arrays. This is not great because the groups along mydim vary in size and don't fit a simple uniform chunking along that dimension and predictably only utilizes a single core as the groups are processed sequentially.

Is there an easy way to do something like dataset.chunk(mydim=100).groupby('mydim').map(myfunc) that would utilize my machines CPU better?

dcherian · 2023-08-21T20:10:19Z

dcherian
Aug 21, 2023
Maintainer

If myfunc handles and returns dask arrays, then the graph construction will happen in serial but the compute will be parallel. Will that work?

Can you create a small example to show how your groups are patterned? It seems like each group is sequential?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Applying GroupBy.map along chunks instead of dimensions #8076

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Applying GroupBy.map along chunks instead of dimensions #8076

Uh oh!

sadsimulation Aug 17, 2023

Replies: 1 comment

Uh oh!

dcherian Aug 21, 2023 Maintainer

sadsimulation
Aug 17, 2023

dcherian
Aug 21, 2023
Maintainer