Very slow groupy speed with a MultiIndex and a custom fuction #8149

mgorfer · 2023-09-06T09:33:42Z

mgorfer
Sep 6, 2023

I am writing a code which maps monthly climatologies from pressure to altitude. This needs to be done on every grid box by itself, therefore I am using xarray.groupby. That works, but it is very slow. I also tried to speed it up this function using flox.

from flox.xarray import xarray_reduce
import xarray as xr

def swap_dims(ds: xr.Dataset) -> xr.Dataset:
    ds = ds.squeeze()
    ds = ds.swap_dims({"pressure": "altitude"})
    ds = ds.reset_coords("pressure")
    ds = ds.sortby("altitude")
    ds = interpolate_to_alt(ds)
    return ds

def convert_to_alt(ds: xr.Dataset) -> xr.Dataset:
    ds = ds.stack({"stacked_dim": ["latitude_bins", "longitude_bins", "time"]})
    ds = ds.chunk(chunks={"pressure":1})
    ds = ds.groupby("stacked_dim").map(swap_dims)
    # ds = flox.xarray_reduce(ds, by="stacked_dim", func=swap_dims) (Not working flox code)
    ds = ds.unstack("stacked_dim")
    return ds

The stacked chunked ds before the groupy or xarray_reduce looks like this:

<xarray.Dataset>
Dimensions:            (pressure: 37, stacked_dim: 12458880)
Coordinates:
  * pressure           (pressure) int32 1 2 3 5 7 10 ... 900 925 950 975 1000
  * stacked_dim        (stacked_dim) object MultiIndex
  * latitude_bins      (stacked_dim) float32 90.0 90.0 90.0 ... -90.0 -90.0
  * longitude_bins     (stacked_dim) float32 0.0 0.0 0.0 ... 359.8 359.8 359.8
  * time               (stacked_dim) datetime64[ns] 2010-01-01 ... 2010-12-01
Data variables:
    specific_humidity  (pressure, stacked_dim) float32 dask.array<chunksize=(1, 12458880), meta=np.ndarray>
    temperature        (pressure, stacked_dim) float32 dask.array<chunksize=(1, 12458880), meta=np.ndarray>
    altitude           (pressure, stacked_dim) float32 dask.array<chunksize=(1, 12458880), meta=np.ndarray>

When I use it like this, I get the error
NotImplementedError: func must be a string when reducing along a dimension not present in by

Should the original xarray.groupby work if I just adjust something, is this possible using flox directly, or do I have to look for a totally different multiprocessing option?

I also wrote a function using multiprocessing. This looks like:

from multiprocessing import Pool, cpu_count
from time import perf_counter

def interpolate_to_alt(
    ds: xr.Dataset, pres_var_name: str = vd.pressure.var_name
) -> xr.Dataset:
    """
    Interpolates a given xarray Dataset to altitude.

    Parameters
    ----------
    ds : xr.Dataset
        The xarray Dataset to be interpolated.
    pres_var_name : str, optional
        The name of the pressure variable. Defaults to 'pressure'.

    Returns
    -------
    xr.Dataset
        The interpolated xarray Dataset.
    """
    log_vars = [
        "pressure",
    ]
    log_vars_ds = list(set(log_vars).intersection(list(ds.variables.keys())))
    attrs = {}
    shift = {}
    for var_ in log_vars_ds:
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            ds[var_] = np.log(ds[var_]

    # Interpolate
    ds = ds.interp(altitude=np.arange(0, 80000, 100))

    # Antilog
    for var_ in log_vars_ds:
        ds[var_] = np.exp(ds[var_]) 
        ds[var_].attrs = attrs[var_]

    # Get rid of artifacts due to log/antilog
    ds["pressure"] = ds["pressure"].round(1)
    ds["pressure"].attrs = attrs["pressure"]

    return ds


def convert_to_alt(ds: xr.Dataset) -> xr.Dataset:
    """
    Converts a given xarray Dataset from pressure to altitude.

    Parameters
    ----------
    ds : xr.Dataset
        The xarray Dataset to be converted.

    Returns
    -------
    xr.Dataset
        The converted xarray Dataset.
    """
    # TODO(max): Add a .squeeze() here for better performance?
    ds = ds.swap_dims({"pressure": "altitude"}).dropna(
        dim="altitude",
        subset=["pressure"],
    )
    ds = ds.reset_coords("pressure")
    ds = ds.sortby("altitude")
    ds = interpolate_to_alt(ds)
    ds = ds.expand_dims(["latitude_bins", "longitude_bins", "time"])
    ds = ds.drop("stacked_dim")
    return ds


def parallel_stack_func(
    ds: xr.Dataset,
    stacked_dims: tuple[str, str, str] = (
        "latitude_bins",
        "longitude_bins",
        "time",
    ),
    func: Callable = convert_to_alt,
    func_cpu_count: int = -1,
) -> xr.Dataset:
    """
    Executes a function in parallel on a given xr.Dataset object.

    Parameters
    ----------
    ds : xr.Dataset
        The input dataset to perform the function in parallel on.
    stacked_dims : tuple[str, str, str], optional
        The dimensions to stack the dataset along. Defaults to ("latitude_bins", "longitude_bins", "time").
    func : Callable, optional
        The function to be executed in parallel. Defaults to `convert_to_alt`.
    func_cpu_count : int, optional
        The number of CPU processes to use for parallel execution. Defaults to -1, which uses all available CPUs.

    Returns
    -------
    xr.Dataset
        The output dataset after performing the function in parallel.
    """
    if func_cpu_count == -1:
        func_cpu_count = cpu_count()

    ds = ds.stack({"stacked_dim": stacked_dims})
    ds = ds.load()

    ds_list = [ds.isel(stacked_dim=i).squeeze() for i in range(ds.stacked_dim.size)]

    with Pool(processes=func_cpu_count) as pool:
        res = pool.map(func, ds_list)

    out = xr.combine_by_coords(
        res,
    )

    return out

It works quite fast (except for the merging), but I am not sure if this is an elegant approach?

Answered by dcherian

Sep 7, 2023

Use xgcm here: https://xgcm.readthedocs.io/en/latest/transform.html. It does the apply_ufunc thing along with numba kernels for fast interpolation.

View full answer

benbovy · 2023-09-06T21:20:00Z

benbovy
Sep 6, 2023
Maintainer

Hi @mgorfer, do you see any difference when using groupby with a single index instead of a multi-index? Which version of Xarray are you using?

Perhaps you are hitting the same issue than #7376, which has been fixed since v2023.06.0?

3 replies

mgorfer Sep 6, 2023
Author

Hi @benbovy.
I did not try using a single index yet. I will try see if I can create a working example of that.

And I am using Xarray v2023.7.0.

As additional information, I am applying this functions to the ERA5 hourly data on pressure levels from 1940 to present. Maybe there is even a dataset in altitude resolution out there, but I wanted to try the pressure to altitude mapping anyway.

benbovy Sep 6, 2023
Maintainer

Reading again your comment, I don't think the slow groupby is related to the bug I mentioned above. It is just that you have many groups.

This needs to be done on every grid box by itself.

Hmm groupby is not a great solution for this. I'm not sure that flox (via numpy-groupies) would help much as there appears to have nothing to aggregate on the (stacked) lat/lon/time dimensions? I'm afraid there's no way to avoid pure-python loops in your case, unless you can rewrite convert_to_alt as a vectorized function working with (numpy or dask) unlabelled arrays. Then you could use xr.apply_ufunc to wrap it for working with xarray objects.

mgorfer Sep 8, 2023
Author

I have solved all of my performance issues using xgcm.
Thankfully, in most cases, there are already great modules out there! :)

mathause · 2023-09-06T22:28:23Z

mathause
Sep 6, 2023
Maintainer

As mentioned in xarray-contrib/flox#260, this may be solvable using apply_ufunc (which does of course not fix the problem of the slow groupby). I give an example below, interpolating over lon instead of the pressure, because the test dataset is available.

Here lon is the same all the time - so there are better ways to this - but it should work the same way for the non-constant coords mentioned above.
It might be worth checking in the interpolation docs if xarray supports this natively.
I am not sure if/ how dask is supported here (may need some adapdations)

import numpy as np
import xarray as xr

# load example dataset and subset (so it's faster)
air = xr.tutorial.open_dataset("air_temperature")
air = air.isel(time=slice(2))
air = air.air.astype(float)

# define the target longitude (use 330, so the result can be double checked)
target = np.array([201, 206, 330])

# example to interpolate one lat/ time combinations
np.interp(target, air.lon, air.air.isel(time=0, lat=0).values)


# define a function to feed to xr.apply_ufunc
# could directly use `np.interp` but to accommodate
# the other manipulations (e.g. log)
def interp(values, coords, target):
    
    out = np.interp(target, coords, values)

    return out

# option 1: pass target as a numpy array (no "longitude" coordinates)
xr.apply_ufunc(
    interp,
    air.air,
    air.lon,
    kwargs={"target": target},
    vectorize=True,
    input_core_dims=[["lon"], ["lon"]],
    output_core_dims=[["longitude"]]
)

# option 21: pass target as a DataArray (has "longitude" coordinates)
target = xr.DataArray(target, dims="longitude", coords={"longitude": target})

xr.apply_ufunc(
    interp,
    air.air,
    air.lon,
    target,
    vectorize=True,
    input_core_dims=[["lon"], ["lon"], ["longitude"]],
    output_core_dims=[["longitude"]]
)

6 replies

mathause Sep 7, 2023
Maintainer

Can you post the repr of the original dataset?

The way I envision it is that you would replace air.lon in my example with ds.height and target with altitude=np.arange(0, 80000, 100). Because you interpolate using numpy and not xarray you don't need to swap the dims (the function should loop through each lat/ lon/ time coordinate for both the variable and the height).

mgorfer Sep 7, 2023
Author

This is the rep of the original dataset, as I downloaded it from the CDS using the api.

<xarray.Dataset>
Dimensions:    (longitude: 1440, latitude: 721, level: 37, time: 991, expver: 2)
Coordinates:
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * level      (level) int32 1 2 3 5 7 10 20 30 ... 850 875 900 925 950 975 1000
  * time       (time) datetime64[ns] 1940-01-01 1940-02-01 ... 2023-07-01
  * expver     (expver) int32 1 5
Data variables:
    z          (time, level, latitude, longitude, expver) float32 dask.array<chunksize=(12, 37, 721, 1440, 2), meta=np.ndarray>
    q          (time, level, latitude, longitude, expver) float32 dask.array<chunksize=(12, 37, 721, 1440, 2), meta=np.ndarray>
    t          (time, level, latitude, longitude, expver) float32 dask.array<chunksize=(12, 37, 721, 1440, 2), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
    history:      2023-08-31 07:18:03 GMT by grib_to_netcdf-2.25.1: /opt/ecmw...

I will first try now the approach as suggested by @dcherian using xgcm, and then maybe try again to implement it by myself using apply_ufunc.

mgorfer Sep 8, 2023
Author

In the meantime, I have written a function using xgcm which works really well for my use case (except for a small workaround, which I hope will get added to the module itself).

As it performs great, I think I will not try to implement it by myself using the apply_ufunc. Thank you a lot for your examples, though!

I am not sure if there would still be some benefit of using my own apply_ufunc? If you are interested, please take a look at my new (WIP) approach: xgcm/xgcm#617

mathause Sep 8, 2023
Maintainer

No - I just did not think of xgcm (as I don't use it - but happy to learn of it's uses). I think xgcm has tested and optimized this so that I would also recommend you use it.

As a side note: if you follow the code through xgcm (Grid.transform -> transform -> linear_interpolation) you'll find that they also use xr.apply_ufunc (in a similar form as suggested above). See:

https://github.com/xgcm/xgcm/blob/7492277de22ed1e677a6a7523b61b20643a98a77/xgcm/transform.py#L223

jbusecke Sep 15, 2023

At this point xgcm is mostly xarray apply_ufunc calls 😂. For transform we also use vectorized numba kernels to really speed this up. This feature really was an open source highlight for me: we had several algos with different pros/cons, had a massive discussion in the community and thanks to apply_ufunc/numba came up with a solution that covered all the features and is super fast. I wrote a blogpost about it a while back if you are curious.

dcherian · 2023-09-07T09:03:00Z

dcherian
Sep 7, 2023
Maintainer

Use xgcm here: https://xgcm.readthedocs.io/en/latest/transform.html. It does the apply_ufunc thing along with numba kernels for fast interpolation.

5 replies

mgorfer Sep 7, 2023
Author

Thank you for your recommendation!
I created a first draft to process my data using xgcm.

Is there maybe a better way to keep the original coordinate also included in the output dataset, than my approach with broadcast_like?

TARGET_RESOLUTION = np.arange(0, 80000, 100)


def xgmc_transformation(
    ds: xr.Dataset,
    orig_coord: str = "pressure",
    target_coord: str = "altitude",
    target_resolution: np.ndarray = TARGET_RESOLUTION,
    manual_method: str = "auto",
) -> xr.Dataset:
    """
    Apply a coordinate transformation to the dataset.

    Parameters
    ----------
    ds : xr.Dataset
        The input dataset.
    orig_coord : str, optional
        The original coordinate to transform from. Default is "pressure".
    target_coord : str, optional
        The target coordinate to transform to. Default is "altitude".
    target_resolution :  np.ndarray, optional
        The target resolution to transform to. Default is np.arange(0, 80000, 100).
    manual_method : str, optional
        The chosen manual method for the transformation. Either linear or logarithmic.
        Default is auto, which choses linear for a original pressure coordinate and
        linear for an original altitude coordinate.

    Returns
    -------
    xr.Dataset
        The transformed dataset.
    """
    if manual_method != "auto":
        method = manual_method
    elif orig_coord == "altitude":
        method = "linear"
    elif orig_coord == "pressure":
        method = "log"
    else:
        msg = "Method could not be determined"
        raise ValueError(msg)

    if method == "log":
        ds[orig_coord] = np.log(ds[orig_coord])

    logger.info(
        "Tranform Dataset from %s to %s using %s transform",
        orig_coord,
        target_coord,
        method,
    )

    # Add orig_coord along target_coord in the dataset, as it should be in the output
    ds[f"{orig_coord}_var"] = ds[orig_coord].broadcast_like(ds[target_coord])

    grid = Grid(ds, coords={"Z": {"center": orig_coord}}, periodic=False)

    transform_vars = [var_ for var_ in ds.data_vars if var_ != target_coord]

    transform_list = []
    for var_ in transform_vars:
        var_transformed = grid.transform(
            ds[var_],
            "Z",
            target_resolution,
            target_data=ds[target_coord],
        )
        transform_list.append(var_transformed)
    out = xr.merge(transform_list)

    # Rename orig_coord to the original name.
    out = out.rename({f"{orig_coord}_var": orig_coord})

    if method == "log":
        out[orig_coord] = np.exp(out[orig_coord])

    return out

mgorfer Sep 7, 2023
Author

And is there a benefit for using xgmc over xESMF, which I am using at the moment for lat/lon regridding? Or are these two just very different extensions for Xarray?

This is my corresponding xESMF code, which accepts anything I throw at it and regrids it to any x × y lat lon grid.

import xarray as xr
import xesmf as xe

def regrid_to_lat_lon(
    ds: xr.Dataset,
    target_res_lat: float = 1,
    target_res_lon: float = 1,
) -> xr.Dataset:
    """
    Regrid any global grid to a global res_lat x res_lon degree lat lon grid.

    Parameters
    ----------
    ds : xr.Dataset
        Input dataset to be regridded.
    target_res_lat : float, optional
        Target resolution in degrees for latitude, by default 1.
    target_res_lon : float, optional
        Target resolution in degrees for longitude, by default 1.

    Returns
    -------
    xr.Dataset
        Regridded dataset.
    """
    ds_out = xe.util.grid_global(target_res_lat, target_res_lon)

    regridder = xe.Regridder(
        ds,
        ds_out,
        "bilinear",
        periodic=True,
    )

    dr_out = regridder(ds)
    dr_out["x"] = dr_out["lon"].isel(y=0).data
    dr_out["y"] = dr_out["lat"].isel(x=0).data
    dr_out = dr_out.drop(["lat", "lon"])
    dr_out = dr_out.rename({"y": "latitude", "x": "longitude"})

    return dr_out

dcherian Sep 8, 2023
Maintainer

cc @jbusecke

jbusecke Sep 8, 2023

Is there maybe a better way to keep the original coordinate also included in the output dataset, than my approach with broadcast_like?

I am not quite able to grok what you do here quickly (might just be end of Friday lol). But could you post the dataset reprs with and without your fix? I think that this is actually an interesting issue to raise on xgcm. Ideally we should be able to do grid.transform(ds) and get reasonable output.

And is there a benefit for using xgmc over xESMF, which I am using at the moment for lat/lon regridding? Or are these two just very different extensions for Xarray?

xgcm can only perform very efficient 1-dimensional interpolation (as you might need to transform model output from e.g. depth to tracer coordinates). Dealing with curvilinear spatial coordinates is not possible.

mgorfer Sep 8, 2023
Author

Thank you for answer @jbusecke!
I just raised a feature request issue on xgcm: xgcm/xgcm#617

And thank you for your answer about xESMF and xgcm.
Then I will continue using both modules for my regridding and transformation needs.

Uh oh!

Very slow groupy speed with a MultiIndex and a custom fuction #8149

Uh oh!

Uh oh!

mgorfer Sep 6, 2023

Replies: 3 comments · 14 replies

Uh oh!

benbovy Sep 6, 2023 Maintainer

Uh oh!

Uh oh!

mgorfer Sep 6, 2023 Author

Uh oh!

benbovy Sep 6, 2023 Maintainer

Uh oh!

mgorfer Sep 8, 2023 Author

Uh oh!

mathause Sep 6, 2023 Maintainer

Uh oh!

mathause Sep 7, 2023 Maintainer

Uh oh!

mgorfer Sep 7, 2023 Author

Uh oh!

Uh oh!

mgorfer Sep 8, 2023 Author

Uh oh!

mathause Sep 8, 2023 Maintainer

Uh oh!

jbusecke Sep 15, 2023

Uh oh!

dcherian Sep 7, 2023 Maintainer

Uh oh!

mgorfer Sep 7, 2023 Author

Uh oh!

Uh oh!

mgorfer Sep 7, 2023 Author

Uh oh!

dcherian Sep 8, 2023 Maintainer

Uh oh!

jbusecke Sep 8, 2023

Uh oh!

Uh oh!

mgorfer Sep 8, 2023 Author

mgorfer
Sep 6, 2023

Replies: 3 comments 14 replies

benbovy
Sep 6, 2023
Maintainer

mgorfer Sep 6, 2023
Author

benbovy Sep 6, 2023
Maintainer

mgorfer Sep 8, 2023
Author

mathause
Sep 6, 2023
Maintainer

mathause Sep 7, 2023
Maintainer

mgorfer Sep 7, 2023
Author

mgorfer Sep 8, 2023
Author

mathause Sep 8, 2023
Maintainer

dcherian
Sep 7, 2023
Maintainer

mgorfer Sep 7, 2023
Author

mgorfer Sep 7, 2023
Author

dcherian Sep 8, 2023
Maintainer

mgorfer Sep 8, 2023
Author