Performance and memory issues in slice and compute operations #6295

hmkhatri · 2022-02-24T00:05:05Z

hmkhatri
Feb 24, 2022

I am working with multi-ensemble output and I use concat operation to combine multiple files into one xarray dataset.

ds =[]
ds_month = []
for ensemble in ensemble_paths:

     d = xr.open_mfdataset(ensemble + "*.nc", chunks={'time':1, 'lat':45, 'lon':60}) # read all times
     ds.append(d)
     d = xr.open_dataset(ensemble + "time_0.nc", chunks={'time':1, 'lat':45, 'lon':60}) # read specific time
     ds_month.append(d)

ds = xr.concat(ds, dim="r")
ds_month = xr.concat(ds_month, dim="r")
ds = ds.chunk({'r':-1})
ds_month = ds_month.chunk({'r':-1})

The main issue

I want to compute mean over ensemble members for a specific time snapshot. I compared the required calculation using two methods (for consistency, chunks were defined the same way in both)

From the dataset above, slice the required time snapshot and compute mean.

ds_mean = ds.isel(time=0).mean('r').compute()

Read files for the required time snapshot only and compute mean.

ds_mean = ds_month.mean('r').compute()

I was expecting both methods to have similar efficiencies. However, method 1 tends to be significantly slower with a lot of inter-worker communication and data transfer (see task stream below).

On the other hand, method 2 works very smoothly.

It seems that, in method 1, data for a lot of time snapshots is being loaded before slicing. This issue is related to issue dask/dask#3595. Based on suggestions @dcherian and others, map_blocks(numpy.copy) should work fine. However, I could not make it work (I am new to map_blocks) and I get the following error.

tmp = ds.isel(time=0).map_blocks(numpy.copy)
ds_mean = tmp.compute()

~/miniconda3/lib/python3.9/site-packages/xarray/core/dataarray.py in map_blocks(self, func, args, kwargs, template)
   3811         from .parallel import map_blocks
   3812 
-> 3813         return map_blocks(func, self, args, kwargs, template)
   3814 
   3815     def polyfit(

~/miniconda3/lib/python3.9/site-packages/xarray/core/parallel.py in map_blocks(func, obj, args, kwargs, template)
    368     if template is None:
    369         # infer template by providing zero-shaped arrays
--> 370         template = infer_template(func, aligned[0], *args, **kwargs)
    371         template_indexes = set(template.xindexes)
    372         preserved_indexes = template_indexes & set(input_indexes)

~/miniconda3/lib/python3.9/site-packages/xarray/core/parallel.py in infer_template(func, obj, *args, **kwargs)
    138 
    139     if not isinstance(template, (Dataset, DataArray)):
--> 140         raise TypeError(
    141             "Function must return an xarray DataArray or Dataset. Instead it returned "
    142             f"{type(template)}"

TypeError: Function must return an xarray DataArray or Dataset. Instead it returned <class 'numpy.ndarray'>

I don't know what is going is wrong with map_blocks command. Could someone help with this? Are there better ways to handle slicing of large datasets?

I am using
dask 2021.10.0
Xarray 0.19.0

dcherian · 2022-02-24T02:45:42Z

dcherian
Feb 24, 2022
Maintainer

The map_blocks trick should be unnecessary given dask/dask#8015 is in 2021.09.0. Your error is because you're using xarray.map_blocks instead of dask.array.map_blocks

I don't know what's happening. Is there any difference in the memory usage? It would really help if you could provide a reproducible example

0 replies

hmkhatri · 2022-02-24T11:36:50Z

hmkhatri
Feb 24, 2022
Author

@dcherian Thanks for the quick response.

Is there any difference in the memory usage?

I do not see significant difference in memory usage in test examples on my personal laptop. On the other hand, I saw huge memory usage (with isel approach) when was running on hpc cluster with dask-mpi. It was probably because I set memory target, spill, pause options to False in distributed.yaml file to check actual memory usage (>32gb for 300 GB data). Eventually, my code crashed.

It would really help if you could provide a reproducible example

Here is an reproducible example

import xarray as xr
import numpy as np
from dask.distributed import Client, performance_report

client = Client(n_workers=4)

# First create a test dataset and save in a netcdf file

(nt,ny,nx)=(720,1000,1000)
dummy=xr.DataArray(data=np.random.randn(nt,ny,nx),dims=['t','y','x'])

ds = dummy.to_dataset(name = 'data') # full data for all time steps
ds['data_snapshot'] = dummy.isel(t=0) # data for one time step
ds.to_netcdf('test.nc')

# Next we read for 100 ensembles and use concat. 
# For simplicity, we read the the same data 100 times treating them different ensembles. 

ds = []
for ensemble in range(0,100):
    
    d = xr.open_dataset("test.nc", chunks={'t':1, 'y':100, 'x':100})
    ds.append(d)
    
ds = xr.concat(ds, dim='r')
ds = ds.chunk({'r':-1})

with performance_report(filename="reproduce-isel.html"):
    tmp = ds['data'].isel(t=0).mean('r')
    %time data_mean = tmp.compute()
    
with performance_report(filename="reproduce-snap.html"):
    tmp = ds['data_snapshot'].mean('r')
    %time data_mean = tmp.compute()

On my Mac, the first approach takes double time and there is lot of reds in the task streams.

From 1st approach (isel and then mean)

CPU times: user 4.35 s, sys: 300 ms, total: 4.65 s
Wall time: 5.65 s

From 2nd approach (mean from data snapshot)

CPU times: user 2 s, sys: 78.3 ms, total: 2.07 s
Wall time: 2.29 s

Also, I observed that using copy() does help in reducing inter-worker communication but the worker load still remains larger (more light yellow and green in task stream) than the second approach above.
tmp = ds['data'].isel(t=0).copy().mean('r')

I do not know what is going wrong. I though that isel operation would be the same as working with a snapshot and the calculations would be identical. Could xr.concat operation create any issue here?

Any ideas? @rabernat @jbusecke

5 replies

rabernat Feb 25, 2022
Maintainer

On my Mac, the first approach takes double time and there is lot of reds in the task streams.

The core problem is that the "first approach" is creating a very large number of tasks. The overhead of transferring this graph between workers is what is causing the red you see in your dashboard.

It can be very instructive to look at the reprs for your dask arrays. I made your problem smaller to just focus on the dask part:

(nt,ny,nx)=(72,100,100)
dummy=xr.DataArray(data=np.random.randn(nt,ny,nx),dims=['t','y','x'])

ds = dummy.to_dataset(name = 'data') # full data for all time steps
ds['data_snapshot'] = dummy.isel(t=0) # data for one time step
ds.to_netcdf('test.nc')

ds = []
for ensemble in range(0,10):
    
    d = xr.open_dataset("test.nc", chunks={'t':1, 'y':10, 'x':10})
    ds.append(d)
    
ds = xr.concat(ds, dim='r')

You can see that there are already > 86000 tasks in the data array. This is a lot! Your bigger array probably has 10x that. So you are bumping against the limits suggested in the dask array best practices. In contrast, data_snapshot has much fewer.

It's important to remember that chaining computations on top of this array will only ever increase the number of tasks. With that in mind, I would remove this:

ds = ds.chunk({'r':-1})

as it is not really helping with anything. It just creates another layer of tasks.

The same is happening with ds['data'].isel(t=0).mean('r'). This just chains more tasks on top of the existing graph. Ideally, dask optimizations would figure out how to optimize the graph and cull all of the unneeded tasks. Such that the graph sent to the scheduler only contains the tasks you actually need to compute your desired result. I don't know why that isn't happening here.

Maybe pinging a dask expert would help?

In the meantime, try to do whatever you can to reduce the number of tasks. In general, writing code like

xr.open_mfdataset(ensemble + "*.nc", chunks={'time':1, 'lat':45, 'lon':60})

is kind of just wishful thinking unless the underlying netCDF file is chunked suitably on disk. It's unlikely that specifying arbitrary chunks will lead to an I/O pattern that can parallelize efficiently against the data on disk.

hmkhatri Feb 25, 2022
Author

@rabernat thanks for the detailed explanation.

For now, I will avoid providing chunks and stick with xr.open_mfdataset("*.nc"). Hopefully, it would give a chunking that is efficient with the netcdf files and reduce the number of tasks. Unfortunately, I will have to use isel(time=t) operation. The nc files I am working with contain multiple time steps in a single file.

One thing I noticed in the test example is that performing isel operation before concat significantly reduces number of tasks. In this case, I will have to read the data multiple times and use isel, concat operations. Anyway, I will try different approaches to find the most efficient method.

rabernat Feb 25, 2022
Maintainer

I tried an failed to reproduce this problem at the dask level

Create a chunked array

import dask.array as da
a = da.random.random(4, chunks=1)
a.visualize(optimize_graph=True)

Select just the first chunk

a[0].visualize(optimize_graph=True)

Stack two such arrays

b = da.stack([da.random.random(4, chunks=1), da.random.random(4, chunks=1)])
b.visualize(optimize_graph=True)

Take the mean across the stacked dimension of the first chunk

b[:, 0].mean(axis=0).visualize(optimize_graph=True)

This shows that dask should in principle be able to optimize this operation. But is this type of optimization happening in @hmkhatri's example? If not, why not?

rabernat Feb 25, 2022
Maintainer

For now, I will avoid providing chunks and stick with xr.open_mfdataset("*.nc")

I think you can safely chunk the time dimension, as this is the slowest varying dimension of the data. But chunking lon and lat probably doesn't accomplish anything.

Do you happen to know if you original real data files use NetCDF chunking?

hmkhatri Feb 28, 2022
Author

Do you happen to know if you original real data files use NetCDF chunking?

@rabernat The original data seems to have chunking. I did the following

from netCDF4 import Dataset    
ds = Dataset('file.nc', 'r')
data = ds.variables['thetao']
print data.shape
(12, 75, 1205, 1440)
print data.chunking()
[1, 13, 241, 288]

On the other hand, reading with Xarray does not seem to make any chunks.

ds = xr.open_mfdataset('file.nc')

Uh oh!

Performance and memory issues in slice and compute operations #6295

Uh oh!

Uh oh!

hmkhatri Feb 24, 2022

Replies: 2 comments · 5 replies

Uh oh!

dcherian Feb 24, 2022 Maintainer

Uh oh!

Uh oh!

hmkhatri Feb 24, 2022 Author

Uh oh!

rabernat Feb 25, 2022 Maintainer

Uh oh!

hmkhatri Feb 25, 2022 Author

Uh oh!

rabernat Feb 25, 2022 Maintainer

Uh oh!

rabernat Feb 25, 2022 Maintainer

Uh oh!

Uh oh!

hmkhatri Feb 28, 2022 Author

hmkhatri
Feb 24, 2022

Replies: 2 comments 5 replies

dcherian
Feb 24, 2022
Maintainer

hmkhatri
Feb 24, 2022
Author

rabernat Feb 25, 2022
Maintainer

hmkhatri Feb 25, 2022
Author

rabernat Feb 25, 2022
Maintainer

rabernat Feb 25, 2022
Maintainer

hmkhatri Feb 28, 2022
Author