Out of memory computation #5753

etsmith14 · 2021-08-31T17:28:57Z

etsmith14
Aug 31, 2021

I am working with large datasets and trying to do some basic computation on these large files. Here is the size/dimensions of the ERA5_Land_wind DataSet: [time (hours): 351384, latitude: 271, longitude: 611]. I can very quickly convert to wind speed (ERA5_WS) and even resample to daily means pretty quickly (10 to 20 seconds). However, when I try to load the ERA5_WS_daily (daily means) DataArray into memory (size is about 7GB, [time (days): 14642, latitude: 271, longitude: 611]) it consumes all my memory (128GB). Shouldn't it only consume the size of the resampled DataArray? Also, when I try different chunk sizes, resampling to daily means (ERA5_WS_daily_chunked) consumes all my memory. Something seems off here. Any help would be great.

Hopefully, this code snippet is helpful:

import xarray as xr


#%% Open Reanalysis Data from NetCDFs
ERA5_Land_wind = xr.open_mfdataset(r'D:\Climate_Data\ERA5_Land\Wind\*' + '.nc', concat_dim="time", 
                              data_vars='minimal', coords='minimal', compat='override') 
ERA5_Land_wind_chunked = xr.open_mfdataset(r'D:\Climate_Data\ERA5_Land\Wind\*' + '.nc', concat_dim="time", 
                              data_vars='minimal', coords='minimal', compat='override', chunks={'time':743, 'latitude':25, 'longitude':25}) 

ERA5_WS = ((ERA5_Land_wind.u10**2) + (ERA5_Land_wind.v10**2))**0.5
ERA5_WS_chunked = ((ERA5_Land_wind_chunked.u10**2) + (ERA5_Land_wind_chunked.v10**2))**0.5

ERA5_WS_daily = ERA5_WS.resample(time='D',skipna=True).mean() # No memory used
ERA5_WS_daily.load() # 100% memory used

ERA5_WS_daily_chunked = ERA5_WS_chunked.resample(time='D',skipna=True).mean()  # 100% memory usage and fails

dcherian · 2021-09-01T01:22:07Z

dcherian
Sep 1, 2021
Maintainer

Our current groupby and resample implementations are inefficient with dask arrays, i suspect that's what's happening.

It would be good to check chunk sizes of u10 in both your datasets. You want these to be about 200MB.

Since you're resampling from hourly to daily, I would try .coarsen(time=24).mean() This is usually a lot better.

If you're up for experimenting, you could try dask_groupby.xarray.resample_reduce(ERA5_WS_chunked.resample(time='D'), func="mean") from https://github.com/dcherian/dask_groupby .

1 reply

etsmith14 Sep 1, 2021
Author

Thanks for the quick reply Deepak! I used coarsen instead of resample and it reduced the computational time dramatically (see code comments) which is awesome! I realized that this also helps improve subsequent computations, even when using resample (i.e using coarsen to convert to daily then resample to convert to yearly is much faster than using resample for both conversions). However, I am still having an issue with actually loading the dataset into memory. ERA5_WS_yearly (~0.2GB) takes around an hour to load to memory while ERA5_WS_daily (~7GB) never actually loaded (I eventually killed the process after several hours of not loading). The increase in memory usage for ERA5_WS_yearly looked correct (once it finished loading, about 0.5% additional memory was being used). This data is stored on my physical HDD, so seems odd that it would take so long to load into memory. I tried different chunk sizes for time, lat, and lon (all around 200MB), but no improvement in load time. Any insight as to why this might be the case?

ERA5_WS_daily = ERA5_WS.coarsen(time=24).mean()  # 0.2 Seconds 
ERA5_WS_daily2 = ERA5_WS.resample(time='D').mean()  # 29.36 Seconds


ERA5_WS_yearly = ERA5_WS_daily.resample(time='Y').mean()  # 0.1 Seconds
ERA5_WS_yearly2 = ERA5_WS_daily2.resample(time='Y').mean()  # 6.08 Seconds

I will definitely experiment with the dask_groupby function. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Out of memory computation #5753

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Out of memory computation #5753

Uh oh!

Uh oh!

etsmith14 Aug 31, 2021

Replies: 1 comment · 1 reply

Uh oh!

dcherian Sep 1, 2021 Maintainer

Uh oh!

Uh oh!

etsmith14 Sep 1, 2021 Author

etsmith14
Aug 31, 2021

Replies: 1 comment 1 reply

dcherian
Sep 1, 2021
Maintainer

etsmith14 Sep 1, 2021
Author