Issue with working with very large data #7819

Atousa-Saberi-NASA · 2023-05-05T14:31:37Z

Atousa-Saberi-NASA
May 5, 2023

Hi xr community,
I'm dealing with very large data files and I'd like to ask a few questions:
I have 431 (#of days) files with size latxlon= 2000x3499 (the dimensions of salinity avriable). I use xr & dask to read the data and perform a very simple task of calculating the running mean and subtracting it from the data. If I print one value of the output, it takes ~30 sec:

%%time
ds_anomaly.isel(lat=0, lon=0, date=0).Salt.values
CPU times: user 23.1 s, sys: 6.59 s, total: 29.7 s
Wall time: 29.7 s

But of course to save it to netcdf, or actually computing ds_anomaly.values (for all the data), it takes forever or the compute nodes on supercomputers seem to go out of memory.
I was wondering:

How can I speed this up if I want to simply save all the anomaly values? I explicitly don't make/specify dask client. Do you suggest using dask distributed? If so, please provide some guidance.
Is there any xr parameter related to nan values that I need to pass for saving the data? If I plot the anomalies on the fly (or animate them through time) the values look fine, but when I save the anomaly (only if successfull), it writes out nan for all values. This surprises me, because when I calculate the mean and the anomaly with panda I specify skipna=True & the plots look right, but something goes wrong with writing the output in netcdf.

Thank you,
Atousa

dcherian · 2023-05-05T17:02:40Z

dcherian
May 5, 2023
Maintainer

I would start by reading the last two sections here and experimenting with chunk sizes.

1 reply

Atousa-Saberi-NASA May 8, 2023
Author

Hi Deepak,
Thanks. I read through this, and tried a few things:

I tried chunking the geographic area (lat/lon chunks of 10) and it was obviously slower than the automatic chunking (which is based on the number of files (number of #days431). So I think the automatic chunking is already good in my case.
I tried saving intermediate results (such as saving the mean before computing the anomaly), but it also failed (the 64 processor with 512 GB RAM went out of memory on the supercomputer).
Is there any other suggestions you may have?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Issue with working with very large data #7819

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Issue with working with very large data #7819

Uh oh!

Uh oh!

Atousa-Saberi-NASA May 5, 2023

Replies: 1 comment · 1 reply

Uh oh!

dcherian May 5, 2023 Maintainer

Uh oh!

Atousa-Saberi-NASA May 8, 2023 Author

Atousa-Saberi-NASA
May 5, 2023

Replies: 1 comment 1 reply

dcherian
May 5, 2023
Maintainer

Atousa-Saberi-NASA May 8, 2023
Author