Issue with working with very large data #7819
Unanswered
Atousa-Saberi-NASA
asked this question in
Q&A
Replies: 1 comment 1 reply
-
I would start by reading the last two sections here and experimenting with chunk sizes. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi xr community,
I'm dealing with very large data files and I'd like to ask a few questions:
I have 431 (#of days) files with size latxlon= 2000x3499 (the dimensions of salinity avriable). I use xr & dask to read the data and perform a very simple task of calculating the running mean and subtracting it from the data. If I print one value of the output, it takes ~30 sec:
%%time
ds_anomaly.isel(lat=0, lon=0, date=0).Salt.values
CPU times: user 23.1 s, sys: 6.59 s, total: 29.7 s
Wall time: 29.7 s
But of course to save it to netcdf, or actually computing ds_anomaly.values (for all the data), it takes forever or the compute nodes on supercomputers seem to go out of memory.
I was wondering:
How can I speed this up if I want to simply save all the anomaly values? I explicitly don't make/specify dask client. Do you suggest using dask distributed? If so, please provide some guidance.
Is there any xr parameter related to nan values that I need to pass for saving the data? If I plot the anomalies on the fly (or animate them through time) the values look fine, but when I save the anomaly (only if successfull), it writes out nan for all values. This surprises me, because when I calculate the mean and the anomaly with panda I specify skipna=True & the plots look right, but something goes wrong with writing the output in netcdf.
Thank you,
Atousa
Beta Was this translation helpful? Give feedback.
All reactions