Suggestions for performance improvement (Best practices) when using Xarray with Dask #7632
ricardobarroslourenco
started this conversation in
Office Hours
Replies: 2 comments 6 replies
-
What is the We could have a longer discussion at the office hours on Friday |
Beta Was this translation helpful? Give feedback.
5 replies
-
@ricardobarroslourenco Regarding the HDF warnings, this might be related #7549 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
On a workflow for Remote Sensing (please refer to the gist) I am currently able to load and do the processing of my data (originally ENVI raster files transformed into netcdf with
rioxarray
), and the results are quite good (initially, I used to process in R, facing some memory limitations and issues with data structures, and Dask+Xarray solved this end of the problem).However, in this mentioned notebook, I am facing performance issues, especially when getting to the end of the workflow. Saving the Dataset is taking hours; however, the CPU utilization is somehow low (average in the time-series of 30-40%), and the write bandwidth is almost unused, around 2Mbps in the Dask monitoring (I am running in an Apptainer container (derived from this Docker image ) in an HPC node with 128GB of RAM and saving it in a
scratch
partition which has lustre filesystem (and Gigabit speeds).Any suggestions on how can I start improving my workflow?
EDIT 1:

A compute time per task screenshot. How can I find the functions that trigged such tasks? I was wondering if I have a margin to improve things here (basically on the four large ones, which do stacking, and mapping on store)
EDIT 2: The source data can be found here.
EDIT 3: It seems that some issues are happening with the HDF5 library... =/
EDIT 4: Adding new version of the notebook, with better usage of
apply_ufunc
, but the process of exporting to netCDF is still really slow...Beta Was this translation helpful? Give feedback.
All reactions