Suggestions for performance improvement (Best practices) when using Xarray with Dask #7632

ricardobarroslourenco · 2023-03-15T18:24:50Z

ricardobarroslourenco
Mar 15, 2023

On a workflow for Remote Sensing (please refer to the gist) I am currently able to load and do the processing of my data (originally ENVI raster files transformed into netcdf with rioxarray), and the results are quite good (initially, I used to process in R, facing some memory limitations and issues with data structures, and Dask+Xarray solved this end of the problem).

However, in this mentioned notebook, I am facing performance issues, especially when getting to the end of the workflow. Saving the Dataset is taking hours; however, the CPU utilization is somehow low (average in the time-series of 30-40%), and the write bandwidth is almost unused, around 2Mbps in the Dask monitoring (I am running in an Apptainer container (derived from this Docker image ) in an HPC node with 128GB of RAM and saving it in a scratch partition which has lustre filesystem (and Gigabit speeds).

Any suggestions on how can I start improving my workflow?

EDIT 1:

A compute time per task screenshot. How can I find the functions that trigged such tasks? I was wondering if I have a margin to improve things here (basically on the four large ones, which do stacking, and mapping on store)

EDIT 2: The source data can be found here.

EDIT 3: It seems that some issues are happening with the HDF5 library... =/

EDIT 4: Adding new version of the notebook, with better usage of apply_ufunc, but the process of exporting to netCDF is still really slow...

dcherian · 2023-03-15T18:36:19Z

dcherian
Mar 15, 2023
Maintainer

What is the stack piece? We'll need to look at your code and input chunk sizes.

We could have a longer discussion at the office hours on Friday

5 replies

ricardobarroslourenco Mar 15, 2023
Author

Great! I will sign up then.

Meanwhile, the notebook is available here in this gist. I noticed a weird thing. When writing to disk (via a Dataset.to_netcdf call), the file is created, and within the first 10-15 minutes, it gets to 4.5 GB and then stops growing (Currently 1h05m elapsed). The cluster memory usage is around 7 GB.

Do you think that perhaps a dependency (for ex.: NetCDF library) may be causing the issue? The progress panel on Dask shows that many tasks are queued, but no actual file writing is happening.

ricardobarroslourenco Mar 15, 2023
Author

I have just registered for office hours and included the source data in the main post here.

ricardobarroslourenco Mar 17, 2023
Author

Just included a new revised version of the notebook, with the recent improvements we discussed.

ricardobarroslourenco Mar 17, 2023
Author

@dcherian just putting a note here on what we discussed today at office hours (it might be helpful to others):

Force .compute() before calling .to_netcdf() # In my case, I think this is already in place
Use Zarr instead of netCDF
- Provides better parallelization
- And netCDF format will likely adopt Zarr structure soon: https://docs.unidata.ucar.edu/nug/current/nczarr_head.html

I think I have not forgotten anything. I will try to apply these changes and see how it goes (if successful, I will post back here). Thanks again!

dcherian Mar 17, 2023
Maintainer

And using flox for multiple variable groupby is probably a lot more efficient (no stacking/unstacking),

kmuehlbauer · 2023-03-17T14:34:57Z

kmuehlbauer
Mar 17, 2023
Maintainer

@ricardobarroslourenco Regarding the HDF warnings, this might be related #7549

1 reply

ricardobarroslourenco Mar 17, 2023
Author

Thanks for the reference and guidance @kmuehlbauer

Uh oh!

Suggestions for performance improvement (Best practices) when using Xarray with Dask #7632

Uh oh!

Uh oh!

ricardobarroslourenco Mar 15, 2023

Replies: 2 comments · 6 replies

Uh oh!

dcherian Mar 15, 2023 Maintainer

Uh oh!

ricardobarroslourenco Mar 15, 2023 Author

Uh oh!

ricardobarroslourenco Mar 15, 2023 Author

Uh oh!

ricardobarroslourenco Mar 17, 2023 Author

Uh oh!

Uh oh!

ricardobarroslourenco Mar 17, 2023 Author

Uh oh!

dcherian Mar 17, 2023 Maintainer

Uh oh!

kmuehlbauer Mar 17, 2023 Maintainer

Uh oh!

ricardobarroslourenco Mar 17, 2023 Author

ricardobarroslourenco
Mar 15, 2023

Replies: 2 comments 6 replies

dcherian
Mar 15, 2023
Maintainer

ricardobarroslourenco Mar 15, 2023
Author

ricardobarroslourenco Mar 15, 2023
Author

ricardobarroslourenco Mar 17, 2023
Author

ricardobarroslourenco Mar 17, 2023
Author

dcherian Mar 17, 2023
Maintainer

kmuehlbauer
Mar 17, 2023
Maintainer

ricardobarroslourenco Mar 17, 2023
Author