Is it really recommended to save intermediate Dask results as netCDF? #8372

maresb · 2023-10-25T10:25:18Z

maresb
Oct 25, 2023

I'm trying to understand the Optimization Tips, and there it is suggested to

Save intermediate results to disk as a netCDF files (using to_netcdf()) and then load them again with open_dataset() for further computations.

I am wondering if this accomplishes anything that .compute() does not.

Also, it seems to be in contradiction to the advice recently provided in #7632 (reply in thread). I am wondering if this optimization tip (added in #7454) may be outdated or suboptimal, or if I'm misunderstanding something?

Answered by trexfeathers

Oct 25, 2023

Intermediate compute() will achieve the same results, but depends on having enough memory available.

The advantage of saving/loading is if your array is larger than memory - it can be streamed chunk-by-chunk into a file, then streamed back out again, only using 1 chunk's worth of memory at a time.

I hadn't even considered that memory wouldn't be a concern 😂 It can be quite difficult to remember that some people only use Dask for parallelisation, and others only use it for larger-than-memory operations. Not everyone needs both. Makes it hard to think from each others' perspectives!

View full answer

trexfeathers · 2023-10-25T13:09:11Z

trexfeathers
Oct 25, 2023

I can't speak for Xarray, but here is how I understand the Dask situation:

Every time you add a Dask operation to an existing sequence of Dask operations, Dask modifies its computation graph (i.e. its 'plan') for delivering the final result you're after. Whatever graphing strategies it uses, they don't always produce an optimal graph, and will sometimes do strange things like attempt to load all the data into memory multiple times.

But saving in the middle means Dask doesn't have to rationalise so many operations into one graph - your workflow is split into two graphs: one leading up to saving, one post-loading. So it could avoid any such problems.

If you've ever used a querying language like SQL, this is equivalent to the cases where it's more efficient to cache intermediary results in a temporary table, rather than do everything in one query.

4 replies

maresb Oct 25, 2023
Author

Thanks a lot for the quick response!

Are you saying that even when using intermediate calls to .compute(), Dask would still use the original computation graph rather than treating the result as intermediate and then redoing the computation graph in terms of this intermediate value?

(I'm still trying to understand if/why there's an advantage to saving/loading over .compute().)

trexfeathers Oct 25, 2023

Intermediate compute() will achieve the same results, but depends on having enough memory available.

The advantage of saving/loading is if your array is larger than memory - it can be streamed chunk-by-chunk into a file, then streamed back out again, only using 1 chunk's worth of memory at a time.

I hadn't even considered that memory wouldn't be a concern 😂 It can be quite difficult to remember that some people only use Dask for parallelisation, and others only use it for larger-than-memory operations. Not everyone needs both. Makes it hard to think from each others' perspectives!

Answer selected by maresb

dcherian Oct 25, 2023
Maintainer

I'll add that parallel writes to zarr are a lot easier and performant than parallel writes to netCDF

TomNicholas Oct 26, 2023
Maintainer

Every time you add a Dask operation to an existing sequence of Dask operations, Dask modifies its computation graph (i.e. its 'plan') for delivering the final result you're after. Whatever graphing strategies it uses, they don't always produce an optimal graph, and will sometimes do strange things like attempt to load all the data into memory multiple times.

The advantage of saving/loading is if your array is larger than memory - it can be streamed chunk-by-chunk into a file, then streamed back out again, .

@trexfeathers this is a very good description of the exact motivation for cubed. Cubed breaks the whole workplan up into steps that it knows it can do in parallel using only using 1 chunk's worth of memory at a time, and persists intermediate results between those steps via Zarr. It's kind of like an automation of the user pattern of defensively saving intermediate results to disk. You might be interested to read more about that project in this blog post.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Is it really recommended to save intermediate Dask results as netCDF? #8372

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Is it really recommended to save intermediate Dask results as netCDF? #8372

Uh oh!

maresb Oct 25, 2023

Replies: 1 comment · 4 replies

Uh oh!

trexfeathers Oct 25, 2023

Uh oh!

maresb Oct 25, 2023 Author

Uh oh!

trexfeathers Oct 25, 2023

Uh oh!

dcherian Oct 25, 2023 Maintainer

Uh oh!

TomNicholas Oct 26, 2023 Maintainer

maresb
Oct 25, 2023

Replies: 1 comment 4 replies

trexfeathers
Oct 25, 2023

maresb Oct 25, 2023
Author

dcherian Oct 25, 2023
Maintainer

TomNicholas Oct 26, 2023
Maintainer