Is it really recommended to save intermediate Dask results as netCDF? #8372
-
I'm trying to understand the Optimization Tips, and there it is suggested to
I am wondering if this accomplishes anything that Also, it seems to be in contradiction to the advice recently provided in #7632 (reply in thread). I am wondering if this optimization tip (added in #7454) may be outdated or suboptimal, or if I'm misunderstanding something? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
I can't speak for Xarray, but here is how I understand the Dask situation: Every time you add a Dask operation to an existing sequence of Dask operations, Dask modifies its computation graph (i.e. its 'plan') for delivering the final result you're after. Whatever graphing strategies it uses, they don't always produce an optimal graph, and will sometimes do strange things like attempt to load all the data into memory multiple times. But saving in the middle means Dask doesn't have to rationalise so many operations into one graph - your workflow is split into two graphs: one leading up to saving, one post-loading. So it could avoid any such problems. If you've ever used a querying language like SQL, this is equivalent to the cases where it's more efficient to cache intermediary results in a temporary table, rather than do everything in one query. |
Beta Was this translation helpful? Give feedback.
Intermediate
compute()
will achieve the same results, but depends on having enough memory available.The advantage of saving/loading is if your array is larger than memory - it can be streamed chunk-by-chunk into a file, then streamed back out again, only using 1 chunk's worth of memory at a time.
I hadn't even considered that memory wouldn't be a concern 😂 It can be quite difficult to remember that some people only use Dask for parallelisation, and others only use it for larger-than-memory operations. Not everyone needs both. Makes it hard to think from each others' perspectives!