How does to_dask_dataframe chunk? #8451
Unanswered
stviaene-si
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I was just wondering what happens to data chunks when calling
to_dask_dataframe
on an xarray Dataset. The documentation is not very verbose on that aspect, so I assumed that chunks would be more or less maintained equal. However, when running some analysis, I noticed the following.I've got a Dataset
x
chunked along 3 dimensions, with about 25 variables:When I transform that into a dask dataframe with
ddf = x.to_dask_dataframe()
, I get the following warning:When inspecting xarray's repr of my dataset, each chunk is about 32MiB large. I expected the behaviour I was seeing was all 25 of my variables being collided into a single chunk. However, when I check the partitions and divisions, I get the following:
When I do as dask suggests in its warnings, and let it split large chunks:
I get:
I'm just puzzled how my dask-backed Dataset with ~300 chunks per data variable ends up being collided into 2 dask chunks. Am I missing something here? Is there a way to minimize the amount of data shuffling dask needs to do to convert my Dataset to a dask DataFrame? I suppose my ideal scenario would be for every original chunk to just get reshaped into dataframe format.
I am new to the dask/xarray stack, so please forgive me silly questions, and obviously happy to provide more details if that helps. Any help to understand what's going on here would be greatly appreciated!
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions