How does to_dask_dataframe chunk? #8451

stviaene-si · 2023-11-15T11:59:57Z

stviaene-si
Nov 15, 2023

Hi!

I was just wondering what happens to data chunks when calling to_dask_dataframe on an xarray Dataset. The documentation is not very verbose on that aspect, so I assumed that chunks would be more or less maintained equal. However, when running some analysis, I noticed the following.

I've got a Dataset x chunked along 3 dimensions, with about 25 variables:

>>> x.chunksizes
Frozen({'period_name': (1, 1), 'y': (4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096, 2471), 'x': (4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096, 2653)})

When I transform that into a dask dataframe with ddf = x.to_dask_dataframe(), I get the following warning:

/srv/conda/envs/notebook/lib/python3.11/site-packages/IPython/core/interactiveshell.py:3466: PerformanceWarning: Reshaping is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array.reshape(shape)

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array.reshape(shape)Explictly passing ``limit`` to ``reshape`` will also silence this warning
    >>> array.reshape(shape, limit='128 MiB')

When inspecting xarray's repr of my dataset, each chunk is about 32MiB large. I expected the behaviour I was seeing was all 25 of my variables being collided into a single chunk. However, when I check the partitions and divisions, I get the following:

>>> ddf.npartitions
2
>>> ddf.divisions
(0, 2430072747, 4860145493)

When I do as dask suggests in its warnings, and let it split large chunks:

with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ddf = xr_merged.to_dask_dataframe()

I get:

>>> ddf.nparititions
580
>>> ddf.divisions
(0,
 8388608,
 16777216,
 25165824,
 33554432,
...
# truncated

I'm just puzzled how my dask-backed Dataset with ~300 chunks per data variable ends up being collided into 2 dask chunks. Am I missing something here? Is there a way to minimize the amount of data shuffling dask needs to do to convert my Dataset to a dask DataFrame? I suppose my ideal scenario would be for every original chunk to just get reshaped into dataframe format.

I am new to the dask/xarray stack, so please forgive me silly questions, and obviously happy to provide more details if that helps. Any help to understand what's going on here would be greatly appreciated!

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How does to_dask_dataframe chunk? #8451

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

How does to_dask_dataframe chunk? #8451

Uh oh!

stviaene-si Nov 15, 2023

Replies: 0 comments

stviaene-si
Nov 15, 2023