Skip to content

Strategy for chunking arrays #3433

@TomNicholas

Description

@TomNicholas

I often want to test some parallel dask code for input with a range of different chunking structures. This seems like a possible candidate for a general chunks strategy that could live in hypothesis.

We actually made a simple strategy to test various chunking structures in both xhistogram here and in rechunker here, but now I'm wondering if there is a general solution I could implement. It would also be really useful to use in xarray's tests.

Dask's rules for how they defines chunk shapes are:

We always specify a chunks argument to tell dask.array how to break up the underlying array into chunks. We can specify chunks in a variety of ways:

  1. A uniform dimension size like 1000, meaning chunks of size 1000 in each dimension

  2. A uniform chunk shape like (1000, 2000, 3000), meaning chunks of size 1000 in the first axis, 2000 in the second axis, and 3000 in the third

  3. Fully explicit sizes of all blocks along all dimensions, like ((1000, 1000, 500), (400, 400), (5, 5, 5, 5, 5))

  4. A dictionary specifying chunk size per dimension like {0: 1000, 1: 2000, 2: 3000}. This is just another way of writing the forms 2 and 3 above

  1. Xarray adds another way to specify chunks, which is the same as (4) but with string dimension names instead of integer axis names. We distinguish that representation by referring to it as .chunksizes instead of just .chunks.

I think the most generally useful form of chunk strategy should produce (3), which can then be wrapped up by the user to make (4) or (5) if desired.

My attempt so far looks like this

from typing import Tuple

import hypothesis.strategies as st


@st.composite
def chunks(
    draw: st.DrawFn,
    shape: Tuple[int, ...],
    axes: Tuple[int, ...] = None,
    min_chunk_length: int = 1,
    max_chunk_length: int = None,
) -> st.SearchStrategy[Tuple[Tuple[int, ...], ...]]:
    """
    Generate different chunking patterns for an N-D array of data.

    Returns chunking structure as a tuple of tuples, with each inner tuple containing
    the block lengths along one dimension of the array.

    You can limit chunking to specific axes using the `axes` kwarg.
    """

    if min_chunk_length < 1:
        raise ValueError()

    if axes is None:
        axes = tuple(range(len(shape)))

    chunks = []
    for axis, n in enumerate(shape):

        if max_chunk_length:
            _max_chunk_length = max(max_chunk_length, n)
        else:
            _max_chunk_length = n

        if axes is not None and axis in axes:
            block_lengths = draw(st.integers(min_value=min_chunk_length, max_value=_max_chunk_length))
        else:
            # don't chunk along this dimension
            block_lengths = n

        chunks.append(block_lengths)

    return tuple(chunks)

However this isn't actually quite what I want: it returns (2) rather than (3). To return (3) I need to work out how to make a strategy to create a list of integers whose value sums to a given number, e.g. chunks=((3, 3, 2)) for a 1D array of length 8. Any pointers on the best way to do that would be appreciated!

Happy to submit a PR if this is of interest!

Metadata

Metadata

Assignees

No one assigned

    Labels

    new-featureentirely novel capabilities or strategiesquestionnot sure it's a bug? questions welcome

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions