Strategy for chunking arrays

I often want to test some parallel dask code for input with a range of different chunking structures. This seems like a possible candidate for a general `chunks` strategy that could live in hypothesis.

We actually made a simple strategy to test various chunking structures in both [xhistogram here](https://github.com/xgcm/xhistogram/blob/0d3bf5ad9c118dccb211d30fc889963087c28dd4/xhistogram/test/test_chunking_hypotheses.py#L13) and in [rechunker here](https://github.com/pangeo-data/rechunker/blob/9d129321499a7b0be422fd159dff35e28db4cfac/tests/test_algorithm.py#L204), but now I'm wondering if there is a general solution I could implement. It would also be really useful to use in xarray's tests.

Dask's [rules for how they defines chunk shapes](https://docs.dask.org/en/stable/array-chunks.html#specifying-chunk-shapes) are:

> We always specify a chunks argument to tell dask.array how to break up the underlying array into chunks. We can specify chunks in a variety of ways:
>
> 1. A uniform dimension size like `1000`, meaning chunks of size 1000 in each dimension
> 
> 2. A uniform chunk shape like `(1000, 2000, 3000)`, meaning chunks of size 1000 in the first axis, 2000 in the second axis, and 3000 in the third
>
> 3. Fully explicit sizes of all blocks along all dimensions, like `((1000, 1000, 500), (400, 400), (5, 5, 5, 5, 5))`
> 
> 4. A dictionary specifying chunk size per dimension like `{0: 1000, 1: 2000, 2: 3000}`. This is just another way of writing the forms 2 and 3 above

5. Xarray adds [another way to specify chunks](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.chunksizes.html), which is the same as (4) but with string dimension names instead of integer axis names. We distinguish that representation by referring to it as `.chunksizes` instead of just `.chunks`.

I think the most generally useful form of chunk strategy should produce (3), which can then be wrapped up by the user to make (4) or (5) if desired.

My attempt so far looks like this

```python
from typing import Tuple

import hypothesis.strategies as st


@st.composite
def chunks(
    draw: st.DrawFn,
    shape: Tuple[int, ...],
    axes: Tuple[int, ...] = None,
    min_chunk_length: int = 1,
    max_chunk_length: int = None,
) -> st.SearchStrategy[Tuple[Tuple[int, ...], ...]]:
    """
    Generate different chunking patterns for an N-D array of data.

    Returns chunking structure as a tuple of tuples, with each inner tuple containing
    the block lengths along one dimension of the array.

    You can limit chunking to specific axes using the `axes` kwarg.
    """

    if min_chunk_length < 1:
        raise ValueError()

    if axes is None:
        axes = tuple(range(len(shape)))

    chunks = []
    for axis, n in enumerate(shape):

        if max_chunk_length:
            _max_chunk_length = max(max_chunk_length, n)
        else:
            _max_chunk_length = n

        if axes is not None and axis in axes:
            block_lengths = draw(st.integers(min_value=min_chunk_length, max_value=_max_chunk_length))
        else:
            # don't chunk along this dimension
            block_lengths = n

        chunks.append(block_lengths)

    return tuple(chunks)
```

However this isn't actually quite what I want: it returns (2) rather than (3). To return (3) I need to work out how to make a strategy to create a list of integers whose value sums to a given number, e.g. `chunks=((3, 3, 2))` for a 1D array of length 8. Any pointers on the best way to do that would be appreciated!

Happy to submit a PR if this is of interest!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strategy for chunking arrays #3433

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strategy for chunking arrays #3433

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions