Skip to content

Commit cb09522

Browse files
dcherianmathause
andauthored
Grouper object design doc (#8510)
* Grouper object design doc * [skip-ci] Update design_notes/grouper_objects.md * [skip-ci] Minor updates * [skip-ci] Update design_notes/grouper_objects.md Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com> --------- Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
1 parent 160fbf8 commit cb09522

File tree

1 file changed

+240
-0
lines changed

1 file changed

+240
-0
lines changed

design_notes/grouper_objects.md

Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
# Grouper Objects
2+
**Author**: Deepak Cherian <deepak@cherian.net>
3+
**Created**: Nov 21, 2023
4+
5+
## Abstract
6+
7+
I propose the addition of Grouper objects to Xarray's public API so that
8+
```python
9+
Dataset.groupby(x=BinGrouper(bins=np.arange(10, 2))))
10+
```
11+
is identical to today's syntax:
12+
```python
13+
Dataset.groupby_bins("x", bins=np.arange(10, 2))
14+
```
15+
16+
## Motivation and scope
17+
18+
Xarray's GroupBy API implements the split-apply-combine pattern (Wickham, 2011)[^1], which applies to a very large number of problems: histogramming, compositing, climatological averaging, resampling to a different time frequency, etc.
19+
The pattern abstracts the following pseudocode:
20+
```python
21+
results = []
22+
for element in unique_labels:
23+
subset = ds.sel(x=(ds.x == element)) # split
24+
# subset = ds.where(ds.x == element, drop=True) # alternative
25+
result = subset.mean() # apply
26+
results.append(result)
27+
28+
xr.concat(results) # combine
29+
```
30+
31+
to
32+
```python
33+
ds.groupby('x').mean() # splits, applies, and combines
34+
```
35+
36+
Efficient vectorized implementations of this pattern are implemented in numpy's [`ufunc.at`](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.at.html), [`ufunc.reduceat`](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.reduceat.html), [`numbagg.grouped`](https://github.com/numbagg/numbagg/blob/main/numbagg/grouped.py), [`numpy_groupies`](https://github.com/ml31415/numpy-groupies), and probably more.
37+
These vectorized implementations *all* require, as input, an array of integer codes or labels that identify unique elements in the array being grouped over (`'x'` in the example above).
38+
```python
39+
import numpy as np
40+
41+
# array to reduce
42+
a = np.array([1, 1, 1, 1, 2])
43+
44+
# initial value for result
45+
out = np.zeros((3,), dtype=int)
46+
47+
# integer codes
48+
labels = np.array([0, 0, 1, 2, 1])
49+
50+
# groupby-reduction
51+
np.add.at(out, labels, a)
52+
out # array([2, 3, 1])
53+
```
54+
55+
One can 'factorize' or construct such an array of integer codes using `pandas.factorize` or `numpy.unique(..., return_inverse=True)` for categorical arrays; `pandas.cut`, `pandas.qcut`, or `np.digitize` for discretizing continuous variables.
56+
In practice, since `GroupBy` objects exist, much of complexity in applying the groupby paradigm stems from appropriately factorizing or generating labels for the operation.
57+
Consider these two examples:
58+
1. [Bins that vary in a dimension](https://flox.readthedocs.io/en/latest/user-stories/nD-bins.html)
59+
2. [Overlapping groups](https://flox.readthedocs.io/en/latest/user-stories/overlaps.html)
60+
3. [Rolling resampling](https://github.com/pydata/xarray/discussions/8361)
61+
62+
Anecdotally, less experienced users commonly resort to the for-loopy implementation illustrated by the pseudocode above when the analysis at hand is not easily expressed using the API presented by Xarray's GroupBy object.
63+
Xarray's GroupBy API today abstracts away the split, apply, and combine stages but not the "factorize" stage.
64+
Grouper objects will close the gap.
65+
66+
## Usage and impact
67+
68+
69+
Grouper objects
70+
1. Will abstract useful factorization algorithms, and
71+
2. Present a natural way to extend GroupBy to grouping by multiple variables: `ds.groupby(x=BinGrouper(...), t=Resampler(freq="M", ...)).mean()`.
72+
73+
In addition, Grouper objects provide a nice interface to add often-requested grouping functionality
74+
1. A new `SpaceResampler` would allow specifying resampling spatial dimensions. ([issue](https://github.com/pydata/xarray/issues/4008))
75+
2. `RollingTimeResampler` would allow rolling-like functionality that understands timestamps ([issue](https://github.com/pydata/xarray/issues/3216))
76+
3. A `QuantileBinGrouper` to abstract away `pd.cut` ([issue](https://github.com/pydata/xarray/discussions/7110))
77+
4. A `SeasonGrouper` and `SeasonResampler` would abstract away common annoyances with such calculations today
78+
1. Support seasons that span a year-end.
79+
2. Only include seasons with complete data coverage.
80+
3. Allow grouping over seasons of unequal length
81+
4. See [this xcdat discussion](https://github.com/xCDAT/xcdat/issues/416) for a `SeasonGrouper` like functionality:
82+
5. Return results with seasons in a sensible order
83+
5. Weighted grouping ([issue](https://github.com/pydata/xarray/issues/3937))
84+
1. Once `IntervalIndex` like objects are supported, `Resampler` groupers can account for interval lengths when resampling.
85+
86+
## Backward Compatibility
87+
88+
Xarray's existing grouping functionality will be exposed using two new Groupers:
89+
1. `UniqueGrouper` which uses `pandas.factorize`
90+
2. `BinGrouper` which uses `pandas.cut`
91+
3. `TimeResampler` which mimics pandas' `.resample`
92+
93+
Grouping by single variables will be unaffected so that `ds.groupby('x')` will be identical to `ds.groupby(x=UniqueGrouper())`.
94+
Similarly, `ds.groupby_bins('x', bins=np.arange(10, 2))` will be unchanged and identical to `ds.groupby(x=BinGrouper(bins=np.arange(10, 2)))`.
95+
96+
## Detailed description
97+
98+
All Grouper objects will subclass from a Grouper object
99+
```python
100+
import abc
101+
102+
class Grouper(abc.ABC):
103+
@abc.abstractmethod
104+
def factorize(self, by: DataArray):
105+
raise NotImplementedError
106+
107+
class CustomGrouper(Grouper):
108+
def factorize(self, by: DataArray):
109+
...
110+
return codes, group_indices, unique_coord, full_index
111+
112+
def weights(self, by: DataArray) -> DataArray:
113+
...
114+
return weights
115+
```
116+
117+
### The `factorize` method
118+
Today, the `factorize` method takes as input the group variable and returns 4 variables (I propose to clean this up below):
119+
1. `codes`: An array of same shape as the `group` with int dtype. NaNs in `group` are coded by `-1` and ignored later.
120+
2. `group_indices` is a list of index location of `group` elements that belong to a single group.
121+
3. `unique_coord` is (usually) a `pandas.Index` object of all unique `group` members present in `group`.
122+
4. `full_index` is a `pandas.Index` of all `group` members. This is different from `unique_coord` for binning and resampling, where not all groups in the output may be represented in the input `group`. For grouping by a categorical variable e.g. `['a', 'b', 'a', 'c']`, `full_index` and `unique_coord` are identical.
123+
There is some redundancy here since `unique_coord` is always equal to or a subset of `full_index`.
124+
We can clean this up (see Implementation below).
125+
126+
### The `weights` method (?)
127+
128+
The proposed `weights` method is optional and unimplemented today.
129+
Groupers with `weights` will allow composing `weighted` and `groupby` ([issue](https://github.com/pydata/xarray/issues/3937)).
130+
The `weights` method should return an appropriate array of weights such that the following property is satisfied
131+
```python
132+
gb_sum = ds.groupby(by).sum()
133+
134+
weights = CustomGrouper.weights(by)
135+
weighted_sum = xr.dot(ds, weights)
136+
137+
assert_identical(gb_sum, weighted_sum)
138+
```
139+
For example, the boolean weights for `group=np.array(['a', 'b', 'c', 'a', 'a'])` should be
140+
```
141+
[[1, 0, 0, 1, 1],
142+
[0, 1, 0, 0, 0],
143+
[0, 0, 1, 0, 0]]
144+
```
145+
This is the boolean "summarization matrix" referred to in the classic Iverson (1980, Section 4.3)[^2] and "nub sieve" in [various APLs](https://aplwiki.com/wiki/Nub_Sieve).
146+
147+
> [!NOTE]
148+
> We can always construct `weights` automatically using `group_indices` from `factorize`, so this is not a required method.
149+
150+
For a rolling resampling, windowed weights are possible
151+
```
152+
[[0.5, 1, 0.5, 0, 0],
153+
[0, 0.25, 1, 1, 0],
154+
[0, 0, 0, 1, 1]]
155+
```
156+
157+
### The `preferred_chunks` method (?)
158+
159+
Rechunking support is another optional extension point.
160+
In `flox` I experimented some with automatically rechunking to make a groupby more parallel-friendly ([example 1](https://flox.readthedocs.io/en/latest/generated/flox.rechunk_for_blockwise.html), [example 2](https://flox.readthedocs.io/en/latest/generated/flox.rechunk_for_cohorts.html)).
161+
A great example is for resampling-style groupby reductions, for which `codes` might look like
162+
```
163+
0001|11122|3333
164+
```
165+
where `|` represents chunk boundaries. A simple rechunking to
166+
```
167+
000|111122|3333
168+
```
169+
would make this resampling reduction an embarassingly parallel blockwise problem.
170+
171+
Similarly consider monthly-mean climatologies for which the month numbers might be
172+
```
173+
1 2 3 4 5 | 6 7 8 9 10 | 11 12 1 2 3 | 4 5 6 7 8 | 9 10 11 12 |
174+
```
175+
A slight rechunking to
176+
```
177+
1 2 3 4 | 5 6 7 8 | 9 10 11 12 | 1 2 3 4 | 5 6 7 8 | 9 10 11 12 |
178+
```
179+
allows us to reduce `1, 2, 3, 4` separately from `5,6,7,8` and `9, 10, 11, 12` while still being parallel friendly (see the [flox documentation](https://flox.readthedocs.io/en/latest/implementation.html#method-cohorts) for more).
180+
181+
We could attempt to detect these patterns, or we could just have the Grouper take as input `chunks` and return a tuple of "nice" chunk sizes to rechunk to.
182+
```python
183+
def preferred_chunks(self, chunks: ChunksTuple) -> ChunksTuple:
184+
pass
185+
```
186+
For monthly means, since the period of repetition of labels is 12, the Grouper might choose possible chunk sizes of `((2,),(3,),(4,),(6,))`.
187+
For resampling, the Grouper could choose to resample to a multiple or an even fraction of the resampling frequency.
188+
189+
## Related work
190+
191+
Pandas has [Grouper objects](https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html#pandas-grouper) that represent the GroupBy instruction.
192+
However, these objects do not appear to be extension points, unlike the Grouper objects proposed here.
193+
Instead, Pandas' `ExtensionArray` has a [`factorize`](https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.factorize.html) method.
194+
195+
Composing rolling with time resampling is a common workload:
196+
1. Polars has [`group_by_dynamic`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by_dynamic.html) which appears to be like the proposed `RollingResampler`.
197+
2. scikit-downscale provides [`PaddedDOYGrouper`](
198+
https://github.com/pangeo-data/scikit-downscale/blob/e16944a32b44f774980fa953ea18e29a628c71b8/skdownscale/pointwise_models/groupers.py#L19)
199+
200+
## Implementation Proposal
201+
202+
1. Get rid of `squeeze` [issue](https://github.com/pydata/xarray/issues/2157): [PR](https://github.com/pydata/xarray/pull/8506)
203+
2. Merge existing two class implementation to a single Grouper class
204+
1. This design was implemented in [this PR](https://github.com/pydata/xarray/pull/7206) to account for some annoying data dependencies.
205+
2. See [PR](https://github.com/pydata/xarray/pull/8509)
206+
3. Clean up what's returned by `factorize` methods.
207+
1. A solution here might be to have `group_indices: Mapping[int, Sequence[int]]` be a mapping from group index in `full_index` to a sequence of integers.
208+
2. Return a `namedtuple` or `dataclass` from existing Grouper factorize methods to facilitate API changes in the future.
209+
4. Figure out what to pass to `factorize`
210+
1. Xarray eagerly reshapes nD variables to 1D. This is an implementation detail we need not expose.
211+
2. When grouping by an unindexed variable Xarray passes a `_DummyGroup` object. This seems like something we don't want in the public interface. We could special case "internal" Groupers to preserve the optimizations in `UniqueGrouper`.
212+
5. Grouper objects will exposed under the `xr.groupers` Namespace. At first these will include `UniqueGrouper`, `BinGrouper`, and `TimeResampler`.
213+
214+
## Alternatives
215+
216+
One major design choice made here was to adopt the syntax `ds.groupby(x=BinGrouper(...))` instead of `ds.groupby(BinGrouper('x', ...))`.
217+
This allows reuse of Grouper objects, example
218+
```python
219+
grouper = BinGrouper(...)
220+
ds.groupby(x=grouper, y=grouper)
221+
```
222+
but requires that all variables being grouped by (`x` and `y` above) are present in Dataset `ds`. This does not seem like a bad requirement.
223+
Importantly `Grouper` instances will be copied internally so that they can safely cache state that might be shared between `factorize` and `weights`.
224+
225+
Today, it is possible to `ds.groupby(DataArray, ...)`. This syntax will still be supported.
226+
227+
## Discussion
228+
229+
This proposal builds on these discussions:
230+
1. https://github.com/xarray-contrib/flox/issues/191#issuecomment-1328898836
231+
2. https://github.com/pydata/xarray/issues/6610
232+
233+
## Copyright
234+
235+
This document has been placed in the public domain.
236+
237+
## References and footnotes
238+
239+
[^1]: Wickham, H. (2011). The split-apply-combine strategy for data analysis. https://vita.had.co.nz/papers/plyr.html
240+
[^2]: Iverson, K.E. (1980). Notation as a tool of thought. Commun. ACM 23, 8 (Aug. 1980), 444–465. https://doi.org/10.1145/358896.358899

0 commit comments

Comments
 (0)