|
| 1 | +# Grouper Objects |
| 2 | +**Author**: Deepak Cherian <deepak@cherian.net> |
| 3 | +**Created**: Nov 21, 2023 |
| 4 | + |
| 5 | +## Abstract |
| 6 | + |
| 7 | +I propose the addition of Grouper objects to Xarray's public API so that |
| 8 | +```python |
| 9 | +Dataset.groupby(x=BinGrouper(bins=np.arange(10, 2)))) |
| 10 | +``` |
| 11 | +is identical to today's syntax: |
| 12 | +```python |
| 13 | +Dataset.groupby_bins("x", bins=np.arange(10, 2)) |
| 14 | +``` |
| 15 | + |
| 16 | +## Motivation and scope |
| 17 | + |
| 18 | +Xarray's GroupBy API implements the split-apply-combine pattern (Wickham, 2011)[^1], which applies to a very large number of problems: histogramming, compositing, climatological averaging, resampling to a different time frequency, etc. |
| 19 | +The pattern abstracts the following pseudocode: |
| 20 | +```python |
| 21 | +results = [] |
| 22 | +for element in unique_labels: |
| 23 | + subset = ds.sel(x=(ds.x == element)) # split |
| 24 | + # subset = ds.where(ds.x == element, drop=True) # alternative |
| 25 | + result = subset.mean() # apply |
| 26 | + results.append(result) |
| 27 | + |
| 28 | +xr.concat(results) # combine |
| 29 | +``` |
| 30 | + |
| 31 | +to |
| 32 | +```python |
| 33 | +ds.groupby('x').mean() # splits, applies, and combines |
| 34 | +``` |
| 35 | + |
| 36 | +Efficient vectorized implementations of this pattern are implemented in numpy's [`ufunc.at`](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.at.html), [`ufunc.reduceat`](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.reduceat.html), [`numbagg.grouped`](https://github.com/numbagg/numbagg/blob/main/numbagg/grouped.py), [`numpy_groupies`](https://github.com/ml31415/numpy-groupies), and probably more. |
| 37 | +These vectorized implementations *all* require, as input, an array of integer codes or labels that identify unique elements in the array being grouped over (`'x'` in the example above). |
| 38 | +```python |
| 39 | +import numpy as np |
| 40 | + |
| 41 | +# array to reduce |
| 42 | +a = np.array([1, 1, 1, 1, 2]) |
| 43 | + |
| 44 | +# initial value for result |
| 45 | +out = np.zeros((3,), dtype=int) |
| 46 | + |
| 47 | +# integer codes |
| 48 | +labels = np.array([0, 0, 1, 2, 1]) |
| 49 | + |
| 50 | +# groupby-reduction |
| 51 | +np.add.at(out, labels, a) |
| 52 | +out # array([2, 3, 1]) |
| 53 | +``` |
| 54 | + |
| 55 | +One can 'factorize' or construct such an array of integer codes using `pandas.factorize` or `numpy.unique(..., return_inverse=True)` for categorical arrays; `pandas.cut`, `pandas.qcut`, or `np.digitize` for discretizing continuous variables. |
| 56 | +In practice, since `GroupBy` objects exist, much of complexity in applying the groupby paradigm stems from appropriately factorizing or generating labels for the operation. |
| 57 | +Consider these two examples: |
| 58 | +1. [Bins that vary in a dimension](https://flox.readthedocs.io/en/latest/user-stories/nD-bins.html) |
| 59 | +2. [Overlapping groups](https://flox.readthedocs.io/en/latest/user-stories/overlaps.html) |
| 60 | +3. [Rolling resampling](https://github.com/pydata/xarray/discussions/8361) |
| 61 | + |
| 62 | +Anecdotally, less experienced users commonly resort to the for-loopy implementation illustrated by the pseudocode above when the analysis at hand is not easily expressed using the API presented by Xarray's GroupBy object. |
| 63 | +Xarray's GroupBy API today abstracts away the split, apply, and combine stages but not the "factorize" stage. |
| 64 | +Grouper objects will close the gap. |
| 65 | + |
| 66 | +## Usage and impact |
| 67 | + |
| 68 | + |
| 69 | +Grouper objects |
| 70 | +1. Will abstract useful factorization algorithms, and |
| 71 | +2. Present a natural way to extend GroupBy to grouping by multiple variables: `ds.groupby(x=BinGrouper(...), t=Resampler(freq="M", ...)).mean()`. |
| 72 | + |
| 73 | +In addition, Grouper objects provide a nice interface to add often-requested grouping functionality |
| 74 | +1. A new `SpaceResampler` would allow specifying resampling spatial dimensions. ([issue](https://github.com/pydata/xarray/issues/4008)) |
| 75 | +2. `RollingTimeResampler` would allow rolling-like functionality that understands timestamps ([issue](https://github.com/pydata/xarray/issues/3216)) |
| 76 | +3. A `QuantileBinGrouper` to abstract away `pd.cut` ([issue](https://github.com/pydata/xarray/discussions/7110)) |
| 77 | +4. A `SeasonGrouper` and `SeasonResampler` would abstract away common annoyances with such calculations today |
| 78 | + 1. Support seasons that span a year-end. |
| 79 | + 2. Only include seasons with complete data coverage. |
| 80 | + 3. Allow grouping over seasons of unequal length |
| 81 | + 4. See [this xcdat discussion](https://github.com/xCDAT/xcdat/issues/416) for a `SeasonGrouper` like functionality: |
| 82 | + 5. Return results with seasons in a sensible order |
| 83 | +5. Weighted grouping ([issue](https://github.com/pydata/xarray/issues/3937)) |
| 84 | + 1. Once `IntervalIndex` like objects are supported, `Resampler` groupers can account for interval lengths when resampling. |
| 85 | + |
| 86 | +## Backward Compatibility |
| 87 | + |
| 88 | +Xarray's existing grouping functionality will be exposed using two new Groupers: |
| 89 | +1. `UniqueGrouper` which uses `pandas.factorize` |
| 90 | +2. `BinGrouper` which uses `pandas.cut` |
| 91 | +3. `TimeResampler` which mimics pandas' `.resample` |
| 92 | + |
| 93 | +Grouping by single variables will be unaffected so that `ds.groupby('x')` will be identical to `ds.groupby(x=UniqueGrouper())`. |
| 94 | +Similarly, `ds.groupby_bins('x', bins=np.arange(10, 2))` will be unchanged and identical to `ds.groupby(x=BinGrouper(bins=np.arange(10, 2)))`. |
| 95 | + |
| 96 | +## Detailed description |
| 97 | + |
| 98 | +All Grouper objects will subclass from a Grouper object |
| 99 | +```python |
| 100 | +import abc |
| 101 | + |
| 102 | +class Grouper(abc.ABC): |
| 103 | + @abc.abstractmethod |
| 104 | + def factorize(self, by: DataArray): |
| 105 | + raise NotImplementedError |
| 106 | + |
| 107 | +class CustomGrouper(Grouper): |
| 108 | + def factorize(self, by: DataArray): |
| 109 | + ... |
| 110 | + return codes, group_indices, unique_coord, full_index |
| 111 | + |
| 112 | + def weights(self, by: DataArray) -> DataArray: |
| 113 | + ... |
| 114 | + return weights |
| 115 | +``` |
| 116 | + |
| 117 | +### The `factorize` method |
| 118 | +Today, the `factorize` method takes as input the group variable and returns 4 variables (I propose to clean this up below): |
| 119 | +1. `codes`: An array of same shape as the `group` with int dtype. NaNs in `group` are coded by `-1` and ignored later. |
| 120 | +2. `group_indices` is a list of index location of `group` elements that belong to a single group. |
| 121 | +3. `unique_coord` is (usually) a `pandas.Index` object of all unique `group` members present in `group`. |
| 122 | +4. `full_index` is a `pandas.Index` of all `group` members. This is different from `unique_coord` for binning and resampling, where not all groups in the output may be represented in the input `group`. For grouping by a categorical variable e.g. `['a', 'b', 'a', 'c']`, `full_index` and `unique_coord` are identical. |
| 123 | +There is some redundancy here since `unique_coord` is always equal to or a subset of `full_index`. |
| 124 | +We can clean this up (see Implementation below). |
| 125 | + |
| 126 | +### The `weights` method (?) |
| 127 | + |
| 128 | +The proposed `weights` method is optional and unimplemented today. |
| 129 | +Groupers with `weights` will allow composing `weighted` and `groupby` ([issue](https://github.com/pydata/xarray/issues/3937)). |
| 130 | +The `weights` method should return an appropriate array of weights such that the following property is satisfied |
| 131 | +```python |
| 132 | +gb_sum = ds.groupby(by).sum() |
| 133 | + |
| 134 | +weights = CustomGrouper.weights(by) |
| 135 | +weighted_sum = xr.dot(ds, weights) |
| 136 | + |
| 137 | +assert_identical(gb_sum, weighted_sum) |
| 138 | +``` |
| 139 | +For example, the boolean weights for `group=np.array(['a', 'b', 'c', 'a', 'a'])` should be |
| 140 | +``` |
| 141 | +[[1, 0, 0, 1, 1], |
| 142 | + [0, 1, 0, 0, 0], |
| 143 | + [0, 0, 1, 0, 0]] |
| 144 | +``` |
| 145 | +This is the boolean "summarization matrix" referred to in the classic Iverson (1980, Section 4.3)[^2] and "nub sieve" in [various APLs](https://aplwiki.com/wiki/Nub_Sieve). |
| 146 | + |
| 147 | +> [!NOTE] |
| 148 | +> We can always construct `weights` automatically using `group_indices` from `factorize`, so this is not a required method. |
| 149 | +
|
| 150 | +For a rolling resampling, windowed weights are possible |
| 151 | +``` |
| 152 | +[[0.5, 1, 0.5, 0, 0], |
| 153 | + [0, 0.25, 1, 1, 0], |
| 154 | + [0, 0, 0, 1, 1]] |
| 155 | +``` |
| 156 | + |
| 157 | +### The `preferred_chunks` method (?) |
| 158 | + |
| 159 | +Rechunking support is another optional extension point. |
| 160 | +In `flox` I experimented some with automatically rechunking to make a groupby more parallel-friendly ([example 1](https://flox.readthedocs.io/en/latest/generated/flox.rechunk_for_blockwise.html), [example 2](https://flox.readthedocs.io/en/latest/generated/flox.rechunk_for_cohorts.html)). |
| 161 | +A great example is for resampling-style groupby reductions, for which `codes` might look like |
| 162 | +``` |
| 163 | +0001|11122|3333 |
| 164 | +``` |
| 165 | +where `|` represents chunk boundaries. A simple rechunking to |
| 166 | +``` |
| 167 | +000|111122|3333 |
| 168 | +``` |
| 169 | +would make this resampling reduction an embarassingly parallel blockwise problem. |
| 170 | + |
| 171 | +Similarly consider monthly-mean climatologies for which the month numbers might be |
| 172 | +``` |
| 173 | +1 2 3 4 5 | 6 7 8 9 10 | 11 12 1 2 3 | 4 5 6 7 8 | 9 10 11 12 | |
| 174 | +``` |
| 175 | +A slight rechunking to |
| 176 | +``` |
| 177 | +1 2 3 4 | 5 6 7 8 | 9 10 11 12 | 1 2 3 4 | 5 6 7 8 | 9 10 11 12 | |
| 178 | +``` |
| 179 | +allows us to reduce `1, 2, 3, 4` separately from `5,6,7,8` and `9, 10, 11, 12` while still being parallel friendly (see the [flox documentation](https://flox.readthedocs.io/en/latest/implementation.html#method-cohorts) for more). |
| 180 | + |
| 181 | +We could attempt to detect these patterns, or we could just have the Grouper take as input `chunks` and return a tuple of "nice" chunk sizes to rechunk to. |
| 182 | +```python |
| 183 | +def preferred_chunks(self, chunks: ChunksTuple) -> ChunksTuple: |
| 184 | + pass |
| 185 | +``` |
| 186 | +For monthly means, since the period of repetition of labels is 12, the Grouper might choose possible chunk sizes of `((2,),(3,),(4,),(6,))`. |
| 187 | +For resampling, the Grouper could choose to resample to a multiple or an even fraction of the resampling frequency. |
| 188 | + |
| 189 | +## Related work |
| 190 | + |
| 191 | +Pandas has [Grouper objects](https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html#pandas-grouper) that represent the GroupBy instruction. |
| 192 | +However, these objects do not appear to be extension points, unlike the Grouper objects proposed here. |
| 193 | +Instead, Pandas' `ExtensionArray` has a [`factorize`](https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.factorize.html) method. |
| 194 | + |
| 195 | +Composing rolling with time resampling is a common workload: |
| 196 | +1. Polars has [`group_by_dynamic`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by_dynamic.html) which appears to be like the proposed `RollingResampler`. |
| 197 | +2. scikit-downscale provides [`PaddedDOYGrouper`]( |
| 198 | +https://github.com/pangeo-data/scikit-downscale/blob/e16944a32b44f774980fa953ea18e29a628c71b8/skdownscale/pointwise_models/groupers.py#L19) |
| 199 | + |
| 200 | +## Implementation Proposal |
| 201 | + |
| 202 | +1. Get rid of `squeeze` [issue](https://github.com/pydata/xarray/issues/2157): [PR](https://github.com/pydata/xarray/pull/8506) |
| 203 | +2. Merge existing two class implementation to a single Grouper class |
| 204 | + 1. This design was implemented in [this PR](https://github.com/pydata/xarray/pull/7206) to account for some annoying data dependencies. |
| 205 | + 2. See [PR](https://github.com/pydata/xarray/pull/8509) |
| 206 | +3. Clean up what's returned by `factorize` methods. |
| 207 | + 1. A solution here might be to have `group_indices: Mapping[int, Sequence[int]]` be a mapping from group index in `full_index` to a sequence of integers. |
| 208 | + 2. Return a `namedtuple` or `dataclass` from existing Grouper factorize methods to facilitate API changes in the future. |
| 209 | +4. Figure out what to pass to `factorize` |
| 210 | + 1. Xarray eagerly reshapes nD variables to 1D. This is an implementation detail we need not expose. |
| 211 | + 2. When grouping by an unindexed variable Xarray passes a `_DummyGroup` object. This seems like something we don't want in the public interface. We could special case "internal" Groupers to preserve the optimizations in `UniqueGrouper`. |
| 212 | +5. Grouper objects will exposed under the `xr.groupers` Namespace. At first these will include `UniqueGrouper`, `BinGrouper`, and `TimeResampler`. |
| 213 | + |
| 214 | +## Alternatives |
| 215 | + |
| 216 | +One major design choice made here was to adopt the syntax `ds.groupby(x=BinGrouper(...))` instead of `ds.groupby(BinGrouper('x', ...))`. |
| 217 | +This allows reuse of Grouper objects, example |
| 218 | +```python |
| 219 | +grouper = BinGrouper(...) |
| 220 | +ds.groupby(x=grouper, y=grouper) |
| 221 | +``` |
| 222 | +but requires that all variables being grouped by (`x` and `y` above) are present in Dataset `ds`. This does not seem like a bad requirement. |
| 223 | +Importantly `Grouper` instances will be copied internally so that they can safely cache state that might be shared between `factorize` and `weights`. |
| 224 | + |
| 225 | +Today, it is possible to `ds.groupby(DataArray, ...)`. This syntax will still be supported. |
| 226 | + |
| 227 | +## Discussion |
| 228 | + |
| 229 | +This proposal builds on these discussions: |
| 230 | +1. https://github.com/xarray-contrib/flox/issues/191#issuecomment-1328898836 |
| 231 | +2. https://github.com/pydata/xarray/issues/6610 |
| 232 | + |
| 233 | +## Copyright |
| 234 | + |
| 235 | +This document has been placed in the public domain. |
| 236 | + |
| 237 | +## References and footnotes |
| 238 | + |
| 239 | +[^1]: Wickham, H. (2011). The split-apply-combine strategy for data analysis. https://vita.had.co.nz/papers/plyr.html |
| 240 | +[^2]: Iverson, K.E. (1980). Notation as a tool of thought. Commun. ACM 23, 8 (Aug. 1980), 444–465. https://doi.org/10.1145/358896.358899 |
0 commit comments