Skip to content

Commit ccbdd0d

Browse files
authored
Mention flox in docs (#7454)
* Mention flox in docs * add to groupby.rst * Add whats-new * [skip-ci] Update doc/whats-new.rst
1 parent dcb20e9 commit ccbdd0d

File tree

5 files changed

+457
-15
lines changed

5 files changed

+457
-15
lines changed

doc/user-guide/dask.rst

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -559,16 +559,19 @@ With analysis pipelines involving both spatial subsetting and temporal resamplin
559559
can become very slow or memory hungry in certain cases. Here are some optimization tips we have found
560560
through experience:
561561

562-
1. Do your spatial and temporal indexing (e.g. ``.sel()`` or ``.isel()``) early in the pipeline, especially before calling ``resample()`` or ``groupby()``. Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn't been implemented in Dask yet. (See `Dask issue #746 <https://github.com/dask/dask/issues/746>`_). More generally, ``groupby()`` is a costly operation and does not (yet) perform well on datasets split across multiple files (see :pull:`5734` and linked discussions there).
562+
1. Do your spatial and temporal indexing (e.g. ``.sel()`` or ``.isel()``) early in the pipeline, especially before calling ``resample()`` or ``groupby()``. Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn't been implemented in Dask yet. (See `Dask issue #746 <https://github.com/dask/dask/issues/746>`_).
563563

564-
2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)
564+
2. More generally, ``groupby()`` is a costly operation and will perform a lot better if the ``flox`` package is installed.
565+
See the `flox documentation <flox.readthedocs.io/>`_ for more. By default Xarray will use ``flox`` if installed.
565566

566-
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load subsets of data which span multiple chunks. On individual files, prefer to subset before chunking (suggestion 1).
567+
3. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)
567568

568-
4. Chunk as early as possible, and avoid rechunking as much as possible. Always pass the ``chunks={}`` argument to :py:func:`~xarray.open_mfdataset` to avoid redundant file reads.
569+
4. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load subsets of data which span multiple chunks. On individual files, prefer to subset before chunking (suggestion 1).
569570

570-
5. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset` can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.
571+
5. Chunk as early as possible, and avoid rechunking as much as possible. Always pass the ``chunks={}`` argument to :py:func:`~xarray.open_mfdataset` to avoid redundant file reads.
571572

572-
6. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.
573+
6. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset` can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.
573574

574-
7. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be useful in identifying performance bottlenecks.
575+
7. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.
576+
577+
8. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be useful in identifying performance bottlenecks.

doc/user-guide/groupby.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,16 @@ over a multi-dimensional variable has recently been implemented. Note that for
2222
one-dimensional data, it is usually faster to rely on pandas' implementation of
2323
the same pipeline.
2424

25+
.. tip::
26+
27+
To substantially improve the performance of GroupBy operations, particularly
28+
with dask install the `flox <https://flox.readthedocs.io>`_ package. flox also
29+
`extends <https://flox.readthedocs.io/en/latest/xarray.html>`_
30+
Xarray's in-built GroupBy capabilities by allowing grouping by multiple variables,
31+
and lazy grouping by dask arrays. Xarray will automatically use flox by default
32+
if it is installed.
33+
34+
2535
Split
2636
~~~~~
2737

doc/whats-new.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ Bug fixes
5757
Documentation
5858
~~~~~~~~~~~~~
5959

60+
- Mention the `flox package <https://flox.readthedocs.io>`_ in GroupBy documentation and docstrings.
61+
By `Deepak Cherian <https://github.com/dcherian>`_.
6062

6163
Internal Changes
6264
~~~~~~~~~~~~~~~~

0 commit comments

Comments
 (0)