Skip to content

Commit 4b54421

Browse files
Duck array documentation improvements (#7911)
* draft updates * discuss array API standard * fix sparse examples so they run * Deepak's suggestions Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * link to duck arrays user guide from internals page * fix various links * itemized list * mention dispatching on functions not in the array API standard * examples of duckarrays * add intended audience to xarray internals section * move paragraph on why its called a duck array upwards * delete section on numpy ufuncs * explain difference between .values and to_numpy * strongly prefer to_numpy over values * recommend to_numpy instead of values in the how do I? page * clearer about using to_numpy * merge section on missing features * remove todense from examples * whatsnew * double that Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * numpy array class clarification Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Remove sentence about xarray's internals Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * array API standard Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * proper link for sparse.COO type Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * links to docstrings of array types Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * don't put variable in parentheses Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * double backquote formatting Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * better bracketing Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * fix list formatting * add links to glue packages, dask, and cubed * link to todense method Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * link to numpy-like arrays page * link to numpy ufunc docs * add example of using .to_numpy * show example of .values failing * move whatsnew entry to unreleased version * fix warning in docs build * trigger CI --------- Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
1 parent 1146c5a commit 4b54421

File tree

7 files changed

+232
-28
lines changed

7 files changed

+232
-28
lines changed

doc/howdoi.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ How do I ...
4242
* - extract the underlying array (e.g. NumPy or Dask arrays)
4343
- :py:attr:`DataArray.data`
4444
* - convert to and extract the underlying NumPy array
45-
- :py:attr:`DataArray.values`
45+
- :py:attr:`DataArray.to_numpy`
4646
* - convert to a pandas DataFrame
4747
- :py:attr:`Dataset.to_dataframe`
4848
* - sort values

doc/internals/duck-arrays-integration.rst

Lines changed: 38 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,55 @@
11

2-
.. _internals.duck_arrays:
2+
.. _internals.duckarrays:
33

44
Integrating with duck arrays
55
=============================
66

77
.. warning::
88

9-
This is a experimental feature.
9+
This is an experimental feature. Please report any bugs or other difficulties on `xarray's issue tracker <https://github.com/pydata/xarray/issues>`_.
1010

11-
Xarray can wrap custom :term:`duck array` objects as long as they define numpy's
12-
``shape``, ``dtype`` and ``ndim`` properties and the ``__array__``,
13-
``__array_ufunc__`` and ``__array_function__`` methods.
11+
Xarray can wrap custom numpy-like arrays (":term:`duck array`\s") - see the :ref:`user guide documentation <userguide.duckarrays>`.
12+
This page is intended for developers who are interested in wrapping a new custom array type with xarray.
13+
14+
Duck array requirements
15+
~~~~~~~~~~~~~~~~~~~~~~~
16+
17+
Xarray does not explicitly check that required methods are defined by the underlying duck array object before
18+
attempting to wrap the given array. However, a wrapped array type should at a minimum define these attributes:
19+
20+
* ``shape`` property,
21+
* ``dtype`` property,
22+
* ``ndim`` property,
23+
* ``__array__`` method,
24+
* ``__array_ufunc__`` method,
25+
* ``__array_function__`` method.
26+
27+
These need to be defined consistently with :py:class:`numpy.ndarray`, for example the array ``shape``
28+
property needs to obey `numpy's broadcasting rules <https://numpy.org/doc/stable/user/basics.broadcasting.html>`_
29+
(see also the `Python Array API standard's explanation <https://data-apis.org/array-api/latest/API_specification/broadcasting.html>`_
30+
of these same rules).
31+
32+
Python Array API standard support
33+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
34+
35+
As an integration library xarray benefits greatly from the standardization of duck-array libraries' APIs, and so is a
36+
big supporter of the `Python Array API Standard <https://data-apis.org/array-api/latest/>`_. .
37+
38+
We aim to support any array libraries that follow the Array API standard out-of-the-box. However, xarray does occasionally
39+
call some numpy functions which are not (yet) part of the standard (e.g. :py:meth:`xarray.DataArray.pad` calls :py:func:`numpy.pad`).
40+
See `xarray issue #7848 <https://github.com/pydata/xarray/issues/7848>`_ for a list of such functions. We can still support dispatching on these functions through
41+
the array protocols above, it just means that if you exclusively implement the methods in the Python Array API standard
42+
then some features in xarray will not work.
43+
44+
Custom inline reprs
45+
~~~~~~~~~~~~~~~~~~~
1446

1547
In certain situations (e.g. when printing the collapsed preview of
1648
variables of a ``Dataset``), xarray will display the repr of a :term:`duck array`
1749
in a single line, truncating it to a certain number of characters. If that
1850
would drop too much information, the :term:`duck array` may define a
1951
``_repr_inline_`` method that takes ``max_width`` (number of characters) as an
20-
argument:
52+
argument
2153

2254
.. code:: python
2355

doc/internals/extending-xarray.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11

2+
.. _internals.accessors:
3+
24
Extending xarray using accessors
35
================================
46

doc/internals/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,12 @@ stack, NumPy and pandas. It is written in pure Python (no C or Cython
88
extensions), which makes it easy to develop and extend. Instead, we push
99
compiled code to :ref:`optional dependencies<installing>`.
1010

11+
The pages in this section are intended for:
12+
13+
* Contributors to xarray who wish to better understand some of the internals,
14+
* Developers who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users,
15+
* Developers who wish to interface xarray with their existing tooling, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type.
16+
1117

1218
.. toctree::
1319
:maxdepth: 2

doc/user-guide/data-structures.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ DataArray
1919
:py:class:`xarray.DataArray` is xarray's implementation of a labeled,
2020
multi-dimensional array. It has several key properties:
2121

22-
- ``values``: a :py:class:`numpy.ndarray` holding the array's values
22+
- ``values``: a :py:class:`numpy.ndarray` or
23+
:ref:`numpy-like array <userguide.duckarrays>` holding the array's values
2324
- ``dims``: dimension names for each axis (e.g., ``('x', 'y', 'z')``)
2425
- ``coords``: a dict-like container of arrays (*coordinates*) that label each
2526
point (e.g., 1-dimensional arrays of numbers, datetime objects or
@@ -46,7 +47,8 @@ Creating a DataArray
4647
The :py:class:`~xarray.DataArray` constructor takes:
4748

4849
- ``data``: a multi-dimensional array of values (e.g., a numpy ndarray,
49-
:py:class:`~pandas.Series`, :py:class:`~pandas.DataFrame` or ``pandas.Panel``)
50+
a :ref:`numpy-like array <userguide.duckarrays>`, :py:class:`~pandas.Series`,
51+
:py:class:`~pandas.DataFrame` or ``pandas.Panel``)
5052
- ``coords``: a list or dictionary of coordinates. If a list, it should be a
5153
list of tuples where the first element is the dimension name and the second
5254
element is the corresponding coordinate array_like object.

doc/user-guide/duckarrays.rst

Lines changed: 179 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,183 @@
11
.. currentmodule:: xarray
22

3+
.. _userguide.duckarrays:
4+
35
Working with numpy-like arrays
46
==============================
57

8+
NumPy-like arrays (often known as :term:`duck array`\s) are drop-in replacements for the :py:class:`numpy.ndarray`
9+
class but with different features, such as propagating physical units or a different layout in memory.
10+
Xarray can often wrap these array types, allowing you to use labelled dimensions and indexes whilst benefiting from the
11+
additional features of these array libraries.
12+
13+
Some numpy-like array types that xarray already has some support for:
14+
15+
* `Cupy <https://cupy.dev/>`_ - GPU support (see `cupy-xarray <https://cupy-xarray.readthedocs.io>`_),
16+
* `Sparse <https://sparse.pydata.org/en/stable/>`_ - for performant arrays with many zero elements,
17+
* `Pint <https://pint.readthedocs.io/en/latest/>`_ - for tracking the physical units of your data (see `pint-xarray <https://pint-xarray.readthedocs.io>`_),
18+
* `Dask <https://docs.dask.org/en/stable/>`_ - parallel computing on larger-than-memory arrays (see :ref:`using dask with xarray <dask>`),
19+
* `Cubed <https://github.com/tomwhite/cubed/tree/main/cubed>`_ - another parallel computing framework that emphasises reliability (see `cubed-xarray <https://github.com/cubed-xarray>`_).
20+
621
.. warning::
722

8-
This feature should be considered experimental. Please report any bug you may find on
9-
xarray’s github repository.
23+
This feature should be considered somewhat experimental. Please report any bugs you find on
24+
`xarray’s issue tracker <https://github.com/pydata/xarray/issues>`_.
25+
26+
.. note::
27+
28+
For information on wrapping dask arrays see :ref:`dask`. Whilst xarray wraps dask arrays in a similar way to that
29+
described on this page, chunked array types like :py:class:`dask.array.Array` implement additional methods that require
30+
slightly different user code (e.g. calling ``.chunk`` or ``.compute``).
31+
32+
Why "duck"?
33+
-----------
34+
35+
Why is it also called a "duck" array? This comes from a common statement of object-oriented programming -
36+
"If it walks like a duck, and quacks like a duck, treat it like a duck". In other words, a library like xarray that
37+
is capable of using multiple different types of arrays does not have to explicitly check that each one it encounters is
38+
permitted (e.g. ``if dask``, ``if numpy``, ``if sparse`` etc.). Instead xarray can take the more permissive approach of simply
39+
treating the wrapped array as valid, attempting to call the relevant methods (e.g. ``.mean()``) and only raising an
40+
error if a problem occurs (e.g. the method is not found on the wrapped class). This is much more flexible, and allows
41+
objects and classes from different libraries to work together more easily.
42+
43+
What is a numpy-like array?
44+
---------------------------
45+
46+
A "numpy-like array" (also known as a "duck array") is a class that contains array-like data, and implements key
47+
numpy-like functionality such as indexing, broadcasting, and computation methods.
48+
49+
For example, the `sparse <https://sparse.pydata.org/en/stable/>`_ library provides a sparse array type which is useful for representing nD array objects like sparse matrices
50+
in a memory-efficient manner. We can create a sparse array object (of the :py:class:`sparse.COO` type) from a numpy array like this:
51+
52+
.. ipython:: python
53+
54+
from sparse import COO
55+
56+
x = np.eye(4, dtype=np.uint8) # create diagonal identity matrix
57+
s = COO.from_numpy(x)
58+
s
1059
11-
NumPy-like arrays (:term:`duck array`) extend the :py:class:`numpy.ndarray` with
12-
additional features, like propagating physical units or a different layout in memory.
60+
This sparse object does not attempt to explicitly store every element in the array, only the non-zero elements.
61+
This approach is much more efficient for large arrays with only a few non-zero elements (such as tri-diagonal matrices).
62+
Sparse array objects can be converted back to a "dense" numpy array by calling :py:meth:`sparse.COO.todense`.
1363

14-
:py:class:`DataArray` and :py:class:`Dataset` objects can wrap these duck arrays, as
15-
long as they satisfy certain conditions (see :ref:`internals.duck_arrays`).
64+
Just like :py:class:`numpy.ndarray` objects, :py:class:`sparse.COO` arrays support indexing
65+
66+
.. ipython:: python
67+
68+
s[1, 1] # diagonal elements should be ones
69+
s[2, 3] # off-diagonal elements should be zero
70+
71+
broadcasting,
72+
73+
.. ipython:: python
74+
75+
x2 = np.zeros(
76+
(4, 1), dtype=np.uint8
77+
) # create second sparse array of different shape
78+
s2 = COO.from_numpy(x2)
79+
(s * s2) # multiplication requires broadcasting
80+
81+
and various computation methods
82+
83+
.. ipython:: python
84+
85+
s.sum(axis=1)
86+
87+
This numpy-like array also supports calling so-called `numpy ufuncs <https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs>`_
88+
("universal functions") on it directly:
89+
90+
.. ipython:: python
91+
92+
np.sum(s, axis=1)
93+
94+
95+
Notice that in each case the API for calling the operation on the sparse array is identical to that of calling it on the
96+
equivalent numpy array - this is the sense in which the sparse array is "numpy-like".
1697

1798
.. note::
1899

19-
For ``dask`` support see :ref:`dask`.
100+
For discussion on exactly which methods a class needs to implement to be considered "numpy-like", see :ref:`internals.duckarrays`.
101+
102+
Wrapping numpy-like arrays in xarray
103+
------------------------------------
104+
105+
:py:class:`DataArray`, :py:class:`Dataset`, and :py:class:`Variable` objects can wrap these numpy-like arrays.
20106

107+
Constructing xarray objects which wrap numpy-like arrays
108+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
21109

22-
Missing features
23-
----------------
24-
Most of the API does support :term:`duck array` objects, but there are a few areas where
25-
the code will still cast to ``numpy`` arrays:
110+
The primary way to create an xarray object which wraps a numpy-like array is to pass that numpy-like array instance directly
111+
to the constructor of the xarray class. The :ref:`page on xarray data structures <data structures>` shows how :py:class:`DataArray` and :py:class:`Dataset`
112+
both accept data in various forms through their ``data`` argument, but in fact this data can also be any wrappable numpy-like array.
26113

27-
- dimension coordinates, and thus all indexing operations:
114+
For example, we can wrap the sparse array we created earlier inside a new DataArray object:
115+
116+
.. ipython:: python
117+
118+
s_da = xr.DataArray(s, dims=["i", "j"])
119+
s_da
120+
121+
We can see what's inside - the printable representation of our xarray object (the repr) automatically uses the printable
122+
representation of the underlying wrapped array.
123+
124+
Of course our sparse array object is still there underneath - it's stored under the ``.data`` attribute of the dataarray:
125+
126+
.. ipython:: python
127+
128+
s_da.data
129+
130+
Array methods
131+
~~~~~~~~~~~~~
132+
133+
We saw above that numpy-like arrays provide numpy methods. Xarray automatically uses these when you call the corresponding xarray method:
134+
135+
.. ipython:: python
136+
137+
s_da.sum(dim="j")
138+
139+
Converting wrapped types
140+
~~~~~~~~~~~~~~~~~~~~~~~~
141+
142+
If you want to change the type inside your xarray object you can use :py:meth:`DataArray.as_numpy`:
143+
144+
.. ipython:: python
145+
146+
s_da.as_numpy()
147+
148+
This returns a new :py:class:`DataArray` object, but now wrapping a normal numpy array.
149+
150+
If instead you want to convert to numpy and return that numpy array you can use either :py:meth:`DataArray.to_numpy` or
151+
:py:meth:`DataArray.values`, where the former is strongly preferred. The difference is in the way they coerce to numpy - :py:meth:`~DataArray.values`
152+
always uses :py:func:`numpy.asarray` which will fail for some array types (e.g. ``cupy``), whereas :py:meth:`~DataArray.to_numpy`
153+
uses the correct method depending on the array type.
154+
155+
.. ipython:: python
156+
157+
s_da.to_numpy()
158+
159+
.. ipython:: python
160+
:okexcept:
161+
162+
s_da.values
163+
164+
This illustrates the difference between :py:meth:`~DataArray.data` and :py:meth:`~DataArray.values`,
165+
which is sometimes a point of confusion for new xarray users.
166+
Explicitly: :py:meth:`DataArray.data` returns the underlying numpy-like array, regardless of type, whereas
167+
:py:meth:`DataArray.values` converts the underlying array to a numpy array before returning it.
168+
(This is another reason to use :py:meth:`~DataArray.to_numpy` over :py:meth:`~DataArray.values` - the intention is clearer.)
169+
170+
Conversion to numpy as a fallback
171+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
172+
173+
If a wrapped array does not implement the corresponding array method then xarray will often attempt to convert the
174+
underlying array to a numpy array so that the operation can be performed. You may want to watch out for this behavior,
175+
and report any instances in which it causes problems.
176+
177+
Most of xarray's API does support using :term:`duck array` objects, but there are a few areas where
178+
the code will still convert to ``numpy`` arrays:
179+
180+
- Dimension coordinates, and thus all indexing operations:
28181

29182
* :py:meth:`Dataset.sel` and :py:meth:`DataArray.sel`
30183
* :py:meth:`Dataset.loc` and :py:meth:`DataArray.loc`
@@ -33,7 +186,7 @@ the code will still cast to ``numpy`` arrays:
33186
:py:meth:`DataArray.reindex` and :py:meth:`DataArray.reindex_like`: duck arrays in
34187
data variables and non-dimension coordinates won't be casted
35188

36-
- functions and methods that depend on external libraries or features of ``numpy`` not
189+
- Functions and methods that depend on external libraries or features of ``numpy`` not
37190
covered by ``__array_function__`` / ``__array_ufunc__``:
38191

39192
* :py:meth:`Dataset.ffill` and :py:meth:`DataArray.ffill` (uses ``bottleneck``)
@@ -49,17 +202,25 @@ the code will still cast to ``numpy`` arrays:
49202
:py:class:`numpy.vectorize`)
50203
* :py:func:`apply_ufunc` with ``vectorize=True`` (uses :py:class:`numpy.vectorize`)
51204

52-
- incompatibilities between different :term:`duck array` libraries:
205+
- Incompatibilities between different :term:`duck array` libraries:
53206

54207
* :py:meth:`Dataset.chunk` and :py:meth:`DataArray.chunk`: this fails if the data was
55208
not already chunked and the :term:`duck array` (e.g. a ``pint`` quantity) should
56-
wrap the new ``dask`` array; changing the chunk sizes works.
57-
209+
wrap the new ``dask`` array; changing the chunk sizes works however.
58210

59211
Extensions using duck arrays
60212
----------------------------
61-
Here's a list of libraries extending ``xarray`` to make working with wrapped duck arrays
62-
easier:
213+
214+
Whilst the features above allow many numpy-like array libraries to be used pretty seamlessly with xarray, it often also
215+
makes sense to use an interfacing package to make certain tasks easier.
216+
217+
For example the `pint-xarray package <https://pint-xarray.readthedocs.io>`_ offers a custom ``.pint`` accessor (see :ref:`internals.accessors`) which provides
218+
convenient access to information stored within the wrapped array (e.g. ``.units`` and ``.magnitude``), and makes makes
219+
creating wrapped pint arrays (and especially xarray-wrapping-pint-wrapping-dask arrays) simpler for the user.
220+
221+
We maintain a list of libraries extending ``xarray`` to make working with particular wrapped duck arrays
222+
easier. If you know of more that aren't on this list please raise an issue to add them!
63223

64224
- `pint-xarray <https://pint-xarray.readthedocs.io>`_
65225
- `cupy-xarray <https://cupy-xarray.readthedocs.io>`_
226+
- `cubed-xarray <https://github.com/cubed-xarray>`_

doc/whats-new.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@ Bug fixes
3838
Documentation
3939
~~~~~~~~~~~~~
4040

41+
- Expanded the page on wrapping numpy-like "duck" arrays.
42+
(:pull:`7911`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
4143

4244
Internal Changes
4345
~~~~~~~~~~~~~~~~
@@ -98,7 +100,6 @@ Bug fixes
98100
Documentation
99101
~~~~~~~~~~~~~
100102

101-
102103
Internal Changes
103104
~~~~~~~~~~~~~~~~
104105

0 commit comments

Comments
 (0)