Skip to content

Commit 4441f99

Browse files
benbovyIllviljanpre-commit-ci[bot]dcherian
authored
Expose "Coordinates" as part of Xarray's public API (#7368)
* add indexes argument to Dataset.__init__ * make indexes arg public for DataArray.__init__ * Indexes constructor updates - easily create an empty Indexes collection - check consistency between indexes and variables * use the generic Mapping[Any, Index] for indexes * add wrap_pandas_multiindex function * do not create default indexes when not desired * fix Dataset dimensions TODO: check indexes shapes / dims for DataArray * copy the coordinate variables of passed indexes * DataArray: check dimensions/shape of index coords * docstrings tweaks * more Indexes safety Since its constructor can now be used publicly. Copy input mappings and check the type of input indexes. * ensure input indexes are Xarray indexes * add .assign_indexes() method * add `IndexedCoordinates` subclass + add `IndexedCoordinates.from_pandas_multiindex` helper. * rollback/update Dataset and DataArray constructors Drop the `indexes` argument or keep it as private API. When a `Coordinates` object is passed as `coords` argument, extract both coordinate variables and indexes and add them to the new Dataset or DataArray. * update docstrings * fix Dataset creation internal error * add IndexedCoordinates.merge_coords * drop IndexedCoordinates and reuse Coordinates * update api docs * make Coordinates init args optional * docstrings updates * convert to base variable when no index is given * raise when an index is given with no variable * skip create default indexes... ... When a Coordinates object is given to the Dataset constructor * invariant checks: maybe skip IndexVariable checks ... when check_default_indexes is False. * add Coordinates tests * more Coordinates tests * add Dataset constructor tests with Coordinates * fix mypy * assign_coords: do not create default indexes... ... when passing a Coordinates object * support alignment of Coordinates * clean-up * fix failing test (dataarray coords not extracted) * fix tests: prevent index conflicts Do not extract multi-coordinate indexes from DataArray if they are overwritten or dropped (dimension coordinate). * add Coordinates.equals and Coordinates.identical * more tests, docstrings, docs * fix assert_* (Coordinates subclasses) * review copy * another few tests * fix mypy * update what's new * do not copy indexes May corrupt multi-coordinate indexes. * add Coordinates fastpath constructor * fix sphinx directive * re-add coord indexes in merge (dataset constructor) This re-enables the optimization in deep_align that skips alignment for any alignable (DataArray) in a dict that matches an index key. * create coords with default idx: try a cleaner impl Coordinate variables and indexes extracted from DataArrays should be merged more properly. * some useful comments for later * xr.merge: add support for Coordinates objects * allow skip align for object(s) in merge_core This fixes the decrease in performance observed in Dataset creation benchmarks. When creating a new Dataset, the variables and indexes in `Coordinates` should already be aligned together so it doesn't need to go through the complex alignment logic once again. `Coordinates` indexes are still used to align data variables. * fix mypy * what's new tweaks * align Coordinates callbacks: don't reindex data vars * fix Coordinates._overwrite_indexes callback mypy was rightfully complaining. This callback is called from Aligner only, which passes the first two arguments and ignores the rest. * remove merge_coords * futurewarning: pass multi-index via data vars * review comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix circulat imports * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typing: add Alignable protocol class * try fixing mypy error (Self redefinition) * remove Coordinate alias of Variable Much water has flowed under the bridge since it has been renamed. * fix groupby test * doc: remove merge_coords in api reference * doc: improve docstrings and glossary * use Self type annotation in Coordinate class * better comment * fix Self undefined error with python < 3.11 Pyright displays an info message "Self is not valid in this context" but most important this should avoid runtime errors with python < 3.11. --------- Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
1 parent efa2863 commit 4441f99

21 files changed

+1103
-277
lines changed

doc/api-hidden.rst

Lines changed: 38 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,40 @@
99
.. autosummary::
1010
:toctree: generated/
1111

12+
Coordinates.from_pandas_multiindex
13+
Coordinates.get
14+
Coordinates.items
15+
Coordinates.keys
16+
Coordinates.values
17+
Coordinates.dims
18+
Coordinates.dtypes
19+
Coordinates.variables
20+
Coordinates.xindexes
21+
Coordinates.indexes
22+
Coordinates.to_dataset
23+
Coordinates.to_index
24+
Coordinates.update
25+
Coordinates.merge
26+
Coordinates.copy
27+
Coordinates.equals
28+
Coordinates.identical
29+
1230
core.coordinates.DatasetCoordinates.get
1331
core.coordinates.DatasetCoordinates.items
1432
core.coordinates.DatasetCoordinates.keys
15-
core.coordinates.DatasetCoordinates.merge
16-
core.coordinates.DatasetCoordinates.to_dataset
17-
core.coordinates.DatasetCoordinates.to_index
18-
core.coordinates.DatasetCoordinates.update
1933
core.coordinates.DatasetCoordinates.values
2034
core.coordinates.DatasetCoordinates.dims
21-
core.coordinates.DatasetCoordinates.indexes
35+
core.coordinates.DatasetCoordinates.dtypes
2236
core.coordinates.DatasetCoordinates.variables
37+
core.coordinates.DatasetCoordinates.xindexes
38+
core.coordinates.DatasetCoordinates.indexes
39+
core.coordinates.DatasetCoordinates.to_dataset
40+
core.coordinates.DatasetCoordinates.to_index
41+
core.coordinates.DatasetCoordinates.update
42+
core.coordinates.DatasetCoordinates.merge
43+
core.coordinates.DataArrayCoordinates.copy
44+
core.coordinates.DatasetCoordinates.equals
45+
core.coordinates.DatasetCoordinates.identical
2346

2447
core.rolling.DatasetCoarsen.boundary
2548
core.rolling.DatasetCoarsen.coord_func
@@ -47,14 +70,19 @@
4770
core.coordinates.DataArrayCoordinates.get
4871
core.coordinates.DataArrayCoordinates.items
4972
core.coordinates.DataArrayCoordinates.keys
50-
core.coordinates.DataArrayCoordinates.merge
51-
core.coordinates.DataArrayCoordinates.to_dataset
52-
core.coordinates.DataArrayCoordinates.to_index
53-
core.coordinates.DataArrayCoordinates.update
5473
core.coordinates.DataArrayCoordinates.values
5574
core.coordinates.DataArrayCoordinates.dims
56-
core.coordinates.DataArrayCoordinates.indexes
75+
core.coordinates.DataArrayCoordinates.dtypes
5776
core.coordinates.DataArrayCoordinates.variables
77+
core.coordinates.DataArrayCoordinates.xindexes
78+
core.coordinates.DataArrayCoordinates.indexes
79+
core.coordinates.DataArrayCoordinates.to_dataset
80+
core.coordinates.DataArrayCoordinates.to_index
81+
core.coordinates.DataArrayCoordinates.update
82+
core.coordinates.DataArrayCoordinates.merge
83+
core.coordinates.DataArrayCoordinates.copy
84+
core.coordinates.DataArrayCoordinates.equals
85+
core.coordinates.DataArrayCoordinates.identical
5886

5987
core.rolling.DataArrayCoarsen.boundary
6088
core.rolling.DataArrayCoarsen.coord_func

doc/api.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1085,6 +1085,7 @@ Advanced API
10851085
.. autosummary::
10861086
:toctree: generated/
10871087

1088+
Coordinates
10881089
Dataset.variables
10891090
DataArray.variable
10901091
Variable

doc/user-guide/terminology.rst

Lines changed: 44 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -54,23 +54,22 @@ complete examples, please consult the relevant documentation.*
5454
Coordinate
5555
An array that labels a dimension or set of dimensions of another
5656
``DataArray``. In the usual one-dimensional case, the coordinate array's
57-
values can loosely be thought of as tick labels along a dimension. There
58-
are two types of coordinate arrays: *dimension coordinates* and
59-
*non-dimension coordinates* (see below). A coordinate named ``x`` can be
60-
retrieved from ``arr.coords[x]``. A ``DataArray`` can have more
61-
coordinates than dimensions because a single dimension can be labeled by
62-
multiple coordinate arrays. However, only one coordinate array can be a
63-
assigned as a particular dimension's dimension coordinate array. As a
57+
values can loosely be thought of as tick labels along a dimension. We
58+
distinguish :term:`Dimension coordinate` vs. :term:`Non-dimension
59+
coordinate` and :term:`Indexed coordinate` vs. :term:`Non-indexed
60+
coordinate`. A coordinate named ``x`` can be retrieved from
61+
``arr.coords[x]``. A ``DataArray`` can have more coordinates than
62+
dimensions because a single dimension can be labeled by multiple
63+
coordinate arrays. However, only one coordinate array can be a assigned
64+
as a particular dimension's dimension coordinate array. As a
6465
consequence, ``len(arr.dims) <= len(arr.coords)`` in general.
6566

6667
Dimension coordinate
6768
A one-dimensional coordinate array assigned to ``arr`` with both a name
68-
and dimension name in ``arr.dims``. Dimension coordinates are used for
69-
label-based indexing and alignment, like the index found on a
70-
:py:class:`pandas.DataFrame` or :py:class:`pandas.Series`. In fact,
71-
dimension coordinates use :py:class:`pandas.Index` objects under the
72-
hood for efficient computation. Dimension coordinates are marked by
73-
``*`` when printing a ``DataArray`` or ``Dataset``.
69+
and dimension name in ``arr.dims``. Usually (but not always), a
70+
dimension coordinate is also an :term:`Indexed coordinate` so that it can
71+
be used for label-based indexing and alignment, like the index found on
72+
a :py:class:`pandas.DataFrame` or :py:class:`pandas.Series`.
7473

7574
Non-dimension coordinate
7675
A coordinate array assigned to ``arr`` with a name in ``arr.coords`` but
@@ -79,20 +78,40 @@ complete examples, please consult the relevant documentation.*
7978
example, multidimensional coordinates are often used in geoscience
8079
datasets when :doc:`the data's physical coordinates (such as latitude
8180
and longitude) differ from their logical coordinates
82-
<../examples/multidimensional-coords>`. However, non-dimension coordinates
83-
are not indexed, and any operation on non-dimension coordinates that
84-
leverages indexing will fail. Printing ``arr.coords`` will print all of
85-
``arr``'s coordinate names, with the corresponding dimension(s) in
86-
parentheses. For example, ``coord_name (dim_name) 1 2 3 ...``.
81+
<../examples/multidimensional-coords>`. Printing ``arr.coords`` will
82+
print all of ``arr``'s coordinate names, with the corresponding
83+
dimension(s) in parentheses. For example, ``coord_name (dim_name) 1 2 3
84+
...``.
85+
86+
Indexed coordinate
87+
A coordinate which has an associated :term:`Index`. Generally this means
88+
that the coordinate labels can be used for indexing (selection) and/or
89+
alignment. An indexed coordinate may have one or more arbitrary
90+
dimensions although in most cases it is also a :term:`Dimension
91+
coordinate`. It may or may not be grouped with other indexed coordinates
92+
depending on whether they share the same index. Indexed coordinates are
93+
marked by ``*`` when printing a ``DataArray`` or ``Dataset``.
94+
95+
Non-indexed coordinate
96+
A coordinate which has no associated :term:`Index`. It may still
97+
represent fixed labels along one or more dimensions but it cannot be
98+
used for label-based indexing and alignment.
8799

88100
Index
89-
An *index* is a data structure optimized for efficient selecting and
90-
slicing of an associated array. Xarray creates indexes for dimension
91-
coordinates so that operations along dimensions are fast, while
92-
non-dimension coordinates are not indexed. Under the hood, indexes are
93-
implemented as :py:class:`pandas.Index` objects. The index associated
94-
with dimension name ``x`` can be retrieved by ``arr.indexes[x]``. By
95-
construction, ``len(arr.dims) == len(arr.indexes)``
101+
An *index* is a data structure optimized for efficient data selection
102+
and alignment within a discrete or continuous space that is defined by
103+
coordinate labels (unless it is a functional index). By default, Xarray
104+
creates a :py:class:`~xarray.indexes.PandasIndex` object (i.e., a
105+
:py:class:`pandas.Index` wrapper) for each :term:`Dimension coordinate`.
106+
For more advanced use cases (e.g., staggered or irregular grids,
107+
geospatial indexes), Xarray also accepts any instance of a specialized
108+
:py:class:`~xarray.indexes.Index` subclass that is associated to one or
109+
more arbitrary coordinates. The index associated with the coordinate
110+
``x`` can be retrieved by ``arr.xindexes[x]`` (or ``arr.indexes["x"]``
111+
if the index is convertible to a :py:class:`pandas.Index` object). If
112+
two coordinates ``x`` and ``y`` share the same index,
113+
``arr.xindexes[x]`` and ``arr.xindexes[y]`` both return the same
114+
:py:class:`~xarray.indexes.Index` object.
96115

97116
name
98117
The names of dimensions, coordinates, DataArray objects and data

doc/whats-new.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,20 @@ v2023.07.1 (unreleased)
2222
New Features
2323
~~~~~~~~~~~~
2424

25+
- :py:class:`Coordinates` can now be constructed independently of any Dataset or
26+
DataArray (it is also returned by the :py:attr:`Dataset.coords` and
27+
:py:attr:`DataArray.coords` properties). ``Coordinates`` objects are useful for
28+
passing both coordinate variables and indexes to new Dataset / DataArray objects,
29+
e.g., via their constructor or via :py:meth:`Dataset.assign_coords`. We may also
30+
wrap coordinate variables in a ``Coordinates`` object in order to skip
31+
the automatic creation of (pandas) indexes for dimension coordinates.
32+
The :py:class:`Coordinates.from_pandas_multiindex` constructor may be used to
33+
create coordinates directly from a :py:class:`pandas.MultiIndex` object (it is
34+
preferred over passing it directly as coordinate data, which may be deprecated soon).
35+
Like Dataset and DataArray objects, ``Coordinates`` objects may now be used in
36+
:py:func:`align` and :py:func:`merge`.
37+
(:issue:`6392`, :pull:`7368`).
38+
By `Benoît Bovy <https://github.com/benbovy>`_.
2539
- Visually group together coordinates with the same indexes in the index section of the text repr (:pull:`7225`).
2640
By `Justus Magin <https://github.com/keewis>`_.
2741
- Allow creating Xarray objects where a multidimensional variable shares its name

xarray/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
where,
2727
)
2828
from xarray.core.concat import concat
29+
from xarray.core.coordinates import Coordinates
2930
from xarray.core.dataarray import DataArray
3031
from xarray.core.dataset import Dataset
3132
from xarray.core.extensions import (
@@ -37,7 +38,7 @@
3738
from xarray.core.merge import Context, MergeError, merge
3839
from xarray.core.options import get_options, set_options
3940
from xarray.core.parallel import map_blocks
40-
from xarray.core.variable import Coordinate, IndexVariable, Variable, as_variable
41+
from xarray.core.variable import IndexVariable, Variable, as_variable
4142
from xarray.util.print_versions import show_versions
4243

4344
try:
@@ -100,6 +101,7 @@
100101
"CFTimeIndex",
101102
"Context",
102103
"Coordinate",
104+
"Coordinates",
103105
"DataArray",
104106
"Dataset",
105107
"Index",

0 commit comments

Comments
 (0)