Skip to content

Commit b63f78a

Browse files
benbovydcherianpre-commit-ci[bot]
authored
More builtin indexes (#20)
* unrelated fixes * add PandasIndex and PandasMultiIndex * add example cross-refs * PandasMultiIndex stack/unstack example * Update docs/builtin/pdindex.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update docs/builtin/pdmultiindex.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Update docs/builtin/pdmultiindex.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Update docs/builtin/pdmultiindex.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Update docs/builtin/pdmultiindex.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Update docs/builtin/pdmultiindex.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Update docs/builtin/pdmultiindex.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * re-arrange PandasMultiIndex * fix * change PandasMultiIndex title * fix PandasIndex and PandasMultiIndex cross-refs * nit * temp: install Xarray main branch This is needed for pd.RangeIndex with "lazy" coordinate variable. This will be needed for NDPointIndex too. TODO: remove when next version of Xarray is released. * add pandas.RangeIndex and RangeIndex * open_dataset: skip the creation of default indexes * Update docs/builtin/pdrange.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Update docs/builtin/range.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Revert "Update docs/builtin/pdrange.md" This reverts commit abd9a34. --------- Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 0c40485 commit b63f78a

File tree

10 files changed

+480
-8
lines changed

10 files changed

+480
-8
lines changed

docs/builtin/pdindex.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
format_name: myst
5+
kernelspec:
6+
display_name: Python 3
7+
name: python
8+
---
9+
10+
# The default `PandasIndex`
11+
12+
````{grid}
13+
```{grid-item}
14+
:columns: 3
15+
```{image} https://pandas.pydata.org/docs/_static/pandas.svg
16+
---
17+
alt: Pandas logo
18+
width: 200px
19+
align: center
20+
---
21+
```
22+
````
23+
24+
## Highlights
25+
26+
1. {py:class}`xarray.indexes.PandasIndex` can wrap _one dimensional_ {py:class}`pandas.Index` objects to allow indexing along 1D coordinate variables. These indexes can apply to both {term}`"dimension" coordinates <xarray:Dimension coordinate>` and {term}`"non-dimension" coordinates <xarray:Non-dimension coordinate>`.
27+
1. When opening or constructing a new Dataset or DataArray, Xarray creates by default a {py:class}`xarray.indexes.PandasIndex` for each {term}`"dimension" coordinate <xarray:Dimension coordinate>`.
28+
1. It is possible to either drop those default indexes or skip their creation.
29+
30+
## Example
31+
32+
Let's open a tutorial dataset.
33+
34+
```{code-cell} python
35+
import xarray as xr
36+
```
37+
38+
```{code-cell} python
39+
---
40+
tags: [remove-cell]
41+
---
42+
%xmode minimal
43+
44+
xr.set_options(
45+
display_expand_indexes=True,
46+
display_expand_attrs=False,
47+
);
48+
```
49+
50+
```{code-cell} python
51+
ds_air = xr.tutorial.open_dataset("air_temperature")
52+
ds_air
53+
```
54+
55+
It has created by default a {py:class}`~xarray.indexes.PandasIndex` for each of
56+
the "lat", "lon" and "time" dimension coordinates, as we can also see below via
57+
the {py:attr}`xarray.Dataset.xindexes` property.
58+
59+
```{code-cell} python
60+
ds_air.xindexes
61+
```
62+
63+
Those indexes are used under the hood for, e.g., label-based selection.
64+
65+
```{code-cell} python
66+
ds_air.sel(time="2013")
67+
```
68+
69+
### Set indexes for non-dimension coordinates
70+
71+
Xarray does not automatically create an index for non-dimension coordinates like
72+
the "season (time)" coordinate added below.
73+
74+
```{code-cell} python
75+
ds_air.coords["season"] = ds_air.time.dt.season
76+
ds_air
77+
```
78+
79+
Without an index, it is not possible select data based on the "season"
80+
coordinate.
81+
82+
```{code-cell} python
83+
---
84+
tags: [raises-exception]
85+
---
86+
ds_air.sel(season="DJF")
87+
```
88+
89+
However, it is possible to manually set a `PandasIndex` for that 1-dimensional
90+
coordinate.
91+
92+
```{code-cell} python
93+
ds_extra = ds_air.set_xindex("season", xr.indexes.PandasIndex)
94+
ds_extra
95+
```
96+
97+
Which now enables label-based selection.
98+
99+
```{code-cell} python
100+
ds_extra.sel(season="DJF")
101+
```
102+
103+
It is not yet supported to provide labels to {py:meth}`xarray.Dataset.sel` for
104+
multiple index coordinates sharing common dimensions (unless those coordinates
105+
also share the same index object, e.g., like shown in the {doc}`PandasMultiIndex example <pdmultiindex>`).
106+
107+
```{code-cell} python
108+
---
109+
tags: [raises-exception]
110+
---
111+
ds_extra.sel(season="DJF", time="2013")
112+
```
113+
114+
### Drop indexes
115+
116+
Indexes are not always necessary and (re-)computing them may introduce some
117+
unwanted overhead.
118+
119+
The code line below drops the default indexes that have been created when
120+
opening the example dataset.
121+
122+
```{code-cell} python
123+
ds_air.drop_indexes(["time", "lat", "lon"])
124+
```
125+
126+
### Skip the creation of default indexes
127+
128+
Let's re-open the example dataset above, this time with no index.
129+
130+
```{code-cell} python
131+
ds_air_no_index = xr.tutorial.open_dataset(
132+
"air_temperature", create_default_indexes=False
133+
)
134+
135+
ds_air_no_index
136+
```
137+
138+
Like {py:func}`xarray.open_dataset`, indexes are created by default for
139+
dimension coordinates when constructing a new Dataset.
140+
141+
```{code-cell} python
142+
ds = xr.Dataset(coords={"x": [1, 2], "y": [3, 4, 5]})
143+
144+
ds
145+
```
146+
147+
Also when assigning new coordinates.
148+
149+
```{code-cell} python
150+
ds.assign_coords(u=[10, 20])
151+
```
152+
153+
To skip the creation of those default indexes, we need to explicitly create a
154+
new {py:class}`xarray.Coordinates` object and pass `indexes={}` (empty
155+
dictionary).
156+
157+
```{code-cell} python
158+
coords = xr.Coordinates({"u": [10, 20]}, indexes={})
159+
160+
ds.assign_coords(coords)
161+
```

docs/builtin/pdinterval.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Learn more at the [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_gui
3333
1. Sadly {py:class}`pandas.IntervalIndex` supports numpy datetimes but not [cftime](https://unidata.github.io/cftime/).
3434

3535
```{important}
36-
A pandas IntervalIndex models intervals using a single variable. The [Climate and Forecast Conventions](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#cell-boundaries), by contrast, model the intervals using two arrays: the intervals ("bounds" variable) and "central values".
36+
A pandas IntervalIndex models intervals using a single variable. The [Climate and Forecast Conventions](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#cell-boundaries), by contrast, model the intervals using two arrays: the intervals ("bounds" variable) and "central values".
3737
```
3838

3939
## Example

docs/builtin/pdmultiindex.md

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
format_name: myst
5+
kernelspec:
6+
display_name: Python 3
7+
name: python
8+
---
9+
10+
# Stack and unstack with `PandasMultiIndex`
11+
12+
````{grid}
13+
```{grid-item}
14+
:columns: 3
15+
```{image} https://pandas.pydata.org/docs/_static/pandas.svg
16+
---
17+
alt: Pandas logo
18+
width: 200px
19+
align: center
20+
---
21+
```
22+
````
23+
24+
## Highlights
25+
26+
1. An {py:class}`xarray.indexes.PandasMultiIndex` is associated with multiple coordinate variables sharing the same dimension.
27+
1. Create PandasMultiIndex from PandasIndex using {py:meth}`xarray.Dataset.stack` and convert back with {py:meth}`xarray.Dataset.unstack`.
28+
1. Labels of coordinates associated with a PandasMultiIndex can be passed all at once to `.sel`.
29+
30+
## Example
31+
32+
Let's open a tutorial dataset.
33+
34+
```{code-cell} python
35+
import xarray as xr
36+
```
37+
38+
```{code-cell} python
39+
---
40+
tags: [remove-cell]
41+
---
42+
%xmode minimal
43+
44+
xr.set_options(
45+
display_expand_indexes=True,
46+
display_expand_attrs=False,
47+
);
48+
```
49+
50+
```{code-cell} python
51+
ds_air = xr.tutorial.open_dataset("air_temperature")
52+
ds_air
53+
```
54+
55+
### Stack / Unstack
56+
57+
Stacking the "lat" and "lon" dimensions of the example dataset results here in
58+
the corresponding "lat" and "lon" stacked coordinates both associated with a
59+
`PandasMultiIndex` by default.
60+
The underlying data are _reshaped_ to collapse the `lat` and `lon` dimensions to a new `space` dimension.
61+
62+
```{code-cell} python
63+
stacked = ds_air.stack(space=("lat", "lon"))
64+
stacked
65+
```
66+
67+
The multi-index allows retrieving the original, unstacked dataset where the
68+
"lat" and "lon" dimension coordinates have their own `PandasIndex`.
69+
70+
```{code-cell} python
71+
unstacked = stacked.unstack("space")
72+
unstacked
73+
```
74+
75+
### Assigning
76+
77+
We can also directly associate a {py:class}`~xarray.indexes.PandasMultiIndex`
78+
with existing coordinates sharing the same dimension.
79+
80+
```{code-cell} python
81+
ds_air = (
82+
ds_air
83+
.assign_coords(season=ds_air.time.dt.season)
84+
.rename_vars(time="datetime")
85+
.drop_indexes("datetime")
86+
)
87+
88+
ds_air
89+
```
90+
91+
```{code-cell} python
92+
multi_indexed = ds_air.set_xindex(["season", "datetime"], xr.indexes.PandasMultiIndex)
93+
multi_indexed
94+
```
95+
96+
### Indexing
97+
98+
Contrary to what is shown in {doc}`the default PandasIndex <pdindex>` example,
99+
it is here possible to provide labels to {py:meth}`xarray.Dataset.sel` for both
100+
of the multi-index time coordinates.
101+
102+
```{code-cell} python
103+
multi_indexed.sel(season="DJF", datetime="2013")
104+
```
105+
106+
Chaining `.sel` calls for those coordinates each with their own index would
107+
yield equivalent results, though.
108+
109+
```{code-cell} python
110+
single_indexed = ds_air.set_xindex("datetime").set_xindex("season")
111+
112+
single_indexed.sel(season="DJF").sel(datetime="2013")
113+
```
114+
115+
### Assigning a `pandas.MultiIndex`
116+
117+
It is easy to wrap an existing {py:class}`pandas.MultiIndex` object into a new Xarray
118+
Dataset or DataArray.
119+
120+
```{code-cell} python
121+
import pandas as pd
122+
123+
midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("foo", "bar"))
124+
midx
125+
```
126+
127+
This can be done via {py:meth}`xarray.Coordinates.from_pandas_multiindex`.
128+
129+
```{code-cell} python
130+
midx_coords = xr.Coordinates.from_pandas_multiindex(midx, dim="x")
131+
132+
ds = xr.Dataset(coords=midx_coords)
133+
ds
134+
```

0 commit comments

Comments
 (0)