Skip to content

Commit f994c4c

Browse files
Joe HammanTomNicholas
andauthored
update development roadmap (#5759)
* update development roadmap * Apply suggestions from code review Co-authored-by: Tom Nicholas <35968931+TomNicholas@users.noreply.github.com> * add paragraph on lightweight variable package * Update roadmap * lint Co-authored-by: Tom Nicholas <35968931+TomNicholas@users.noreply.github.com>
1 parent 6a29380 commit f994c4c

File tree

1 file changed

+74
-4
lines changed

1 file changed

+74
-4
lines changed

doc/roadmap.rst

Lines changed: 74 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
Development roadmap
44
===================
55

6-
Authors: Stephan Hoyer, Joe Hamman and xarray developers
6+
Authors: Xarray developers
77

8-
Date: July 24, 2018
8+
Date: September 7, 2021
99

1010
Xarray is an open source Python library for labeled multidimensional
1111
arrays and datasets.
@@ -20,15 +20,15 @@ Why has xarray been successful? In our opinion:
2020

2121
- The dominant use-case for xarray is for analysis of gridded
2222
dataset in the geosciences, e.g., as part of the
23-
`Pangeo <http://pangeo-data.org>`__ project.
23+
`Pangeo <http://pangeo.io>`__ project.
2424
- Xarray is also used more broadly in the physical sciences, where
2525
we've found the needs for analyzing multidimensional datasets are
2626
remarkably consistent (e.g., see
2727
`SunPy <https://github.com/sunpy/ndcube>`__ and
2828
`PlasmaPy <https://github.com/PlasmaPy/PlasmaPy/issues/59>`__).
2929
- Finally, xarray is used in a variety of other domains, including
3030
finance, `probabilistic
31-
programming <https://github.com/arviz-devs/arviz/issues/97>`__ and
31+
programming <https://arviz-devs.github.io/arviz/>`__ and
3232
genomics.
3333

3434
- Xarray is also a **domain agnostic** solution:
@@ -87,12 +87,17 @@ We can generalize the community's needs into three main categories:
8787
- More flexible grids/indexing.
8888
- More flexible arrays/computing.
8989
- More flexible storage backends.
90+
- More flexible data structures.
9091

9192
Each of these are detailed further in the subsections below.
9293

9394
Flexible indexes
9495
~~~~~~~~~~~~~~~~
9596

97+
.. note::
98+
Work on flexible grids and indexes is currently underway. See
99+
`GH Project #1 <https://github.com/pydata/xarray/projects/1>`__ for more detail.
100+
96101
Xarray currently keeps track of indexes associated with coordinates by
97102
storing them in the form of a ``pandas.Index`` in special
98103
``xarray.IndexVariable`` objects.
@@ -130,6 +135,10 @@ build upon indexing, such as groupby operations with multiple variables.
130135
Flexible arrays
131136
~~~~~~~~~~~~~~~
132137

138+
.. note::
139+
Work on flexible arrays is currently underway. See
140+
`GH Project #2 <https://github.com/pydata/xarray/projects/2>`__ for more detail.
141+
133142
Xarray currently supports wrapping multidimensional arrays defined by
134143
NumPy, dask and to a limited-extent pandas. It would be nice to have
135144
interfaces that allow xarray to wrap alternative N-D array
@@ -160,6 +169,10 @@ third-party libraries.
160169
Flexible storage
161170
~~~~~~~~~~~~~~~~
162171

172+
.. note::
173+
Work on flexible storage backends is currently underway. See
174+
`GH Project #3 <https://github.com/pydata/xarray/projects/3>`__ for more detail.
175+
163176
The xarray backends module has grown in size and complexity. Much of
164177
this growth has been "organic" and mostly to support incremental
165178
additions to the supported backends. This has left us with a fragile
@@ -181,9 +194,66 @@ development would include:
181194
- Possibly moving some infrequently used backends to third-party
182195
packages.
183196

197+
Flexible data structures
198+
~~~~~~~~~~~~~~~~~~~~~~~~
199+
200+
Xarray provides two primary data structures, the ``xarray.DataArray`` and
201+
the ``xarray.Dataset``. This section describes two possible data model
202+
extensions.
203+
204+
Tree-like data structure
205+
++++++++++++++++++++++++
206+
207+
.. note::
208+
Work on developing a hierarchical data structure in Xarray is just
209+
beginning. See `Datatree <https://github.com/TomNicholas/datatree>`__
210+
for an early prototype.
211+
212+
Xarray’s highest-level object is currently an ``xarray.Dataset``, whose data
213+
model echoes that of a single netCDF group. However real-world datasets are
214+
often better represented by a collection of related Datasets. Particular common
215+
examples include:
216+
217+
- Multi-resolution datasets,
218+
- Collections of time series datasets with differing lengths,
219+
- Heterogeneous datasets comprising multiple different types of related
220+
observational or simulation data,
221+
- Bayesian workflows involving various statistical distributions over multiple
222+
variables,
223+
- Whole netCDF files containing multiple groups.
224+
- Comparison of output from many similar models (such as in the IPCC's Coupled Model Intercomparison Projects)
225+
226+
A new tree-like data structure which is essentially a structured hierarchical
227+
collection of Datasets could represent these cases, and would instead map to
228+
multiple netCDF groups (see `GH4118 <https://github.com/pydata/xarray/issues/4118>`__.).
229+
230+
Currently there are several libraries which have wrapped xarray in order to build
231+
domain-specific data structures (e.g. `xarray-multiscale <https://github.com/JaneliaSciComp/xarray-multiscale>`__.),
232+
but a general ``xarray.DataTree`` object would obviate the need for these and]
233+
consolidate effort in a single domain-agnostic tool, much as xarray has already achieved.
234+
235+
Labeled array without coordinates
236+
+++++++++++++++++++++++++++++++++
237+
238+
There is a need for a lightweight array structure with named dimensions for
239+
convenient indexing and broadcasting. Xarray includes such a structure internally
240+
(``xarray.Variable``). We want to factor out Xarray's “Variable” object into a
241+
standalone package with minimal dependencies for integration with libraries that
242+
don't want to inherit Xarray's dependency on Pandas (e.g. scikit-learn).
243+
The new “Variable” class will follow established array protocols and the new
244+
data-apis standard. It will be capable of wrapping multiple array-like objects
245+
(e.g. NumPy, Dask, Sparse, Pint, CuPy, Pytorch). While “DataArray” fits some of
246+
these requirements, it offers a more complex data model than is desired for
247+
many applications and depends on Pandas.
248+
184249
Engaging more users
185250
-------------------
186251

252+
.. note::
253+
Work on improving Xarray’s documentation and user engagement is
254+
currently underway. See `GH Project #4 <https://github.com/pydata/xarray/projects/4>`__
255+
for more detail.
256+
187257
Like many open-source projects, the documentation of xarray has grown
188258
together with the library's features. While we think that the xarray
189259
documentation is comprehensive already, we acknowledge that the adoption

0 commit comments

Comments
 (0)