Skip to content

Commit 95d6ffe

Browse files
committed
New extras page on the PyAOS stack
1 parent 46c81b2 commit 95d6ffe

File tree

2 files changed

+210
-5
lines changed

2 files changed

+210
-5
lines changed

_extras/discuss.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

_extras/stack.md

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
---
2+
layout: page
3+
title: PyAOS stack
4+
---
5+
6+
It would be an understatement to say that Python has exploded onto the data science scene in recent years.
7+
PyCon and SciPy conferences are held somewhere in the world every few months now,
8+
at which loads of new and/or improved data science libraries are showcased to the community
9+
(check out [pyvideo.org](pyvideo.org) for conference recordings).
10+
The ongoing rapid development of new libraries means that data scientists are (hopefully)
11+
continually able to do more and more cool things with less and less time and effort,
12+
but at the same time it can be difficult to figure out how they all relate to one another.
13+
To assist in making sense of this constantly changing landscape,
14+
this page summarises the current state of the weather and climate Python software “stack”
15+
(i.e. the collection of libraries used for data analysis and visualisation).
16+
The focus is on libraries that are widely used and that have good (and likely long-term) support.
17+
18+
![PyAOS stack](../fig/01-pyaos-stack.png)
19+
20+
## Core
21+
22+
The dashed box in the diagram represents the core of the stack, so let’s start this tour there.
23+
The default library for dealing with numerical arrays in Python is [NumPy](http://www.numpy.org/).
24+
It has a bunch of built in functions for reading and writing common data formats like .csv,
25+
but if your data is stored in netCDF format then the default library for getting data
26+
into/out of those files is [netCDF4](http://unidata.github.io/netcdf4-python/netCDF4/index.html).
27+
28+
Once you’ve read your data in, you’re probably going to want to do some statistical analysis.
29+
The NumPy library has some built in functions for calculating very simple statistics
30+
(e.g. maximum, mean, standard deviation),
31+
but for more complex analysis
32+
(e.g. interpolation, integration, linear algebra)
33+
the [SciPy](https://www.scipy.org/scipylib/index.html) library is the default.
34+
35+
If you’re dealing with a particularly large dataset,
36+
you may get memory errors (and/or slow performance)
37+
when trying to read and process your data.
38+
[Dask[(https://dask.org/) works with the existing Python ecosystem (i.e. NumPy, SciPy etc)
39+
to scale your analysis to multi-core machines and/or distributed clusters
40+
(i.e. parallel processing).
41+
42+
The NumPy library doesn’t come with any plotting capability,
43+
so if you want to visualise your NumPy data arrays then the default library is [matplotlib](https://matplotlib.org/).
44+
As you can see at the [matplotlib gallery](https://matplotlib.org/gallery.html),
45+
this library is great for any simple (e.g. bar charts, contour plots, line graphs),
46+
static (e.g. .png, .eps, .pdf) plots.
47+
The [cartopy](https://scitools.org.uk/cartopy/docs/latest/) library
48+
provides additional functionality for common map projections,
49+
while [Bokeh](http://bokeh.pydata.org/) allows for the creation of interactive plots
50+
where you can zoom and scroll.
51+
52+
While pretty much all data analysis and visualisation tasks
53+
could be achieved with a combination of these core libraries,
54+
their highly flexible, all-purpose nature means relatively common/simple tasks
55+
can often require quite a bit of work (i.e. many lines of code).
56+
To make things more efficient for data scientists,
57+
the scientific Python community has therefore built a number of libraries on top of the core stack.
58+
These additional libraries aren’t as flexible
59+
– they can’t do *everything* like the core stack can –
60+
but they can do common tasks with far less effort.
61+
62+
## Generic additions
63+
64+
Let’s first consider the generic additional libraries.
65+
That is, the ones that can be used in essentially all fields of data science.
66+
The most popular of these libraries is undoubtedly [pandas](http://pandas.pydata.org/),
67+
which has been a real game-changer for the Python data science community.
68+
The key advance offered by pandas is the concept of labelled arrays.
69+
Rather than referring to the individual elements of a data array using a numeric index
70+
(as is required with NumPy),
71+
the actual row and column headings can be used.
72+
That means Fred’s information for the year 2005
73+
could be obtained from a medical dataset by asking for `data(name=’Fred’, year=2005)`,
74+
rather than having to remember the numeric index corresponding to that person and year.
75+
This labelled array feature,
76+
combined with a bunch of other features that simplify common statistical and plotting tasks
77+
traditionally performed with SciPy and matplotlib,
78+
greatly simplifies the code development process (read: less lines of code).
79+
80+
One of the limitations of pandas
81+
is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays.
82+
The [xarray](http://xarray.pydata.org/) library was therefore created
83+
to extend the labelled array concept to x-dimensional arrays.
84+
Not all of the pandas functionality is available
85+
(which is a trade-off associated with being able to handle multi-dimensional arrays),
86+
but the ability to refer to array elements by their actual latitude (e.g. 20 South),
87+
longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example,
88+
makes the xarray data array far easier to deal with than the NumPy array.
89+
(As an added bonus, xarray also builds on netCDF4 to make netCDF input/output easier.)
90+
91+
## Discipline-specific additions
92+
93+
While the xarray library is a good option for those working in the atmosphere and ocean sciences
94+
(especially those dealing with large multi-dimensional arrays from model simulations),
95+
the [SciTools](https://scitools.org.uk/) project (led by the MetOffice)
96+
has taken a different approach to building on top of the core stack.
97+
Rather than striving to make their software generic
98+
(xarray is designed to handle any multi-dimensional data),
99+
they explicitly assume that users of their [Iris](https://scitools.org.uk/iris/docs/latest/)
100+
library are dealing with weather/ocean/climate data.
101+
Doing this allows them to make common weather/climate tasks super quick and easy,
102+
and it also means they have added functionality specific to atmosphere and ocean science.
103+
(The SciTools project is also behind cartopy
104+
and a number of other useful libraries for analysing earth science data.)
105+
106+
In addition to Iris, you may also come across [CDAT](https://cdat.llnl.gov),
107+
which is maintained by the team at Lawrence Livermore National Laboratory.
108+
It was the precursor to xarray and Iris in the sense that it was the first package
109+
for atmosphere and ocean scientists built on top of the core Python stack.
110+
For a number of years the funding and direction of that project shifted towards
111+
developing a graphical interface ([VCDAT](https://vcdat.llnl.gov))
112+
for managing large workflows and visualising data
113+
(i.e. as opposed to further developing the capabilities of the underlying Python libraries),
114+
but it seems that CDAT is now once again under [active development](https://github.com/CDAT/cdat/wiki).
115+
The VCDAT application also now runs as a JupyterLab extension, which is an exciting development.
116+
117+
> ## How to choose
118+
>
119+
> In terms of choosing between xarray and Iris,
120+
> some people like the slightly more atmosphere/ocean-centric experience offered by Iris,
121+
> while others don’t like the restrictions that places on their work
122+
> and prefer the generic xarray experience
123+
> (e.g. to use Iris your netCDF data files have to be CF compliant or close to it).
124+
> Either way, they are both a vast improvement on the netCDF/NumPy/matplotlib experience.
125+
{: .callout}
126+
127+
## Simplifying data exploration
128+
129+
While the plotting functionality associated with xarray and Iris
130+
speeds up the process of visually exploring data (as compared to matplotlib),
131+
there’s still a fair bit of messing around involved in tweaking the various aspects of a plot
132+
(e.g. colour schemes, plot size, labels, map projections, etc).
133+
This tweaking burden is an issue across all data science fields and programming languages,
134+
so developers of the latest generation of visualisation tools
135+
are moving towards something called *declarative visualisation*.
136+
The basic concept is that the user simply has to describe the characteristics of their data,
137+
and then the software figures out the optimal way to visualise it
138+
(i.e. it makes all the tweaking decisions for you).
139+
140+
The two major Python libraries in the declarative visualisation space are
141+
[HoloViews](http://holoviews.org/) and [Altair](https://altair-viz.github.io/).
142+
The former (which has been around much longer) uses matplotlib or Bokeh under the hood,
143+
which means it allows for the generation of static or interactive plots.
144+
Since HoloViews doesn’t have support for geographic plots,
145+
[GeoViews](http://geoviews.org/) has been created on top of it
146+
(which incorporates cartopy and can handle Iris or xarray data arrays).
147+
148+
## Sub-discipline-specific libraries
149+
150+
So far we’ve considered libraries that do general,
151+
broad-scale tasks like data input/output, common statistics, visualisation, etc.
152+
Given their large user base,
153+
these libraries are usually written and supported by large companies
154+
(e.g. Anaconda supports Bokeh and HoloViews/Geoviews),
155+
large institutions (e.g. the MetOffice supports Iris, cartopy and GeoViews)
156+
or the wider PyData community (e.g. pandas, xarray).
157+
Within each sub-discipline of atmosphere and ocean science,
158+
individuals and research groups take these libraries
159+
and apply them to their very specific data analysis tasks.
160+
Increasingly, these individuals and groups
161+
are formally packaging and releasing their code for use within their community.
162+
For instance, Andrew Dawson (an atmospheric scientist at Oxford)
163+
does a lot of EOF analysis and manipulation of wind data,
164+
so he has released his [eofs](https://ajdawson.github.io/eofs/latest/)
165+
and [windspharm](https://ajdawson.github.io/windspharm/latest/) libraries
166+
(which are able to handle data arrays from NumPy, Iris or xarray).
167+
Similarly, a group at the Atmospheric Radiation Measurement (ARM) Climate Research Facility
168+
have released their Python ARM Radar Toolkit ([Py-ART](http://arm-doe.github.io/pyart/))
169+
for analysing weather radar data,
170+
and a [similar story](https://www.unidata.ucar.edu/blogs/news/entry/metpy_an_open_source_python)
171+
is true for [MetPy](https://unidata.github.io/MetPy/latest/index.html).
172+
173+
> ## Coming soon
174+
>
175+
> In terms of new libraries that might be available soon,
176+
> the [Pangeo](https://pangeo.io/) project is actively supporting and encouraging
177+
> the development of more domain-specific geoscience packages.
178+
> It was also recently [announced](https://www.ncl.ucar.edu/Document/Pivot_to_Python/)
179+
> that NCAR will adopt Python as their scripting language of choice
180+
> for future development of analysis and visualisation tools,
181+
> so expect to see many of your favourite [NCL](https://www.ncl.ucar.edu/) functions
182+
> re-implemented as new Python libraries over the coming months/years.
183+
{: .callout}
184+
185+
It would be impossible to list all the sub-discipline-specific libraries on this page,
186+
but the [PyAOS community](http://pyaos.johnny-lin.com/) is an excellent resource
187+
if you’re trying to find out what’s available in your area of research.
188+
189+
## Navigating the stack
190+
191+
All of the additional libraries discussed on this page
192+
essentially exist to hide the complexity of the core libraries
193+
(in software engineering this is known as abstraction).
194+
Iris, for instance, was built to hide some of the complexity of netCDF4, NumPy and matplotlib.
195+
GeoViews was built to hide some of the complexity of xarray/Iris, cartopy and Bokeh.
196+
So if you want to start exploring your data, start at the top right of the stack
197+
and move your way down and left as required.
198+
If GeoViews doesn’t have quite the right functions for a particular plot that you want to create,
199+
drop down a level and use some Iris and cartopy functions.
200+
If Iris doesn’t have any functions for a statistical procedure that you want to apply,
201+
go back down another level and use SciPy.
202+
By starting at the top right and working your way back,
203+
you’ll ensure that you never re-invent the wheel.
204+
Nothing would be more heartbreaking than spending hours writing your own function (using netCDF4)
205+
for extracting the metadata contained within a netCDF file, for instance,
206+
only to find that Iris automatically keeps this information upon reading a file.
207+
In this way, a solid working knowledge of the scientific Python stack
208+
can save you a lot of time and effort.
209+
210+

0 commit comments

Comments
 (0)