|
| 1 | +--- |
| 2 | +layout: page |
| 3 | +title: PyAOS stack |
| 4 | +--- |
| 5 | + |
| 6 | +It would be an understatement to say that Python has exploded onto the data science scene in recent years. |
| 7 | +PyCon and SciPy conferences are held somewhere in the world every few months now, |
| 8 | +at which loads of new and/or improved data science libraries are showcased to the community |
| 9 | +(check out [pyvideo.org](pyvideo.org) for conference recordings). |
| 10 | +The ongoing rapid development of new libraries means that data scientists are (hopefully) |
| 11 | +continually able to do more and more cool things with less and less time and effort, |
| 12 | +but at the same time it can be difficult to figure out how they all relate to one another. |
| 13 | +To assist in making sense of this constantly changing landscape, |
| 14 | +this page summarises the current state of the weather and climate Python software “stack” |
| 15 | +(i.e. the collection of libraries used for data analysis and visualisation). |
| 16 | +The focus is on libraries that are widely used and that have good (and likely long-term) support. |
| 17 | + |
| 18 | + |
| 19 | + |
| 20 | +## Core |
| 21 | + |
| 22 | +The dashed box in the diagram represents the core of the stack, so let’s start this tour there. |
| 23 | +The default library for dealing with numerical arrays in Python is [NumPy](http://www.numpy.org/). |
| 24 | +It has a bunch of built in functions for reading and writing common data formats like .csv, |
| 25 | +but if your data is stored in netCDF format then the default library for getting data |
| 26 | +into/out of those files is [netCDF4](http://unidata.github.io/netcdf4-python/netCDF4/index.html). |
| 27 | + |
| 28 | +Once you’ve read your data in, you’re probably going to want to do some statistical analysis. |
| 29 | +The NumPy library has some built in functions for calculating very simple statistics |
| 30 | +(e.g. maximum, mean, standard deviation), |
| 31 | +but for more complex analysis |
| 32 | +(e.g. interpolation, integration, linear algebra) |
| 33 | +the [SciPy](https://www.scipy.org/scipylib/index.html) library is the default. |
| 34 | + |
| 35 | +If you’re dealing with a particularly large dataset, |
| 36 | +you may get memory errors (and/or slow performance) |
| 37 | +when trying to read and process your data. |
| 38 | +[Dask[(https://dask.org/) works with the existing Python ecosystem (i.e. NumPy, SciPy etc) |
| 39 | +to scale your analysis to multi-core machines and/or distributed clusters |
| 40 | +(i.e. parallel processing). |
| 41 | + |
| 42 | +The NumPy library doesn’t come with any plotting capability, |
| 43 | +so if you want to visualise your NumPy data arrays then the default library is [matplotlib](https://matplotlib.org/). |
| 44 | +As you can see at the [matplotlib gallery](https://matplotlib.org/gallery.html), |
| 45 | +this library is great for any simple (e.g. bar charts, contour plots, line graphs), |
| 46 | +static (e.g. .png, .eps, .pdf) plots. |
| 47 | +The [cartopy](https://scitools.org.uk/cartopy/docs/latest/) library |
| 48 | +provides additional functionality for common map projections, |
| 49 | +while [Bokeh](http://bokeh.pydata.org/) allows for the creation of interactive plots |
| 50 | +where you can zoom and scroll. |
| 51 | + |
| 52 | +While pretty much all data analysis and visualisation tasks |
| 53 | +could be achieved with a combination of these core libraries, |
| 54 | +their highly flexible, all-purpose nature means relatively common/simple tasks |
| 55 | +can often require quite a bit of work (i.e. many lines of code). |
| 56 | +To make things more efficient for data scientists, |
| 57 | +the scientific Python community has therefore built a number of libraries on top of the core stack. |
| 58 | +These additional libraries aren’t as flexible |
| 59 | +– they can’t do *everything* like the core stack can – |
| 60 | +but they can do common tasks with far less effort. |
| 61 | + |
| 62 | +## Generic additions |
| 63 | + |
| 64 | +Let’s first consider the generic additional libraries. |
| 65 | +That is, the ones that can be used in essentially all fields of data science. |
| 66 | +The most popular of these libraries is undoubtedly [pandas](http://pandas.pydata.org/), |
| 67 | +which has been a real game-changer for the Python data science community. |
| 68 | +The key advance offered by pandas is the concept of labelled arrays. |
| 69 | +Rather than referring to the individual elements of a data array using a numeric index |
| 70 | +(as is required with NumPy), |
| 71 | +the actual row and column headings can be used. |
| 72 | +That means Fred’s information for the year 2005 |
| 73 | +could be obtained from a medical dataset by asking for `data(name=’Fred’, year=2005)`, |
| 74 | +rather than having to remember the numeric index corresponding to that person and year. |
| 75 | +This labelled array feature, |
| 76 | +combined with a bunch of other features that simplify common statistical and plotting tasks |
| 77 | +traditionally performed with SciPy and matplotlib, |
| 78 | +greatly simplifies the code development process (read: less lines of code). |
| 79 | + |
| 80 | +One of the limitations of pandas |
| 81 | +is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays. |
| 82 | +The [xarray](http://xarray.pydata.org/) library was therefore created |
| 83 | +to extend the labelled array concept to x-dimensional arrays. |
| 84 | +Not all of the pandas functionality is available |
| 85 | +(which is a trade-off associated with being able to handle multi-dimensional arrays), |
| 86 | +but the ability to refer to array elements by their actual latitude (e.g. 20 South), |
| 87 | +longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example, |
| 88 | +makes the xarray data array far easier to deal with than the NumPy array. |
| 89 | +(As an added bonus, xarray also builds on netCDF4 to make netCDF input/output easier.) |
| 90 | + |
| 91 | +## Discipline-specific additions |
| 92 | + |
| 93 | +While the xarray library is a good option for those working in the atmosphere and ocean sciences |
| 94 | +(especially those dealing with large multi-dimensional arrays from model simulations), |
| 95 | +the [SciTools](https://scitools.org.uk/) project (led by the MetOffice) |
| 96 | +has taken a different approach to building on top of the core stack. |
| 97 | +Rather than striving to make their software generic |
| 98 | +(xarray is designed to handle any multi-dimensional data), |
| 99 | +they explicitly assume that users of their [Iris](https://scitools.org.uk/iris/docs/latest/) |
| 100 | +library are dealing with weather/ocean/climate data. |
| 101 | +Doing this allows them to make common weather/climate tasks super quick and easy, |
| 102 | +and it also means they have added functionality specific to atmosphere and ocean science. |
| 103 | +(The SciTools project is also behind cartopy |
| 104 | +and a number of other useful libraries for analysing earth science data.) |
| 105 | + |
| 106 | +In addition to Iris, you may also come across [CDAT](https://cdat.llnl.gov), |
| 107 | +which is maintained by the team at Lawrence Livermore National Laboratory. |
| 108 | +It was the precursor to xarray and Iris in the sense that it was the first package |
| 109 | +for atmosphere and ocean scientists built on top of the core Python stack. |
| 110 | +For a number of years the funding and direction of that project shifted towards |
| 111 | +developing a graphical interface ([VCDAT](https://vcdat.llnl.gov)) |
| 112 | +for managing large workflows and visualising data |
| 113 | +(i.e. as opposed to further developing the capabilities of the underlying Python libraries), |
| 114 | +but it seems that CDAT is now once again under [active development](https://github.com/CDAT/cdat/wiki). |
| 115 | +The VCDAT application also now runs as a JupyterLab extension, which is an exciting development. |
| 116 | + |
| 117 | +> ## How to choose |
| 118 | +> |
| 119 | +> In terms of choosing between xarray and Iris, |
| 120 | +> some people like the slightly more atmosphere/ocean-centric experience offered by Iris, |
| 121 | +> while others don’t like the restrictions that places on their work |
| 122 | +> and prefer the generic xarray experience |
| 123 | +> (e.g. to use Iris your netCDF data files have to be CF compliant or close to it). |
| 124 | +> Either way, they are both a vast improvement on the netCDF/NumPy/matplotlib experience. |
| 125 | +{: .callout} |
| 126 | + |
| 127 | +## Simplifying data exploration |
| 128 | + |
| 129 | +While the plotting functionality associated with xarray and Iris |
| 130 | +speeds up the process of visually exploring data (as compared to matplotlib), |
| 131 | +there’s still a fair bit of messing around involved in tweaking the various aspects of a plot |
| 132 | +(e.g. colour schemes, plot size, labels, map projections, etc). |
| 133 | +This tweaking burden is an issue across all data science fields and programming languages, |
| 134 | +so developers of the latest generation of visualisation tools |
| 135 | +are moving towards something called *declarative visualisation*. |
| 136 | +The basic concept is that the user simply has to describe the characteristics of their data, |
| 137 | +and then the software figures out the optimal way to visualise it |
| 138 | +(i.e. it makes all the tweaking decisions for you). |
| 139 | + |
| 140 | +The two major Python libraries in the declarative visualisation space are |
| 141 | +[HoloViews](http://holoviews.org/) and [Altair](https://altair-viz.github.io/). |
| 142 | +The former (which has been around much longer) uses matplotlib or Bokeh under the hood, |
| 143 | +which means it allows for the generation of static or interactive plots. |
| 144 | +Since HoloViews doesn’t have support for geographic plots, |
| 145 | +[GeoViews](http://geoviews.org/) has been created on top of it |
| 146 | +(which incorporates cartopy and can handle Iris or xarray data arrays). |
| 147 | + |
| 148 | +## Sub-discipline-specific libraries |
| 149 | + |
| 150 | +So far we’ve considered libraries that do general, |
| 151 | +broad-scale tasks like data input/output, common statistics, visualisation, etc. |
| 152 | +Given their large user base, |
| 153 | +these libraries are usually written and supported by large companies |
| 154 | +(e.g. Anaconda supports Bokeh and HoloViews/Geoviews), |
| 155 | +large institutions (e.g. the MetOffice supports Iris, cartopy and GeoViews) |
| 156 | +or the wider PyData community (e.g. pandas, xarray). |
| 157 | +Within each sub-discipline of atmosphere and ocean science, |
| 158 | +individuals and research groups take these libraries |
| 159 | +and apply them to their very specific data analysis tasks. |
| 160 | +Increasingly, these individuals and groups |
| 161 | +are formally packaging and releasing their code for use within their community. |
| 162 | +For instance, Andrew Dawson (an atmospheric scientist at Oxford) |
| 163 | +does a lot of EOF analysis and manipulation of wind data, |
| 164 | +so he has released his [eofs](https://ajdawson.github.io/eofs/latest/) |
| 165 | +and [windspharm](https://ajdawson.github.io/windspharm/latest/) libraries |
| 166 | +(which are able to handle data arrays from NumPy, Iris or xarray). |
| 167 | +Similarly, a group at the Atmospheric Radiation Measurement (ARM) Climate Research Facility |
| 168 | +have released their Python ARM Radar Toolkit ([Py-ART](http://arm-doe.github.io/pyart/)) |
| 169 | +for analysing weather radar data, |
| 170 | +and a [similar story](https://www.unidata.ucar.edu/blogs/news/entry/metpy_an_open_source_python) |
| 171 | +is true for [MetPy](https://unidata.github.io/MetPy/latest/index.html). |
| 172 | + |
| 173 | +> ## Coming soon |
| 174 | +> |
| 175 | +> In terms of new libraries that might be available soon, |
| 176 | +> the [Pangeo](https://pangeo.io/) project is actively supporting and encouraging |
| 177 | +> the development of more domain-specific geoscience packages. |
| 178 | +> It was also recently [announced](https://www.ncl.ucar.edu/Document/Pivot_to_Python/) |
| 179 | +> that NCAR will adopt Python as their scripting language of choice |
| 180 | +> for future development of analysis and visualisation tools, |
| 181 | +> so expect to see many of your favourite [NCL](https://www.ncl.ucar.edu/) functions |
| 182 | +> re-implemented as new Python libraries over the coming months/years. |
| 183 | +{: .callout} |
| 184 | + |
| 185 | +It would be impossible to list all the sub-discipline-specific libraries on this page, |
| 186 | +but the [PyAOS community](http://pyaos.johnny-lin.com/) is an excellent resource |
| 187 | +if you’re trying to find out what’s available in your area of research. |
| 188 | + |
| 189 | +## Navigating the stack |
| 190 | + |
| 191 | +All of the additional libraries discussed on this page |
| 192 | +essentially exist to hide the complexity of the core libraries |
| 193 | +(in software engineering this is known as abstraction). |
| 194 | +Iris, for instance, was built to hide some of the complexity of netCDF4, NumPy and matplotlib. |
| 195 | +GeoViews was built to hide some of the complexity of xarray/Iris, cartopy and Bokeh. |
| 196 | +So if you want to start exploring your data, start at the top right of the stack |
| 197 | +and move your way down and left as required. |
| 198 | +If GeoViews doesn’t have quite the right functions for a particular plot that you want to create, |
| 199 | +drop down a level and use some Iris and cartopy functions. |
| 200 | +If Iris doesn’t have any functions for a statistical procedure that you want to apply, |
| 201 | +go back down another level and use SciPy. |
| 202 | +By starting at the top right and working your way back, |
| 203 | +you’ll ensure that you never re-invent the wheel. |
| 204 | +Nothing would be more heartbreaking than spending hours writing your own function (using netCDF4) |
| 205 | +for extracting the metadata contained within a netCDF file, for instance, |
| 206 | +only to find that Iris automatically keeps this information upon reading a file. |
| 207 | +In this way, a solid working knowledge of the scientific Python stack |
| 208 | +can save you a lot of time and effort. |
| 209 | + |
| 210 | + |
0 commit comments