Skip to content

Commit 1f52cc7

Browse files
committed
Fixing rebase conflicts and adding links.
1 parent a084459 commit 1f52cc7

File tree

1 file changed

+111
-115
lines changed

1 file changed

+111
-115
lines changed

_episodes/13-matplotlib.md

Lines changed: 111 additions & 115 deletions
Original file line numberDiff line numberDiff line change
@@ -3,87 +3,99 @@ title: "Data visualisation using Matplotlib"
33
teaching: 25
44
exercises: 25
55
questions:
6-
- "How can I create simple visualisations of my data?"
6+
- "How can I create visualisations of my data?"
77
objectives:
88
- "Import pyplot from the matplotlib library"
99
- "Create simple plots using pyplot"
1010
keypoints:
11-
- "Graphs can be drawn directly from pandas, but it still uses matplotlib"
11+
- "Graphs can be drawn directly from Pandas, but it still uses Matplotlib"
1212
- "Different graph types have different data requirements"
1313
- "Graphs are created from a variety of discrete components placed on a 'canvas', you don't have to use them all"
1414
- "Plotting multiple graphs on a single 'canvas' is possible"
1515
---
1616

17-
## Matplotlib
17+
## Plotting in Python
1818

19-
Matplotlib is a Python graphical library that can be used to produce a variety of different graph types.
19+
There are a wide variety of ways to plot in Python, like many programming languages. Some do more of the design work for you and others let you customize the look of the plots and all of the little details yourself. `Pandas` has basic plots built into it that reduce the amount of syntax, if your data is already in a DataFrame.
20+
Matplotlib is a Python graphical library that can be used to produce a variety of different graph types, it is fully controllable down to basic elements and includes a module `pylab` that is somewhere in between (designed to feel like matlab plotting, if you happen to have done that before).
2021

21-
The pandas library contains very tight integration with matplotlib. There are functions in pandas that automatically call matplotlib functions to produce graphs.
2222

23-
Although we are using Matplotlib in this episode, pandas can make use of several other graphical libraries available from within Python such as ggplot2 and seaborn.
23+
The [Pandas][pandas-web] library contains very tight integration with [Matplotlib][matplotlib-web].
24+
There are functions in Pandas that automatically call Matplotlib functions to produce graphs.
2425

25-
## Importing matplotlib
26+
The Matplotlib library can be imported using any of the import techniques we have seen. As Pandas is generally imported
27+
with `import Pandas as pd`, you will find that `matplotlib` is most commonly imported with `import matplotlib as plt` where 'plt' is the alias.
2628

27-
The matplotlib library can be imported using any of the import techniques we have seen. As `pandas` is generally imported with `import pandas as pd`, you will find that `matplotlib` is most commonly imported with `import matplotlib as plt` where 'plt' is the alias.
29+
In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a
30+
graph, we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the `%matplotlib inline` directive.
2831

29-
In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a graph, we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the `%matplotlib inline` directive.
32+
Although we are using Matplotlib in this episode, Pandas can make use of several other graphical libraries available
33+
from within Python such as [ggplot2][ggplot2-web] and [Seaborn][seaborn-web]. Seaborn has some very powerful features
34+
and advanced plot types. One of its most useful features is formatting.
3035

31-
If you forget to do this, you will have to add `plt.show()` to see the graphs.
36+
## Plotting with Pandas
37+
38+
To plot with Pandas we have to import it as we have done in past episodes. We can also use the `%matplotlib inline` notebook magic to reduce syntax otherwise. Without that we need to a `show()` command
3239

3340
~~~
34-
import matplotlib.pyplot as plt
41+
import pandas as pd
3542
%matplotlib inline
3643
~~~
3744
{: .language-python}
3845

39-
## Numpy
40-
41-
Numpy is another Python library. It is used for multi-dimensional array processing. In our case we just want to use it for its useful random number generation functions which we will use to create some fake data to demonstrate some of the graphing functions of matplotlib.
46+
We also need data to work with loaded into a DataFrame and it's helpful to look at a few rows to remember what's there.
4247

43-
We will use the alias `np`, following convention.
48+
~~~
49+
df = pd.read_csv("data/SAFI_full_shortname.csv")
50+
df.head()
51+
~~~
52+
{: .language-python}
4453

45-
## Bar charts
54+
Next, we can plot the a histogram of a variable.
4655

4756
~~~
48-
np.random.rand(20)
57+
df['years_liv'].hist()
4958
~~~
5059
{: .language-python}
5160

52-
will generate 20 random numbers between 0 and 1.
61+
![png](output_4_1.png)
5362

54-
We are using these to create a pandas Series of values.
5563

56-
A bar chart only needs a single set of values. Each 'bar' represents the value from the Series of values.
57-
A pandas Series (and a Dataframe) have a method called 'plot'. We only need to tell plot what kind of graph we want.
64+
We can change the number of bins to make it look how we would like, for example
5865

59-
The 'x' axis represents the index values of the Series
66+
~~~
67+
df['years_liv'].hist(bins=20)
68+
~~~
69+
{: .language-python}
6070

6171

62-
~~~
63-
import numpy as np
64-
import pandas as pd
72+
We can also specify the column as a parameter and a groupby column with the `by` keyword. there are a lot of keywords available to make it look better, we can see some of the most likely ones (as decided by Pandas developers) by using <kbd>shift</kbd> + <kbd>tab<kbd>. Lets try `layout`, `figsize`, and `sharex`.
6573

66-
np.random.seed(12345) # set a seed value to ensure reproducibility of the plots
67-
s = pd.Series(np.random.rand(20))
68-
#s
69-
# plot the bar chart
70-
s.plot(kind='bar')
74+
~~~
75+
df.hist(column='years_liv',by='village',layout=(1,3),figsize=(12,3),sharex=True)
7176
~~~
7277
{: .language-python}
7378

74-
Internally the pandas 'plot' method has called the 'bar' method of matplotlib and provided a set of parameters, including the pandas.Series s to generate the graph.
79+
## Scatter plot
7580

76-
We can use matplotlib directly to produce a similar graph. In this case we need to pass two parameters, the number of bars we need and the pandas Series holding the values.
81+
The scatter plot requires the x and y coordinates of each of the points being plotted.
82+
To provide this we will generate two series of random data one for the x coordinates and the other for the y coordinates
7783

78-
We also have to explicitly call the `show()` function to produce the graph.
84+
We will generate two sets of points and plot them on the same graph.
85+
86+
We will also add other common features like a title, a legend and labels on the x and y axis.
7987

80-
~~~
81-
plt.bar(range(len(s)), s)
82-
plt.show()
8388
~~~
8489
{: .language-python}
90+
df.plot.scatter(x='gps:Latitude', y='gps:Longitude', c='gps:Altitude', colormap="viridis", figsize=[4,4])
91+
~~~~~~
92+
{: .language-python}
93+
94+
![png](output_10_1.png)
8595
8696
> ## Exercise
97+
> 1. Make a scatter plot of 'years_farm' vs 'years_liv' and color the points by 'buildings_in_compound'
98+
> 2. Make a bar plot of the mean number of rooms per wall type
8799
>
88100
> Compare the two graphs we have just drawn. How do they differ? Are the differences significant?
89101
>
@@ -98,63 +110,57 @@ plt.show()
98110
> > ~~~
99111
> > {: .language-python}
100112
> {: .solution}
113+
> Extension: try plotting by by wall and roof type?
101114
{: .challenge}
102115
103-
## Histograms
116+
## Boxplot
117+
118+
A boxplot provides a simple representation of a variety of statistical qualities of a single set of data values.
104119
105-
We can plot histograms in a similar way, directly from pandas and also from Matplotlib
120+
![box_plot](../fig/vis_boxplot_01.png)
106121
107-
The pandas way
122+
A common use of the boxplot is to compare the statistical variations across a set of variables.
123+
124+
The variables can be an independent series or columns of a Dataframe using the Pandas plot method
108125
109126
~~~
110-
s = pd.Series(np.random.rand(20))
111-
# plot the bar chart
112-
s.plot(kind='hist')
127+
df.boxplot(by ='village',column=['buildings_in_compound'])
113128
~~~
114-
{: .language-python}
115-
116-
and the matplotlib way
129+
{:.language-python}
117130
131+
We can make it look prettier with seaborn, much more easily than fixing components manually with Matplotlib.
118132
~~~
119-
plt.hist(s)
120-
plt.show()
133+
import seaborn as sns
134+
sns.boxplot(data=df,x ='village',y='buildings_in_compound')
121135
~~~
122136
{: .language-python}
123137
124-
For the Histogram, each data point is allocated to 1 of 10 (by default) equal 'bins' of equal size (range of numbers) which are indicated along the x axis and the number of points (frequency) is shown on the y axis.
125-
126-
In this case the graphs are almost identical. The only difference being in the first graph the y axis has a label 'Frequency' associated with it.
127-
128-
We can fix this with a call to the `ylabel` function
129138
130139
~~~
131-
plt.ylabel('Frequency')
132-
plt.hist(s)
133-
plt.show()
140+
sns.lmplot(x='years_farm', y='years_liv',data=df,hue='village')
134141
~~~
135142
{: .language-python}
136143
137144
In general, most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
138145
139146
The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph.
140147
141-
In matplotlib you create the graph by providing values for all of the individual components you choose to include. When you are ready, you call the `show` function.
148+
In Matplotlib you create the graph by providing values for all of the individual components you choose to include. When you are ready, you call the `show` function.
142149
143150
Using this same approach, we can plot two sets of data on the same graph.
144151
145152
We will use a scatter plot to demonstrate some of the available features.
146153
147-
For a scatter plot we need two sets of data points one for the x values
148-
and the other for the y values.
154+
## Fine-tuning figures with Matplotlib
149155
150-
## Scatter plot
156+
If we want to do more advanced or lower level things with our plots, we need to use Matplotlib directly, not through Pandas. First we need to import it.
151157
152-
The scatter plot requires the x and y coordinates of each of the points being plotted.
153-
To provide this we will generate two series of random data one for the x coordinates and the other for the y coordinates
154158
155-
We will generate two sets of points and plot them on the same graph.
159+
The Matplotlib library can be imported using any of the import techniques we have seen. As `pandas` is generally imported with `import pandas as pd`, you will find that `matplotlib` is most commonly imported with `import matplotlib.pylab as plt` where 'plt' is the alias.
156160
157-
We will also add other common features like a title, a legend and labels on the x and y axis.
161+
In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a graph we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the `%matplotlib inline` directive.
162+
163+
If you forget to do this, you will have to add `plt.show()` to see the graphs.
158164
159165
~~~
160166
# Generate some date for 2 sets of points.
@@ -181,11 +187,11 @@ plt.show()
181187
{: .language-python}
182188
183189
In the call to the `scatter` method, the `label` parameter values are used by the _legend_.
184-
The `c` or `color` parameter can be set to any color matplotlib recognises. Full details of the available colours are available in the [matplotlib](http://matplotlib.org/api/colors_api.html) website. The [markers](http://matplotlib.org/api/markers_api.html) section will tell you what markers you can use instead of the default 'dots'. There is also an `s` (size) parameter which allows you to change the size of the marker.
190+
The `c` or `color` parameter can be set to any color Matplotlib recognises. Full details of the available colours are available in the [Matplotlib](http://matplotlib.org/api/colors_api.html) website. The [markers](http://matplotlib.org/api/markers_api.html) section will tell you what markers you can use instead of the default 'dots'. There is also an `s` (size) parameter which allows you to change the size of the marker.
185191
186192
> ## Exercise
187193
>
188-
> In the scatterplot the s parameter determines the size of the dots. s can be a simple numeric value, say s=100, which will produce dots all of the same size. However, you can pass a list of values (or a pandas Series) to provide sizes for the individual dots. This approach is very common as it allows us to provide an extra variable worth of information on the graph.
194+
> In the scatterplot the s parameter determines the size of the dots. s can be a simple numeric value, say s=100, which will produce dots all of the same size. However, you can pass a list of values (or a Pandas Series) to provide sizes for the individual dots. This approach is very common as it allows us to provide an extra variable worth of information on the graph.
189195
>
190196
> 1. Modify the code we used for the scatter plot to include a size value for each of the points in the series being plotted.
191197
> (The downside is that some of the smaller dots may be completely covered by the larger dots. To try and highlight when this has happened, we can change the opacity of the dots.)
@@ -221,77 +227,62 @@ The `c` or `color` parameter can be set to any color matplotlib recognises. Full
221227
> {: .solution}
222228
{: .challenge}
223229
224-
## Boxplot
225-
226-
A boxplot provides a simple representation of a variety of statistical qualities of a single set of data values.
227230
228-
![box_plot](../fig/vis_boxplot_01.png)
231+
Internally the Pandas 'plot' method has called the 'bar' method of Matplotlib and provided a set of parameters, including the pandas.Series s to generate the graph.
229232
230-
~~~
231-
x = pd.Series(np.random.standard_normal(256))
233+
We can use Matplotlib directly to produce a similar graph. In this case we need to pass two parameters, the number of bars we need and the Pandas Series holding the values.
232234
233-
# Show a boxplot of the data
234-
plt.boxplot(x)
235-
plt.show()
236-
~~~
237-
{: .language-python}
238-
239-
A common use of the boxplot is to compare the statistical variations across a set of variables.
235+
We also have to explicitly call the `show()` function to produce the graph.
240236
241-
The variables can be an independent series or columns of a Dataframe.
237+
## Saving Plots
242238
243239
~~~
244-
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE')) # creating a Dataframe directly with pandas
245-
plt.boxplot(df.A, labels = 'A')
240+
plt.savefig("rooms.png")
241+
plt.savefig("rooms.pdf", bbox_inches="tight", dpi=600)
246242
plt.show()
247243
~~~
248244
{: .language-python}
249245
250-
> ## Exercise
251-
>
252-
> Can you change the code above so that columns `A` , `C` and `D` are all displayed on the same graph?
253-
>
254-
> > ## Solution
255-
> >
256-
> > ~~~
257-
> > df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
258-
> > plt.boxplot([df.A, df.C, df.D], labels = ['A', 'C', 'D'])
259-
> > plt.show()
260-
> > ~~~
261-
> > {: .language-python}
262-
> {: .solution}
263-
{: .challenge}
264246
265-
The boxplot function cannot accept a whole Dataframe. The code
247+
For the Histogram, each data point is allocated to 1 of 10 (by default) equal 'bins' of equal size (range of numbers) which are indicated along the x axis and the number of points (frequency) is shown on the y axis.
248+
249+
In this case the graphs are almost identical. The only difference being in the first graph the y axis has a label 'Frequency' associated with it.
250+
251+
We can fix this with a call to the `ylabel` function
266252
267253
~~~
268-
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
269-
plt.boxplot(df)
254+
plt.ylabel('Frequency')
255+
plt.hist(s)
270256
plt.show()
271257
~~~
272258
{: .language-python}
273259
274-
will fail. However, we can use the pandas plot method.
275260
276-
~~~,
277-
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
278-
df.plot(kind = 'box', return_type='axes') # the return_type='axes' is only needed for forward compatibility
279-
~~~
280-
{: .language-python}
261+
In general most graphs can be broken down into a series of elements which, although typically related in some way, can
262+
all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
281263
282-
We can add a title to the above by adding the `title` parameter. However there are no parameters for adding the axis labels.
283-
To add labels, we can use matplotlib directly.
264+
The labels (if any) on the x and y axis are independent of the data values being represented.
265+
The title and the legend are also independent objects within the overall graph.
284266
285-
~~~
286-
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
287-
df.plot(kind = 'box', return_type='axes')
267+
In Matplotlib you create the graph by providing values for all of the individual components you choose to
268+
include. When you are ready, you call the `show` function.
269+
270+
Using this same approach we can plot two sets of data on the same graph. We will use a scatter plot to
271+
demonstrate some of the available features.
272+
273+
> ## Exercise
274+
>
275+
> Revisit your favorite plot we've made so far, or make one with your own data then:
276+
>
277+
> 1. add axes labels
278+
> 2. add a title
279+
> 3. add a legend
280+
> 4. save it in two different formats
281+
>
282+
> extension: try plotting by by wall and roof type?
283+
>
284+
{: .challenge}
288285
289-
plt.title('Box Plot')
290-
plt.xlabel('xlabel')
291-
plt.ylabel('ylabel')
292-
plt.show()
293-
~~~
294-
{: .language-python}
295286
296287
## Saving a graph
297288
@@ -308,3 +299,8 @@ plt.ylabel('ylabel')
308299
plt.savefig('boxplot_from_df.pdf')
309300
~~~
310301
{: .language-python}
302+
303+
[matplotlib-web]: http://matplotlib.org/
304+
[pandas-web]: http://pandas.pydata.org/
305+
[ggplot2-web]: http://ggplot2.tidyverse.org/
306+
[seaborn-web]: https://seaborn.pydata.org/

0 commit comments

Comments
 (0)