You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -3,87 +3,99 @@ title: "Data visualisation using Matplotlib"
3
3
teaching: 25
4
4
exercises: 25
5
5
questions:
6
-
- "How can I create simple visualisations of my data?"
6
+
- "How can I create visualisations of my data?"
7
7
objectives:
8
8
- "Import pyplot from the matplotlib library"
9
9
- "Create simple plots using pyplot"
10
10
keypoints:
11
-
- "Graphs can be drawn directly from pandas, but it still uses matplotlib"
11
+
- "Graphs can be drawn directly from Pandas, but it still uses Matplotlib"
12
12
- "Different graph types have different data requirements"
13
13
- "Graphs are created from a variety of discrete components placed on a 'canvas', you don't have to use them all"
14
14
- "Plotting multiple graphs on a single 'canvas' is possible"
15
15
---
16
16
17
-
## Matplotlib
17
+
## Plotting in Python
18
18
19
-
Matplotlib is a Python graphical library that can be used to produce a variety of different graph types.
19
+
There are a wide variety of ways to plot in Python, like many programming languages. Some do more of the design work for you and others let you customize the look of the plots and all of the little details yourself. `Pandas` has basic plots built into it that reduce the amount of syntax, if your data is already in a DataFrame.
20
+
Matplotlib is a Python graphical library that can be used to produce a variety of different graph types, it is fully controllable down to basic elements and includes a module `pylab` that is somewhere in between (designed to feel like matlab plotting, if you happen to have done that before).
20
21
21
-
The pandas library contains very tight integration with matplotlib. There are functions in pandas that automatically call matplotlib functions to produce graphs.
22
22
23
-
Although we are using Matplotlib in this episode, pandas can make use of several other graphical libraries available from within Python such as ggplot2 and seaborn.
23
+
The [Pandas][pandas-web] library contains very tight integration with [Matplotlib][matplotlib-web].
24
+
There are functions in Pandas that automatically call Matplotlib functions to produce graphs.
24
25
25
-
## Importing matplotlib
26
+
The Matplotlib library can be imported using any of the import techniques we have seen. As Pandas is generally imported
27
+
with `import Pandas as pd`, you will find that `matplotlib` is most commonly imported with `import matplotlib as plt` where 'plt' is the alias.
26
28
27
-
The matplotlib library can be imported using any of the import techniques we have seen. As `pandas` is generally imported with `import pandas as pd`, you will find that `matplotlib` is most commonly imported with `import matplotlib as plt` where 'plt' is the alias.
29
+
In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a
30
+
graph, we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the `%matplotlib inline` directive.
28
31
29
-
In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a graph, we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the `%matplotlib inline` directive.
32
+
Although we are using Matplotlib in this episode, Pandas can make use of several other graphical libraries available
33
+
from within Python such as [ggplot2][ggplot2-web] and [Seaborn][seaborn-web]. Seaborn has some very powerful features
34
+
and advanced plot types. One of its most useful features is formatting.
30
35
31
-
If you forget to do this, you will have to add `plt.show()` to see the graphs.
36
+
## Plotting with Pandas
37
+
38
+
To plot with Pandas we have to import it as we have done in past episodes. We can also use the `%matplotlib inline` notebook magic to reduce syntax otherwise. Without that we need to a `show()` command
32
39
33
40
~~~
34
-
import matplotlib.pyplot as plt
41
+
import pandas as pd
35
42
%matplotlib inline
36
43
~~~
37
44
{: .language-python}
38
45
39
-
## Numpy
40
-
41
-
Numpy is another Python library. It is used for multi-dimensional array processing. In our case we just want to use it for its useful random number generation functions which we will use to create some fake data to demonstrate some of the graphing functions of matplotlib.
46
+
We also need data to work with loaded into a DataFrame and it's helpful to look at a few rows to remember what's there.
42
47
43
-
We will use the alias `np`, following convention.
48
+
~~~
49
+
df = pd.read_csv("data/SAFI_full_shortname.csv")
50
+
df.head()
51
+
~~~
52
+
{: .language-python}
44
53
45
-
## Bar charts
54
+
Next, we can plot the a histogram of a variable.
46
55
47
56
~~~
48
-
np.random.rand(20)
57
+
df['years_liv'].hist()
49
58
~~~
50
59
{: .language-python}
51
60
52
-
will generate 20 random numbers between 0 and 1.
61
+

53
62
54
-
We are using these to create a pandas Series of values.
55
63
56
-
A bar chart only needs a single set of values. Each 'bar' represents the value from the Series of values.
57
-
A pandas Series (and a Dataframe) have a method called 'plot'. We only need to tell plot what kind of graph we want.
64
+
We can change the number of bins to make it look how we would like, for example
58
65
59
-
The 'x' axis represents the index values of the Series
66
+
~~~
67
+
df['years_liv'].hist(bins=20)
68
+
~~~
69
+
{: .language-python}
60
70
61
71
62
-
~~~
63
-
import numpy as np
64
-
import pandas as pd
72
+
We can also specify the column as a parameter and a groupby column with the `by` keyword. there are a lot of keywords available to make it look better, we can see some of the most likely ones (as decided by Pandas developers) by using <kbd>shift</kbd> + <kbd>tab<kbd>. Lets try `layout`, `figsize`, and `sharex`.
65
73
66
-
np.random.seed(12345) # set a seed value to ensure reproducibility of the plots
Internally the pandas 'plot' method has called the 'bar' method of matplotlib and provided a set of parameters, including the pandas.Series s to generate the graph.
79
+
## Scatter plot
75
80
76
-
We can use matplotlib directly to produce a similar graph. In this case we need to pass two parameters, the number of bars we need and the pandas Series holding the values.
81
+
The scatter plot requires the x and y coordinates of each of the points being plotted.
82
+
To provide this we will generate two series of random data one for the x coordinates and the other for the y coordinates
77
83
78
-
We also have to explicitly call the `show()` function to produce the graph.
84
+
We will generate two sets of points and plot them on the same graph.
85
+
86
+
We will also add other common features like a title, a legend and labels on the x and y axis.
For the Histogram, each data point is allocated to 1 of 10 (by default) equal 'bins' of equal size (range of numbers) which are indicated along the x axis and the number of points (frequency) is shown on the y axis.
125
-
126
-
In this case the graphs are almost identical. The only difference being in the first graph the y axis has a label 'Frequency' associated with it.
127
-
128
-
We can fix this with a call to the `ylabel` function
In general, most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
138
145
139
146
The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph.
140
147
141
-
In matplotlib you create the graph by providing values for all of the individual components you choose to include. When you are ready, you call the `show` function.
148
+
In Matplotlib you create the graph by providing values for all of the individual components you choose to include. When you are ready, you call the `show` function.
142
149
143
150
Using this same approach, we can plot two sets of data on the same graph.
144
151
145
152
We will use a scatter plot to demonstrate some of the available features.
146
153
147
-
For a scatter plot we need two sets of data points one for the x values
148
-
and the other for the y values.
154
+
## Fine-tuning figures with Matplotlib
149
155
150
-
## Scatter plot
156
+
If we want to do more advanced or lower level things with our plots, we need to use Matplotlib directly, not through Pandas. First we need to import it.
151
157
152
-
The scatter plot requires the x and y coordinates of each of the points being plotted.
153
-
To provide this we will generate two series of random data one for the x coordinates and the other for the y coordinates
154
158
155
-
We will generate two sets of points and plot them on the same graph.
159
+
The Matplotlib library can be imported using any of the import techniques we have seen. As `pandas` is generally imported with `import pandas as pd`, you will find that `matplotlib` is most commonly imported with `import matplotlib.pylab as plt` where 'plt' is the alias.
156
160
157
-
We will also add other common features like a title, a legend and labels on the x and y axis.
161
+
In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a graph we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the `%matplotlib inline` directive.
162
+
163
+
If you forget to do this, you will have to add `plt.show()` to see the graphs.
158
164
159
165
~~~
160
166
# Generate some date for 2 sets of points.
@@ -181,11 +187,11 @@ plt.show()
181
187
{: .language-python}
182
188
183
189
In the call to the `scatter` method, the `label` parameter values are used by the _legend_.
184
-
The `c` or `color` parameter can be set to any color matplotlib recognises. Full details of the available colours are available in the [matplotlib](http://matplotlib.org/api/colors_api.html) website. The [markers](http://matplotlib.org/api/markers_api.html) section will tell you what markers you can use instead of the default 'dots'. There is also an `s` (size) parameter which allows you to change the size of the marker.
190
+
The `c` or `color` parameter can be set to any color Matplotlib recognises. Full details of the available colours are available in the [Matplotlib](http://matplotlib.org/api/colors_api.html) website. The [markers](http://matplotlib.org/api/markers_api.html) section will tell you what markers you can use instead of the default 'dots'. There is also an `s` (size) parameter which allows you to change the size of the marker.
185
191
186
192
> ## Exercise
187
193
>
188
-
> In the scatterplot the s parameter determines the size of the dots. s can be a simple numeric value, say s=100, which will produce dots all of the same size. However, you can pass a list of values (or a pandas Series) to provide sizes for the individual dots. This approach is very common as it allows us to provide an extra variable worth of information on the graph.
194
+
> In the scatterplot the s parameter determines the size of the dots. s can be a simple numeric value, say s=100, which will produce dots all of the same size. However, you can pass a list of values (or a Pandas Series) to provide sizes for the individual dots. This approach is very common as it allows us to provide an extra variable worth of information on the graph.
189
195
>
190
196
> 1. Modify the code we used for the scatter plot to include a size value for each of the points in the series being plotted.
191
197
> (The downside is that some of the smaller dots may be completely covered by the larger dots. To try and highlight when this has happened, we can change the opacity of the dots.)
@@ -221,77 +227,62 @@ The `c` or `color` parameter can be set to any color matplotlib recognises. Full
221
227
> {: .solution}
222
228
{: .challenge}
223
229
224
-
## Boxplot
225
-
226
-
A boxplot provides a simple representation of a variety of statistical qualities of a single set of data values.
227
230
228
-

231
+
Internally the Pandas 'plot' method has called the 'bar' method of Matplotlib and provided a set of parameters, including the pandas.Series s to generate the graph.
229
232
230
-
~~~
231
-
x = pd.Series(np.random.standard_normal(256))
233
+
We can use Matplotlib directly to produce a similar graph. In this case we need to pass two parameters, the number of bars we need and the Pandas Series holding the values.
232
234
233
-
# Show a boxplot of the data
234
-
plt.boxplot(x)
235
-
plt.show()
236
-
~~~
237
-
{: .language-python}
238
-
239
-
A common use of the boxplot is to compare the statistical variations across a set of variables.
235
+
We also have to explicitly call the `show()` function to produce the graph.
240
236
241
-
The variables can be an independent series or columns of a Dataframe.
237
+
## Saving Plots
242
238
243
239
~~~
244
-
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE')) # creating a Dataframe directly with pandas
The boxplot function cannot accept a whole Dataframe. The code
247
+
For the Histogram, each data point is allocated to 1 of 10 (by default) equal 'bins' of equal size (range of numbers) which are indicated along the x axis and the number of points (frequency) is shown on the y axis.
248
+
249
+
In this case the graphs are almost identical. The only difference being in the first graph the y axis has a label 'Frequency' associated with it.
250
+
251
+
We can fix this with a call to the `ylabel` function
0 commit comments