Skip to content

Commit b4b4c45

Browse files
committed
Fixing rebase conflicts and adding figure files. Finishing rebase of
gh-pages into branch started from #93.
1 parent 6b17a16 commit b4b4c45

File tree

10 files changed

+111
-48
lines changed

10 files changed

+111
-48
lines changed

_episodes/13-matplotlib.md

Lines changed: 111 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,23 @@ exercises: 25
55
questions:
66
- "How can I create visualisations of my data?"
77
objectives:
8+
- "Create simple plots using pandas"
89
- "Import pyplot from the matplotlib library"
9-
- "Create simple plots using pyplot"
10+
- "Customise plots using pyplot"
1011
keypoints:
1112
- "Graphs can be drawn directly from Pandas, but it still uses Matplotlib"
1213
- "Different graph types have different data requirements"
1314
- "Graphs are created from a variety of discrete components placed on a 'canvas', you don't have to use them all"
14-
- "Plotting multiple graphs on a single 'canvas' is possible"
1515
---
1616

1717
## Plotting in Python
1818

19-
There are a wide variety of ways to plot in Python, like many programming languages. Some do more of the design work for you and others let you customize the look of the plots and all of the little details yourself. `Pandas` has basic plots built into it that reduce the amount of syntax, if your data is already in a DataFrame.
20-
Matplotlib is a Python graphical library that can be used to produce a variety of different graph types, it is fully controllable down to basic elements and includes a module `pylab` that is somewhere in between (designed to feel like matlab plotting, if you happen to have done that before).
21-
22-
23-
The [Pandas][pandas-web] library contains very tight integration with [Matplotlib][matplotlib-web].
24-
There are functions in Pandas that automatically call Matplotlib functions to produce graphs.
19+
There is a wide variety of ways to plot in Python, like many programming languages.
20+
Some do more of the design work for you and others let you customize the look of the plots and all of the little details yourself.
21+
[Pandas][pandas-web] has basic plots built into it that reduce the amount of syntax, if your data is already in a DataFrame.
22+
[Matplotlib][matplotlib-web]. is a Python graphical library that can be used to produce a variety of different graph types,
23+
it is fully controllable down to basic elements and includes a module `pylab` that is somewhere in between
24+
(designed to feel like MATLAB plotting, if you happen to have done that before).
2525

2626
The Matplotlib library can be imported using any of the import techniques we have seen. As Pandas is generally imported
2727
with `import Pandas as pd`, you will find that `matplotlib` is most commonly imported with `import matplotlib as plt` where 'plt' is the alias.
@@ -35,7 +35,18 @@ and advanced plot types. One of its most useful features is formatting.
3535

3636
## Plotting with Pandas
3737

38-
To plot with Pandas we have to import it as we have done in past episodes. We can also use the `%matplotlib inline` notebook magic to reduce syntax otherwise. Without that we need to a `show()` command
38+
The `pandas` library contains very tight integration with `matplotlib`. There are functions in `pandas` that
39+
automatically call `matplotlib` functions to produce graphs.
40+
41+
Other graphical libraries available from within Python are for example `plotnine` (a ggplot2 realisation for python)
42+
and `seaborn`. [Seaborn](https://seaborn.pydata.org) has some very powerful features and advanced plot types.
43+
One of its most useful features is formatting.
44+
45+
## Plotting with Pandas
46+
47+
To plot with `pandas` we have to import it as we have done in past episodes.
48+
To tell Jupyter that when we produce a graph we want it to be displayed in a cell in the notebook just like any other results,
49+
we use the `%matplotlib inline` directive. Without that we need to do a `show()` command.
3950

4051
~~~
4152
import pandas as pd
@@ -58,7 +69,7 @@ df['years_liv'].hist()
5869
~~~
5970
{: .language-python}
6071

61-
![png](output_4_1.png)
72+
![png](../fig/histogram1.png)
6273

6374

6475
We can change the number of bins to make it look how we would like, for example
@@ -68,32 +79,34 @@ df['years_liv'].hist(bins=20)
6879
~~~
6980
{: .language-python}
7081

71-
We can also specify the column as a parameter and a groupby column with the `by` keyword. there are a lot of keywords available to make it look better, we can see some of the most likely ones (as decided by Pandas developers) by using <kbd>shift</kbd> + <kbd>tab<kbd>. Lets try `layout`, `figsize`, and `sharex`.
82+
We can also specify the column as a parameter and a groupby column with the `by` keyword.
83+
there are a lot of keywords available to make it look better, we can see some of the most likely ones
84+
(as decided by Pandas developers) by using <kbd>shift</kbd> + <kbd>tab<kbd> .
85+
86+
Lets try `layout`, `figsize`, and `sharex`.
7287

7388
~~~
7489
df.hist(column='years_liv',by='village',layout=(1,3),figsize=(12,3),sharex=True)
7590
~~~
7691
{: .language-python}
7792

78-
## Scatter plot
93+
![png](../fig/histogram3.png)
7994

80-
The scatter plot requires the x and y coordinates of each of the points being plotted.
81-
To provide this we will generate two series of random data one for the x coordinates and the other for the y coordinates
8295

83-
We will generate two sets of points and plot them on the same graph.
96+
## Scatter plot
8497

85-
We will also add other common features like a title, a legend and labels on the x and y axis.
98+
The scatter plot requires the x and y coordinates of each of the points being plotted. We can add a third dimension as different colors with the `c` argument.
8699

100+
~~~
101+
df.plot.scatter(x='gps_Latitude', y='gps_Longitude', c='gps_Altitude', colormap="viridis", figsize=[4,4])
87102
~~~
88103
{: .language-python}
89-
df.plot.scatter(x='gps:Latitude', y='gps:Longitude', c='gps:Altitude', colormap="viridis", figsize=[4,4])
90-
~~~~~~
91-
{: .language-python}
92104

93-
![png](output_10_1.png)
105+
![png](../fig/scatter1.png)
106+
94107

95108
> ## Exercise
96-
> 1. Make a scatter plot of 'years_farm' vs 'years_liv' and color the points by 'buildings_in_compound'
109+
> 1. Make a scatter plot of `years_farm` vs `years_liv` and color the points by `buildings_in_compound`
97110
> 2. Make a bar plot of the mean number of rooms per wall type
98111
>
99112
> Compare the two graphs we have just drawn. How do they differ? Are the differences significant?
@@ -109,15 +122,32 @@ df.plot.scatter(x='gps:Latitude', y='gps:Longitude', c='gps:Altitude', colormap=
109122
> > ~~~
110123
> > {: .language-python}
111124
> {: .solution}
112-
> Extension: try plotting by by wall and roof type?
125+
> Extension: try plotting by wall and roof type?
126+
>
127+
> > ## Solution
128+
> > For the scatter plot:
129+
> >
130+
> > ~~~
131+
> > df.plot.scatter(x = 'years_liv', y = 'years_farm', c = 'buildings_in_compound', colormap = 'viridis')
132+
> > ~~~
133+
> > {: .language-python}
134+
> > ![png](../fig/scatter2.png)
135+
> >
136+
> > For the barplot: we first need to calculate the mean value of rooms per wall type, then we can make the plot.
137+
> >
138+
> > ~~~
139+
> > rooms_mean = df.groupby('respondent_wall_type').mean()['rooms']
140+
> > rooms_mean.plot.bar()
141+
> > ~~~
142+
> > {: .language-python}
143+
> > ![png](../fig/barplot1.png)
144+
> {: .solution}
113145
{: .challenge}
114146
147+
115148
## Boxplot
116149
117150
A boxplot provides a simple representation of a variety of statistical qualities of a single set of data values.
118-
119-
![box_plot](../fig/vis_boxplot_01.png)
120-
121151
A common use of the boxplot is to compare the statistical variations across a set of variables.
122152
123153
The variables can be an independent series or columns of a Dataframe using the Pandas plot method
@@ -132,34 +162,51 @@ We can make it look prettier with seaborn, much more easily than fixing componen
132162
import seaborn as sns
133163
sns.boxplot(data=df,x ='village',y='buildings_in_compound')
134164
~~~
165+
{:.language-python}
166+
167+
![png](../fig/boxplot1.png)
168+
169+
We can make it look prettier with `seaborn`, much more easily than fixing components manually with `matplotlib`. [`Seaborn`](https://seaborn.pydata.org) is a Python data visualization library based on `matplotlib`. It provides a high-level interface for drawing attractive and informative statistical graphics. `Seaborn` comes with Anaconda; to make it available in our python session we need to import it.
170+
171+
~~~
172+
import seaborn as sns
173+
sns.boxplot(data = df, x = 'village', y = 'buildings_in_compound')
174+
~~~
135175
{: .language-python}
136176
177+
![png](../fig/boxplot2.png)
178+
179+
We can also draw linear models in a plot using `lmplot()` from `seaborn`, e.g. for `years_farm` vs `years_liv` per `village`.
137180
138181
~~~
139182
sns.lmplot(x='years_farm', y='years_liv',data=df,hue='village')
140183
~~~
141184
{: .language-python}
185+
![png](../fig/lm1.png)
142186
143-
In general, most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
187+
In general, most graphs can be broken down into a series of elements which, although typically related in some way,
188+
can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
144189
145-
The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph.
190+
The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend
191+
are also independent objects within the overall graph.
146192
147-
In Matplotlib you create the graph by providing values for all of the individual components you choose to include. When you are ready, you call the `show` function.
193+
In Matplotlib you create the graph by providing values for all of the individual components you choose to include.
194+
When you are ready, you call the `show` function.
148195
149196
Using this same approach, we can plot two sets of data on the same graph.
150197
151198
We will use a scatter plot to demonstrate some of the available features.
152199
153200
## Fine-tuning figures with Matplotlib
154201
155-
If we want to do more advanced or lower level things with our plots, we need to use Matplotlib directly, not through Pandas. First we need to import it.
156-
202+
If we want to do more advanced or lower level things with our plots, we need to use Matplotlib directly,
203+
not through Pandas. First we need to import it.
157204
158-
The Matplotlib library can be imported using any of the import techniques we have seen. As `pandas` is generally imported with `import pandas as pd`, you will find that `matplotlib` is most commonly imported with `import matplotlib.pylab as plt` where 'plt' is the alias.
159205
160-
In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a graph we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the `%matplotlib inline` directive.
206+
## Customising our plots with Matplotlib
161207
162-
If you forget to do this, you will have to add `plt.show()` to see the graphs.
208+
We can further customise our plots with `matplotlib` directly. First we need to import it.
209+
The `matplotlib` library can be imported using any of the import techniques we have seen. As `pandas` is generally imported with `import pandas as pd`, you will find that `matplotlib` is most commonly imported with `import matplotlib.pylab as plt` where `plt` is the alias.
163210
164211
~~~
165212
# Generate some date for 2 sets of points.
@@ -231,32 +278,49 @@ Internally the Pandas 'plot' method has called the 'bar' method of Matplotlib an
231278
232279
We can use Matplotlib directly to produce a similar graph. In this case we need to pass two parameters, the number of bars we need and the Pandas Series holding the values.
233280
234-
We also have to explicitly call the `show()` function to produce the graph.
281+
Let's redo the boxplot we did above:
235282
236-
## Saving Plots
283+
~~~
284+
df.boxplot(column = 'buildings_in_compound', by = 'village')
285+
~~~
286+
{: .language-python}
287+
288+
![png](../fig/boxplot1.png)
289+
290+
The automatic title of the plot does not look good, we are missing a title for the y-axis and we do not need the extra x-axis title. We can also remove the gridlines. Let's fix these things using functions from `plt`. Note: all the adjustments for the plot have to go into the same notebook cell together with the plot statement itself.
237291
238292
~~~
239-
plt.savefig("rooms.png")
240-
plt.savefig("rooms.pdf", bbox_inches="tight", dpi=600)
241-
plt.show()
293+
df.boxplot(column = 'buildings_in_compound', by = 'village')
294+
plt.suptitle('') # remove the automatic title
295+
plt.title('Buildings in compounds per village') # add a title
296+
plt.ylabel('Number of buildings') # add a y-axis title
297+
plt.xlabel('') # remove the x-axis title
298+
plt.grid(None) # remove the grid lines
242299
~~~
243300
{: .language-python}
244301
302+
![png](../fig/boxplot3.png)
245303
246-
For the Histogram, each data point is allocated to 1 of 10 (by default) equal 'bins' of equal size (range of numbers) which are indicated along the x axis and the number of points (frequency) is shown on the y axis.
304+
In general most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
305+
The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph.
306+
In `matplotlib` you create the graph by providing values for all of the individual components you choose to include.
247307
248-
In this case the graphs are almost identical. The only difference being in the first graph the y axis has a label 'Frequency' associated with it.
308+
## Saving a graph
249309
250-
We can fix this with a call to the `ylabel` function
310+
If you wish to save your graph as an image you can do so using the `plt.savefig()` function. The image can be saved as a pdf, jpg or png file by changing the file extension. `plt.savefig()` needs to be called at the end of all your plot statements in the same notebook cell.
251311
252312
~~~
253-
plt.ylabel('Frequency')
254-
plt.hist(s)
255-
plt.show()
313+
df.boxplot(column = 'buildings_in_compound', by = 'village')
314+
plt.suptitle('') # remove the automatic title
315+
plt.title('Buildings in compounds per village') # add a title
316+
plt.ylabel('Number of buildings') # add a y-axis title
317+
plt.xlabel('') # remove the x-axis title
318+
plt.grid(None) # remove the grid lines
319+
plt.savefig('safi_boxplot_buildings.pdf') # save as pdf file
320+
plt.savefig('safi_boxplot_buildings.png', dpi = 150) # save as png file, some extra arguments are provided
256321
~~~
257322
{: .language-python}
258323
259-
260324
In general most graphs can be broken down into a series of elements which, although typically related in some way, can
261325
all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
262326
@@ -278,11 +342,10 @@ demonstrate some of the available features.
278342
> 3. add a legend
279343
> 4. save it in two different formats
280344
>
281-
> extension: try plotting by by wall and roof type?
345+
> extension: try plotting by wall and roof type!
282346
>
283347
{: .challenge}
284348
285-
286349
## Saving a graph
287350
288351
If you wish to save your graph as an image you can do so using the `savefig()` function. The image can be saved as a pdf, jpg or png file by changing the file extension.
@@ -294,7 +357,7 @@ df.plot(kind = 'box', return_type='axes')
294357
plt.title('Box Plot')
295358
plt.xlabel('xlabel')
296359
plt.ylabel('ylabel')
297-
#plt.show()
360+
298361
plt.savefig('boxplot_from_df.pdf')
299362
~~~
300363
{: .language-python}

fig/barplot1.png

9.17 KB
Loading

fig/boxplot1.png

11.4 KB
Loading

fig/boxplot2.png

8.26 KB
Loading

fig/boxplot3.png

11 KB
Loading

fig/histogram1.png

7.04 KB
Loading

fig/histogram3.png

8.58 KB
Loading

fig/lm1.png

36 KB
Loading

fig/scatter1.png

11.1 KB
Loading

fig/scatter2.png

20.3 KB
Loading

0 commit comments

Comments
 (0)