Skip to content

Commit 861e19c

Browse files
authored
Merge pull request #168 from vinisalazar/ep11
Trim line lengths and add command
2 parents 883ee62 + 2e7a115 commit 861e19c

File tree

1 file changed

+21
-8
lines changed

1 file changed

+21
-8
lines changed

_episodes/11-joins.md

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -82,12 +82,17 @@ df_all_rows
8282
~~~
8383
{: .language-python}
8484

85-
We didn't explicitly set an index for any of the Dataframes we have used. For `df_SN7577i_a` and `df_SN7577i_b` default indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries.
85+
We didn't explicitly set an index for any of the Dataframes we have used. For `df_SN7577i_a` and `df_SN7577i_b` default
86+
indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries.
8687

8788
This is really only a problem if you need to access a row by its index. We can fix the problem with the following code.
8889

8990
~~~
9091
df_all_rows=df_all_rows.reset_index(drop=True)
92+
93+
# or, alternatively, there's the `ignore_index` option in the `pd.concat()` function:
94+
df_all_rows = pd.concat([df_SN7577i_a, df_SN7577i_b], ignore_index=True)
95+
9196
df_all_rows
9297
~~~
9398
{: .language-python}
@@ -102,7 +107,9 @@ df_all_rows
102107
~~~
103108
{: .language-python}
104109

105-
In this case `df_SN7577i_aa` has no Q4 column and `df_SN7577i_bb` has no `Q3` column. When they are concatenated, the resulting Dataframe has a column for `Q3` and `Q4`. For the rows corresponding to `df_SN7577i_aa` the values in the `Q4` column are missing and denoted by `NaN`. The same applies to `Q3` for the `df_SN7577i_bb` rows.
110+
In this case `df_SN7577i_aa` has no Q4 column and `df_SN7577i_bb` has no `Q3` column. When they are concatenated, the
111+
resulting Dataframe has a column for `Q3` and `Q4`. For the rows corresponding to `df_SN7577i_aa` the values in the `Q4`
112+
column are missing and denoted by `NaN`. The same applies to `Q3` for the `df_SN7577i_bb` rows.
106113

107114

108115
### Scenario 2 - Adding the columns from one Dataframe to those of another Dataframe
@@ -115,15 +122,18 @@ df_all_cols
115122
~~~
116123
{: .language-python}
117124

118-
We use the `axis=1` parameter to indicate that it is the columns that need to be joined together. Notice that the `Id` column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not necessarily a problem. However, there are better ways of combining columns from two Dataframes which avoid this problem.
125+
We use the `axis=1` parameter to indicate that it is the columns that need to be joined together. Notice that the `Id`
126+
column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not
127+
necessarily a problem. However, there are better ways of combining columns from two Dataframes which avoid this problem.
119128

120129
### Scenario 3 - Using merge to join columns
121130

122131
We can join columns from two Dataframes using the `merge()` function. This is similar to the SQL 'join' functionality.
123132

124133
A detailed discussion of different join types is given in the [SQL lesson](./episodes/sql...).
125134

126-
You specify the type of join you want using the `how` parameter. The default is the `inner` join which returns the columns from both tables where the `key` or common column values match in both Dataframes.
135+
You specify the type of join you want using the `how` parameter. The default is the `inner` join which returns the
136+
columns from both tables where the `key` or common column values match in both Dataframes.
127137

128138
The possible values of the `how` parameter are shown in the picture below (taken from the Pandas documentation)
129139

@@ -140,24 +150,27 @@ df_cd
140150
~~~
141151
{: .language-python}
142152

143-
In fact, if there is only one column with the same name in each Dataframe, it will be assumed to be the one you want to join on. In this example the `Id` column
153+
In fact, if there is only one column with the same name in each Dataframe, it will be assumed to be the one you want to
154+
join on. In this example the `Id` column
144155

145-
Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using the `on` parameter.
156+
Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using
157+
the `on` parameter.
146158

147159
~~~
148160
df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', on = 'Id')
149161
~~~
150162
{: .language-python}
151163

152-
In many circumstances, the column names that you wish to join on are not the same in both Dataframes, in which case you can use the `left_on` and `right_on` parameters to specify them separately.
164+
In many circumstances, the column names that you wish to join on are not the same in both Dataframes, in which case you
165+
can use the `left_on` and `right_on` parameters to specify them separately.
153166

154167
~~~
155168
df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', left_on = 'Id', right_on = 'Id')
156169
~~~
157170
{: .language-python}
158171

159172

160-
> ## Exercises
173+
> ## Practice with data
161174
>
162175
> 1. Examine the contents of the `SN7577i_aa` and `SN7577i_bb` csv files using Excel or equivalent.
163176
> 2. Using the `SN7577i_aa` and `SN7577i_bb` csv files, create a Dataframe which is the result of an outer join using the `Id` column to join on.

0 commit comments

Comments
 (0)