You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _episodes/11-joins.md
+21-8Lines changed: 21 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -82,12 +82,17 @@ df_all_rows
82
82
~~~
83
83
{: .language-python}
84
84
85
-
We didn't explicitly set an index for any of the Dataframes we have used. For `df_SN7577i_a` and `df_SN7577i_b` default indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries.
85
+
We didn't explicitly set an index for any of the Dataframes we have used. For `df_SN7577i_a` and `df_SN7577i_b` default
86
+
indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries.
86
87
87
88
This is really only a problem if you need to access a row by its index. We can fix the problem with the following code.
88
89
89
90
~~~
90
91
df_all_rows=df_all_rows.reset_index(drop=True)
92
+
93
+
# or, alternatively, there's the `ignore_index` option in the `pd.concat()` function:
In this case `df_SN7577i_aa` has no Q4 column and `df_SN7577i_bb` has no `Q3` column. When they are concatenated, the resulting Dataframe has a column for `Q3` and `Q4`. For the rows corresponding to `df_SN7577i_aa` the values in the `Q4` column are missing and denoted by `NaN`. The same applies to `Q3` for the `df_SN7577i_bb` rows.
110
+
In this case `df_SN7577i_aa` has no Q4 column and `df_SN7577i_bb` has no `Q3` column. When they are concatenated, the
111
+
resulting Dataframe has a column for `Q3` and `Q4`. For the rows corresponding to `df_SN7577i_aa` the values in the `Q4`
112
+
column are missing and denoted by `NaN`. The same applies to `Q3` for the `df_SN7577i_bb` rows.
106
113
107
114
108
115
### Scenario 2 - Adding the columns from one Dataframe to those of another Dataframe
@@ -115,15 +122,18 @@ df_all_cols
115
122
~~~
116
123
{: .language-python}
117
124
118
-
We use the `axis=1` parameter to indicate that it is the columns that need to be joined together. Notice that the `Id` column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not necessarily a problem. However, there are better ways of combining columns from two Dataframes which avoid this problem.
125
+
We use the `axis=1` parameter to indicate that it is the columns that need to be joined together. Notice that the `Id`
126
+
column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not
127
+
necessarily a problem. However, there are better ways of combining columns from two Dataframes which avoid this problem.
119
128
120
129
### Scenario 3 - Using merge to join columns
121
130
122
131
We can join columns from two Dataframes using the `merge()` function. This is similar to the SQL 'join' functionality.
123
132
124
133
A detailed discussion of different join types is given in the [SQL lesson](./episodes/sql...).
125
134
126
-
You specify the type of join you want using the `how` parameter. The default is the `inner` join which returns the columns from both tables where the `key` or common column values match in both Dataframes.
135
+
You specify the type of join you want using the `how` parameter. The default is the `inner` join which returns the
136
+
columns from both tables where the `key` or common column values match in both Dataframes.
127
137
128
138
The possible values of the `how` parameter are shown in the picture below (taken from the Pandas documentation)
129
139
@@ -140,24 +150,27 @@ df_cd
140
150
~~~
141
151
{: .language-python}
142
152
143
-
In fact, if there is only one column with the same name in each Dataframe, it will be assumed to be the one you want to join on. In this example the `Id` column
153
+
In fact, if there is only one column with the same name in each Dataframe, it will be assumed to be the one you want to
154
+
join on. In this example the `Id` column
144
155
145
-
Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using the `on` parameter.
156
+
Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using
157
+
the `on` parameter.
146
158
147
159
~~~
148
160
df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', on = 'Id')
149
161
~~~
150
162
{: .language-python}
151
163
152
-
In many circumstances, the column names that you wish to join on are not the same in both Dataframes, in which case you can use the `left_on` and `right_on` parameters to specify them separately.
164
+
In many circumstances, the column names that you wish to join on are not the same in both Dataframes, in which case you
165
+
can use the `left_on` and `right_on` parameters to specify them separately.
0 commit comments