# A collection of R packages designed for data science
library(tidyverse)
<- read_csv("data/surveys.csv")
surveys <- read_csv("data/plots.csv") plot_types
13 Combining data
- Learn how to join two tables together.
- Be able to distinguish between different types of joins.
- Use filtering joins to identify (mis)matches between tables.
13.1 Context
Data is often split over multiple tables. We saw this in the previous section. Sometimes we need to combine information from multiple sources.
13.2 Section setup
We’ll continue this section with the script named da4-10-combining-data.R
. If needed, add the following code to the top of your script and run it.
We’ll continue this section with the script named da4-10-combining-data.py
. If needed, add the following code to the top of your script and run it.
# A Python data analysis and manipulation tool
import pandas as pd
# Python equivalent of `ggplot2`
from plotnine import *
= pd.read_csv("data/surveys.csv")
surveys = pd.read_csv("data/plots.csv") plot_types
13.3 Joining tables
We’ve already seen that some of our data sets have an identifier, for example the record_id
in the surveys
data set.
When it comes to joining data, these identifiers become extra important. After all, we need to be able to tell the computer how to join the data. We do this through a common key or variable between two data sets. We’ll illustrate how this works below.
13.3.1 Setting up your joins
There are different ways you can join tables, depending on which data you’d like to retain. The way these joins are named often depend on the direction in which you are joining. Let’s look at this in more detail, using examples.
<- read_csv("data/plots.csv") plot_types
Rows: 5 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): plot_type
dbl (1): plot_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s look at the data. We can see that there are five distinct plot types encoded in these data, using plot_id
as a key.
plot_types
# A tibble: 5 × 2
plot_id plot_type
<dbl> <chr>
1 1 Spectab exclosure
2 2 Control
3 3 Long-term Krat Exclosure
4 4 Rodent Exclosure
5 5 Short-term Krat Exclosure
Now let’s look at our surveys
data set. We know we have a plot_id
column there, too. Let’s check how many times the different plot_id
values occur in our data. For this, we can use the count()
function:
|> count(plot_id) surveys
# A tibble: 24 × 2
plot_id n
<dbl> <int>
1 1 1995
2 2 2194
3 3 1828
4 4 1969
5 5 1194
6 6 1582
7 7 816
8 8 1891
9 9 1936
10 10 469
# ℹ 14 more rows
This shows that there are 24 plot_id
values (there are 24 rows in the output). So, the plot_types
data set won’t contain information on all of these, since it only contains 5 distinct plots.
= pd.read_csv("data/plots.csv") plot_types
Let’s look at the data. We can see that there are five distinct plot types encoded in these data, using plot_id
as a key.
plot_types
plot_id plot_type
0 1 Spectab exclosure
1 2 Control
2 3 Long-term Krat Exclosure
3 4 Rodent Exclosure
4 5 Short-term Krat Exclosure
Now let’s look at our surveys
data set. We know we have a plot_id
column there, too. Let’s check how many times the different plot_id
values occur in our data.
'plot_id').size().reset_index(name='n') surveys.groupby(
plot_id n
0 1 1995
1 2 2194
2 3 1828
3 4 1969
4 5 1194
5 6 1582
6 7 816
7 8 1891
8 9 1936
9 10 469
10 11 1918
11 12 2365
12 13 1538
13 14 1885
14 15 1069
15 16 646
16 17 2039
17 18 1445
18 19 1189
19 20 1390
20 21 1173
21 22 1399
22 23 571
23 24 1048
This shows that there are 24 plot_id
values (there are 24 rows in the output). So, the plot_types
data set won’t contain information on all of these, since it only contains 5 distinct plots.
13.3.2 Left joins
We’ll be adding the data from plot_types
to the existing surveys
data. This uses the following principle:
This means that any plot_id
that appears in surveys
, but isn’t present in plot_types
will be empty or missing for its plot_type
value.
left_join(surveys, plot_types, by = "plot_id")
# A tibble: 35,549 × 10
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
7 7 7 16 1977 2 PE F NA NA
8 8 7 16 1977 1 DM M 37 NA
9 9 7 16 1977 1 DM F 34 NA
10 10 7 16 1977 6 PF F 20 NA
# ℹ 35,539 more rows
# ℹ 1 more variable: plot_type <chr>
Let’s assign that output to an object called surveys_left
.
<- left_join(surveys, plot_types, by = "plot_id") surveys_left
= pd.merge(surveys, plot_types, how = "left", on = "plot_id") surveys_left
surveys_left
record_id month ... weight plot_type
0 1 7 ... NaN Control
1 2 7 ... NaN Long-term Krat Exclosure
2 3 7 ... NaN Control
3 4 7 ... NaN NaN
4 5 7 ... NaN Long-term Krat Exclosure
... ... ... ... ... ...
35544 35545 12 ... NaN NaN
35545 35546 12 ... NaN NaN
35546 35547 12 ... 14.0 NaN
35547 35548 12 ... 51.0 NaN
35548 35549 12 ... NaN Short-term Krat Exclosure
[35549 rows x 10 columns]
Having this information now allows us to plot the data by plot_type
in a much more meaningful way than if we would have used plot_id
. For example, let’s look at the hindfoot_length
for each plot_type
.
= (ggplot(surveys_left, aes(x = "plot_type", y = "hindfoot_length")) +
p
geom_boxplot())
p.show()
/Users/martinvanrongen/miniforge3/envs/mv372/lib/python3.13/site-packages/plotnine/layer.py:284: PlotnineWarning: stat_boxplot : Removed 4111 rows containing non-finite values.
13.3.3 Right joins
right_join(surveys, plot_types, by = "plot_id")
# A tibble: 9,180 × 10
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 5 7 16 1977 3 DM M 35 NA
5 6 7 16 1977 1 PF M 14 NA
6 7 7 16 1977 2 PE F NA NA
7 8 7 16 1977 1 DM M 37 NA
8 9 7 16 1977 1 DM F 34 NA
9 11 7 16 1977 5 DS F 53 NA
10 13 7 16 1977 3 DM M 35 NA
# ℹ 9,170 more rows
# ℹ 1 more variable: plot_type <chr>
= pd.merge(surveys, plot_types, how = "right", on = "plot_id") surveys_right
Let’s see how many rows we’ve retained.
len(surveys_right)
9180
We can see that we have far fewer rows left (9180) than in the full data set (35549). This is because all the rows where there isn’t a match for plot_id
in the surveys
data set are dropped.
13.3.4 Inner joins
inner_join(surveys, plot_types, by = "plot_id")
# A tibble: 9,180 × 10
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 5 7 16 1977 3 DM M 35 NA
5 6 7 16 1977 1 PF M 14 NA
6 7 7 16 1977 2 PE F NA NA
7 8 7 16 1977 1 DM M 37 NA
8 9 7 16 1977 1 DM F 34 NA
9 11 7 16 1977 5 DS F 53 NA
10 13 7 16 1977 3 DM M 35 NA
# ℹ 9,170 more rows
# ℹ 1 more variable: plot_type <chr>
= pd.merge(surveys, plot_types, how = "inner", on = "plot_id") surveys_inner
Let’s see how many rows we’ve retained.
len(surveys_inner)
9180
This actually gives the same result as with the right join (9180 rows), which is because there aren’t any rows in plot_type
that don’t have a match in surveys
.
13.3.5 Full joins
full_join(surveys, plot_types, by = "plot_id")
# A tibble: 35,549 × 10
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
7 7 7 16 1977 2 PE F NA NA
8 8 7 16 1977 1 DM M 37 NA
9 9 7 16 1977 1 DM F 34 NA
10 10 7 16 1977 6 PF F 20 NA
# ℹ 35,539 more rows
# ℹ 1 more variable: plot_type <chr>
The “full” join in pandas’ merge()
function is referred to as "outer"
.
= pd.merge(surveys, plot_types, how = "outer", on = "plot_id") surveys_full
Let’s see how many rows we’ve retained.
len(surveys_full)
35549
This again gives us our entire surveys
data set, including the additional plot_type
information where available. The reason why this is not different to the left join is because there are no rows in plot_types
that do not have a match in surveys
.
13.3.6 Filtering joins
There is one last set of joins we haven’t discussed yet: filtering joins. These can be really helpful if you’re comparing two tables and want to specifically extract the rows that are either present (semi-join) or absent (anti-join) in the other table.
Semi-join:
semi_join(surveys, plot_types, by = "plot_id")
# A tibble: 9,180 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 5 7 16 1977 3 DM M 35 NA
5 6 7 16 1977 1 PF M 14 NA
6 7 7 16 1977 2 PE F NA NA
7 8 7 16 1977 1 DM M 37 NA
8 9 7 16 1977 1 DM F 34 NA
9 11 7 16 1977 5 DS F 53 NA
10 13 7 16 1977 3 DM M 35 NA
# ℹ 9,170 more rows
Anti-join:
anti_join(surveys, plot_types, by = "plot_id")
# A tibble: 26,369 × 9
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 4 7 16 1977 7 DM M 36 NA
2 10 7 16 1977 6 PF F 20 NA
3 12 7 16 1977 7 DM M 38 NA
4 14 7 16 1977 8 DM <NA> NA NA
5 15 7 16 1977 6 DM F 36 NA
6 20 7 17 1977 11 DS F 48 NA
7 21 7 17 1977 14 DM F 34 NA
8 22 7 17 1977 15 NL F 31 NA
9 23 7 17 1977 13 DM M 36 NA
10 24 7 17 1977 13 SH M 21 NA
# ℹ 26,359 more rows
Semi-join:
"plot_id"].isin(plot_types["plot_id"])] surveys[surveys[
record_id month day year ... species_id sex hindfoot_length weight
0 1 7 16 1977 ... NL M 32.0 NaN
1 2 7 16 1977 ... NL M 33.0 NaN
2 3 7 16 1977 ... DM F 37.0 NaN
4 5 7 16 1977 ... DM M 35.0 NaN
5 6 7 16 1977 ... PF M 14.0 NaN
... ... ... ... ... ... ... ... ... ...
35498 35499 12 31 2002 ... PB F 27.0 28.0
35499 35500 12 31 2002 ... PB F 25.0 28.0
35500 35501 12 31 2002 ... DO M 33.0 48.0
35501 35502 12 31 2002 ... PP M 21.0 16.0
35548 35549 12 31 2002 ... NaN NaN NaN NaN
[9180 rows x 9 columns]
Anti-join:
~surveys["plot_id"].isin(plot_types["plot_id"])] surveys[
record_id month day year ... species_id sex hindfoot_length weight
3 4 7 16 1977 ... DM M 36.0 NaN
9 10 7 16 1977 ... PF F 20.0 NaN
11 12 7 16 1977 ... DM M 38.0 NaN
13 14 7 16 1977 ... DM NaN NaN NaN
14 15 7 16 1977 ... DM F 36.0 NaN
... ... ... ... ... ... ... ... ... ...
35543 35544 12 31 2002 ... US NaN NaN NaN
35544 35545 12 31 2002 ... AH NaN NaN NaN
35545 35546 12 31 2002 ... AH NaN NaN NaN
35546 35547 12 31 2002 ... RM F 15.0 14.0
35547 35548 12 31 2002 ... DO M 36.0 51.0
[26369 rows x 9 columns]
13.4 Summary
- We can use left, right, inner and outer/full joins to merge two tables together.
- There needs to be at least one matching column between the two tables that can be used as a key to link them.
- We can also use filter joins to identify rows in a table that are present/absent in the other