library(tidyverse)
23 Combining and joining
Often, as data journalists, we’re looking at data across time or at data stored in multiple tables. And to do that, we need to often need to merge that data together.
Depending on what we have, we may just need to stack data on top of each other to make new data. If we have 2019 data and 2018 data and we want that to be one file, we stack them. If we have a dataset of cows in counties and a dataset of populations in county, we’re going to join those two together on the county – the common element.
Let’s explore.
23.1 Combining data (stacking)
Let’s say that we have Maryland county voter registration data from five different elections in five different files. They have the same record layout and the same number of counties (plus Baltimore City). We can combine them into a single dataframe.
Let’s do what we need to import them properly. I’ve merged it all into one step for each of the datasets.
<- read_csv("data/county_voters_2016.csv") county_voters_2016
Rows: 24 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (8): YEAR, DEM, REP, LIB, GRN, UNA, OTH, TOTAL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data/county_voters_2018.csv") county_voters_2018
Rows: 24 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (8): YEAR, DEM, REP, LIB, GRN, UNA, OTH, TOTAL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data/county_voters_2020.csv") county_voters_2020
Rows: 24 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (8): YEAR, DEM, REP, GRN, LIB, OTH, UNA, TOTAL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data/county_voters_2022.csv") county_voters_2022
Rows: 24 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (8): YEAR, DEM, REP, GRN, LIB, UNA, OTH, TOTAL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data/county_voters_2024.csv") county_voters_2024
Rows: 25 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (7): YEAR, DEM, REP, LIB, UNA, OTH, TOTAL
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
All of these datasets have the same number of columns, all with the same names, so if we want to merge them together to compare them over time, we need to stack them together. The verb here, in R, is bind_rows
. You tell the function what you want to combine and it does it, assuming that you’ve got column names in common containing identically formatted data.
Since we have five dataframes, we’re going to need to pass them as a list, meaning they’ll be enclosed inside the list
function.
<- bind_rows(list(county_voters_2016, county_voters_2018, county_voters_2020, county_voters_2022, county_voters_2024)) county_voters_combined
And boom, like that, we have 125 rows of data together instead of five dataframes. Now we can ask more interesting questions like how a county’s registration patterns have changed over time.
There are plenty of uses for bind_rows
: any regularly updated data that comes in the same format like crime reports or award recipients or player game statistics. Or election results.
23.2 Joining data
More complicated is when you have two separate tables that are connected by a common element or elements. But there’s a verb for that, too: join
.
Let’s start by reading in some Maryland 2020 county population data:
<- read_csv('data/maryland_population_2020.csv') maryland_population
Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (1): POP2020
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
One of the columns we have is called county
, which is what we have in our county_voters_2020 dataframe.
To put the Maryland population data and voter registration data together, we need to use something called a join. There are different kinds of joins. It’s better if you think of two tables sitting next to each other. A left_join
takes all the records from the left table and only the records that match in the right one. A right_join
does the same thing. An inner_join
takes only the records where they are equal. There’s one other join – a full_join
which returns all rows of both, regardless of if there’s a match – but I’ve never once had a use for a full join.
In the best-case scenario, the two tables we want to join share a common column. In this case, both of our tables have a column called county
that has the same characteristics: values in both look identical, including how they distinguish Baltimore City from Baltimore County. This is important, because joins work on exact matches.
We can do this join multiple ways and get a similar result. We can put the population file on the left and the registration data on the right and use a left join to get them all together. And we use by=
to join by the correct column. I’m going to count the rows at the end. The reason I’m doing this is important: Rule 1 in joining data is having an idea of what you are expecting to get. So with a left join with population on the left, I have 24 rows, so I expect to get 24 rows when I’m done.
|> left_join(county_voters_2020, by="COUNTY") |> nrow() maryland_population
[1] 24
Remove the nrow and run it again for yourself. By default, dplyr
will do a “natural” join, where it’ll match all the matching columns in both tables. So if we take out the by, it’ll use all the common columns between the tables. That may not be right in every instance but let’s try it. If it works, we should get 24 rows.
|> left_join(county_voters_2020) maryland_population
Joining with `by = join_by(COUNTY)`
# A tibble: 24 × 10
COUNTY POP2020 YEAR DEM REP GRN LIB OTH UNA TOTAL
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Allegany 68106 2020 12820 22530 74 204 434 7674 43736
2 Anne Arundel 588261 2020 174494 135457 564 1922 3017 90162 405616
3 Baltimore City 585708 2020 311610 30163 802 951 3709 52450 399685
4 Baltimore County 854535 2020 313870 142534 898 2227 6303 100576 566408
5 Calvert 92783 2020 24587 28181 89 332 617 14178 67984
6 Caroline 33293 2020 6629 10039 33 86 182 4208 21177
7 Carroll 172891 2020 33662 63967 155 670 1137 25770 125361
8 Cecil 103725 2020 21601 30880 103 341 784 15110 68819
9 Charles 166617 2020 72416 24711 112 349 865 19849 118302
10 Dorchester 32531 2020 9848 8730 19 78 164 3348 22187
# ℹ 14 more rows
Since we only have one column in common between the two tables, the join only used that column. And we got the same answer. If we had more columns in common, you could see in your results columns with .X after them - that’s a sign of duplicative columns between two tables, and you may decide you don’t need both moving forward.
Let’s save our joined data to a new dataframe, but this time let’s remove the select function so we don’t limit the columns to just three.
<- maryland_population |> left_join(county_voters_2020) maryland_population_with_voters
Joining with `by = join_by(COUNTY)`
Now, with our joined data, we can answer questions in a more useful way. But joins can do even more than just bring data together; they can include additional data to enable you to ask more sophisticated questions. Right now we have registered voters and total population. But we can do more.
Let’s try adding more Maryland demographic data to the mix. Using a file describing the 18-and-over population (from which eligible voters come) from the state’s data catalog, we can read it into R:
<- read_csv('data/maryland_demographics.csv') maryland_demographics
Rows: 24 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): NAME
dbl (10): GEOCODE, pop_18_over, pop_one_race, pop_white, pop_black, pop_nati...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Again, we can use a left_join
to make our demographic data available. This time we’ll need to specify the two fields to join because they do not have identical names. We’ll use COUNTY
from our population data and NAME
from the demographic data, and the order matters - the first column is from the dataframe you name first.
<- maryland_population_with_voters |> left_join(maryland_demographics, by=c("COUNTY"="NAME")) maryland_population_with_voters_and_demographics
Now we’ve got population data and demographic data by county. That means we can draw from both datasets in asking our questions. For example, we could see the counties with the highest 18+ Black population as a percentage of all population 18 and over and also the percentage of Democrats in that county.
We can get this by using mutate and arrange:
|>
maryland_population_with_voters_and_demographics mutate(pct_black_18_plus = (pop_black/pop_18_over)*100, pct_dems = (DEM/TOTAL)*100) |>
arrange(desc(pct_black_18_plus)) |>
select(COUNTY, pct_black_18_plus, pct_dems)
# A tibble: 24 × 3
COUNTY pct_black_18_plus pct_dems
<chr> <dbl> <dbl>
1 Prince George's 60.9 78.3
2 Baltimore City 56.3 78.0
3 Charles 48.2 61.2
4 Somerset 39.0 41.8
5 Baltimore County 28.8 55.4
6 Dorchester 26.2 44.4
7 Wicomico 25.6 42.3
8 Howard 18.7 52.4
9 Montgomery 18.1 61.0
10 Anne Arundel 17.4 43.0
# ℹ 14 more rows
If you know Maryland political demographics, this result isn’t too surprising, but Somerset County - the state’s 2nd smallest in terms of population - stands out for its Black population, which is a greater percentage than Baltimore County and Montgomery County.
Let’s change that to look at Asian population:
|>
maryland_population_with_voters_and_demographics mutate(pct_asian_18_plus = (pop_asian/pop_18_over)*100, pct_dems = (DEM/TOTAL)*100) |>
arrange(desc(pct_asian_18_plus)) |>
select(COUNTY, pct_asian_18_plus, pct_dems)
# A tibble: 24 × 3
COUNTY pct_asian_18_plus pct_dems
<chr> <dbl> <dbl>
1 Howard 19.4 52.4
2 Montgomery 16.0 61.0
3 Baltimore County 6.34 55.4
4 Frederick 4.88 38.9
5 Prince George's 4.68 78.3
6 Anne Arundel 4.52 43.0
7 Baltimore City 4.17 78.0
8 Charles 3.55 61.2
9 Harford 3.15 35.4
10 St. Mary's 3.13 35.7
# ℹ 14 more rows
Here, Howard and Montgomery County stand out in terms of the percentage of Asian population 18 and over. The jurisdictions with the highest percentage of Democrats - Prince George’s and Baltimore City - have small Asian populations.
Sometimes joins look like they should work but don’t. Often this is due to the two columns you’re joining on having different data types: joining a
<- read_csv('data/maryland_zcta.csv') maryland_zcta
Rows: 468 Columns: 40
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): FIRST_CLAS, FIRST_MTFC, FIRST_FUNC, REPORT_2_P, REPORT_9_P
dbl (35): OBJECTID_1, ZCTA5CE10, FIRST_STAT, FIRST_GEOI, ZCTA5N, STATE, AREA...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(maryland_zcta)
Rows: 468
Columns: 40
$ OBJECTID_1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ ZCTA5CE10 <dbl> 20601, 20602, 20603, 20606, 20607, 20608, 20609, 20611, 206…
$ FIRST_STAT <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,…
$ FIRST_GEOI <dbl> 2420601, 2420602, 2420603, 2420606, 2420607, 2420608, 24206…
$ FIRST_CLAS <chr> "B5", "B5", "B5", "B5", "B5", "B5", "B5", "B5", "B5", "B5",…
$ FIRST_MTFC <chr> "G6350", "G6350", "G6350", "G6350", "G6350", "G6350", "G635…
$ FIRST_FUNC <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S",…
$ ZCTA5N <dbl> 20601, 20602, 20603, 20606, 20607, 20608, 20609, 20611, 206…
$ STATE <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,…
$ AREALAND <dbl> 115635266, 35830723, 44239637, 7501011, 54357590, 45583064,…
$ AREAWATR <dbl> 387684, 352762, 219356, 1248760, 448221, 5330329, 6602735, …
$ POP100 <dbl> 24156, 24955, 28967, 431, 9802, 919, 1120, 1078, 261, 11860…
$ HU100 <dbl> 8722, 9736, 10317, 230, 3504, 426, 554, 413, 142, 4424, 204…
$ NHW <dbl> 9785, 8466, 9625, 377, 2165, 438, 1009, 798, 245, 4044, 352…
$ NHB <dbl> 11146, 13054, 15025, 45, 6321, 453, 82, 215, 12, 6786, 32, …
$ NHAI <dbl> 155, 116, 98, 1, 33, 5, 2, 5, 0, 106, 2, 32, 3, 4, 38, 8, 1…
$ NHA <dbl> 880, 731, 1446, 4, 560, 2, 1, 10, 0, 186, 3, 165, 5, 1, 402…
$ NHNH <dbl> 11, 15, 24, 0, 3, 0, 1, 0, 0, 4, 1, 2, 0, 0, 4, 1, 0, 3, 1,…
$ NHO <dbl> 48, 58, 65, 0, 6, 0, 0, 0, 0, 8, 0, 1, 0, 3, 5, 8, 0, 5, 10…
$ NHT <dbl> 849, 999, 1091, 0, 234, 9, 15, 33, 1, 321, 13, 213, 14, 4, …
$ HISP <dbl> 1282, 1516, 1593, 4, 480, 12, 10, 17, 3, 405, 2, 244, 9, 7,…
$ PNHW <dbl> 40.5, 33.9, 33.2, 87.5, 22.1, 47.7, 90.1, 74.0, 93.9, 34.1,…
$ PNHB <dbl> 46.1, 52.3, 51.9, 10.4, 64.5, 49.3, 7.3, 19.9, 4.6, 57.2, 7…
$ PNHAI <dbl> 0.6, 0.5, 0.3, 0.2, 0.3, 0.5, 0.2, 0.5, 0.0, 0.9, 0.5, 0.5,…
$ PNHA <dbl> 3.6, 2.9, 5.0, 0.9, 5.7, 0.2, 0.1, 0.9, 0.0, 1.6, 0.7, 2.8,…
$ PNHNH <dbl> 0.0, 0.1, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.2, 0.0,…
$ PNHO <dbl> 0.2, 0.2, 0.2, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0,…
$ PNHT <dbl> 3.5, 4.0, 3.8, 0.0, 2.4, 1.0, 1.3, 3.1, 0.4, 2.7, 3.2, 3.6,…
$ PHISP <dbl> 5.3, 6.1, 5.5, 0.9, 4.9, 1.3, 0.9, 1.6, 1.1, 3.4, 0.5, 4.2,…
$ POP65_ <dbl> 1922, 1964, 1400, 108, 847, 173, 271, 129, 54, 1372, 73, 55…
$ PCTPOP65_ <dbl> 8.0, 7.9, 4.8, 25.1, 8.6, 18.8, 24.2, 12.0, 20.7, 11.6, 18.…
$ MEDAGE <dbl> 37.3, 32.6, 34.5, 49.1, 40.9, 46.6, 47.6, 44.3, 47.3, 40.8,…
$ VACNS <dbl> 376, 769, 531, 15, 172, 39, 32, 22, 14, 249, 18, 158, 8, 18…
$ PVACNS <dbl> 4.3, 7.9, 5.1, 6.5, 4.9, 9.2, 5.8, 5.3, 9.9, 5.6, 8.8, 7.2,…
$ PHOWN <dbl> 71.1, 59.7, 73.8, 49.7, 83.1, 60.4, 44.8, 63.8, 38.3, 73.9,…
$ PWOMORT <dbl> 11.2, 9.0, 4.7, 39.3, 10.3, 28.2, 38.7, 21.8, 43.9, 17.4, 2…
$ PRENT <dbl> 19.9, 34.4, 22.6, 18.1, 7.4, 15.9, 27.0, 18.3, 31.7, 10.5, …
$ PLT18SP <dbl> 30.4, 43.6, 29.9, 31.2, 22.1, 14.1, 28.9, 24.5, 43.9, 26.7,…
$ REPORT_2_P <chr> "http://mdpgis.mdp.state.md.us/Census2010/PDF/00_SF1DP_2Pro…
$ REPORT_9_P <chr> "http://mdpgis.mdp.state.md.us/census2010/PDF/00_SF1_9PROFI…
You can see that ZCTA5N, the column representing the Zip Code Tabulation Area, is a numeric column. But should it be? Do we ever want to know the average zip code in Maryland? Zip codes and ZCTAs look like numbers but really are character columns. Let’s change that so that we can be sure to join them correctly with other data where the zip codes are not numbers. We’ll use mutate
:
<- maryland_zcta |> mutate(across(ZCTA5N, as.character)) maryland_zcta
What’s happening here is that we’re telling R to take all of the values in the ZCTA5N column and make them “as.character”. If we wanted to change a column to numeric, we’d do “as.numeric”. When you join two dataframes, the join columns must be the same datatype.
Joining datasets allows you to expand the range and sophistication of questions you’re able to ask. It is one of the most powerful tools in a journalist’s toolkit.