23  Combining and joining

Often, as data journalists, we’re looking at data across time or at data stored in multiple tables. And to do that, we need to often need to merge that data together.

Depending on what we have, we may just need to stack data on top of each other to make new data. If we have 2019 data and 2018 data and we want that to be one file, we stack them. If we have a dataset of cows in counties and a dataset of populations in county, we’re going to join those two together on the county – the common element.

Let’s explore.

23.1 Combining data (stacking)

Let’s say that we have Maryland county voter registration data from five different elections in five different files. They have the same record layout and the same number of counties (plus Baltimore City). We can combine them into a single dataframe.

Let’s do what we need to import them properly. I’ve merged it all into one step for each of the datasets.

library(tidyverse)
county_voters_2016 <- read_csv("data/county_voters_2016.csv")
Rows: 24 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (8): YEAR, DEM, REP, LIB, GRN, UNA, OTH, TOTAL

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
county_voters_2018 <- read_csv("data/county_voters_2018.csv")
Rows: 24 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (8): YEAR, DEM, REP, LIB, GRN, UNA, OTH, TOTAL

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
county_voters_2020 <- read_csv("data/county_voters_2020.csv")
Rows: 24 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (8): YEAR, DEM, REP, GRN, LIB, OTH, UNA, TOTAL

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
county_voters_2022 <- read_csv("data/county_voters_2022.csv")
Rows: 24 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (8): YEAR, DEM, REP, GRN, LIB, UNA, OTH, TOTAL

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
county_voters_2024 <- read_csv("data/county_voters_2024.csv")
Rows: 25 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (7): YEAR, DEM, REP, LIB, UNA, OTH, TOTAL

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

All of these datasets have the same number of columns, all with the same names, so if we want to merge them together to compare them over time, we need to stack them together. The verb here, in R, is bind_rows. You tell the function what you want to combine and it does it, assuming that you’ve got column names in common containing identically formatted data.

Since we have five dataframes, we’re going to need to pass them as a list, meaning they’ll be enclosed inside the list function.

county_voters_combined <- bind_rows(list(county_voters_2016, county_voters_2018, county_voters_2020, county_voters_2022, county_voters_2024))

And boom, like that, we have 125 rows of data together instead of five dataframes. Now we can ask more interesting questions like how a county’s registration patterns have changed over time.

There are plenty of uses for bind_rows: any regularly updated data that comes in the same format like crime reports or award recipients or player game statistics. Or election results.

23.2 Joining data

More complicated is when you have two separate tables that are connected by a common element or elements. But there’s a verb for that, too: join.

Let’s start by reading in some Maryland 2020 county population data:

maryland_population <- read_csv('data/maryland_population_2020.csv')
Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): COUNTY
dbl (1): POP2020

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

One of the columns we have is called county, which is what we have in our county_voters_2020 dataframe.

To put the Maryland population data and voter registration data together, we need to use something called a join. There are different kinds of joins. It’s better if you think of two tables sitting next to each other. A left_join takes all the records from the left table and only the records that match in the right one. A right_join does the same thing. An inner_join takes only the records where they are equal. There’s one other join – a full_join which returns all rows of both, regardless of if there’s a match – but I’ve never once had a use for a full join.

In the best-case scenario, the two tables we want to join share a common column. In this case, both of our tables have a column called county that has the same characteristics: values in both look identical, including how they distinguish Baltimore City from Baltimore County. This is important, because joins work on exact matches.

We can do this join multiple ways and get a similar result. We can put the population file on the left and the registration data on the right and use a left join to get them all together. And we use by= to join by the correct column. I’m going to count the rows at the end. The reason I’m doing this is important: Rule 1 in joining data is having an idea of what you are expecting to get. So with a left join with population on the left, I have 24 rows, so I expect to get 24 rows when I’m done.

maryland_population |> left_join(county_voters_2020, by="COUNTY") |> nrow()
[1] 24

Remove the nrow and run it again for yourself. By default, dplyr will do a “natural” join, where it’ll match all the matching columns in both tables. So if we take out the by, it’ll use all the common columns between the tables. That may not be right in every instance but let’s try it. If it works, we should get 24 rows.

maryland_population |> left_join(county_voters_2020)
Joining with `by = join_by(COUNTY)`
# A tibble: 24 × 10
   COUNTY           POP2020  YEAR    DEM    REP   GRN   LIB   OTH    UNA  TOTAL
   <chr>              <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
 1 Allegany           68106  2020  12820  22530    74   204   434   7674  43736
 2 Anne Arundel      588261  2020 174494 135457   564  1922  3017  90162 405616
 3 Baltimore City    585708  2020 311610  30163   802   951  3709  52450 399685
 4 Baltimore County  854535  2020 313870 142534   898  2227  6303 100576 566408
 5 Calvert            92783  2020  24587  28181    89   332   617  14178  67984
 6 Caroline           33293  2020   6629  10039    33    86   182   4208  21177
 7 Carroll           172891  2020  33662  63967   155   670  1137  25770 125361
 8 Cecil             103725  2020  21601  30880   103   341   784  15110  68819
 9 Charles           166617  2020  72416  24711   112   349   865  19849 118302
10 Dorchester         32531  2020   9848   8730    19    78   164   3348  22187
# ℹ 14 more rows

Since we only have one column in common between the two tables, the join only used that column. And we got the same answer. If we had more columns in common, you could see in your results columns with .X after them - that’s a sign of duplicative columns between two tables, and you may decide you don’t need both moving forward.

Let’s save our joined data to a new dataframe, but this time let’s remove the select function so we don’t limit the columns to just three.

maryland_population_with_voters <- maryland_population |> left_join(county_voters_2020)
Joining with `by = join_by(COUNTY)`

Now, with our joined data, we can answer questions in a more useful way. But joins can do even more than just bring data together; they can include additional data to enable you to ask more sophisticated questions. Right now we have registered voters and total population. But we can do more.

Let’s try adding more Maryland demographic data to the mix. Using a file describing the 18-and-over population (from which eligible voters come) from the state’s data catalog, we can read it into R:

maryland_demographics <- read_csv('data/maryland_demographics.csv')
Rows: 24 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): NAME
dbl (10): GEOCODE, pop_18_over, pop_one_race, pop_white, pop_black, pop_nati...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Again, we can use a left_join to make our demographic data available. This time we’ll need to specify the two fields to join because they do not have identical names. We’ll use COUNTY from our population data and NAME from the demographic data, and the order matters - the first column is from the dataframe you name first.

maryland_population_with_voters_and_demographics <- maryland_population_with_voters |> left_join(maryland_demographics, by=c("COUNTY"="NAME"))

Now we’ve got population data and demographic data by county. That means we can draw from both datasets in asking our questions. For example, we could see the counties with the highest 18+ Black population as a percentage of all population 18 and over and also the percentage of Democrats in that county.

We can get this by using mutate and arrange:

maryland_population_with_voters_and_demographics |>
  mutate(pct_black_18_plus = (pop_black/pop_18_over)*100, pct_dems = (DEM/TOTAL)*100) |>
  arrange(desc(pct_black_18_plus)) |>
  select(COUNTY, pct_black_18_plus, pct_dems)
# A tibble: 24 × 3
   COUNTY           pct_black_18_plus pct_dems
   <chr>                        <dbl>    <dbl>
 1 Prince George's               60.9     78.3
 2 Baltimore City                56.3     78.0
 3 Charles                       48.2     61.2
 4 Somerset                      39.0     41.8
 5 Baltimore County              28.8     55.4
 6 Dorchester                    26.2     44.4
 7 Wicomico                      25.6     42.3
 8 Howard                        18.7     52.4
 9 Montgomery                    18.1     61.0
10 Anne Arundel                  17.4     43.0
# ℹ 14 more rows

If you know Maryland political demographics, this result isn’t too surprising, but Somerset County - the state’s 2nd smallest in terms of population - stands out for its Black population, which is a greater percentage than Baltimore County and Montgomery County.

Let’s change that to look at Asian population:

maryland_population_with_voters_and_demographics |>
  mutate(pct_asian_18_plus = (pop_asian/pop_18_over)*100, pct_dems = (DEM/TOTAL)*100) |>
  arrange(desc(pct_asian_18_plus)) |>
  select(COUNTY, pct_asian_18_plus, pct_dems)
# A tibble: 24 × 3
   COUNTY           pct_asian_18_plus pct_dems
   <chr>                        <dbl>    <dbl>
 1 Howard                       19.4      52.4
 2 Montgomery                   16.0      61.0
 3 Baltimore County              6.34     55.4
 4 Frederick                     4.88     38.9
 5 Prince George's               4.68     78.3
 6 Anne Arundel                  4.52     43.0
 7 Baltimore City                4.17     78.0
 8 Charles                       3.55     61.2
 9 Harford                       3.15     35.4
10 St. Mary's                    3.13     35.7
# ℹ 14 more rows

Here, Howard and Montgomery County stand out in terms of the percentage of Asian population 18 and over. The jurisdictions with the highest percentage of Democrats - Prince George’s and Baltimore City - have small Asian populations.

Sometimes joins look like they should work but don’t. Often this is due to the two columns you’re joining on having different data types: joining a column to a column, for example. Let’s walk through an example of that using some demographic data by zip code.

maryland_zcta <- read_csv('data/maryland_zcta.csv')
Rows: 468 Columns: 40
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): FIRST_CLAS, FIRST_MTFC, FIRST_FUNC, REPORT_2_P, REPORT_9_P
dbl (35): OBJECTID_1, ZCTA5CE10, FIRST_STAT, FIRST_GEOI, ZCTA5N, STATE, AREA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(maryland_zcta)
Rows: 468
Columns: 40
$ OBJECTID_1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ ZCTA5CE10  <dbl> 20601, 20602, 20603, 20606, 20607, 20608, 20609, 20611, 206…
$ FIRST_STAT <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,…
$ FIRST_GEOI <dbl> 2420601, 2420602, 2420603, 2420606, 2420607, 2420608, 24206…
$ FIRST_CLAS <chr> "B5", "B5", "B5", "B5", "B5", "B5", "B5", "B5", "B5", "B5",…
$ FIRST_MTFC <chr> "G6350", "G6350", "G6350", "G6350", "G6350", "G6350", "G635…
$ FIRST_FUNC <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S",…
$ ZCTA5N     <dbl> 20601, 20602, 20603, 20606, 20607, 20608, 20609, 20611, 206…
$ STATE      <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,…
$ AREALAND   <dbl> 115635266, 35830723, 44239637, 7501011, 54357590, 45583064,…
$ AREAWATR   <dbl> 387684, 352762, 219356, 1248760, 448221, 5330329, 6602735, …
$ POP100     <dbl> 24156, 24955, 28967, 431, 9802, 919, 1120, 1078, 261, 11860…
$ HU100      <dbl> 8722, 9736, 10317, 230, 3504, 426, 554, 413, 142, 4424, 204…
$ NHW        <dbl> 9785, 8466, 9625, 377, 2165, 438, 1009, 798, 245, 4044, 352…
$ NHB        <dbl> 11146, 13054, 15025, 45, 6321, 453, 82, 215, 12, 6786, 32, …
$ NHAI       <dbl> 155, 116, 98, 1, 33, 5, 2, 5, 0, 106, 2, 32, 3, 4, 38, 8, 1…
$ NHA        <dbl> 880, 731, 1446, 4, 560, 2, 1, 10, 0, 186, 3, 165, 5, 1, 402…
$ NHNH       <dbl> 11, 15, 24, 0, 3, 0, 1, 0, 0, 4, 1, 2, 0, 0, 4, 1, 0, 3, 1,…
$ NHO        <dbl> 48, 58, 65, 0, 6, 0, 0, 0, 0, 8, 0, 1, 0, 3, 5, 8, 0, 5, 10…
$ NHT        <dbl> 849, 999, 1091, 0, 234, 9, 15, 33, 1, 321, 13, 213, 14, 4, …
$ HISP       <dbl> 1282, 1516, 1593, 4, 480, 12, 10, 17, 3, 405, 2, 244, 9, 7,…
$ PNHW       <dbl> 40.5, 33.9, 33.2, 87.5, 22.1, 47.7, 90.1, 74.0, 93.9, 34.1,…
$ PNHB       <dbl> 46.1, 52.3, 51.9, 10.4, 64.5, 49.3, 7.3, 19.9, 4.6, 57.2, 7…
$ PNHAI      <dbl> 0.6, 0.5, 0.3, 0.2, 0.3, 0.5, 0.2, 0.5, 0.0, 0.9, 0.5, 0.5,…
$ PNHA       <dbl> 3.6, 2.9, 5.0, 0.9, 5.7, 0.2, 0.1, 0.9, 0.0, 1.6, 0.7, 2.8,…
$ PNHNH      <dbl> 0.0, 0.1, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.2, 0.0,…
$ PNHO       <dbl> 0.2, 0.2, 0.2, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0,…
$ PNHT       <dbl> 3.5, 4.0, 3.8, 0.0, 2.4, 1.0, 1.3, 3.1, 0.4, 2.7, 3.2, 3.6,…
$ PHISP      <dbl> 5.3, 6.1, 5.5, 0.9, 4.9, 1.3, 0.9, 1.6, 1.1, 3.4, 0.5, 4.2,…
$ POP65_     <dbl> 1922, 1964, 1400, 108, 847, 173, 271, 129, 54, 1372, 73, 55…
$ PCTPOP65_  <dbl> 8.0, 7.9, 4.8, 25.1, 8.6, 18.8, 24.2, 12.0, 20.7, 11.6, 18.…
$ MEDAGE     <dbl> 37.3, 32.6, 34.5, 49.1, 40.9, 46.6, 47.6, 44.3, 47.3, 40.8,…
$ VACNS      <dbl> 376, 769, 531, 15, 172, 39, 32, 22, 14, 249, 18, 158, 8, 18…
$ PVACNS     <dbl> 4.3, 7.9, 5.1, 6.5, 4.9, 9.2, 5.8, 5.3, 9.9, 5.6, 8.8, 7.2,…
$ PHOWN      <dbl> 71.1, 59.7, 73.8, 49.7, 83.1, 60.4, 44.8, 63.8, 38.3, 73.9,…
$ PWOMORT    <dbl> 11.2, 9.0, 4.7, 39.3, 10.3, 28.2, 38.7, 21.8, 43.9, 17.4, 2…
$ PRENT      <dbl> 19.9, 34.4, 22.6, 18.1, 7.4, 15.9, 27.0, 18.3, 31.7, 10.5, …
$ PLT18SP    <dbl> 30.4, 43.6, 29.9, 31.2, 22.1, 14.1, 28.9, 24.5, 43.9, 26.7,…
$ REPORT_2_P <chr> "http://mdpgis.mdp.state.md.us/Census2010/PDF/00_SF1DP_2Pro…
$ REPORT_9_P <chr> "http://mdpgis.mdp.state.md.us/census2010/PDF/00_SF1_9PROFI…

You can see that ZCTA5N, the column representing the Zip Code Tabulation Area, is a numeric column. But should it be? Do we ever want to know the average zip code in Maryland? Zip codes and ZCTAs look like numbers but really are character columns. Let’s change that so that we can be sure to join them correctly with other data where the zip codes are not numbers. We’ll use mutate:

maryland_zcta <- maryland_zcta |> mutate(across(ZCTA5N, as.character))

What’s happening here is that we’re telling R to take all of the values in the ZCTA5N column and make them “as.character”. If we wanted to change a column to numeric, we’d do “as.numeric”. When you join two dataframes, the join columns must be the same datatype.

Joining datasets allows you to expand the range and sophistication of questions you’re able to ask. It is one of the most powerful tools in a journalist’s toolkit.