library(tidyverse)
18 Mutating data
Often the data you have will prompt questions that it doesn’t immediately answer. Election results, for example, have raw vote totals but we often don’t use those to make comparisons between candidates unless the numbers are small. We need percentages!
To do that in R, we can use dplyr
and mutate
to calculate new metrics in a new field using existing fields of data. That’s the essence of mutate
- using the data you have to answer a new question.
So first we’ll import the tidyverse so we can read in our data and begin to work with it.
Now we’ll import a dataset of county-level gubernatorial results from Maryland’s 2022 general election that is in the data folder in this chapter’s pre-lab directory. We’ll use this to explore ways to create new information from existing data.
<- read_csv('data/md_gov_county.csv') general_22
Rows: 24 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): county
dbl (7): fips_code, cox, moore, lashar, wallace, harding, write_ins
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s add a column called percent_moore
for the percentage of votes that went to Wes Moore, the Democratic candidate who won the election, in each county. The code to calculate a percentage is pretty simple. Remember, with summarize
, we used n()
to count things. With mutate
, we use very similar syntax to calculate a new value – a new column of data – using other values in our dataset.
To calculate a percentage, we need both the number of votes for Moore but also the total number of votes. We’ll use mutate to create both columns. The first will be total votes. The key here is to save the dataframe to itself so that our changes stick.
<- general_22 |>
general_22 mutate(
total_votes = cox + moore + lashar + wallace + write_ins,
pct_moore = moore/total_votes
)
But what do you see right away? Do those numbers look like we expect them to? No. They’re a decimal expressed as a percentage. So let’s fix that by multiplying by 100. Since we’re replacing the contents of our new pct_moore
column, we can just update our previous code and run it again:
<- general_22 |>
general_22 mutate(
pct_moore = (moore/total_votes)*100
)
Now, does this ordering do anything for us? No. Let’s fix that with arrange.
<- general_22 |>
general_22 mutate(
pct_moore = (moore/total_votes)*100
|>
) arrange(desc(pct_moore))
So now we have results ordered by pct_moore
with the highest percentage first. To see the lowest percentage first, we can reverse that arrange
function - we don’t need to recalculate the column:
|>
general_22 arrange(pct_moore)
# A tibble: 24 × 10
fips_code county cox moore lashar wallace harding write_ins total_votes
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 24023 Garrett 8195 2402 194 52 138 0 10843
2 24001 Allegany 13833 6390 279 143 267 0 20645
3 24015 Cecil 18159 8691 472 155 311 0 27477
4 24011 Caroline 6727 3276 176 56 140 0 10235
5 24039 Somerset 3974 2254 67 46 81 0 6341
6 24043 Washington 26943 15723 614 252 472 0 43532
7 24013 Carroll 38969 25155 1515 436 561 0 66075
8 24047 Worcester 13433 8550 273 121 161 0 22377
9 24037 Saint Mar… 20279 13291 661 253 346 0 34484
10 24035 Queen Ann… 12840 8577 416 120 212 0 21953
# ℹ 14 more rows
# ℹ 1 more variable: pct_moore <dbl>
Moore had his weakest performance in Garrett County, at the far western edge of the state.
18.1 Another use of mutate
Mutate is also useful for standardizing data - for example, making different spellings of, say, campaign spending recipients.
Let’s load some Maryland state campaign expenditures into a maryland_expenses
dataframe, and focus in particular on the payee_name
column.
<- read_csv("data/maryland_expenses.csv") maryland_expenses
Rows: 97912 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): expenditure_date, payee_name, address, payee_type, committee_name,...
dbl (1): amount
lgl (1): expense_toward
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
maryland_expenses
# A tibble: 97,912 × 14
expenditure_date payee_name address payee_type amount committee_name
<chr> <chr> <chr> <chr> <dbl> <chr>
1 3/12/2021 <NA> <NA> Reimburse 350 Salling Joh…
2 3/29/2021 Dundalk Eagle News… PO Box… Business/… 329 Salling Joh…
3 4/29/2021 Dundalk Eagle News… PO Box… Business/… 400 Salling Joh…
4 5/18/2021 Dundalk Eagle News… PO Box… Business/… 350 Salling Joh…
5 6/9/2021 Dundalk Heritage F… Dundal… Business/… 200 Salling Joh…
6 6/9/2021 Dundalk Heritage F… Dundal… Business/… 250 Salling Joh…
7 6/1/2021 Neighborhood Signs 6655 a… Business/… 77.4 Salling Joh…
8 4/16/2021 <NA> <NA> Reimburse 150 Salling Joh…
9 7/1/2021 MSP CUSTOM SOL 1000 P… Business/… 238. Salling Joh…
10 7/2/2021 Squire's Restaurant 6723 H… Business/… 260 Salling Joh…
# ℹ 97,902 more rows
# ℹ 8 more variables: expense_category <chr>, expense_purpose <chr>,
# expense_toward <lgl>, expense_method <chr>, vendor <chr>, fundtype <chr>,
# comments <chr>, x14 <chr>
You’ll notice that there’s a mix of styles: lower-case and upper-case names like “Anedot” and “ANEDOT”, for example. R will think those are two different payees, and that will mean that any aggregates we create based on payee_name won’t be accurate.
So how can we fix that? Mutate - it’s not just for math! And a function called str_to_upper
that will convert a character column into all uppercase.
<- maryland_expenses |>
standardized_maryland_expenses mutate(
payee_upper = str_to_upper(payee_name)
)
There are lots of potential uses for standardization - addresses, zip codes, anything that can be misspelled or abbreviated.
18.2 A more powerful use
Mutate is even more useful when combined with some additional functions. Let’s keep rolling with our expenditure data. Take a look at the address column: it contains a full address, including the state, spelled out. It would be useful to have a separate state
column with an abbreviation. We can check to see if a state name is contained in that column and then populate a new column with the value we want, using the functions str_detect
and case_when
. We can identify the state by the following pattern: a space, followed by the full name, followed by another space. So, ” Maryland “. The case_when
function handles multiple variations, such as if the state is Maryland or the state is Texas, etc. Crucially, we can tell R to populate the new column with NA
if it doesn’t find a match.
<- maryland_expenses |>
maryland_expenses_with_state mutate(
state = case_when(
str_detect(address, " Maryland ") ~ "MD",
str_detect(address, " California ") ~ "CA",
str_detect(address, " Washington ") ~ "WA",
str_detect(address, " Louisiana ") ~ "LA",
str_detect(address, " Florida ") ~ "FL",
str_detect(address, " North Carolina ") ~ "NC",
str_detect(address, " Massachusetts ") ~ "MA",
str_detect(address, " West Virginia ") ~ "WV",
str_detect(address, " Virginia ") ~ "VA",
.default = NA
) )
There’s a lot going on here, so let’s unpack it. It starts out as a typical mutate statement, but case_when
introduces some new things. Each line checks to see if the pattern is contained in the address column, followed by ~
and then a value for the new column for records that match that check. You can read it like this: “If we find ’ Maryland ’ in the address column, then put ‘MD’ in the state column” for Maryland and then a handful of states, and if we don’t match any state we’re looking for, make state NA
.
We can then use our new state
column in group_by statements to make summarizing easier.
|>
maryland_expenses_with_state group_by(state) |>
summarize(total = sum(amount)) |>
arrange(desc(total))
# A tibble: 10 × 2
state total
<chr> <dbl>
1 MD 77723146.
2 WA 15552127.
3 VA 10519646.
4 CA 3370284.
5 FL 1470592.
6 MA 1264728.
7 NC 691006.
8 LA 255522.
9 WV 41088.
10 <NA> NA
Most expenditures seem to have occurred in Maryland, which makes sense, although we haven’t assigned a state for every transaction.
Mutate is there to make your data more useful and to make it easier for you to ask more and better questions of it.