library(tidyverse)
5 Mutating data
One of the most common data analysis techniques is to look at change over time. The most common way of comparing change over time is through percent change. The math behind calculating percent change is very simple, and you should know it off the top of your head. The easy way to remember it is:
(new - old) / old
Or new minus old divided by old. Your new number minus the old number, the result of which is divided by the old number. To do that in R, we can use dplyr
and mutate
to calculate new metrics in a new field using existing fields of data.
So first we’ll import the tidyverse so we can read in our data and begin to work with it.
Now you’ll need a common and simple dataset of total attendance at NCAA football games over the last few seasons.
For this walkthrough:
You’ll import it something like this.
<- read_csv('data/attendance.csv') attendance
Rows: 146 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Institution, Conference
dbl (11): 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If you want to see the first six rows – handy to take a peek at your data – you can use the function head
.
head(attendance)
# A tibble: 6 × 13
Insti…¹ Confe…² `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020` `2021`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Air Fo… MWC 228562 168967 156158 177519 174924 166205 162505 5600 136984
2 Akron MAC 107101 55019 108588 62021 117416 92575 107752 NA 61165
3 Alabama SEC 710538 710736 707786 712747 712053 710931 707817 97120 691038
4 Appala… FBS In… 149366 NA NA NA NA NA NA NA NA
5 Appala… Sun Be… NA 138995 128755 156916 154722 131716 166640 NA NA
6 Arizona Pac-12 285713 354973 308355 338017 255791 318051 237194 NA 209401
# … with 2 more variables: `2022` <dbl>, `2023` <dbl>, and abbreviated variable
# names ¹Institution, ²Conference
The code to calculate percent change is pretty simple. Remember, with summarize
, we used n()
to count things. With mutate
, we use very similar syntax to calculate a new value using other values in our dataset. So in this case, we’re trying to do (new-old)/old, but we’re doing it with fields. If we look at what we got when we did head
, you’ll see there’s `2023` as the new data, and we’ll use `2022` as the old data. So we’re looking at one year. Then, to help us, we’ll use arrange again to sort it, so we get the fastest growing school over one year.
|> mutate(
attendance change = (`2023` - `2022`)/`2022`
)
# A tibble: 146 × 14
Institution Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Air Force MWC 228562 168967 156158 177519 174924 166205 162505 5600
2 Akron MAC 107101 55019 108588 62021 117416 92575 107752 NA
3 Alabama SEC 710538 710736 707786 712747 712053 710931 707817 97120
4 Appalachian … FBS In… 149366 NA NA NA NA NA NA NA
5 Appalachian … Sun Be… NA 138995 128755 156916 154722 131716 166640 NA
6 Arizona Pac-12 285713 354973 308355 338017 255791 318051 237194 NA
7 Arizona St. Pac-12 501509 343073 368985 286417 359660 291091 344161 NA
8 Arkansas SEC 431174 399124 471279 487067 442569 367748 356517 82500
9 Arkansas St. Sun Be… 149477 149163 138043 136200 119538 119001 124017 24095
10 Army West Po… FBS In… 169781 171310 185946 163267 185543 190156 185935 45548
# … with 136 more rows, 4 more variables: `2021` <dbl>, `2022` <dbl>,
# `2023` <dbl>, change <dbl>, and abbreviated variable name ¹Conference
What do we see right away? Do those numbers look like we expect them to? No. They’re a decimal expressed as a percentage. So let’s fix that by multiplying by 100.
|> mutate(
attendance change = ((`2023` - `2022`)/`2022`)*100
)
# A tibble: 146 × 14
Institution Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Air Force MWC 228562 168967 156158 177519 174924 166205 162505 5600
2 Akron MAC 107101 55019 108588 62021 117416 92575 107752 NA
3 Alabama SEC 710538 710736 707786 712747 712053 710931 707817 97120
4 Appalachian … FBS In… 149366 NA NA NA NA NA NA NA
5 Appalachian … Sun Be… NA 138995 128755 156916 154722 131716 166640 NA
6 Arizona Pac-12 285713 354973 308355 338017 255791 318051 237194 NA
7 Arizona St. Pac-12 501509 343073 368985 286417 359660 291091 344161 NA
8 Arkansas SEC 431174 399124 471279 487067 442569 367748 356517 82500
9 Arkansas St. Sun Be… 149477 149163 138043 136200 119538 119001 124017 24095
10 Army West Po… FBS In… 169781 171310 185946 163267 185543 190156 185935 45548
# … with 136 more rows, 4 more variables: `2021` <dbl>, `2022` <dbl>,
# `2023` <dbl>, change <dbl>, and abbreviated variable name ¹Conference
Now, does this ordering do anything for us? No. Let’s fix that with arrange.
|> mutate(
attendance change = ((`2023` - `2022`)/`2022`)*100
|> arrange(desc(change)) )
# A tibble: 146 × 14
Institution Confere…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Houston AAC 145537 198177 271836 233716 195499 179029 127592 35152
2 South Fla. AAC 242911 214861 159468 262772 188408 231102 222759 20742
3 Arizona St. Pac-12 501509 343073 368985 286417 359660 291091 344161 NA
4 Louisville ACC NA 317829 294413 324391 276957 351755 299475 68816
5 Wyoming MWC 116854 117593 108361 148860 144299 113277 138042 9210
6 UNLV MWC 120486 94043 116228 110336 104692 100935 119186 4000
7 Oregon St. Pac-12 257784 295230 216476 263357 208524 211252 194546 NA
8 Stanford Pac-12 355081 287174 349417 264853 284388 227052 259123 NA
9 Baylor Big 12 321639 280257 276960 275029 262978 248017 318621 46668
10 Maryland Big Ten NA 281884 310389 237690 237859 201562 226871 NA
# … with 136 more rows, 4 more variables: `2021` <dbl>, `2022` <dbl>,
# `2023` <dbl>, change <dbl>, and abbreviated variable name ¹Conference
So who had the most growth in 2023 compared to the year before? Houston, followed by USF and Arizona State. How about Maryland at #10!
5.1 A more complex example
There’s metric in basketball that’s easy to understand – shooting percentage. It’s the number of shots made divided by the number of shots attempted. Simple, right? Except it’s a little too simple. Because what about three point shooters? They tend to be more vailable because the three point shot is worth more. What about players who get to the line? In shooting percentage, free throws are nowhere to be found.
Basketball nerds, because of these weaknesses, have created a new metric called True Shooting Percentage. True shooting percentage takes into account all aspects of a players shooting to determine who the real shooters are.
Using dplyr
and mutate
, we can calculate true shooting percentage. So let’s look at a new dataset, one of every college basketball player’s season stats in 2023-24 season. It’s a dataset of 5,688 players, and we’ve got 59 variables – one of them is True Shooting Percentage, but we’re going to ignore that.
For this walkthrough:
Import it like this:
<- read_csv("data/players24.csv") players
Rows: 5678 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Team, Player, Class, Pos, Height, Hometown, High School, Summary, ...
dbl (49): #, Weight, Rk.x, G, GS, MP, FG, FGA, FG%, 2P, 2PA, 2P%, 3P, 3PA, 3...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The basic true shooting percentage formula is (Points / (2*(FieldGoalAttempts + (.44 * FreeThrowAttempts)))) * 100
. Let’s talk that through. Points divided by a lot. It’s really field goal attempts plus 44 percent of the free throw attempts. Why? Because that’s about what a free throw is worth, compared to other ways to score. After adding those things together, you double it. And after you divide points by that number, you multiply the whole lot by 100.
In our data, we need to be able to find the fields so we can complete the formula. To do that, one way is to use the Environment tab in R Studio. In the Environment tab is a listing of all the data you’ve imported, and if you click the triangle next to it, it’ll list all the field names, giving you a bit of information about each one.
So what does True Shooting Percentage look like in code?
Let’s think about this differently. Who had the best true shooting season last year?
|>
players mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
arrange(desc(trueshooting))
# A tibble: 5,678 × 60
Team Player `#` Class Pos Height Weight Homet…¹ High …² Summary Rk.x
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Winthro… Henry… 21 FR G 5-11 175 Lewisb… Lewisb… 1.3 Pt… 14
2 Tenness… D.J. … 44 FR G 6-5 207 Ricahr… Minnes… 3.0 Pt… 13
3 San Jos… Benne… 24 FR G 6-0 180 Austin… L.C. A… 1.2 Pt… 13
4 Richmon… Liam … 10 JR G 6-2 190 Kansas… Pembro… 1.0 Pt… 12
5 Niagara… Jalen… 10 SO G 5-10 160 Chicag… De La … 0.6 Pt… 12
6 Nevada … Isaac… 24 FR F 6-8 195 Phoeni… Hillcr… 3.0 Pt… 13
7 Georget… Austi… 4 SO F 6-4 180 Hoover… Hoover… 1.0 Pt… 11
8 Florida… Renat… 15 JR G 6-1 180 Planta… Americ… 3.0 Pt… 15
9 Florida… Adefe… 22 FR G 6-1 160 Mirama… Norlan… 1.5 Pt… 16
10 Dayton … Attic… 54 SO G 6-6 190 Wester… St. Fr… 0.4 Pt… 12
# … with 5,668 more rows, 49 more variables: G <dbl>, GS <dbl>, MP <dbl>,
# FG <dbl>, FGA <dbl>, `FG%` <dbl>, `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>,
# `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
# ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>,
# TOV <dbl>, PF <dbl>, PTS <dbl>, Rk.y <dbl>, PER <dbl>, `TS%` <dbl>,
# `eFG%` <dbl>, `3PAr` <dbl>, FTr <dbl>, PProd <dbl>, `ORB%` <dbl>,
# `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, …
You’ll be forgiven if you did not hear about Winthrop’s shooting sensation Henry Harrison. He played in seven games, took three shots and actually hit them all. They all happened to be three pointers, which is three more three pointer than I’ve hit in college basketball. So props to him. Does that mean he had the best true shooting season in college basketball in the 2023-24 season?
Not hardly.
We’ll talk about how to narrow the pile and filter out data in the next chapter.