library(tidyverse)
7 Transforming data
Sometimes long data needs to be wide, and sometimes wide data needs to be long. I’ll explain.
You are soon going to discover that long before you can visualize data, you need to have it in a form that the visualization library can deal with. One of the ways that isn’t immediately obvious is how your data is cast. Most of the data you will encounter will be wide – each row will represent a single entity with multiple measures for that entity. So think of states. Your row of your dataset could have the state name, population, average life expectancy and other demographic data.
But what if your visualization library needs one row for each measure? So state, data type and the data. Maryland, Population, 6,177,224. That’s one row. Then the next row is Maryland, Average Life Expectancy, 78.5 That’s the next row. That’s where recasting your data comes in.
We can use a library called tidyr
to pivot_longer
or pivot_wider
the data, depending on what we need. We’ll use a dataset of college football attendance to demonstrate.
For this walkthrough:
First we need some libraries.
Now we’ll load the data.
<- read_csv('data/attendance.csv') attendance
Rows: 146 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Institution, Conference
dbl (11): 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attendance
# A tibble: 146 × 13
Institution Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Air Force MWC 228562 168967 156158 177519 174924 166205 162505 5600
2 Akron MAC 107101 55019 108588 62021 117416 92575 107752 NA
3 Alabama SEC 710538 710736 707786 712747 712053 710931 707817 97120
4 Appalachian … FBS In… 149366 NA NA NA NA NA NA NA
5 Appalachian … Sun Be… NA 138995 128755 156916 154722 131716 166640 NA
6 Arizona Pac-12 285713 354973 308355 338017 255791 318051 237194 NA
7 Arizona St. Pac-12 501509 343073 368985 286417 359660 291091 344161 NA
8 Arkansas SEC 431174 399124 471279 487067 442569 367748 356517 82500
9 Arkansas St. Sun Be… 149477 149163 138043 136200 119538 119001 124017 24095
10 Army West Po… FBS In… 169781 171310 185946 163267 185543 190156 185935 45548
# … with 136 more rows, 3 more variables: `2021` <dbl>, `2022` <dbl>,
# `2023` <dbl>, and abbreviated variable name ¹Conference
So as you can see, each row represents a school, and then each column represents a year. This is great for calculating the percent change – we can subtract a column from a column and divide by that column. But later, when we want to chart each school’s attendance over the years, we have to have each row be one team for one year. Maryland in 2013, then Maryland in 2014, and Maryland in 2015 and so on.
To do that, we use pivot_longer
because we’re making wide data long. Since all of the columns we want to make rows start with 20, we can use that in our cols
directive. Then we give that column a name – Year – and the values for each year need a name too. Those are the attendance figure. We can see right away how this works.
|> pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "Attendance") attendance
# A tibble: 1,606 × 4
Institution Conference Year Attendance
<chr> <chr> <chr> <dbl>
1 Air Force MWC 2013 228562
2 Air Force MWC 2014 168967
3 Air Force MWC 2015 156158
4 Air Force MWC 2016 177519
5 Air Force MWC 2017 174924
6 Air Force MWC 2018 166205
7 Air Force MWC 2019 162505
8 Air Force MWC 2020 5600
9 Air Force MWC 2021 136984
10 Air Force MWC 2022 188482
# … with 1,596 more rows
We’ve gone from 149 rows to more than 1,600, but that’s expected when we have 10+ years for each team.
7.1 Making long data wide
We can reverse this process using pivot_wider
, which makes long data wide.
Why do any of this?
In some cases, you’re going to be given long data and you need to calculate some metric using two of the years – a percent change for instance. So you’ll need to make the data wide to do that. You might then have to re-lengthen the data now with the percent change. Some project require you to do all kinds of flexing like this. It just depends on the data.
So let’s take what we made above and turn it back into wide data.
<- attendance |> pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "Attendance")
longdata
longdata
# A tibble: 1,606 × 4
Institution Conference Year Attendance
<chr> <chr> <chr> <dbl>
1 Air Force MWC 2013 228562
2 Air Force MWC 2014 168967
3 Air Force MWC 2015 156158
4 Air Force MWC 2016 177519
5 Air Force MWC 2017 174924
6 Air Force MWC 2018 166205
7 Air Force MWC 2019 162505
8 Air Force MWC 2020 5600
9 Air Force MWC 2021 136984
10 Air Force MWC 2022 188482
# … with 1,596 more rows
To pivot_wider
, we just need to say where our column names are coming from – the Year – and where the data under it should come from – Attendance.
|> pivot_wider(names_from = Year, values_from = Attendance) longdata
# A tibble: 146 × 13
Institution Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Air Force MWC 228562 168967 156158 177519 174924 166205 162505 5600
2 Akron MAC 107101 55019 108588 62021 117416 92575 107752 NA
3 Alabama SEC 710538 710736 707786 712747 712053 710931 707817 97120
4 Appalachian … FBS In… 149366 NA NA NA NA NA NA NA
5 Appalachian … Sun Be… NA 138995 128755 156916 154722 131716 166640 NA
6 Arizona Pac-12 285713 354973 308355 338017 255791 318051 237194 NA
7 Arizona St. Pac-12 501509 343073 368985 286417 359660 291091 344161 NA
8 Arkansas SEC 431174 399124 471279 487067 442569 367748 356517 82500
9 Arkansas St. Sun Be… 149477 149163 138043 136200 119538 119001 124017 24095
10 Army West Po… FBS In… 169781 171310 185946 163267 185543 190156 185935 45548
# … with 136 more rows, 3 more variables: `2021` <dbl>, `2022` <dbl>,
# `2023` <dbl>, and abbreviated variable name ¹Conference
And just like that, we’re back.
7.2 Why this matters
This matters because certain visualization types need wide or long data. A significant hurdle you will face for the rest of the semester is getting the data in the right format for what you want to do.
So let me walk you through an example using this data.
Let’s look at Maryland’s attendance over the time period. In order to do that, I need long data because that’s what the charting library, ggplot2
, needs. You’re going to learn a lot more about ggplot later. Since this data is organized by school and conference, I also need to remove records that have no attendance data (because we have a Maryland in the ACC row).
<- longdata |> filter(Institution == "Maryland") |> filter(!is.na(Attendance)) maryland
Now that we have long data for just Maryland, we can chart it.
ggplot(maryland, aes(x=Year, y=Attendance, group=1)) +
geom_line() +
scale_y_continuous(labels = scales::comma) +
labs(x="Year", y="Attendance", title="Maryland Football Attendance Climbs Under Locksley", subtitle="Heading back to the 300,000 mark?", caption="Source: NCAA | By Derek Willis", color = "Outcome") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 10),
axis.title.y = element_blank(),
axis.text = element_text(size = 7),
axis.ticks = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position="bottom"
)