7 Transforming data

Sometimes long data needs to be wide, and sometimes wide data needs to be long. I’ll explain.

You are soon going to discover that long before you can visualize data, you need to have it in a form that the visualization library can deal with. One of the ways that isn’t immediately obvious is how your data is cast. Most of the data you will encounter will be wide – each row will represent a single entity with multiple measures for that entity. So think of states. Your row of your dataset could have the state name, population, average life expectancy and other demographic data.

But what if your visualization library needs one row for each measure? So state, data type and the data. Maryland, Population, 6,177,224. That’s one row. Then the next row is Maryland, Average Life Expectancy, 78.5 That’s the next row. That’s where recasting your data comes in.

We can use a library called tidyr to pivot_longer or pivot_wider the data, depending on what we need. We’ll use a dataset of college football attendance to demonstrate.

For this walkthrough:

First we need some libraries.

library(tidyverse)

Now we’ll load the data.

attendance <- read_csv('data/attendance.csv')

Rows: 146 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Institution, Conference
dbl (12): 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

attendance

# A tibble: 146 × 14
   Institution     Conference   `2013` `2014` `2015` `2016` `2017` `2018` `2019`
   <chr>           <chr>         <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Air Force       MWC          228562 168967 156158 177519 174924 166205 162505
 2 Akron           MAC          107101  55019 108588  62021 117416  92575 107752
 3 Alabama         SEC          710538 710736 707786 712747 712053 710931 707817
 4 Appalachian St. FBS Indepen… 149366     NA     NA     NA     NA     NA     NA
 5 Appalachian St. Sun Belt         NA 138995 128755 156916 154722 131716 166640
 6 Arizona         Big 12       285713 354973 308355 338017 255791 318051 237194
 7 Arizona St.     Big 12       501509 343073 368985 286417 359660 291091 344161
 8 Arkansas        SEC          431174 399124 471279 487067 442569 367748 356517
 9 Arkansas St.    Sun Belt     149477 149163 138043 136200 119538 119001 124017
10 Army West Point The American 169781 171310 185946 163267 185543 190156 185935
# ℹ 136 more rows
# ℹ 5 more variables: `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>,
#   `2024` <dbl>

So as you can see, each row represents a school, and then each column represents a year. This is great for calculating the percent change – we can subtract a column from a column and divide by that column. But later, when we want to chart each school’s attendance over the years, we have to have each row be one team for one year. Maryland in 2013, then Maryland in 2014, and Maryland in 2015 and so on.

To do that, we use pivot_longer because we’re making wide data long. Since all of the columns we want to make rows start with 20, we can use that in our cols directive. Then we give that column a name – Year – and the values for each year need a name too. Those are the attendance figure. We can see right away how this works.

attendance |> pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "Attendance")

# A tibble: 1,752 × 4
   Institution Conference Year  Attendance
   <chr>       <chr>      <chr>      <dbl>
 1 Air Force   MWC        2013      228562
 2 Air Force   MWC        2014      168967
 3 Air Force   MWC        2015      156158
 4 Air Force   MWC        2016      177519
 5 Air Force   MWC        2017      174924
 6 Air Force   MWC        2018      166205
 7 Air Force   MWC        2019      162505
 8 Air Force   MWC        2020        5600
 9 Air Force   MWC        2021      136984
10 Air Force   MWC        2022      188482
# ℹ 1,742 more rows

We’ve gone from 147 rows to more than 1,700, but that’s expected when we have 10+ years for each team.

7.1 Making long data wide

We can reverse this process using pivot_wider, which makes long data wide.

Why do any of this?

In some cases, you’re going to be given long data and you need to calculate some metric using two of the years – a percent change for instance. So you’ll need to make the data wide to do that. You might then have to re-lengthen the data now with the percent change. Some project require you to do all kinds of flexing like this. It just depends on the data.

So let’s take what we made above and turn it back into wide data.

longdata <- attendance |> pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "Attendance")

longdata

# A tibble: 1,752 × 4
   Institution Conference Year  Attendance
   <chr>       <chr>      <chr>      <dbl>
 1 Air Force   MWC        2013      228562
 2 Air Force   MWC        2014      168967
 3 Air Force   MWC        2015      156158
 4 Air Force   MWC        2016      177519
 5 Air Force   MWC        2017      174924
 6 Air Force   MWC        2018      166205
 7 Air Force   MWC        2019      162505
 8 Air Force   MWC        2020        5600
 9 Air Force   MWC        2021      136984
10 Air Force   MWC        2022      188482
# ℹ 1,742 more rows

To pivot_wider, we just need to say where our column names are coming from – the Year – and where the data under it should come from – Attendance.

longdata |> pivot_wider(names_from = Year, values_from = Attendance)

Warning: Values from `Attendance` are not uniquely identified; output will contain
list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} |>
  dplyr::summarise(n = dplyr::n(), .by = c(Institution, Conference, Year)) |>
  dplyr::filter(n > 1L)

# A tibble: 144 × 14
   Institution     Conference   `2013` `2014` `2015` `2016` `2017` `2018` `2019`
   <chr>           <chr>        <list> <list> <list> <list> <list> <list> <list>
 1 Air Force       MWC          <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 2 Akron           MAC          <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 3 Alabama         SEC          <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 4 Appalachian St. FBS Indepen… <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 5 Appalachian St. Sun Belt     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 6 Arizona         Big 12       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 7 Arizona St.     Big 12       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 8 Arkansas        SEC          <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
 9 Arkansas St.    Sun Belt     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
10 Army West Point The American <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
# ℹ 134 more rows
# ℹ 5 more variables: `2020` <list>, `2021` <list>, `2022` <list>,
#   `2023` <list>, `2024` <list>

And just like that, we’re back.

7.2 Why this matters

This matters because certain visualization types need wide or long data. A significant hurdle you will face for the rest of the semester is getting the data in the right format for what you want to do.

So let me walk you through an example using this data.

Let’s look at Maryland’s attendance over the time period. In order to do that, I need long data because that’s what the charting library, ggplot2, needs. You’re going to learn a lot more about ggplot later. Since this data is organized by school and conference, I also need to remove records that have no attendance data (because we have a Maryland in the ACC row).

maryland <- longdata |> filter(Institution == "Maryland") |> filter(!is.na(Attendance))

Now that we have long data for just Maryland, we can chart it.

ggplot(maryland, aes(x=Year, y=Attendance, group=1)) + 
  geom_line() + 
  scale_y_continuous(labels = scales::comma) + 
  labs(x="Year", y="Attendance", title="Attendance Up and Down Under Locksley", subtitle="Heading back to the 300,000 mark?", caption="Source: NCAA | By Derek Willis", color = "Outcome") +
  theme_minimal() + 
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 10),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 7),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position="bottom"
  )