4  Aggregates

R is a statistical programming language that is purpose built for data analysis.

Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. We already installed the tidyverse – or you should have if you followed the instructions for the last assignment – which isn’t exactly a library, but a collection of libraries. Together, they make up the tidyverse. Individually, they are extraordinarily useful for what they do. We can load them all at once using the tidyverse name, or we can load them individually. Let’s start with individually.

The two libraries we are going to need for this assignment are readr and dplyr. The library readr reads different types of data in as a dataframe. For this assignment, we’re going to read in csv data or Comma Separated Values data. That’s data that has a comma between each column of data.

Then we’re going to use dplyr to analyze it.

To use a library, you need to import it. Good practice – one I’m going to insist on – is that you put all your library steps at the top of your notebook.

That code looks like this:

library(readr)

To load them both, you need to run that code twice:

library(readr)
library(dplyr)

You can keep doing that for as many libraries as you need. I’ve seen notebooks with 10 or more library imports.

But the tidyverse has a neat little trick. We can load most of the libraries we’ll need for the whole semester with one line:

library(tidyverse)

From now on, if that’s not the first line of your notebook, you’re probably doing it wrong.

4.1 Basic data analysis: Group By and Count

The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we’re going to read data from a csv file – a comma-separated values file.

The CSV file we’re going to read from is a Basketball Reference page of advanced metrics for NBA players this past season. The Sports Reference sites are a godsend of data, a trove of stuff, and we’re going to use it a lot in this class.

For this walkthrough:

So step 2, after setting up our libraries, is most often going to be importing data. In order to analyze data, we need data, so it stands to reason that this would be something we’d do very early.

The code looks something like this, but hold off copying it just yet:

nbaplayers <- read_csv("~/SportsData/nbaadvancedplayers2223.csv")

Let’s unpack that.

The first part – nbaplayers – is the name of your variable. A variable is just a name of a thing that stores stuff. In this case, our variable is a data frame, which is R’s way of storing data (technically it’s a tibble, which is the tidyverse way of storing data, but the differences aren’t important and people use them interchangeably). We can call this whatever we want. I always want to name data frames after what is in it. In this case, we’re going to import a dataset of NBA players. Variable names, by convention are one word all lower case. You can end a variable with a number, but you can’t start one with a number.

The <- bit is the variable assignment operator. It’s how we know we’re assigning something to a word. Think of the arrow as saying “Take everything on the right of this arrow and stuff it into the thing on the left.” So we’re creating an empty vessel called nbaplayers and stuffing all this data into it.

The read_csv bits are pretty obvious, except for one thing. What happens in the quote marks is the path to the data. In there, I have to tell R where it will find the data. The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook (you’ll have to save that notebook first). If you do that, then you just need to put the name of the file in there (nbaadvancedplayers2122.csv). In my case, in my home directory (that’s the ~ part), there is a folder called SportsData that has the file called nbaadvancedplayers2122.csv in it. Some people – insane people – leave the data in their downloads folder. The data path then would be ~/Downloads/nameofthedatafilehere.csv on PC or Mac.

What you put in there will be different from mine. So your first task is to import the data.

nbaplayers <- read_csv("data/nbaadvancedplayers2324.csv")
Rows: 735 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Player, Pos, Tm
dbl (24): Rk, Age, G, MP, PER, TS%, 3PAr, FTr, ORB%, DRB%, TRB%, AST%, STL%,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now we can inspect the data we imported. What does it look like? To do that, we use head(nbaplayers) to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter nbaplayers and run it.

To get the number of records in our dataset, we run nrow(nbaplayers)

head(nbaplayers)
# A tibble: 6 × 27
     Rk Player     Pos     Age Tm        G    MP   PER `TS%` `3PAr`   FTr `ORB%`
  <dbl> <chr>      <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
1     1 Precious … PF-C     24 TOT      74  1624  14.6 0.545  0.207 0.239   13  
2     1 Precious … C        24 TOR      25   437  15   0.512  0.276 0.247   12.3
3     1 Precious … PF       24 NYK      49  1187  14.5 0.564  0.167 0.234   13.3
4     2 Bam Adeba… C        26 MIA      71  2416  19.8 0.576  0.041 0.381    7.4
5     3 Ochai Agb… SG       23 TOT      78  1641   7.7 0.497  0.487 0.129    4.9
6     3 Ochai Agb… SG       23 UTA      51  1003   8.1 0.531  0.57  0.08     3.9
# … with 15 more variables: `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
#   `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, OWS <dbl>,
#   DWS <dbl>, WS <dbl>, `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>, BPM <dbl>,
#   VORP <dbl>
nrow(nbaplayers)
[1] 735

Another way to look at nrow – we have 735 players from this season in our dataset.

What if we wanted to know how many players there were by position? To do that by hand, we’d have to take each of the 651 records and sort them into a pile. We’d put them in groups and then count them.

dplyr has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together at some point. So it’s a good place to start.

So to do this, we’ll take our dataset and we’ll introduce a new operator: |>. The best way to read that operator, in my opinion, is to interpret that as “and then do this.”

After we group them together, we need to count them. We do that first by saying we want to summarize our data (a count is a part of a summary). To get a summary, we have to tell it what we want. So in this case, we want a count. To get that, let’s create a thing called total and set it equal to n(), which is dplyrs way of counting something.

Here’s the code:

nbaplayers |> 
  group_by(Pos) |>
  summarise(
    total = n()
  )
# A tibble: 12 × 2
   Pos   total
   <chr> <int>
 1 C       119
 2 C-PF      3
 3 PF      147
 4 PF-C      1
 5 PF-SF     1
 6 PG      147
 7 PG-SG     4
 8 SF      155
 9 SF-PF     2
10 SF-SG     1
11 SG      154
12 SG-PG     1

So let’s walk through that. We start with our dataset – nbaplayers – and then we tell it to group the data by a given field in the data which we get by looking at either the output of head or you can look in the environment where you’ll see nbaplayers.

In this case, we wanted to group together positions, signified by the field name Pos. After we group the data, we need to count them up. In dplyr, we use summarize which can do more than just count things. Inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the positions: total = n(), says create a new field, called total and set it equal to n(), which might look weird, but it’s common in stats. The number of things in a dataset? Statisticians call in n. There are n number of players in this dataset. So n() is a function that counts the number of things there are.

And when we run that, we get a list of positions with a count next to them. But it’s not in any order. So we’ll add another And Then Do This |> and use arrange. Arrange does what you think it does – it arranges data in order. By default, it’s in ascending order – smallest to largest. But if we want to know the position with the most players, we need to sort it in descending order. That looks like this:

nbaplayers |>
  group_by(Pos) |>
  summarise(
    total = n()
  ) |> arrange(desc(total))
# A tibble: 12 × 2
   Pos   total
   <chr> <int>
 1 SF      155
 2 SG      154
 3 PF      147
 4 PG      147
 5 C       119
 6 PG-SG     4
 7 C-PF      3
 8 SF-PF     2
 9 PF-C      1
10 PF-SF     1
11 SF-SG     1
12 SG-PG     1

So the most common position in the NBA? Small forward, followed by shooting guard.

We can, if we want, group by more than one thing. Which team has the most of a single position? To do that, we can group by the team – called Tm in the data – and position, or Pos in the data:

nbaplayers |>
  group_by(Tm, Pos) |>
  summarise(
    total = n()
  ) |> arrange(desc(total))
`summarise()` has grouped output by 'Tm'. You can override using the `.groups`
argument.
# A tibble: 162 × 3
# Groups:   Tm [31]
   Tm    Pos   total
   <chr> <chr> <int>
 1 TOT   PG       18
 2 TOT   PF       15
 3 TOT   SF       14
 4 DET   SG       10
 5 TOT   C        10
 6 MEM   PF        9
 7 NYK   SG        9
 8 TOR   SG        9
 9 TOT   SG        8
10 CHO   PG        7
# … with 152 more rows

So wait, what team is TOT?

Valuable lesson: whoever collects the data has opinions on how to solve problems. In this case, Basketball Reference, when a player get’s traded, records stats for the player’s first team, their second team, and a combined season total for a team called TOT, meaning Total. Is there a team abbreviated TOT? No. So ignore them here.

Detroit had 10 shooting guards! You can learn a bit about how a team is assembled by looking at these simple counts.

4.2 Other aggregates: Mean and median

In the last example, we grouped some data together and counted it up, but there’s so much more you can do. You can do multiple measures in a single step as well.

Sticking with our NBA player data, we can calculate any number of measures inside summarize. Here, we’ll use R’s built in mean and median functions to calculate … well, you get the idea.

Let’s look just a the number of minutes each position gets.

nbaplayers |>
  group_by(Pos) |>
  summarise(
    count = n(),
    mean_minutes = mean(MP),
    median_minutes = median(MP)
  )
# A tibble: 12 × 4
   Pos   count mean_minutes median_minutes
   <chr> <int>        <dbl>          <dbl>
 1 C       119         930.           763 
 2 C-PF      3        1115.           974 
 3 PF      147         879.           581 
 4 PF-C      1        1624           1624 
 5 PF-SF     1         630            630 
 6 PG      147         905.           678 
 7 PG-SG     4        1581           1926.
 8 SF      155         912.           608 
 9 SF-PF     2         983            983 
10 SF-SG     1        2160           2160 
11 SG      154         887.           700.
12 SG-PG     1         507            507 

Let’s look at centers. The average center plays 929 minutes and the median is 763 minutes.

Why?

Let’s let sort help us.

nbaplayers |> arrange(desc(MP))
# A tibble: 735 × 27
      Rk Player    Pos     Age Tm        G    MP   PER `TS%` `3PAr`   FTr `ORB%`
   <dbl> <chr>     <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
 1   120 DeMar De… SF       34 CHI      79  2989  19.7 0.584  0.166 0.452    1.6
 2   444 Domantas… C        27 SAC      82  2928  23.2 0.637  0.081 0.389   11  
 3   540 Coby Whi… PG       23 CHI      79  2881  14.5 0.57   0.46  0.215    1.7
 4    61 Mikal Br… SF       27 BRK      82  2854  14.9 0.56   0.457 0.245    2.5
 5    25 Paolo Ba… PF       21 ORL      80  2799  17.3 0.546  0.249 0.397    3.4
 6   137 Kevin Du… PF       35 PHO      75  2791  21.2 0.626  0.283 0.295    1.7
 7   363 Dejounte… SG       27 ATL      78  2783  17.7 0.555  0.379 0.179    2.3
 8   140 Anthony … SG       22 MIN      79  2770  19.7 0.575  0.341 0.325    2.2
 9   263 Nikola J… C        28 DEN      79  2737  31   0.65   0.164 0.31     9.3
10    76 Jalen Br… PG       27 NYK      77  2726  23.4 0.592  0.319 0.302    1.8
# … with 725 more rows, and 15 more variables: `DRB%` <dbl>, `TRB%` <dbl>,
#   `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>,
#   OWS <dbl>, DWS <dbl>, WS <dbl>, `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>,
#   BPM <dbl>, VORP <dbl>

The player with the most minutes on the floor is a small forward. So that means there’s DeMar DeRozan rolling up 2,989 minutes in a season, and then there’s Philly sensation Javonte Smart. Never heard of Javonte Smart? Might be because he logged a single minute in one game this season.

That’s a huge difference.

So when choosing a measure of the middle, you have to ask yourself – could I have extremes? Because a median won’t be sensitive to extremes. It will be the point at which half the numbers are above and half are below. The average or mean will be a measure of the middle, but if you have a bunch of pine riders and then one ironman superstar, the average will be wildly skewed.

4.3 Even more aggregates

There’s a ton of things we can do in summarize – we’ll work with more of them as the course progresses – but here’s a few other questions you can ask.

Which position in the NBA plays the most minutes? And what is the highest and lowest minute total for that position? And how wide is the spread between minutes? We can find that with sum to add up the minutes to get the total minutes, min to find the minimum minutes, max to find the maximum minutes and sd to find the standard deviation in the numbers.

nbaplayers |> 
  group_by(Pos) |> 
  summarise(
    total = sum(MP), 
    avgminutes = mean(MP), 
    minminutes = min(MP),
    maxminutes = max(MP),
    stdev = sd(MP)) |> arrange(desc(total))
# A tibble: 12 × 6
   Pos    total avgminutes minminutes maxminutes stdev
   <chr>  <dbl>      <dbl>      <dbl>      <dbl> <dbl>
 1 SF    141284       912.          1       2989  832.
 2 SG    136539       887.          3       2783  807.
 3 PG    132967       905.          1       2881  820.
 4 PF    129198       879.          4       2799  815.
 5 C     110659       930.          1       2928  781.
 6 PG-SG   6324      1581         433       2040  769.
 7 C-PF    3344      1115.        555       1815  642.
 8 SF-SG   2160      2160        2160       2160   NA 
 9 SF-PF   1966       983         488       1478  700.
10 PF-C    1624      1624        1624       1624   NA 
11 PF-SF    630       630         630        630   NA 
12 SG-PG    507       507         507        507   NA 

So again, no surprise, small forwards spend the most minutes on the floor in the NBA. They average 911 minutes, but we noted why that’s trouble. The minimum is a one-minute wonder, max is some team failing at load management, and the standard deviation is a measure of how spread out the data is. In this case, not the highest spread among positions, but pretty high. So you know you’ve got some huge minutes players and a bunch of bench players.