4 Aggregates

R is a statistical programming language that is purpose built for data analysis.

Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. We already installed the tidyverse – or you should have if you followed the instructions for the last assignment – which isn’t exactly a library, but a collection of libraries. Together, they make up the tidyverse. Individually, they are extraordinarily useful for what they do. We can load them all at once using the tidyverse name, or we can load them individually. Let’s start with individually.

The two libraries we are going to need for this assignment are readr and dplyr. The library readr reads different types of data in as a dataframe. For this assignment, we’re going to read in csv data or Comma Separated Values data. That’s data that has a comma between each column of data.

Then we’re going to use dplyr to analyze it.

To use a library, you need to import it. Good practice – one I’m going to insist on – is that you put all your library steps at the top of your notebook.

That code looks like this:

library(readr)

To load them both, you need to run that code twice:

library(readr)
library(dplyr)

You can keep doing that for as many libraries as you need. I’ve seen notebooks with 10 or more library imports.

But the tidyverse has a neat little trick. We can load most of the libraries we’ll need for the whole semester with one line:

library(tidyverse)

From now on, if that’s not the first line of your notebook, you’re probably doing it wrong.

4.1 Basic data analysis: Group By and Count

The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we’re going to read data from a csv file – a comma-separated values file.

The CSV file we’re going to read from is a Basketball Reference page of advanced metrics for NBA players this past season. The Sports Reference sites are a godsend of data, a trove of stuff, and we’re going to use it a lot in this class.

For this walkthrough:

So step 2, after setting up our libraries, is most often going to be importing data. In order to analyze data, we need data, so it stands to reason that this would be something we’d do very early.

The code looks something like this, but hold off copying it just yet:

nbaplayers <- read_csv("~/SportsData/nbaadvancedplayers2425.csv")

Let’s unpack that.

The first part – nbaplayers – is the name of your variable. A variable is just a name of a thing that stores stuff. In this case, our variable is a data frame, which is R’s way of storing data (technically it’s a tibble, which is the tidyverse way of storing data, but the differences aren’t important and people use them interchangeably). We can call this whatever we want. I always want to name data frames after what is in it. In this case, we’re going to import a dataset of NBA players. Variable names, by convention are one word all lower case. You can end a variable with a number, but you can’t start one with a number.

The <- bit is the variable assignment operator. It’s how we know we’re assigning something to a word. Think of the arrow as saying “Take everything on the right of this arrow and stuff it into the thing on the left.” So we’re creating an empty vessel called nbaplayers and stuffing all this data into it.

The read_csv bits are pretty obvious, except for one thing. What happens in the quote marks is the path to the data. In there, I have to tell R where it will find the data. The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook (you’ll have to save that notebook first). If you do that, then you just need to put the name of the file in there (nbaadvancedplayers2122.csv). In my case, in my home directory (that’s the ~ part), there is a folder called SportsData that has the file called nbaadvancedplayers2122.csv in it. Some people – insane people – leave the data in their downloads folder. The data path then would be ~/Downloads/nameofthedatafilehere.csv on PC or Mac.

What you put in there will be different from mine. So your first task is to import the data.

nbaplayers <- read_csv("data/nbaadvancedplayers2425.csv")

Rows: 735 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Player, Team, Pos, Player-additional
dbl (25): Rk, Age, G, GS, MP, PER, TS%, 3PAr, FTr, ORB%, DRB%, TRB%, AST%, S...
lgl  (1): Awards

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now we can inspect the data we imported. What does it look like? To do that, we use head(nbaplayers) to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter nbaplayers and run it.

To get the number of records in our dataset, we run nrow(nbaplayers)

head(nbaplayers)

# A tibble: 6 × 30
     Rk Player        Age Team  Pos       G    GS    MP   PER `TS%` `3PAr`   FTr
  <dbl> <chr>       <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1     1 Mikal Brid…    28 NYK   SF       82    82  3036  14   0.585  0.391 0.1  
2     2 Josh Hart      29 NYK   SG       77    77  2897  16.5 0.611  0.327 0.266
3     3 Anthony Ed…    23 MIN   SG       79    79  2871  20.1 0.595  0.503 0.308
4     4 Devin Book…    28 PHO   SG       75    75  2795  19.3 0.589  0.388 0.34 
5     5 James Hard…    35 LAC   PG       79    79  2789  20   0.582  0.516 0.446
6     6 DeMar DeRo…    35 SAC   SF       77    77  2768  17.7 0.569  0.196 0.337
# ℹ 18 more variables: `ORB%` <dbl>, `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
#   `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, OWS <dbl>,
#   DWS <dbl>, WS <dbl>, `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>, BPM <dbl>,
#   VORP <dbl>, Awards <lgl>, `Player-additional` <chr>

nrow(nbaplayers)

[1] 735

Another way to look at nrow – we have 735 players from this season in our dataset.

What if we wanted to know how many players there were by position? To do that by hand, we’d have to take each of the 735 records and sort them into a pile. We’d put them in groups and then count them.

dplyr has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together at some point. So it’s a good place to start.

So to do this, we’ll take our dataset and we’ll introduce a new operator: |>. The best way to read that operator, in my opinion, is to interpret that as “and then do this.”

After we group them together, we need to count them. We do that first by saying we want to summarize our data (a count is a part of a summary). To get a summary, we have to tell it what we want. So in this case, we want a count. To get that, let’s create a thing called total and set it equal to n(), which is dplyrs way of counting something.

Here’s the code:

nbaplayers |> 
  group_by(Pos) |>
  summarise(
    total = n()
  )

# A tibble: 5 × 2
  Pos   total
  <chr> <int>
1 C       136
2 PF      134
3 PG      135
4 SF      139
5 SG      191

So let’s walk through that. We start with our dataset – nbaplayers – and then we tell it to group the data by a given field in the data which we get by looking at either the output of head or you can look in the environment where you’ll see nbaplayers.

In this case, we wanted to group together positions, signified by the field name Pos. After we group the data, we need to count them up. In dplyr, we use summarize which can do more than just count things. Inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the positions: total = n(), says create a new field, called total and set it equal to n(), which might look weird, but it’s common in stats. The number of things in a dataset? Statisticians call in n. There are n number of players in this dataset. So n() is a function that counts the number of things there are.

And when we run that, we get a list of positions with a count next to them. But it’s not in any order. So we’ll add another And Then Do This |> and use arrange. Arrange does what you think it does – it arranges data in order. By default, it’s in ascending order – smallest to largest. But if we want to know the position with the most players, we need to sort it in descending order. That looks like this:

nbaplayers |>
  group_by(Pos) |>
  summarise(
    total = n()
  ) |> arrange(desc(total))

# A tibble: 5 × 2
  Pos   total
  <chr> <int>
1 SG      191
2 SF      139
3 C       136
4 PG      135
5 PF      134

So the most common position in the NBA? Shooting guard, followed by small forward.

We can, if we want, group by more than one thing. Which team has the most of a single position? To do that, we can group by the team – called Tm in the data – and position, or Pos in the data:

nbaplayers |>
  group_by(Team, Pos) |>
  summarise(
    total = n()
  ) |> arrange(desc(total))

`summarise()` has grouped output by 'Team'. You can override using the
`.groups` argument.

# A tibble: 158 × 3
# Groups:   Team [32]
   Team  Pos   total
   <chr> <chr> <int>
 1 2TM   SG       20
 2 2TM   PG       15
 3 2TM   C        14
 4 2TM   PF       14
 5 2TM   SF       14
 6 CHO   SG       11
 7 OKC   SG        8
 8 ATL   SG        7
 9 CLE   SG        7
10 GSW   SG        7
# ℹ 148 more rows

So wait, what team is 2TM?

Valuable lesson: whoever collects the data has opinions on how to solve problems. In this case, Basketball Reference, when a player get’s traded, records stats for the player’s first team, their second team, and a combined season total for a team called 2TM, meaning Total. Is there a team abbreviated 2TM? No. So ignore them here.

Charlotte had 11 shooting guards! You can learn a bit about how a team is assembled by looking at these simple counts.

4.2 Other aggregates: Mean and median

In the last example, we grouped some data together and counted it up, but there’s so much more you can do. You can do multiple measures in a single step as well.

Sticking with our NBA player data, we can calculate any number of measures inside summarize. Here, we’ll use R’s built in mean and median functions to calculate … well, you get the idea.

Let’s look just a the number of minutes each position gets.

nbaplayers |>
  group_by(Pos) |>
  summarise(
    count = n(),
    mean_minutes = mean(MP),
    median_minutes = median(MP)
  )

# A tibble: 5 × 4
  Pos   count mean_minutes median_minutes
  <chr> <int>        <dbl>          <dbl>
1 C       136         856.           606.
2 PF      134         894.           724 
3 PG      135         962.           717 
4 SF      139         866.           648 
5 SG      191         935.           732

Let’s look at centers. The average center plays 855 minutes and the median is 606 minutes.

Why?

Let’s let sort help us.

nbaplayers |> arrange(desc(MP))

# A tibble: 735 × 30
      Rk Player       Age Team  Pos       G    GS    MP   PER `TS%` `3PAr`   FTr
   <dbl> <chr>      <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1     1 Mikal Bri…    28 NYK   SF       82    82  3036  14   0.585  0.391 0.1  
 2     2 Josh Hart     29 NYK   SG       77    77  2897  16.5 0.611  0.327 0.266
 3     3 Anthony E…    23 MIN   SG       79    79  2871  20.1 0.595  0.503 0.308
 4     4 Devin Boo…    28 PHO   SG       75    75  2795  19.3 0.589  0.388 0.34 
 5     5 James Har…    35 LAC   PG       79    79  2789  20   0.582  0.516 0.446
 6     6 DeMar DeR…    35 SAC   SF       77    77  2768  17.7 0.569  0.196 0.337
 7     7 Trae Young    26 ATL   PG       76    76  2739  18.3 0.567  0.467 0.408
 8     8 Tyler Her…    25 MIA   SG       77    77  2725  19.7 0.605  0.486 0.237
 9     9 OG Anunoby    27 NYK   PF       74    74  2706  15.4 0.591  0.448 0.22 
10    10 Jalen Gre…    22 HOU   SG       82    82  2697  15.1 0.544  0.46  0.234
# ℹ 725 more rows
# ℹ 18 more variables: `ORB%` <dbl>, `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
#   `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, OWS <dbl>,
#   DWS <dbl>, WS <dbl>, `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>, BPM <dbl>,
#   VORP <dbl>, Awards <lgl>, `Player-additional` <chr>

The player with the most minutes on the floor is a small forward. So that means there’s Mikal Bridges rolling up 3,036 minutes in a season, and then there’s OKC sensation Alex Reese. Never heard of Alex Reese? Might be because he logged two minute in one game this season.

That’s a huge difference.

So when choosing a measure of the middle, you have to ask yourself – could I have extremes? Because a median won’t be sensitive to extremes. It will be the point at which half the numbers are above and half are below. The average or mean will be a measure of the middle, but if you have a bunch of pine riders and then one ironman superstar, the average will be wildly skewed.

4.3 Even more aggregates

There’s a ton of things we can do in summarize – we’ll work with more of them as the course progresses – but here’s a few other questions you can ask.

Which position in the NBA plays the most minutes? And what is the highest and lowest minute total for that position? And how wide is the spread between minutes? We can find that with sum to add up the minutes to get the total minutes, min to find the minimum minutes, max to find the maximum minutes and sd to find the standard deviation in the numbers.

nbaplayers |> 
  group_by(Pos) |> 
  summarise(
    total = sum(MP), 
    avgminutes = mean(MP), 
    minminutes = min(MP),
    maxminutes = max(MP),
    stdev = sd(MP)) |> arrange(desc(total))

# A tibble: 5 × 6
  Pos    total avgminutes minminutes maxminutes stdev
  <chr>  <dbl>      <dbl>      <dbl>      <dbl> <dbl>
1 SG    178577       935.          3       2897  786.
2 PG    129889       962.          4       2789  791.
3 SF    120323       866.          5       3036  795.
4 PF    119786       894.          2       2706  760.
5 C     116389       856.          3       2674  734.

So again, no surprise, shooting guards spend the most minutes on the floor in the NBA. They average 934 minutes, but we noted why that’s trouble. The minimum is a one-minute wonder, max is some team failing at load management, and the standard deviation is a measure of how spread out the data is. In this case, not the highest spread among positions, but pretty high. So you know you’ve got some huge minutes players and a bunch of bench players.