library(readr)
4 Aggregates
R is a statistical programming language that is purpose built for data analysis.
Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. We already installed the tidyverse – or you should have if you followed the instructions for the last assignment – which isn’t exactly a library, but a collection of libraries. Together, they make up the tidyverse. Individually, they are extraordinarily useful for what they do. We can load them all at once using the tidyverse name, or we can load them individually. Let’s start with individually.
The two libraries we are going to need for this assignment are readr
and dplyr
. The library readr
reads different types of data in as a dataframe. For this assignment, we’re going to read in csv data or Comma Separated Values data. That’s data that has a comma between each column of data.
Then we’re going to use dplyr
to analyze it.
To use a library, you need to import it. Good practice – one I’m going to insist on – is that you put all your library steps at the top of your notebook.
That code looks like this:
To load them both, you need to run that code twice:
library(readr)
library(dplyr)
You can keep doing that for as many libraries as you need. I’ve seen notebooks with 10 or more library imports.
But the tidyverse has a neat little trick. We can load most of the libraries we’ll need for the whole semester with one line:
library(tidyverse)
From now on, if that’s not the first line of your notebook, you’re probably doing it wrong.
4.1 Basic data analysis: Group By and Count
The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we’re going to read data from a csv file – a comma-separated values file.
The CSV file we’re going to read from is a Basketball Reference page of advanced metrics for NBA players this past season. The Sports Reference sites are a godsend of data, a trove of stuff, and we’re going to use it a lot in this class.
For this walkthrough:
So step 2, after setting up our libraries, is most often going to be importing data. In order to analyze data, we need data, so it stands to reason that this would be something we’d do very early.
The code looks something like this, but hold off copying it just yet:
nbaplayers <- read_csv("~/SportsData/nbaadvancedplayers2223.csv")
Let’s unpack that.
The first part – nbaplayers – is the name of your variable. A variable is just a name of a thing that stores stuff. In this case, our variable is a data frame, which is R’s way of storing data (technically it’s a tibble, which is the tidyverse way of storing data, but the differences aren’t important and people use them interchangeably). We can call this whatever we want. I always want to name data frames after what is in it. In this case, we’re going to import a dataset of NBA players. Variable names, by convention are one word all lower case. You can end a variable with a number, but you can’t start one with a number.
The <- bit is the variable assignment operator. It’s how we know we’re assigning something to a word. Think of the arrow as saying “Take everything on the right of this arrow and stuff it into the thing on the left.” So we’re creating an empty vessel called nbaplayers
and stuffing all this data into it.
The read_csv
bits are pretty obvious, except for one thing. What happens in the quote marks is the path to the data. In there, I have to tell R where it will find the data. The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook (you’ll have to save that notebook first). If you do that, then you just need to put the name of the file in there (nbaadvancedplayers2122.csv). In my case, in my home directory (that’s the ~
part), there is a folder called SportsData that has the file called nbaadvancedplayers2122.csv in it. Some people – insane people – leave the data in their downloads folder. The data path then would be ~/Downloads/nameofthedatafilehere.csv
on PC or Mac.
What you put in there will be different from mine. So your first task is to import the data.
<- read_csv("data/nbaadvancedplayers2324.csv") nbaplayers
Rows: 735 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Player, Pos, Tm
dbl (24): Rk, Age, G, MP, PER, TS%, 3PAr, FTr, ORB%, DRB%, TRB%, AST%, STL%,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now we can inspect the data we imported. What does it look like? To do that, we use head(nbaplayers)
to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter nbaplayers
and run it.
To get the number of records in our dataset, we run nrow(nbaplayers)
head(nbaplayers)
# A tibble: 6 × 27
Rk Player Pos Age Tm G MP PER `TS%` `3PAr` FTr `ORB%`
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Precious … PF-C 24 TOT 74 1624 14.6 0.545 0.207 0.239 13
2 1 Precious … C 24 TOR 25 437 15 0.512 0.276 0.247 12.3
3 1 Precious … PF 24 NYK 49 1187 14.5 0.564 0.167 0.234 13.3
4 2 Bam Adeba… C 26 MIA 71 2416 19.8 0.576 0.041 0.381 7.4
5 3 Ochai Agb… SG 23 TOT 78 1641 7.7 0.497 0.487 0.129 4.9
6 3 Ochai Agb… SG 23 UTA 51 1003 8.1 0.531 0.57 0.08 3.9
# … with 15 more variables: `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
# `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, OWS <dbl>,
# DWS <dbl>, WS <dbl>, `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>, BPM <dbl>,
# VORP <dbl>
nrow(nbaplayers)
[1] 735
Another way to look at nrow – we have 735 players from this season in our dataset.
What if we wanted to know how many players there were by position? To do that by hand, we’d have to take each of the 651 records and sort them into a pile. We’d put them in groups and then count them.
dplyr
has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together at some point. So it’s a good place to start.
So to do this, we’ll take our dataset and we’ll introduce a new operator: |>. The best way to read that operator, in my opinion, is to interpret that as “and then do this.”
After we group them together, we need to count them. We do that first by saying we want to summarize our data (a count is a part of a summary). To get a summary, we have to tell it what we want. So in this case, we want a count. To get that, let’s create a thing called total and set it equal to n(), which is dplyr
s way of counting something.
Here’s the code:
|>
nbaplayers group_by(Pos) |>
summarise(
total = n()
)
# A tibble: 12 × 2
Pos total
<chr> <int>
1 C 119
2 C-PF 3
3 PF 147
4 PF-C 1
5 PF-SF 1
6 PG 147
7 PG-SG 4
8 SF 155
9 SF-PF 2
10 SF-SG 1
11 SG 154
12 SG-PG 1
So let’s walk through that. We start with our dataset – nbaplayers
– and then we tell it to group the data by a given field in the data which we get by looking at either the output of head
or you can look in the environment where you’ll see nbaplayers
.
In this case, we wanted to group together positions, signified by the field name Pos. After we group the data, we need to count them up. In dplyr, we use summarize
which can do more than just count things. Inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the positions: total = n(),
says create a new field, called total
and set it equal to n()
, which might look weird, but it’s common in stats. The number of things in a dataset? Statisticians call in n. There are n number of players in this dataset. So n()
is a function that counts the number of things there are.
And when we run that, we get a list of positions with a count next to them. But it’s not in any order. So we’ll add another And Then Do This |> and use arrange
. Arrange does what you think it does – it arranges data in order. By default, it’s in ascending order – smallest to largest. But if we want to know the position with the most players, we need to sort it in descending order. That looks like this:
|>
nbaplayers group_by(Pos) |>
summarise(
total = n()
|> arrange(desc(total)) )
# A tibble: 12 × 2
Pos total
<chr> <int>
1 SF 155
2 SG 154
3 PF 147
4 PG 147
5 C 119
6 PG-SG 4
7 C-PF 3
8 SF-PF 2
9 PF-C 1
10 PF-SF 1
11 SF-SG 1
12 SG-PG 1
So the most common position in the NBA? Small forward, followed by shooting guard.
We can, if we want, group by more than one thing. Which team has the most of a single position? To do that, we can group by the team – called Tm in the data – and position, or Pos in the data:
|>
nbaplayers group_by(Tm, Pos) |>
summarise(
total = n()
|> arrange(desc(total)) )
`summarise()` has grouped output by 'Tm'. You can override using the `.groups`
argument.
# A tibble: 162 × 3
# Groups: Tm [31]
Tm Pos total
<chr> <chr> <int>
1 TOT PG 18
2 TOT PF 15
3 TOT SF 14
4 DET SG 10
5 TOT C 10
6 MEM PF 9
7 NYK SG 9
8 TOR SG 9
9 TOT SG 8
10 CHO PG 7
# … with 152 more rows
So wait, what team is TOT?
Valuable lesson: whoever collects the data has opinions on how to solve problems. In this case, Basketball Reference, when a player get’s traded, records stats for the player’s first team, their second team, and a combined season total for a team called TOT, meaning Total. Is there a team abbreviated TOT? No. So ignore them here.
Detroit had 10 shooting guards! You can learn a bit about how a team is assembled by looking at these simple counts.
4.2 Other aggregates: Mean and median
In the last example, we grouped some data together and counted it up, but there’s so much more you can do. You can do multiple measures in a single step as well.
Sticking with our NBA player data, we can calculate any number of measures inside summarize. Here, we’ll use R’s built in mean and median functions to calculate … well, you get the idea.
Let’s look just a the number of minutes each position gets.
|>
nbaplayers group_by(Pos) |>
summarise(
count = n(),
mean_minutes = mean(MP),
median_minutes = median(MP)
)
# A tibble: 12 × 4
Pos count mean_minutes median_minutes
<chr> <int> <dbl> <dbl>
1 C 119 930. 763
2 C-PF 3 1115. 974
3 PF 147 879. 581
4 PF-C 1 1624 1624
5 PF-SF 1 630 630
6 PG 147 905. 678
7 PG-SG 4 1581 1926.
8 SF 155 912. 608
9 SF-PF 2 983 983
10 SF-SG 1 2160 2160
11 SG 154 887. 700.
12 SG-PG 1 507 507
Let’s look at centers. The average center plays 929 minutes and the median is 763 minutes.
Why?
Let’s let sort help us.
|> arrange(desc(MP)) nbaplayers
# A tibble: 735 × 27
Rk Player Pos Age Tm G MP PER `TS%` `3PAr` FTr `ORB%`
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 120 DeMar De… SF 34 CHI 79 2989 19.7 0.584 0.166 0.452 1.6
2 444 Domantas… C 27 SAC 82 2928 23.2 0.637 0.081 0.389 11
3 540 Coby Whi… PG 23 CHI 79 2881 14.5 0.57 0.46 0.215 1.7
4 61 Mikal Br… SF 27 BRK 82 2854 14.9 0.56 0.457 0.245 2.5
5 25 Paolo Ba… PF 21 ORL 80 2799 17.3 0.546 0.249 0.397 3.4
6 137 Kevin Du… PF 35 PHO 75 2791 21.2 0.626 0.283 0.295 1.7
7 363 Dejounte… SG 27 ATL 78 2783 17.7 0.555 0.379 0.179 2.3
8 140 Anthony … SG 22 MIN 79 2770 19.7 0.575 0.341 0.325 2.2
9 263 Nikola J… C 28 DEN 79 2737 31 0.65 0.164 0.31 9.3
10 76 Jalen Br… PG 27 NYK 77 2726 23.4 0.592 0.319 0.302 1.8
# … with 725 more rows, and 15 more variables: `DRB%` <dbl>, `TRB%` <dbl>,
# `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>,
# OWS <dbl>, DWS <dbl>, WS <dbl>, `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>,
# BPM <dbl>, VORP <dbl>
The player with the most minutes on the floor is a small forward. So that means there’s DeMar DeRozan rolling up 2,989 minutes in a season, and then there’s Philly sensation Javonte Smart. Never heard of Javonte Smart? Might be because he logged a single minute in one game this season.
That’s a huge difference.
So when choosing a measure of the middle, you have to ask yourself – could I have extremes? Because a median won’t be sensitive to extremes. It will be the point at which half the numbers are above and half are below. The average or mean will be a measure of the middle, but if you have a bunch of pine riders and then one ironman superstar, the average will be wildly skewed.
4.3 Even more aggregates
There’s a ton of things we can do in summarize – we’ll work with more of them as the course progresses – but here’s a few other questions you can ask.
Which position in the NBA plays the most minutes? And what is the highest and lowest minute total for that position? And how wide is the spread between minutes? We can find that with sum
to add up the minutes to get the total minutes, min
to find the minimum minutes, max
to find the maximum minutes and sd
to find the standard deviation in the numbers.
|>
nbaplayers group_by(Pos) |>
summarise(
total = sum(MP),
avgminutes = mean(MP),
minminutes = min(MP),
maxminutes = max(MP),
stdev = sd(MP)) |> arrange(desc(total))
# A tibble: 12 × 6
Pos total avgminutes minminutes maxminutes stdev
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SF 141284 912. 1 2989 832.
2 SG 136539 887. 3 2783 807.
3 PG 132967 905. 1 2881 820.
4 PF 129198 879. 4 2799 815.
5 C 110659 930. 1 2928 781.
6 PG-SG 6324 1581 433 2040 769.
7 C-PF 3344 1115. 555 1815 642.
8 SF-SG 2160 2160 2160 2160 NA
9 SF-PF 1966 983 488 1478 700.
10 PF-C 1624 1624 1624 1624 NA
11 PF-SF 630 630 630 630 NA
12 SG-PG 507 507 507 507 NA
So again, no surprise, small forwards spend the most minutes on the floor in the NBA. They average 911 minutes, but we noted why that’s trouble. The minimum is a one-minute wonder, max is some team failing at load management, and the standard deviation is a measure of how spread out the data is. In this case, not the highest spread among positions, but pretty high. So you know you’ve got some huge minutes players and a bunch of bench players.