library(readr)
4 Aggregates
R is a statistical programming language that is purpose built for data analysis.
Base R does a lot, but there are a mountain of external libraries that do things to make R better/easier/more fully featured. We already installed the tidyverse – or you should have if you followed the instructions for the last assignment – which isn’t exactly a library, but a collection of libraries. Together, they make up the tidyverse. Individually, they are extraordinarily useful for what they do. We can load them all at once using the tidyverse name, or we can load them individually. Let’s start with individually.
The two libraries we are going to need for this assignment are readr
and dplyr
. The library readr
reads different types of data in as a dataframe. For this assignment, we’re going to read in csv data or Comma Separated Values data. That’s data that has a comma between each column of data.
Then we’re going to use dplyr
to analyze it.
To use a library, you need to import it. Good practice – one I’m going to insist on – is that you put all your library steps at the top of your notebook.
That code looks like this:
To load them both, you need to run that code twice:
library(readr)
library(dplyr)
You can keep doing that for as many libraries as you need. I’ve seen notebooks with 10 or more library imports.
But the tidyverse has a neat little trick. We can load most of the libraries we’ll need for the whole semester with one line:
library(tidyverse)
From now on, if that’s not the first line of your notebook, you’re probably doing it wrong.
4.1 Basic data analysis: Group By and Count
The first thing we need to do is get some data to work with. We do that by reading it in. In our case, we’re going to read data from a csv file – a comma-separated values file.
The CSV file we’re going to read from is a Basketball Reference page of advanced metrics for NBA players this past season. The Sports Reference sites are a godsend of data, a trove of stuff, and we’re going to use it a lot in this class.
For this walkthrough:
So step 2, after setting up our libraries, is most often going to be importing data. In order to analyze data, we need data, so it stands to reason that this would be something we’d do very early.
The code looks something like this, but hold off copying it just yet:
nbaplayers <- read_csv("~/SportsData/nbaadvancedplayers2425.csv")
Let’s unpack that.
The first part – nbaplayers – is the name of your variable. A variable is just a name of a thing that stores stuff. In this case, our variable is a data frame, which is R’s way of storing data (technically it’s a tibble, which is the tidyverse way of storing data, but the differences aren’t important and people use them interchangeably). We can call this whatever we want. I always want to name data frames after what is in it. In this case, we’re going to import a dataset of NBA players. Variable names, by convention are one word all lower case. You can end a variable with a number, but you can’t start one with a number.
The <- bit is the variable assignment operator. It’s how we know we’re assigning something to a word. Think of the arrow as saying “Take everything on the right of this arrow and stuff it into the thing on the left.” So we’re creating an empty vessel called nbaplayers
and stuffing all this data into it.
The read_csv
bits are pretty obvious, except for one thing. What happens in the quote marks is the path to the data. In there, I have to tell R where it will find the data. The easiest thing to do, if you are confused about how to find your data, is to put your data in the same folder as as your notebook (you’ll have to save that notebook first). If you do that, then you just need to put the name of the file in there (nbaadvancedplayers2122.csv). In my case, in my home directory (that’s the ~
part), there is a folder called SportsData that has the file called nbaadvancedplayers2122.csv in it. Some people – insane people – leave the data in their downloads folder. The data path then would be ~/Downloads/nameofthedatafilehere.csv
on PC or Mac.
What you put in there will be different from mine. So your first task is to import the data.
<- read_csv("data/nbaadvancedplayers2425.csv") nbaplayers
Rows: 735 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Player, Team, Pos, Player-additional
dbl (25): Rk, Age, G, GS, MP, PER, TS%, 3PAr, FTr, ORB%, DRB%, TRB%, AST%, S...
lgl (1): Awards
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now we can inspect the data we imported. What does it look like? To do that, we use head(nbaplayers)
to show the headers and the first six rows of data. If we wanted to see them all, we could just simply enter nbaplayers
and run it.
To get the number of records in our dataset, we run nrow(nbaplayers)
head(nbaplayers)
# A tibble: 6 × 30
Rk Player Age Team Pos G GS MP PER `TS%` `3PAr` FTr
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Mikal Brid… 28 NYK SF 82 82 3036 14 0.585 0.391 0.1
2 2 Josh Hart 29 NYK SG 77 77 2897 16.5 0.611 0.327 0.266
3 3 Anthony Ed… 23 MIN SG 79 79 2871 20.1 0.595 0.503 0.308
4 4 Devin Book… 28 PHO SG 75 75 2795 19.3 0.589 0.388 0.34
5 5 James Hard… 35 LAC PG 79 79 2789 20 0.582 0.516 0.446
6 6 DeMar DeRo… 35 SAC SF 77 77 2768 17.7 0.569 0.196 0.337
# ℹ 18 more variables: `ORB%` <dbl>, `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
# `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, OWS <dbl>,
# DWS <dbl>, WS <dbl>, `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>, BPM <dbl>,
# VORP <dbl>, Awards <lgl>, `Player-additional` <chr>
nrow(nbaplayers)
[1] 735
Another way to look at nrow – we have 735 players from this season in our dataset.
What if we wanted to know how many players there were by position? To do that by hand, we’d have to take each of the 735 records and sort them into a pile. We’d put them in groups and then count them.
dplyr
has a group by function in it that does just this. A massive amount of data analysis involves grouping like things together at some point. So it’s a good place to start.
So to do this, we’ll take our dataset and we’ll introduce a new operator: |>. The best way to read that operator, in my opinion, is to interpret that as “and then do this.”
After we group them together, we need to count them. We do that first by saying we want to summarize our data (a count is a part of a summary). To get a summary, we have to tell it what we want. So in this case, we want a count. To get that, let’s create a thing called total and set it equal to n(), which is dplyr
s way of counting something.
Here’s the code:
|>
nbaplayers group_by(Pos) |>
summarise(
total = n()
)
# A tibble: 5 × 2
Pos total
<chr> <int>
1 C 136
2 PF 134
3 PG 135
4 SF 139
5 SG 191
So let’s walk through that. We start with our dataset – nbaplayers
– and then we tell it to group the data by a given field in the data which we get by looking at either the output of head
or you can look in the environment where you’ll see nbaplayers
.
In this case, we wanted to group together positions, signified by the field name Pos. After we group the data, we need to count them up. In dplyr, we use summarize
which can do more than just count things. Inside the parentheses in summarize, we set up the summaries we want. In this case, we just want a count of the positions: total = n(),
says create a new field, called total
and set it equal to n()
, which might look weird, but it’s common in stats. The number of things in a dataset? Statisticians call in n. There are n number of players in this dataset. So n()
is a function that counts the number of things there are.
And when we run that, we get a list of positions with a count next to them. But it’s not in any order. So we’ll add another And Then Do This |> and use arrange
. Arrange does what you think it does – it arranges data in order. By default, it’s in ascending order – smallest to largest. But if we want to know the position with the most players, we need to sort it in descending order. That looks like this:
|>
nbaplayers group_by(Pos) |>
summarise(
total = n()
|> arrange(desc(total)) )
# A tibble: 5 × 2
Pos total
<chr> <int>
1 SG 191
2 SF 139
3 C 136
4 PG 135
5 PF 134
So the most common position in the NBA? Shooting guard, followed by small forward.
We can, if we want, group by more than one thing. Which team has the most of a single position? To do that, we can group by the team – called Tm in the data – and position, or Pos in the data:
|>
nbaplayers group_by(Team, Pos) |>
summarise(
total = n()
|> arrange(desc(total)) )
`summarise()` has grouped output by 'Team'. You can override using the
`.groups` argument.
# A tibble: 158 × 3
# Groups: Team [32]
Team Pos total
<chr> <chr> <int>
1 2TM SG 20
2 2TM PG 15
3 2TM C 14
4 2TM PF 14
5 2TM SF 14
6 CHO SG 11
7 OKC SG 8
8 ATL SG 7
9 CLE SG 7
10 GSW SG 7
# ℹ 148 more rows
So wait, what team is 2TM?
Valuable lesson: whoever collects the data has opinions on how to solve problems. In this case, Basketball Reference, when a player get’s traded, records stats for the player’s first team, their second team, and a combined season total for a team called 2TM, meaning Total. Is there a team abbreviated 2TM? No. So ignore them here.
Charlotte had 11 shooting guards! You can learn a bit about how a team is assembled by looking at these simple counts.
4.2 Other aggregates: Mean and median
In the last example, we grouped some data together and counted it up, but there’s so much more you can do. You can do multiple measures in a single step as well.
Sticking with our NBA player data, we can calculate any number of measures inside summarize. Here, we’ll use R’s built in mean and median functions to calculate … well, you get the idea.
Let’s look just a the number of minutes each position gets.
|>
nbaplayers group_by(Pos) |>
summarise(
count = n(),
mean_minutes = mean(MP),
median_minutes = median(MP)
)
# A tibble: 5 × 4
Pos count mean_minutes median_minutes
<chr> <int> <dbl> <dbl>
1 C 136 856. 606.
2 PF 134 894. 724
3 PG 135 962. 717
4 SF 139 866. 648
5 SG 191 935. 732
Let’s look at centers. The average center plays 855 minutes and the median is 606 minutes.
Why?
Let’s let sort help us.
|> arrange(desc(MP)) nbaplayers
# A tibble: 735 × 30
Rk Player Age Team Pos G GS MP PER `TS%` `3PAr` FTr
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Mikal Bri… 28 NYK SF 82 82 3036 14 0.585 0.391 0.1
2 2 Josh Hart 29 NYK SG 77 77 2897 16.5 0.611 0.327 0.266
3 3 Anthony E… 23 MIN SG 79 79 2871 20.1 0.595 0.503 0.308
4 4 Devin Boo… 28 PHO SG 75 75 2795 19.3 0.589 0.388 0.34
5 5 James Har… 35 LAC PG 79 79 2789 20 0.582 0.516 0.446
6 6 DeMar DeR… 35 SAC SF 77 77 2768 17.7 0.569 0.196 0.337
7 7 Trae Young 26 ATL PG 76 76 2739 18.3 0.567 0.467 0.408
8 8 Tyler Her… 25 MIA SG 77 77 2725 19.7 0.605 0.486 0.237
9 9 OG Anunoby 27 NYK PF 74 74 2706 15.4 0.591 0.448 0.22
10 10 Jalen Gre… 22 HOU SG 82 82 2697 15.1 0.544 0.46 0.234
# ℹ 725 more rows
# ℹ 18 more variables: `ORB%` <dbl>, `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
# `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, OWS <dbl>,
# DWS <dbl>, WS <dbl>, `WS/48` <dbl>, OBPM <dbl>, DBPM <dbl>, BPM <dbl>,
# VORP <dbl>, Awards <lgl>, `Player-additional` <chr>
The player with the most minutes on the floor is a small forward. So that means there’s Mikal Bridges rolling up 3,036 minutes in a season, and then there’s OKC sensation Alex Reese. Never heard of Alex Reese? Might be because he logged two minute in one game this season.
That’s a huge difference.
So when choosing a measure of the middle, you have to ask yourself – could I have extremes? Because a median won’t be sensitive to extremes. It will be the point at which half the numbers are above and half are below. The average or mean will be a measure of the middle, but if you have a bunch of pine riders and then one ironman superstar, the average will be wildly skewed.
4.3 Even more aggregates
There’s a ton of things we can do in summarize – we’ll work with more of them as the course progresses – but here’s a few other questions you can ask.
Which position in the NBA plays the most minutes? And what is the highest and lowest minute total for that position? And how wide is the spread between minutes? We can find that with sum
to add up the minutes to get the total minutes, min
to find the minimum minutes, max
to find the maximum minutes and sd
to find the standard deviation in the numbers.
|>
nbaplayers group_by(Pos) |>
summarise(
total = sum(MP),
avgminutes = mean(MP),
minminutes = min(MP),
maxminutes = max(MP),
stdev = sd(MP)) |> arrange(desc(total))
# A tibble: 5 × 6
Pos total avgminutes minminutes maxminutes stdev
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SG 178577 935. 3 2897 786.
2 PG 129889 962. 4 2789 791.
3 SF 120323 866. 5 3036 795.
4 PF 119786 894. 2 2706 760.
5 C 116389 856. 3 2674 734.
So again, no surprise, shooting guards spend the most minutes on the floor in the NBA. They average 934 minutes, but we noted why that’s trouble. The minimum is a one-minute wonder, max is some team failing at load management, and the standard deviation is a measure of how spread out the data is. In this case, not the highest spread among positions, but pretty high. So you know you’ve got some huge minutes players and a bunch of bench players.