library(tidyverse)
library(cfbfastR)
39 Using packages to get data
There is a growing number of packages and repositories of sports data, largely because there’s a growing number of people who want to analyze that data. We’ve done it ourselves with simple Google Sheets tricks. Then there’s RVest, which is a method of scraping the data yourself from websites. But with these packages, someone has done the work of gathering the data for you. All you have to learn are the commands to get it.
One very promising collection of libraries is something called the SportsDataverse, which has a collection of packages covering specific sports, all of which are in various stages of development. Some are more complete than others, but they are all being actively worked on by developers. Packages of interest in this class are:
- cfbfastR, for college football.
- hoopR, for men’s professional and college basketball.
- wehoop, for women’s professional and college basketball.
- baseballr, for professional and college baseball.
- worldfootballR, for soccer data from around the world.
- hockeyR, for NHL hockey data
- recruitR, for college sports recruiting
Not part of the SportsDataverse, but in the same neighborhood, is nflfastR, which can provide NFL play-by-play data.
Because they’re all under development, not all of them can be installed with just a simple install.packages("something")
. Some require a little work, some require API keys.
The main issue for you is to read the documentation carefully.
39.1 Using cfbfastR as a cautionary tale
cfbfastR presents us a good view into the promise and peril of libraries like this.
First, to make this work, follow the installation instructions and then follow how to get an API key from College Football Data and how to add that to your environment. But maybe wait to do that until you read the whole section.
After installations, we can load it up.
You might be thinking, “Oh wow, I can get play by play data for college football. Let’s look at what are the five most heartbreaking plays of last year’s Maryland season.” Because what better way to determine doom than by looking at the steepest drop-off in win probability, which is included in the data.
Great idea. Let’s do it. You’ll need to make sure that your API key has been added to your environment.
The first thing to do is read the documentation. You’ll see that you can request data for each week. For example, here’s week 1 against Buffalo.
<- cfbd_pbp_data(
maryland 2022,
week=1,
season_type = "regular",
team = "Maryland",
epa_wpa = TRUE,
)
• 09:40:20 | Start processing of 1 game...
There’s not an easy way to get all of a single team’s games. A way to do it that’s not very pretty but it works is like this:
<- cfbd_pbp_data(2022, week=1, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk1 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=2, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk2 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=3, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk3 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=4, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk4 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=5, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk5 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=6, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk6 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=8, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk8 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=9, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk9 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=10, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk10 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=11, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk11 Sys.sleep(2)
<- cfbd_pbp_data(2022, week=12, season_type = "regular", team = "Maryland", epa_wpa = TRUE)
wk12
<- bind_rows(wk1, wk2, wk3, wk4, wk5, wk6, wk8, wk9, wk10, wk11, wk12) umplays
The sys.sleep bits just pauses for two seconds before running the next block. Since we’re requesting data from someone else’s computer, we want to be kind. Week 2 was a bye week for Maryland, so if you request it, you’ll get an empty request and a warning. The bind_rows
parts puts all the dataframes into a single dataframe.
Now you’re ready to look at heartbreak. How do we define heartbreak? How about like this: you first have to lose the game, it comes in the third or fourth quarter, it involves a play (i.e. not a timeout), and it results in the biggest drop in win probability.
|>
umplays filter(pos_team == "Maryland" & wk > 4 & play_type != "Timeout") |>
filter(period == 3 | period == 4) |>
mutate(HeartbreakLevel = wp_before - wp_after) |>
arrange(desc(HeartbreakLevel)) |>
top_n(5, wt=HeartbreakLevel) |>
select(period, clock.minutes, def_pos_team, play_type, play_text)
── play-by-play data from CollegeFootballData.com ──────────── cfbfastR 1.9.0 ──
ℹ Data updated: 2023-11-13 09:40:24 EST
# A tibble: 5 × 5
period clock.minutes def_pos_team play_type play_text
<int> <int> <chr> <chr> <chr>
1 3 13 Northwestern Rush Roman Hemby run for 2 yds to…
2 3 9 Purdue Rush Antwain Littleton II run for…
3 3 12 Ohio State Blocked Punt TEAM punt blocked by Lathan …
4 3 7 Purdue Sack Taulia Tagovailoa sacked by …
5 4 15 Purdue Pass Reception Taulia Tagovailoa pass compl…
The most heartbreaking play of the season, according to our data? A third quarter run for two yards against Northwestern. Hmm - Maryland won that game, though. The other top plays - mostly against Purdue and a blocked punt by Ohio State - seem more in line with what we want.
39.2 Another example
The wehoop package is mature enough to have a version on CRAN, so you can install it the usual way with install.packages("wehoop")
. Another helpful library to install is progressr with install.packages("progressr")
library(wehoop)
Many of these libraries have more than play-by-play data. For example, wehoop has box scores and player data for both the WNBA and college basketball. From personal experience, WNBA data isn’t hard to get, but women’s college basketball is a giant pain.
So, who is Maryland’s single season points champion over the last six seasons?
::with_progress({
progressr<- wehoop::load_wbb_player_box(2018:2022)
wbb_player_box })
With progressr, you’ll see a progress bar in the console, which lets you know that your command is still working, since some of these requests take minutes to complete. Player box scores is quicker – five seasons was a matter of seconds.
If you look at the wbb_player_box data we now have, we have each player in each game over each season – more than 300,000 records. Finding out who Maryland’s top 10 single-season scoring leaders are is a matter of grouping, summing and filtering.
|>
wbb_player_box filter(team_short_display_name == "Maryland", !is.na(points)) |>
group_by(athlete_display_name, season) |>
summarise(totalPoints = sum(as.numeric(points))) |>
arrange(desc(totalPoints)) |>
ungroup() |>
top_n(10, wt=totalPoints)
# A tibble: 11 × 3
athlete_display_name season totalPoints
<chr> <int> <dbl>
1 Kaila Charles 2018 610
2 Kaila Charles 2019 579
3 Angel Reese 2022 569
4 Ashley Owusu 2021 518
5 Diamond Miller 2021 501
6 Kaila Charles 2020 456
7 Taylor Mikesell 2019 456
8 Stephanie Jones 2019 435
9 Ashley Owusu 2022 385
10 Ashley Owusu 2020 383
11 Shakira Austin 2020 383
Maryland relied on Diamond Miller’s scoring last year more than they have any player’s in the past six seasons.