library(rvest)
library(tidyverse)
34 Intro to rvest
All the way back in Chapter 2, we used Google Sheets and importHTML to get our own data out of a website. For me, that’s a lot of pointing and clicking and copying and pasting. R has a library that can automate the harvesting of data from HTML on the internet. It’s called rvest
.
Let’s grab a simple, basic HTML table from College Football Stats. There’s nothing particularly strange about this table – it’s simply formatted and easy to scrape.
First we’ll need some libraries. We’re going to use a library called rvest
, which you can get by running install.packages('rvest')
in the console.
The rvest package has functions that make fetching, reading and parsing HTML simple. The first thing we need to do is specify a url that we’re going to scrape.
<- "http://www.cfbstats.com/2023/leader/national/team/offense/split01/category09/sort01.html" scoringoffenseurl
Now, the most difficult part of scraping data from any website is knowing what exact HTML tag you need to grab. In this case, we want a <table>
tag that has all of our data table in it. But how do you tell R which one that is? Well, it’s easy, once you know what to do. But it’s not simple. So I’ve made a short video to show you how to find it.
When you have simple tables, the code is very simple. You create a variable to receive the data, then pass it the url, read the html that was fetched, find the node you need using your XPath value you just copied and you tell rvest that it’s a table.
<- scoringoffenseurl |>
scoringoffense read_html() |>
html_nodes(xpath = '//*[@id="content"]/div[2]/table') |>
html_table()
What we get from this is … not a dataframe. It’s a list with one element in it, which just so happens to be our dataframe. When you get this, the solution is simple: just overwrite the variable you created with the first list element.
<- scoringoffense[[1]] scoringoffense
And what do we have?
head(scoringoffense)
# A tibble: 6 × 10
`` Name G TD FG `1XP` `2XP` Safety Points `Points/G`
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 LSU 13 79 12 78 1 1 592 45.5
2 2 Oregon 14 83 12 75 5 0 619 44.2
3 3 USC 13 74 10 66 2 0 544 41.8
4 4 Oklahoma 13 71 15 69 0 1 542 41.7
5 5 Georgia 14 71 21 71 0 1 562 40.1
6 6 Memphis 13 68 13 61 2 0 512 39.4
We have data, ready for analysis.
34.1 A slightly more complicated example
What if we want more than one year in our dataframe?
This is a common problem. What if we want to look at every scoring offense going back several years? The website has them going back to 2009. How can we combine them?
First, we should note, that the data does not have anything in it to indicate what year it comes from. So we’re going to have to add that. And we’re going to have to figure out a way to stack two dataframes on top of each other.
So let’s grab 2022.
<- "http://www.cfbstats.com/2022/leader/national/team/offense/split01/category09/sort01.html"
scoringoffenseurl22
<- scoringoffenseurl22 |>
scoringoffense22 read_html() |>
html_nodes(xpath = '//*[@id="content"]/div[2]/table') |>
html_table()
<- scoringoffense22[[1]] scoringoffense22
First, how are we going to know, in the data, which year our data is from? We can use mutate.
<- scoringoffense22 |> mutate(YEAR = 2022) scoringoffense22
Error in initialize(...): attempt to use zero-length variable name
Uh oh. Error. What does it say? It’s … not clear, but a hint is that our first column doesn’t have a name. Each column must be named. If you look at our data in the environment tab in the upper right corner, you’ll see that indeed, the first column has no name. It’s the FBS rank of each team. So we can fix that and mutate in the same step. We’ll do that using rename
and since the field doesn’t have a name to rename it, we’ll use a position argument. We’ll say rename column 1 as Rank.
<- scoringoffense |> rename(Rank = 1) |> mutate(YEAR = 2023)
scoringoffense23 <- scoringoffense22 |> rename(Rank = 1) |> mutate(YEAR = 2022) scoringoffense22
And now, to combine the two tables together length-wise – we need to make long data – we’ll use a dpylr function called bind_rows
. The good thing is bind_rows is simple.
<- bind_rows(scoringoffense23, scoringoffense22) combined
Note in the environment tab we now have a data frame called combined that has 264 observations – which just so happens to be what 131 from 2022 and 133 from 2023 add up to.
head(combined)
# A tibble: 6 × 11
Rank Name G TD FG `1XP` `2XP` Safety Points `Points/G` YEAR
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl>
1 1 LSU 13 79 12 78 1 1 592 45.5 2023
2 2 Oregon 14 83 12 75 5 0 619 44.2 2023
3 3 USC 13 74 10 66 2 0 544 41.8 2023
4 4 Oklahoma 13 71 15 69 0 1 542 41.7 2023
5 5 Georgia 14 71 21 71 0 1 562 40.1 2023
6 6 Memphis 13 68 13 61 2 0 512 39.4 2023
34.2 An even more complicated example
What do you do when the table has non-standard headers?
Unfortunately, non-standard means there’s no one way to do it – it’s going to depend on the table and the headers. But here’s one idea: Don’t try to make it work.
I’ll explain.
Let’s try to get season team stats from Sports Reference. If you look at that page, you’ll see the problem right away – the headers span two rows, and they repeat. That’s going to be all kinds of no good. You can’t import that. Dataframes must have names all in one row. If you have two-line headers, you have a problem you have to fix before you can do anything else with it.
First we’ll grab the page.
<- "https://www.sports-reference.com/cbb/seasons/2023-school-stats.html" url
Now, similar to our example above, we’ll read the html, use XPath to find the table, and then read that table with a directive passed to it setting the header to FALSE. That tells rvest that there isn’t a header row. Just import it as data.
<- url |>
stats read_html() |>
html_nodes(xpath = '//*[@id="basic_school_stats"]') |>
html_table(header=FALSE)
What we get back is a list of one element (similar to above). So let’s pop it out into a data frame.
<- stats[[1]] |> slice(-1) |> slice(-1) stats
And we’ll take a look at what we have.
head(stats)
# A tibble: 6 × 38
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <lgl> <chr>
1 1 Abile… 30 13 17 .433 -2.79 0.90 NA 5 11 NA 9
2 2 Air F… 32 14 18 .438 2.00 2.12 NA 5 13 NA 10
3 3 Akron 33 22 11 .667 4.19 -1.65 NA 13 5 NA 15
4 4 Alaba… 37 31 6 .838 23.19 9.65 NA 16 2 NA 15
5 5 Alaba… 33 15 18 .455 -10.… -7.71 NA 10 8 NA 9
6 6 Alaba… 31 8 23 .258 -16.… -6.29 NA 6 12 NA 5
# … with 25 more variables: X14 <chr>, X15 <lgl>, X16 <chr>, X17 <chr>,
# X18 <lgl>, X19 <chr>, X20 <chr>, X21 <lgl>, X22 <chr>, X23 <chr>,
# X24 <chr>, X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>,
# X30 <chr>, X31 <chr>, X32 <chr>, X33 <chr>, X34 <chr>, X35 <chr>,
# X36 <chr>, X37 <chr>, X38 <chr>
So, that’s not ideal. We have headers and data mixed together, and our columns are named X1 to X38. Also note: They’re all character fields. Because the headers are interspersed with data, it all gets called character data. So we’ve got to first rename each field.
<- stats |> rename(Rank=X1, School=X2, Games=X3, OverallWins=X4, OverallLosses=X5, WinPct=X6, OverallSRS=X7, OverallSOS=X8, Blank1=X9, ConferenceWins=X10, ConferenceLosses=X11, Blank2=X12, HomeWins=X13, HomeLosses=X14, Blank3=X15, AwayWins=X16, AwayLosses=X17, Blank4=X18, ForPoints=X19, OppPoints=X20, Blank5=X21, Minutes=X22, FieldGoalsMade=X23, FieldGoalsAttempted=X24, FieldGoalPCT=X25, ThreePointMade=X26, ThreePointAttempts=X27, ThreePointPct=X28, FreeThrowsMade=X29, FreeThrowsAttempted=X30, FreeThrowPCT=X31, OffensiveRebounds=X32, TotalRebounds=X33, Assists=X34, Steals=X35, Blocks=X36, Turnovers=X37, PersonalFouls=X38) stats
Now we have to get rid of those headers interspersed in the data. We can do that with filter that say keep all the stuff that isn’t this.
<- stats |> filter(Rank != "Rk" & Games != "Overall") stats
And finally, we need to change the file type of all the fields that need it. We’re going to use a clever little trick, which goes like this: We’re going to use mutate_at
, which means mutate these fields. The pattern for mutate_at
is mutate_at
these variables and do this thing to them. But instead of specifying which of 38 variables we’re going to mutate, we’re going to specify the one we don’t want to change, which is the name of the school. And we just want to convert them to numeric, which is simple. Here’s what it looks like:
<- stats |> mutate_at(vars(-School), as.numeric) stats
One last thing: Who needs columns called Blank1, Blank2, Blank3, etc?
<- stats |> select(-starts_with("Blank")) stats
And just like that, we have a method for getting up to the minute season stats for every team in Division I.
head(stats)
# A tibble: 6 × 33
Rank School Games Overa…¹ Overa…² WinPct Overa…³ Overa…⁴ Confe…⁵ Confe…⁶
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Abilene Ch… 30 13 17 0.433 -2.79 0.9 5 11
2 2 Air Force 32 14 18 0.438 2 2.12 5 13
3 3 Akron 33 22 11 0.667 4.19 -1.65 13 5
4 4 Alabama NC… 37 31 6 0.838 23.2 9.65 16 2
5 5 Alabama A&M 33 15 18 0.455 -10.8 -7.71 10 8
6 6 Alabama St… 31 8 23 0.258 -16.3 -6.29 6 12
# … with 23 more variables: HomeWins <dbl>, HomeLosses <dbl>, AwayWins <dbl>,
# AwayLosses <dbl>, ForPoints <dbl>, OppPoints <dbl>, Minutes <dbl>,
# FieldGoalsMade <dbl>, FieldGoalsAttempted <dbl>, FieldGoalPCT <dbl>,
# ThreePointMade <dbl>, ThreePointAttempts <dbl>, ThreePointPct <dbl>,
# FreeThrowsMade <dbl>, FreeThrowsAttempted <dbl>, FreeThrowPCT <dbl>,
# OffensiveRebounds <dbl>, TotalRebounds <dbl>, Assists <dbl>, Steals <dbl>,
# Blocks <dbl>, Turnovers <dbl>, PersonalFouls <dbl>, and abbreviated …