34  Intro to rvest

All the way back in Chapter 2, we used Google Sheets and importHTML to get our own data out of a website. For me, that’s a lot of pointing and clicking and copying and pasting. R has a library that can automate the harvesting of data from HTML on the internet. It’s called rvest.

Let’s grab a simple, basic HTML table from College Football Stats. There’s nothing particularly strange about this table – it’s simply formatted and easy to scrape.

First we’ll need some libraries. We’re going to use a library called rvest, which you can get by running install.packages('rvest') in the console.

library(rvest)
library(tidyverse)

The rvest package has functions that make fetching, reading and parsing HTML simple. The first thing we need to do is specify a url that we’re going to scrape.

scoringoffenseurl <- "http://www.cfbstats.com/2023/leader/national/team/offense/split01/category09/sort01.html"

Now, the most difficult part of scraping data from any website is knowing what exact HTML tag you need to grab. In this case, we want a <table> tag that has all of our data table in it. But how do you tell R which one that is? Well, it’s easy, once you know what to do. But it’s not simple. So I’ve made a short video to show you how to find it.

When you have simple tables, the code is very simple. You create a variable to receive the data, then pass it the url, read the html that was fetched, find the node you need using your XPath value you just copied and you tell rvest that it’s a table.

scoringoffense <- scoringoffenseurl |>
  read_html() |>
  html_nodes(xpath = '//*[@id="content"]/div[2]/table') |>
  html_table()

What we get from this is … not a dataframe. It’s a list with one element in it, which just so happens to be our dataframe. When you get this, the solution is simple: just overwrite the variable you created with the first list element.

scoringoffense <- scoringoffense[[1]]

And what do we have?

head(scoringoffense)
# A tibble: 6 × 10
     `` Name         G    TD    FG `1XP` `2XP` Safety Points `Points/G`
  <int> <chr>    <int> <int> <int> <int> <int>  <int>  <int>      <dbl>
1     1 LSU         13    79    12    78     1      1    592       45.5
2     2 Oregon      14    83    12    75     5      0    619       44.2
3     3 USC         13    74    10    66     2      0    544       41.8
4     4 Oklahoma    13    71    15    69     0      1    542       41.7
5     5 Georgia     14    71    21    71     0      1    562       40.1
6     6 Memphis     13    68    13    61     2      0    512       39.4

We have data, ready for analysis.

34.1 A slightly more complicated example

What if we want more than one year in our dataframe?

This is a common problem. What if we want to look at every scoring offense going back several years? The website has them going back to 2009. How can we combine them?

First, we should note, that the data does not have anything in it to indicate what year it comes from. So we’re going to have to add that. And we’re going to have to figure out a way to stack two dataframes on top of each other.

So let’s grab 2022.

scoringoffenseurl22 <- "http://www.cfbstats.com/2022/leader/national/team/offense/split01/category09/sort01.html"

scoringoffense22 <- scoringoffenseurl22 |>
  read_html() |>
  html_nodes(xpath = '//*[@id="content"]/div[2]/table') |>
  html_table()

scoringoffense22 <- scoringoffense22[[1]]

First, how are we going to know, in the data, which year our data is from? We can use mutate.

scoringoffense22 <- scoringoffense22 |> mutate(YEAR = 2022)
Error in initialize(...): attempt to use zero-length variable name

Uh oh. Error. What does it say? It’s … not clear, but a hint is that our first column doesn’t have a name. Each column must be named. If you look at our data in the environment tab in the upper right corner, you’ll see that indeed, the first column has no name. It’s the FBS rank of each team. So we can fix that and mutate in the same step. We’ll do that using rename and since the field doesn’t have a name to rename it, we’ll use a position argument. We’ll say rename column 1 as Rank.

scoringoffense23 <- scoringoffense |> rename(Rank = 1) |> mutate(YEAR = 2023)
scoringoffense22 <- scoringoffense22 |> rename(Rank = 1) |> mutate(YEAR = 2022)

And now, to combine the two tables together length-wise – we need to make long data – we’ll use a dpylr function called bind_rows. The good thing is bind_rows is simple.

combined <- bind_rows(scoringoffense23, scoringoffense22)

Note in the environment tab we now have a data frame called combined that has 264 observations – which just so happens to be what 131 from 2022 and 133 from 2023 add up to.

head(combined)
# A tibble: 6 × 11
   Rank Name         G    TD    FG `1XP` `2XP` Safety Points `Points/G`  YEAR
  <int> <chr>    <int> <int> <int> <int> <int>  <int>  <int>      <dbl> <dbl>
1     1 LSU         13    79    12    78     1      1    592       45.5  2023
2     2 Oregon      14    83    12    75     5      0    619       44.2  2023
3     3 USC         13    74    10    66     2      0    544       41.8  2023
4     4 Oklahoma    13    71    15    69     0      1    542       41.7  2023
5     5 Georgia     14    71    21    71     0      1    562       40.1  2023
6     6 Memphis     13    68    13    61     2      0    512       39.4  2023

34.2 An even more complicated example

What do you do when the table has non-standard headers?

Unfortunately, non-standard means there’s no one way to do it – it’s going to depend on the table and the headers. But here’s one idea: Don’t try to make it work.

I’ll explain.

Let’s try to get season team stats from Sports Reference. If you look at that page, you’ll see the problem right away – the headers span two rows, and they repeat. That’s going to be all kinds of no good. You can’t import that. Dataframes must have names all in one row. If you have two-line headers, you have a problem you have to fix before you can do anything else with it.

First we’ll grab the page.

url <- "https://www.sports-reference.com/cbb/seasons/2023-school-stats.html"

Now, similar to our example above, we’ll read the html, use XPath to find the table, and then read that table with a directive passed to it setting the header to FALSE. That tells rvest that there isn’t a header row. Just import it as data.

stats <- url |>
  read_html() |>
  html_nodes(xpath = '//*[@id="basic_school_stats"]') |>
  html_table(header=FALSE)

What we get back is a list of one element (similar to above). So let’s pop it out into a data frame.

stats <- stats[[1]] |> slice(-1) |> slice(-1)

And we’ll take a look at what we have.

head(stats)
# A tibble: 6 × 38
  X1    X2     X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13  
  <chr> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <lgl> <chr>
1 1     Abile… 30    13    17    .433  -2.79 0.90  NA    5     11    NA    9    
2 2     Air F… 32    14    18    .438  2.00  2.12  NA    5     13    NA    10   
3 3     Akron  33    22    11    .667  4.19  -1.65 NA    13    5     NA    15   
4 4     Alaba… 37    31    6     .838  23.19 9.65  NA    16    2     NA    15   
5 5     Alaba… 33    15    18    .455  -10.… -7.71 NA    10    8     NA    9    
6 6     Alaba… 31    8     23    .258  -16.… -6.29 NA    6     12    NA    5    
# … with 25 more variables: X14 <chr>, X15 <lgl>, X16 <chr>, X17 <chr>,
#   X18 <lgl>, X19 <chr>, X20 <chr>, X21 <lgl>, X22 <chr>, X23 <chr>,
#   X24 <chr>, X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>,
#   X30 <chr>, X31 <chr>, X32 <chr>, X33 <chr>, X34 <chr>, X35 <chr>,
#   X36 <chr>, X37 <chr>, X38 <chr>

So, that’s not ideal. We have headers and data mixed together, and our columns are named X1 to X38. Also note: They’re all character fields. Because the headers are interspersed with data, it all gets called character data. So we’ve got to first rename each field.

stats <- stats |> rename(Rank=X1, School=X2, Games=X3, OverallWins=X4, OverallLosses=X5, WinPct=X6, OverallSRS=X7, OverallSOS=X8, Blank1=X9, ConferenceWins=X10, ConferenceLosses=X11, Blank2=X12, HomeWins=X13, HomeLosses=X14, Blank3=X15, AwayWins=X16, AwayLosses=X17, Blank4=X18, ForPoints=X19, OppPoints=X20, Blank5=X21, Minutes=X22, FieldGoalsMade=X23, FieldGoalsAttempted=X24, FieldGoalPCT=X25, ThreePointMade=X26, ThreePointAttempts=X27, ThreePointPct=X28, FreeThrowsMade=X29, FreeThrowsAttempted=X30, FreeThrowPCT=X31, OffensiveRebounds=X32, TotalRebounds=X33, Assists=X34, Steals=X35, Blocks=X36, Turnovers=X37, PersonalFouls=X38)

Now we have to get rid of those headers interspersed in the data. We can do that with filter that say keep all the stuff that isn’t this.

stats <- stats |> filter(Rank != "Rk" & Games != "Overall") 

And finally, we need to change the file type of all the fields that need it. We’re going to use a clever little trick, which goes like this: We’re going to use mutate_at, which means mutate these fields. The pattern for mutate_at is mutate_at these variables and do this thing to them. But instead of specifying which of 38 variables we’re going to mutate, we’re going to specify the one we don’t want to change, which is the name of the school. And we just want to convert them to numeric, which is simple. Here’s what it looks like:

stats <- stats |> mutate_at(vars(-School), as.numeric)

One last thing: Who needs columns called Blank1, Blank2, Blank3, etc?

stats <- stats |> select(-starts_with("Blank"))

And just like that, we have a method for getting up to the minute season stats for every team in Division I.

head(stats)
# A tibble: 6 × 33
   Rank School      Games Overa…¹ Overa…² WinPct Overa…³ Overa…⁴ Confe…⁵ Confe…⁶
  <dbl> <chr>       <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1     1 Abilene Ch…    30      13      17  0.433   -2.79    0.9        5      11
2     2 Air Force      32      14      18  0.438    2       2.12       5      13
3     3 Akron          33      22      11  0.667    4.19   -1.65      13       5
4     4 Alabama NC…    37      31       6  0.838   23.2     9.65      16       2
5     5 Alabama A&M    33      15      18  0.455  -10.8    -7.71      10       8
6     6 Alabama St…    31       8      23  0.258  -16.3    -6.29       6      12
# … with 23 more variables: HomeWins <dbl>, HomeLosses <dbl>, AwayWins <dbl>,
#   AwayLosses <dbl>, ForPoints <dbl>, OppPoints <dbl>, Minutes <dbl>,
#   FieldGoalsMade <dbl>, FieldGoalsAttempted <dbl>, FieldGoalPCT <dbl>,
#   ThreePointMade <dbl>, ThreePointAttempts <dbl>, ThreePointPct <dbl>,
#   FreeThrowsMade <dbl>, FreeThrowsAttempted <dbl>, FreeThrowPCT <dbl>,
#   OffensiveRebounds <dbl>, TotalRebounds <dbl>, Assists <dbl>, Steals <dbl>,
#   Blocks <dbl>, Turnovers <dbl>, PersonalFouls <dbl>, and abbreviated …