30  Text cleaning

On occasion, you’ll get some data from someone that … isn’t quite what you need it to be. There’s something flawed in it. Some extra text, some choice that the data provider made that you just don’t agree with.

There’s a ton of tools in the tidyverse to fix this, and you already have some tools in your toolbox. Let’s take a look at a couple.

First, you know what you need.

library(tidyverse)

Now, two examples.

30.1 Stripping out text

Throughout this class, we’ve used data from Sports Reference. If you’ve used their Share > CSV method to copy data from a table, you may have noticed some extra cruft in the player name field. If you haven’t seen it, I’ll give you an example – a dataset of NBA players and their advanced metrics.

For this walkthrough:

Now load it.

nbaplayers <- read_csv("data/nbaplayers.csv")
New names:
Rows: 624 Columns: 29
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): Player, Pos, Tm dbl (24): Rk, Age, G, MP, PER, TS%, 3PAr, FTr, ORB%, DRB%,
TRB%, AST%, STL%,... lgl (2): ...20, ...25
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...20`
• `` -> `...25`

Let’s take a look:

head(nbaplayers)
# A tibble: 6 × 29
     Rk Player     Pos     Age Tm        G    MP   PER `TS%` `3PAr`   FTr `ORB%`
  <dbl> <chr>      <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
1     1 "Steven A… C        26 OKC      58  1564  20.8 0.605  0.007 0.413   14.4
2     2 "Bam Adeb… PF       22 MIA      65  2235  20.6 0.606  0.018 0.476    8.7
3     3 "LaMarcus… C        34 SAS      53  1754  19.8 0.571  0.198 0.241    6.3
4     4 "Nickeil … SG       21 NOP      41   501   7.6 0.441  0.515 0.123    1.7
5     5 "Grayson … SG       24 MEM      30   498  11.4 0.577  0.517 0.199    1.1
6     6 "Jarrett … C        21 BRK      64  1647  20.3 0.658  0.012 0.574   12.5
# … with 17 more variables: `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
#   `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, ...20 <lgl>,
#   OWS <dbl>, DWS <dbl>, WS <dbl>, `WS/48` <dbl>, ...25 <lgl>, OBPM <dbl>,
#   DBPM <dbl>, BPM <dbl>, VORP <dbl>

You can see that every players name is their name, then two backslashes, then some version of their name that must have meaning to Sports Reference, but not to us. So we need to get rid of that.

To do this, we’re going to use a little regular expression magic. Regular expressions are a programmatic way to find any pattern in text. What we’re looking for is that \\ business. But, that presents a problem, because the \ is a special character. It’s called an escape character. That escape character means what comes next is potentially special. For instance, if you see \n, that’s a newline character. So normally, if you see that, it would add a return.

So for us to get rid of the \ we’re going to have to escape the escape character with an escape character. And we have two of them. So we have to do it twice.

Yes. Really.

So if we wanted to find two backslashes, we need \\\\. Then, using regular expressions, we can say “and then everything else after this” with this: .*

No really. That’s it. So we’re looking for \\\\.*. That’ll find two backslashes and then everything after it. If you think this is hard … you’re right. Regular expressions are an entire month of a programming course by themselves. They are EXTREMELY powerful.

To find something in text, we’ll use a function called gsub. The pattern in gsub is pattern, what we want to replace it with, what column this can all be found in. So in our example, the pattern is \\\\.*, what we want to replace it with is … nothing, and this is all in the Player column. Here’s the code.

nbaplayers |> mutate(Player=gsub("\\\\.*","",Player)) |> head()
# A tibble: 6 × 29
     Rk Player     Pos     Age Tm        G    MP   PER `TS%` `3PAr`   FTr `ORB%`
  <dbl> <chr>      <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
1     1 Steven Ad… C        26 OKC      58  1564  20.8 0.605  0.007 0.413   14.4
2     2 Bam Adeba… PF       22 MIA      65  2235  20.6 0.606  0.018 0.476    8.7
3     3 LaMarcus … C        34 SAS      53  1754  19.8 0.571  0.198 0.241    6.3
4     4 Nickeil A… SG       21 NOP      41   501   7.6 0.441  0.515 0.123    1.7
5     5 Grayson A… SG       24 MEM      30   498  11.4 0.577  0.517 0.199    1.1
6     6 Jarrett A… C        21 BRK      64  1647  20.3 0.658  0.012 0.574   12.5
# … with 17 more variables: `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
#   `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, ...20 <lgl>,
#   OWS <dbl>, DWS <dbl>, WS <dbl>, `WS/48` <dbl>, ...25 <lgl>, OBPM <dbl>,
#   DBPM <dbl>, BPM <dbl>, VORP <dbl>

Just like that, the trash is gone.

30.2 Another example: splitting columns

Text cleaning is really just a set of logic puzzles. What do I need to do? How can I get there step by step?

The NCAA does some very interesting things with data, making it pretty useless.

For this walkthrough:

Let’s import it and take a look.

kills <- read_csv("data/killsperset.csv")
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Player, Cl, Ht, Pos, Season
dbl (4): Rank, S, Kills, Per Set

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(kills)
# A tibble: 6 × 9
   Rank Player                      Cl    Ht    Pos       S Kills Per S…¹ Season
  <dbl> <chr>                       <chr> <chr> <chr> <dbl> <dbl>   <dbl> <chr> 
1     1 Lindsey Ruddins, UC Santa … So.   6-2   OH       90   526    5.84 2017-…
2     2 Pilar Victoria, Arkansas (… Sr.   5-11  OH      116   634    5.47 2017-…
3     3 Laura Milos, Oral Roberts … Sr.   5-10  OH      106   560    5.28 2017-…
4     4 Carlyle Nusbaum, Lipscomb … Jr.   5-10  OH      100   522    5.22 2017-…
5     5 Veronica Jones-Perry, BYU … Jr.   6-0   OH      118   569    4.82 2017-…
6     6 Torrey Van Winden, Cal Pol… So.   6-3   OH      101   477    4.72 2017-…
# … with abbreviated variable name ¹​`Per Set`

First things first, Player isn’t just player, it’s player, school and conference, all in one. And Ht is a character field – and in feet and inches.

So … this is a mess. But there is a pattern. See it? A comma after the player’s name. The Conference is in parens. We can use that.

For this, we’re going to use a tidyr function called separate to split columns into multiple columns based on a character. We’ll do this step by step.

First, let’s use that comma to split the player and the rest. Ignore the head at the end. That’s just to keep it from showing you all 150.

kills |> separate(Player, into=c("Player", "School"), sep=",") |> head()
# A tibble: 6 × 10
   Rank Player               School Cl    Ht    Pos       S Kills Per S…¹ Season
  <dbl> <chr>                <chr>  <chr> <chr> <chr> <dbl> <dbl>   <dbl> <chr> 
1     1 Lindsey Ruddins      " UC … So.   6-2   OH       90   526    5.84 2017-…
2     2 Pilar Victoria       " Ark… Sr.   5-11  OH      116   634    5.47 2017-…
3     3 Laura Milos          " Ora… Sr.   5-10  OH      106   560    5.28 2017-…
4     4 Carlyle Nusbaum      " Lip… Jr.   5-10  OH      100   522    5.22 2017-…
5     5 Veronica Jones-Perry " BYU… Jr.   6-0   OH      118   569    4.82 2017-…
6     6 Torrey Van Winden    " Cal… So.   6-3   OH      101   477    4.72 2017-…
# … with abbreviated variable name ¹​`Per Set`

Good start.

Now, let’s get the conference separated. A problem is going to crop up here – the paren is a special character, so we have to escape it with the \\.

kills |> 
  separate(Player, into=c("Player", "School"), sep=",") |>
  separate(School, into=c("School", "Conference"), sep="\\(") |> 
  head()
Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [15, 42, 83].
# A tibble: 6 × 11
   Rank Player       School Confe…¹ Cl    Ht    Pos       S Kills Per S…² Season
  <dbl> <chr>        <chr>  <chr>   <chr> <chr> <chr> <dbl> <dbl>   <dbl> <chr> 
1     1 Lindsey Rud… " UC … Big We… So.   6-2   OH       90   526    5.84 2017-…
2     2 Pilar Victo… " Ark… SEC)    Sr.   5-11  OH      116   634    5.47 2017-…
3     3 Laura Milos  " Ora… Summit… Sr.   5-10  OH      106   560    5.28 2017-…
4     4 Carlyle Nus… " Lip… ASUN)   Jr.   5-10  OH      100   522    5.22 2017-…
5     5 Veronica Jo… " BYU… WCC)    Jr.   6-0   OH      118   569    4.82 2017-…
6     6 Torrey Van … " Cal… Big We… So.   6-3   OH      101   477    4.72 2017-…
# … with abbreviated variable names ¹​Conference, ²​`Per Set`

Uh oh. Says we have problems in rows 15, 42 and 83. What are they? The NCAA has decided to put (FL), (NY) and (PA) into three teams to tell you they’re in Florida, New York and Pennsylvania respectively. Well, we can fix that with some gsub and we’ll use a switch called fixed, which when set to TRUE it means this literal string, no special characters.

kills |> 
  separate(Player, into=c("Player", "School"), sep=",") |> 
  mutate(School = gsub("(FL)", "FL", School, fixed=TRUE)) |>
  mutate(School = gsub("(NY)", "NY", School, fixed=TRUE)) |>
  mutate(School = gsub("(PA)", "PA", School, fixed=TRUE)) |>
  separate(School, into=c("School", "Conference"), sep="\\(") |> 
  head()
# A tibble: 6 × 11
   Rank Player       School Confe…¹ Cl    Ht    Pos       S Kills Per S…² Season
  <dbl> <chr>        <chr>  <chr>   <chr> <chr> <chr> <dbl> <dbl>   <dbl> <chr> 
1     1 Lindsey Rud… " UC … Big We… So.   6-2   OH       90   526    5.84 2017-…
2     2 Pilar Victo… " Ark… SEC)    Sr.   5-11  OH      116   634    5.47 2017-…
3     3 Laura Milos  " Ora… Summit… Sr.   5-10  OH      106   560    5.28 2017-…
4     4 Carlyle Nus… " Lip… ASUN)   Jr.   5-10  OH      100   522    5.22 2017-…
5     5 Veronica Jo… " BYU… WCC)    Jr.   6-0   OH      118   569    4.82 2017-…
6     6 Torrey Van … " Cal… Big We… So.   6-3   OH      101   477    4.72 2017-…
# … with abbreviated variable names ¹​Conference, ²​`Per Set`

One last thing: see the trailing paren?

kills |> 
  separate(Player, into=c("Player", "School"), sep=",") |> 
  mutate(School = gsub("(FL)", "FL", School, fixed=TRUE)) |>
  mutate(School = gsub("(NY)", "NY", School, fixed=TRUE)) |>
  mutate(School = gsub("(PA)", "PA", School, fixed=TRUE)) |>
  separate(School, into=c("School", "Conference"), sep="\\(") |> 
  mutate(Conference=gsub(")", "", Conference)) |>
  head()
# A tibble: 6 × 11
   Rank Player       School Confe…¹ Cl    Ht    Pos       S Kills Per S…² Season
  <dbl> <chr>        <chr>  <chr>   <chr> <chr> <chr> <dbl> <dbl>   <dbl> <chr> 
1     1 Lindsey Rud… " UC … Big We… So.   6-2   OH       90   526    5.84 2017-…
2     2 Pilar Victo… " Ark… SEC     Sr.   5-11  OH      116   634    5.47 2017-…
3     3 Laura Milos  " Ora… Summit… Sr.   5-10  OH      106   560    5.28 2017-…
4     4 Carlyle Nus… " Lip… ASUN    Jr.   5-10  OH      100   522    5.22 2017-…
5     5 Veronica Jo… " BYU… WCC     Jr.   6-0   OH      118   569    4.82 2017-…
6     6 Torrey Van … " Cal… Big We… So.   6-3   OH      101   477    4.72 2017-…
# … with abbreviated variable names ¹​Conference, ²​`Per Set`

Looking good, no errors.

Now, what should we do about Ht? 6-2 is not going to tell me much when I want to run a regression of height to kills per set. And it’s a character field. So we need to convert it to numbers.

Separate again comes to the rescue.

kills |> 
  separate(Player, into=c("Player", "School"), sep=",") |> 
  mutate(School = gsub("(FL)", "FL", School, fixed=TRUE)) |>
  mutate(School = gsub("(NY)", "NY", School, fixed=TRUE)) |>
  mutate(School = gsub("(PA)", "PA", School, fixed=TRUE)) |>
  separate(School, into=c("School", "Conference"), sep="\\(") |> 
  mutate(Conference=gsub(")", "", Conference)) |>
  separate(Ht, into=c("Feet", "Inches"), sep="-") |>
  mutate(Feet = as.numeric(Feet), Inches = as.numeric(Inches)) |>
  head()
# A tibble: 6 × 12
   Rank Player       School Confe…¹ Cl     Feet Inches Pos       S Kills Per S…²
  <dbl> <chr>        <chr>  <chr>   <chr> <dbl>  <dbl> <chr> <dbl> <dbl>   <dbl>
1     1 Lindsey Rud… " UC … Big We… So.       6      2 OH       90   526    5.84
2     2 Pilar Victo… " Ark… SEC     Sr.       5     11 OH      116   634    5.47
3     3 Laura Milos  " Ora… Summit… Sr.       5     10 OH      106   560    5.28
4     4 Carlyle Nus… " Lip… ASUN    Jr.       5     10 OH      100   522    5.22
5     5 Veronica Jo… " BYU… WCC     Jr.       6      0 OH      118   569    4.82
6     6 Torrey Van … " Cal… Big We… So.       6      3 OH      101   477    4.72
# … with 1 more variable: Season <chr>, and abbreviated variable names
#   ¹​Conference, ²​`Per Set`

But how do we turn that into a height? Math!

kills |> 
  separate(Player, into=c("Player", "School"), sep=",") |> 
  mutate(School = gsub("(FL)", "FL", School, fixed=TRUE)) |>
  mutate(School = gsub("(NY)", "NY", School, fixed=TRUE)) |>
  mutate(School = gsub("(PA)", "PA", School, fixed=TRUE)) |>
  separate(School, into=c("School", "Conference"), sep="\\(") |> 
  mutate(Conference=gsub(")", "", Conference)) |>
  separate(Ht, into=c("Feet", "Inches"), sep="-") |>
  mutate(Feet = as.numeric(Feet), Inches = as.numeric(Inches)) |>
  mutate(Height = (Feet*12)+Inches) |>
  head()
# A tibble: 6 × 13
   Rank Player       School Confe…¹ Cl     Feet Inches Pos       S Kills Per S…²
  <dbl> <chr>        <chr>  <chr>   <chr> <dbl>  <dbl> <chr> <dbl> <dbl>   <dbl>
1     1 Lindsey Rud… " UC … Big We… So.       6      2 OH       90   526    5.84
2     2 Pilar Victo… " Ark… SEC     Sr.       5     11 OH      116   634    5.47
3     3 Laura Milos  " Ora… Summit… Sr.       5     10 OH      106   560    5.28
4     4 Carlyle Nus… " Lip… ASUN    Jr.       5     10 OH      100   522    5.22
5     5 Veronica Jo… " BYU… WCC     Jr.       6      0 OH      118   569    4.82
6     6 Torrey Van … " Cal… Big We… So.       6      3 OH      101   477    4.72
# … with 2 more variables: Season <chr>, Height <dbl>, and abbreviated variable
#   names ¹​Conference, ²​`Per Set`

And now, in 10 lines of code, using separate, mutate and gsub, we’ve turned the mess that is the NCAA’s data into actually useful data we can analyze.

These patterns of thought come in handy when facing messed up data.