library(tidyverse)
30 Text cleaning
On occasion, you’ll get some data from someone that … isn’t quite what you need it to be. There’s something flawed in it. Some extra text, some choice that the data provider made that you just don’t agree with.
There’s a ton of tools in the tidyverse to fix this, and you already have some tools in your toolbox. Let’s take a look at a couple.
First, you know what you need.
Now, two examples.
30.1 Stripping out text
Throughout this class, we’ve used data from Sports Reference. If you’ve used their Share > CSV method to copy data from a table, you may have noticed some extra cruft in the player name field. If you haven’t seen it, I’ll give you an example – a dataset of NBA players and their advanced metrics.
For this walkthrough:
Now load it.
<- read_csv("data/nbaplayers.csv") nbaplayers
New names:
Rows: 624 Columns: 29
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): Player, Pos, Tm dbl (24): Rk, Age, G, MP, PER, TS%, 3PAr, FTr, ORB%, DRB%,
TRB%, AST%, STL%,... lgl (2): ...20, ...25
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...20`
• `` -> `...25`
Let’s take a look:
head(nbaplayers)
# A tibble: 6 × 29
Rk Player Pos Age Tm G MP PER `TS%` `3PAr` FTr `ORB%`
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 "Steven A… C 26 OKC 58 1564 20.8 0.605 0.007 0.413 14.4
2 2 "Bam Adeb… PF 22 MIA 65 2235 20.6 0.606 0.018 0.476 8.7
3 3 "LaMarcus… C 34 SAS 53 1754 19.8 0.571 0.198 0.241 6.3
4 4 "Nickeil … SG 21 NOP 41 501 7.6 0.441 0.515 0.123 1.7
5 5 "Grayson … SG 24 MEM 30 498 11.4 0.577 0.517 0.199 1.1
6 6 "Jarrett … C 21 BRK 64 1647 20.3 0.658 0.012 0.574 12.5
# … with 17 more variables: `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
# `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, ...20 <lgl>,
# OWS <dbl>, DWS <dbl>, WS <dbl>, `WS/48` <dbl>, ...25 <lgl>, OBPM <dbl>,
# DBPM <dbl>, BPM <dbl>, VORP <dbl>
You can see that every players name is their name, then two backslashes, then some version of their name that must have meaning to Sports Reference, but not to us. So we need to get rid of that.
To do this, we’re going to use a little regular expression magic. Regular expressions are a programmatic way to find any pattern in text. What we’re looking for is that \\
business. But, that presents a problem, because the \
is a special character. It’s called an escape character. That escape character means what comes next is potentially special. For instance, if you see \n
, that’s a newline character. So normally, if you see that, it would add a return.
So for us to get rid of the \
we’re going to have to escape the escape character with an escape character. And we have two of them. So we have to do it twice.
Yes. Really.
So if we wanted to find two backslashes, we need \\\\
. Then, using regular expressions, we can say “and then everything else after this” with this: .*
No really. That’s it. So we’re looking for \\\\.*
. That’ll find two backslashes and then everything after it. If you think this is hard … you’re right. Regular expressions are an entire month of a programming course by themselves. They are EXTREMELY powerful.
To find something in text, we’ll use a function called gsub
. The pattern in gsub
is pattern, what we want to replace it with, what column this can all be found in
. So in our example, the pattern is \\\\.*
, what we want to replace it with is … nothing, and this is all in the Player column. Here’s the code.
|> mutate(Player=gsub("\\\\.*","",Player)) |> head() nbaplayers
# A tibble: 6 × 29
Rk Player Pos Age Tm G MP PER `TS%` `3PAr` FTr `ORB%`
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Steven Ad… C 26 OKC 58 1564 20.8 0.605 0.007 0.413 14.4
2 2 Bam Adeba… PF 22 MIA 65 2235 20.6 0.606 0.018 0.476 8.7
3 3 LaMarcus … C 34 SAS 53 1754 19.8 0.571 0.198 0.241 6.3
4 4 Nickeil A… SG 21 NOP 41 501 7.6 0.441 0.515 0.123 1.7
5 5 Grayson A… SG 24 MEM 30 498 11.4 0.577 0.517 0.199 1.1
6 6 Jarrett A… C 21 BRK 64 1647 20.3 0.658 0.012 0.574 12.5
# … with 17 more variables: `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>,
# `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, `USG%` <dbl>, ...20 <lgl>,
# OWS <dbl>, DWS <dbl>, WS <dbl>, `WS/48` <dbl>, ...25 <lgl>, OBPM <dbl>,
# DBPM <dbl>, BPM <dbl>, VORP <dbl>
Just like that, the trash is gone.
30.2 Another example: splitting columns
Text cleaning is really just a set of logic puzzles. What do I need to do? How can I get there step by step?
The NCAA does some very interesting things with data, making it pretty useless.
For this walkthrough:
Let’s import it and take a look.
<- read_csv("data/killsperset.csv") kills
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Player, Cl, Ht, Pos, Season
dbl (4): Rank, S, Kills, Per Set
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(kills)
# A tibble: 6 × 9
Rank Player Cl Ht Pos S Kills Per S…¹ Season
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 1 Lindsey Ruddins, UC Santa … So. 6-2 OH 90 526 5.84 2017-…
2 2 Pilar Victoria, Arkansas (… Sr. 5-11 OH 116 634 5.47 2017-…
3 3 Laura Milos, Oral Roberts … Sr. 5-10 OH 106 560 5.28 2017-…
4 4 Carlyle Nusbaum, Lipscomb … Jr. 5-10 OH 100 522 5.22 2017-…
5 5 Veronica Jones-Perry, BYU … Jr. 6-0 OH 118 569 4.82 2017-…
6 6 Torrey Van Winden, Cal Pol… So. 6-3 OH 101 477 4.72 2017-…
# … with abbreviated variable name ¹`Per Set`
First things first, Player isn’t just player, it’s player, school and conference, all in one. And Ht is a character field – and in feet and inches.
So … this is a mess. But there is a pattern. See it? A comma after the player’s name. The Conference is in parens. We can use that.
For this, we’re going to use a tidyr
function called separate
to split columns into multiple columns based on a character. We’ll do this step by step.
First, let’s use that comma to split the player and the rest. Ignore the head at the end. That’s just to keep it from showing you all 150.
|> separate(Player, into=c("Player", "School"), sep=",") |> head() kills
# A tibble: 6 × 10
Rank Player School Cl Ht Pos S Kills Per S…¹ Season
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 1 Lindsey Ruddins " UC … So. 6-2 OH 90 526 5.84 2017-…
2 2 Pilar Victoria " Ark… Sr. 5-11 OH 116 634 5.47 2017-…
3 3 Laura Milos " Ora… Sr. 5-10 OH 106 560 5.28 2017-…
4 4 Carlyle Nusbaum " Lip… Jr. 5-10 OH 100 522 5.22 2017-…
5 5 Veronica Jones-Perry " BYU… Jr. 6-0 OH 118 569 4.82 2017-…
6 6 Torrey Van Winden " Cal… So. 6-3 OH 101 477 4.72 2017-…
# … with abbreviated variable name ¹`Per Set`
Good start.
Now, let’s get the conference separated. A problem is going to crop up here – the paren is a special character, so we have to escape it with the \\
.
|>
kills separate(Player, into=c("Player", "School"), sep=",") |>
separate(School, into=c("School", "Conference"), sep="\\(") |>
head()
Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [15, 42, 83].
# A tibble: 6 × 11
Rank Player School Confe…¹ Cl Ht Pos S Kills Per S…² Season
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 1 Lindsey Rud… " UC … Big We… So. 6-2 OH 90 526 5.84 2017-…
2 2 Pilar Victo… " Ark… SEC) Sr. 5-11 OH 116 634 5.47 2017-…
3 3 Laura Milos " Ora… Summit… Sr. 5-10 OH 106 560 5.28 2017-…
4 4 Carlyle Nus… " Lip… ASUN) Jr. 5-10 OH 100 522 5.22 2017-…
5 5 Veronica Jo… " BYU… WCC) Jr. 6-0 OH 118 569 4.82 2017-…
6 6 Torrey Van … " Cal… Big We… So. 6-3 OH 101 477 4.72 2017-…
# … with abbreviated variable names ¹Conference, ²`Per Set`
Uh oh. Says we have problems in rows 15, 42 and 83. What are they? The NCAA has decided to put (FL), (NY) and (PA) into three teams to tell you they’re in Florida, New York and Pennsylvania respectively. Well, we can fix that with some gsub and we’ll use a switch called fixed
, which when set to TRUE it means this literal string, no special characters.
|>
kills separate(Player, into=c("Player", "School"), sep=",") |>
mutate(School = gsub("(FL)", "FL", School, fixed=TRUE)) |>
mutate(School = gsub("(NY)", "NY", School, fixed=TRUE)) |>
mutate(School = gsub("(PA)", "PA", School, fixed=TRUE)) |>
separate(School, into=c("School", "Conference"), sep="\\(") |>
head()
# A tibble: 6 × 11
Rank Player School Confe…¹ Cl Ht Pos S Kills Per S…² Season
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 1 Lindsey Rud… " UC … Big We… So. 6-2 OH 90 526 5.84 2017-…
2 2 Pilar Victo… " Ark… SEC) Sr. 5-11 OH 116 634 5.47 2017-…
3 3 Laura Milos " Ora… Summit… Sr. 5-10 OH 106 560 5.28 2017-…
4 4 Carlyle Nus… " Lip… ASUN) Jr. 5-10 OH 100 522 5.22 2017-…
5 5 Veronica Jo… " BYU… WCC) Jr. 6-0 OH 118 569 4.82 2017-…
6 6 Torrey Van … " Cal… Big We… So. 6-3 OH 101 477 4.72 2017-…
# … with abbreviated variable names ¹Conference, ²`Per Set`
One last thing: see the trailing paren?
|>
kills separate(Player, into=c("Player", "School"), sep=",") |>
mutate(School = gsub("(FL)", "FL", School, fixed=TRUE)) |>
mutate(School = gsub("(NY)", "NY", School, fixed=TRUE)) |>
mutate(School = gsub("(PA)", "PA", School, fixed=TRUE)) |>
separate(School, into=c("School", "Conference"), sep="\\(") |>
mutate(Conference=gsub(")", "", Conference)) |>
head()
# A tibble: 6 × 11
Rank Player School Confe…¹ Cl Ht Pos S Kills Per S…² Season
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 1 Lindsey Rud… " UC … Big We… So. 6-2 OH 90 526 5.84 2017-…
2 2 Pilar Victo… " Ark… SEC Sr. 5-11 OH 116 634 5.47 2017-…
3 3 Laura Milos " Ora… Summit… Sr. 5-10 OH 106 560 5.28 2017-…
4 4 Carlyle Nus… " Lip… ASUN Jr. 5-10 OH 100 522 5.22 2017-…
5 5 Veronica Jo… " BYU… WCC Jr. 6-0 OH 118 569 4.82 2017-…
6 6 Torrey Van … " Cal… Big We… So. 6-3 OH 101 477 4.72 2017-…
# … with abbreviated variable names ¹Conference, ²`Per Set`
Looking good, no errors.
Now, what should we do about Ht? 6-2 is not going to tell me much when I want to run a regression of height to kills per set. And it’s a character field. So we need to convert it to numbers.
Separate again comes to the rescue.
|>
kills separate(Player, into=c("Player", "School"), sep=",") |>
mutate(School = gsub("(FL)", "FL", School, fixed=TRUE)) |>
mutate(School = gsub("(NY)", "NY", School, fixed=TRUE)) |>
mutate(School = gsub("(PA)", "PA", School, fixed=TRUE)) |>
separate(School, into=c("School", "Conference"), sep="\\(") |>
mutate(Conference=gsub(")", "", Conference)) |>
separate(Ht, into=c("Feet", "Inches"), sep="-") |>
mutate(Feet = as.numeric(Feet), Inches = as.numeric(Inches)) |>
head()
# A tibble: 6 × 12
Rank Player School Confe…¹ Cl Feet Inches Pos S Kills Per S…²
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 Lindsey Rud… " UC … Big We… So. 6 2 OH 90 526 5.84
2 2 Pilar Victo… " Ark… SEC Sr. 5 11 OH 116 634 5.47
3 3 Laura Milos " Ora… Summit… Sr. 5 10 OH 106 560 5.28
4 4 Carlyle Nus… " Lip… ASUN Jr. 5 10 OH 100 522 5.22
5 5 Veronica Jo… " BYU… WCC Jr. 6 0 OH 118 569 4.82
6 6 Torrey Van … " Cal… Big We… So. 6 3 OH 101 477 4.72
# … with 1 more variable: Season <chr>, and abbreviated variable names
# ¹Conference, ²`Per Set`
But how do we turn that into a height? Math!
|>
kills separate(Player, into=c("Player", "School"), sep=",") |>
mutate(School = gsub("(FL)", "FL", School, fixed=TRUE)) |>
mutate(School = gsub("(NY)", "NY", School, fixed=TRUE)) |>
mutate(School = gsub("(PA)", "PA", School, fixed=TRUE)) |>
separate(School, into=c("School", "Conference"), sep="\\(") |>
mutate(Conference=gsub(")", "", Conference)) |>
separate(Ht, into=c("Feet", "Inches"), sep="-") |>
mutate(Feet = as.numeric(Feet), Inches = as.numeric(Inches)) |>
mutate(Height = (Feet*12)+Inches) |>
head()
# A tibble: 6 × 13
Rank Player School Confe…¹ Cl Feet Inches Pos S Kills Per S…²
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 Lindsey Rud… " UC … Big We… So. 6 2 OH 90 526 5.84
2 2 Pilar Victo… " Ark… SEC Sr. 5 11 OH 116 634 5.47
3 3 Laura Milos " Ora… Summit… Sr. 5 10 OH 106 560 5.28
4 4 Carlyle Nus… " Lip… ASUN Jr. 5 10 OH 100 522 5.22
5 5 Veronica Jo… " BYU… WCC Jr. 6 0 OH 118 569 4.82
6 6 Torrey Van … " Cal… Big We… So. 6 3 OH 101 477 4.72
# … with 2 more variables: Season <chr>, Height <dbl>, and abbreviated variable
# names ¹Conference, ²`Per Set`
And now, in 10 lines of code, using separate, mutate and gsub, we’ve turned the mess that is the NCAA’s data into actually useful data we can analyze.
These patterns of thought come in handy when facing messed up data.