library(tidyverse)
library(janitor)
20 Data Cleaning Part II: Janitor
The necessary bane of every data journalist’s existence is data cleaning.
Every developer, every data system, every agency, they all have opinions about how data gets collected. Some decisions make sense from the outside. Some decisions are based entirely on internal politics: who is creating the data, how they are creating it, why they are creating it. Is it automated? Is it manual? Are data normalized? Are there free form fields where users can just type into or does the system restrict them to choices?
Your journalistic questions – what you want the data to tell you – are almost never part of that equation.
So cleaning data is the process of fixing issues in your data so you can answer the questions you want to answer. Data cleaning is a critical step that you can’t skip. A standard metric is that 80 percent of the time working with data will be spent cleaning and verifying data, and 20 percent the more exciting parts like analysis and visualization.
The tidyverse has a lot of built-in tools for data cleaning. We’re also going to make use of a new library, called janitor
that has a bunch of great functions for cleaning data. Let’s load those now.
Let’s continue with our Maryland grants and loans data that we worked with in the previous chapter.
<- read_csv("data/State_of_Maryland_Grant_and_Loan_Data__FY2009_to_FY2022_20250115.csv") md_grants_loans
Rows: 19482 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Grantor, Grantee, Zip Code, Description, Category, Date
dbl (3): Fiscal Year, Amount, Fiscal Period
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
There are a number of issues with this data set that might get in the way of asking questions and receiving accurate answers. They are:
- The column names have spaces in them. This isn’t a deal-breaker, as we used this dataframe previously. But it does require that you do some things differently when writing code, and ideally you don’t want spaces in your column names.
- Inconsistent capitalization across multiple columns. Sometimes the grantee is capitalized, and other times not. Portions of the grantor name are sometimes capitalized. This issue will ruin your ability to count and add things using those columns.
- The zip field mixes five digit ZIP codes and nine digit ZIP codes, and some of the records include spaces. If we wanted to group and count the number of loans in a given ZIP code, this inconsistency would not let us do that correctly.
- The category column is inconsistent and has some missing values.
Let’s get cleaning. Our goal will be to build up one block of code that does all the necessary cleaning in order to answer this question: which zip code has gotten the most amount of money from the Maryland Tourism Board?
20.1 Cleaning headers
One of the first places we can start with cleaning data is cleaning the column names (or headers).
Every system has their own way of recording headers, and every developer has their own thoughts of what a good idea is within it. R is most happy when headers are lower case, without special characters.
If column headers start with a number, or have a space in between two words, you have to set them off with backticks when using them in a function. Generally speaking, we want one word (or words separated by an underscore), all lowercase, that don’t start with numbers.
The janitor
library makes fixing headers trivially simple with the function clean_names()
# cleaning function
<- md_grants_loans |>
cleaned_md_grants_loans clean_names()
# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,482 × 9
grantor grantee zip_code fiscal_year amount description category
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 Commerce/Maryland T… PRINCE… 20772 2017 1.28e5 Maryland T… Grant
2 Department of Healt… Associ… 21201 2010 9.34e4 Minority O… Grant
3 Maryland Department… WESTED… 94107-1… 2014 1.61e6 GRANTS FOR… Grant
4 Department of Healt… Maryla… 21075 2010 2.14e5 Babies Bor… Grant
5 Department of Natur… Anacos… 20710 2017 1.74e5 Payments m… Grant
6 Department of Busin… Washin… 21740 2009 5.59e4 grant fund… Grant
7 Boards and Commissi… Domest… 21045 2014 1.72e5 Domestic V… Grant
8 MD Small Business D… Dacore… 20601 2018 1.04e5 Maryland S… Loan
9 Maryland Higher Edu… Mount … 21727 2015 1.75e6 Sellinger … Grant
10 Department of Busin… Olney … 20830 2010 2.16e5 Grant fund… Grant
# ℹ 19,472 more rows
# ℹ 2 more variables: fiscal_period <dbl>, date <chr>
This function changed Zip Code
to zip_code
and generally got rid of capital letters and replaced spaces with underscores. If we wanted to rename a column, we can use a tidyverse function rename()
to do that. Let’s change grantor
to source
as an example. NOTE: when using rename()
, the new name comes first.
# cleaning function
<- md_grants_loans |>
cleaned_md_grants_loans clean_names() |>
rename(source = grantor)
# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,482 × 9
source grantee zip_code fiscal_year amount description category fiscal_period
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 Comme… PRINCE… 20772 2017 1.28e5 Maryland T… Grant 1
2 Depar… Associ… 21201 2010 9.34e4 Minority O… Grant 1
3 Maryl… WESTED… 94107-1… 2014 1.61e6 GRANTS FOR… Grant 1
4 Depar… Maryla… 21075 2010 2.14e5 Babies Bor… Grant 1
5 Depar… Anacos… 20710 2017 1.74e5 Payments m… Grant 1
6 Depar… Washin… 21740 2009 5.59e4 grant fund… Grant 1
7 Board… Domest… 21045 2014 1.72e5 Domestic V… Grant 1
8 MD Sm… Dacore… 20601 2018 1.04e5 Maryland S… Loan 1
9 Maryl… Mount … 21727 2015 1.75e6 Sellinger … Grant 1
10 Depar… Olney … 20830 2010 2.16e5 Grant fund… Grant 1
# ℹ 19,472 more rows
# ℹ 1 more variable: date <chr>
20.2 Changing capitalization
Right now the source
, grantee
and description
columns have inconsistent capitalization. We can fix that using a mutate statement and a function that changes the case of text called str_to_upper()
. We’ll use the same columns, overwriting what’s in there since all we’re doing is changing case.
# cleaning function
<- md_grants_loans |>
cleaned_md_grants_loans clean_names() |>
rename(source = grantor) |>
mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description))
# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,482 × 9
source grantee zip_code fiscal_year amount description category fiscal_period
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 COMME… PRINCE… 20772 2017 1.28e5 MARYLAND T… Grant 1
2 DEPAR… ASSOCI… 21201 2010 9.34e4 MINORITY O… Grant 1
3 MARYL… WESTED… 94107-1… 2014 1.61e6 GRANTS FOR… Grant 1
4 DEPAR… MARYLA… 21075 2010 2.14e5 BABIES BOR… Grant 1
5 DEPAR… ANACOS… 20710 2017 1.74e5 PAYMENTS M… Grant 1
6 DEPAR… WASHIN… 21740 2009 5.59e4 GRANT FUND… Grant 1
7 BOARD… DOMEST… 21045 2014 1.72e5 DOMESTIC V… Grant 1
8 MD SM… DACORE… 20601 2018 1.04e5 MARYLAND S… Loan 1
9 MARYL… MOUNT … 21727 2015 1.75e6 SELLINGER … Grant 1
10 DEPAR… OLNEY … 20830 2010 2.16e5 GRANT FUND… Grant 1
# ℹ 19,472 more rows
# ℹ 1 more variable: date <chr>
What this does is make it so that using group_by
will result in fewer rows due to inconsistent capitalization. It won’t fix misspellings, but working off a single case style definitely helps.
20.3 Duplicates
One of the most difficult problems to fix in data is duplicate records in the data. They can creep in with bad joins, bad data entry practices, mistakes – all kinds of reasons. A duplicated record isn’t always there because of an error, but you need to know if it’s there before making that determination.
So the question is, do we have any records repeated?
Here we’ll use a function called get_dupes
from the janitor library to check for fully repeated records in our cleaned data set.
|>
cleaned_md_grants_loans get_dupes()
No variable names specified - using all columns.
# A tibble: 58 × 10
source grantee zip_code fiscal_year amount description category fiscal_period
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 BOARD… CASA C… 21740 2009 6.62e4 SERVICES T… Grant 1
2 BOARD… CASA C… 21740 2009 6.62e4 SERVICES T… Grant 1
3 BOARD… FAMILY… 21218 2009 2.18e5 PRE-ADJUDI… Grant 1
4 BOARD… FAMILY… 21218 2009 2.18e5 PRE-ADJUDI… Grant 1
5 BOARD… HEARTL… 21705 2009 5.32e4 UNDERSERVE… Grant 1
6 BOARD… HEARTL… 21705 2009 5.32e4 UNDERSERVE… Grant 1
7 BOARD… MARYLA… 21012 2009 9.72e4 CAPACITY B… Grant 1
8 BOARD… MARYLA… 21012 2009 9.72e4 CAPACITY B… Grant 1
9 BOARD… MARYLA… 21012 2009 9.72e4 SEXUAL ASS… Grant 1
10 BOARD… MARYLA… 21012 2009 9.72e4 SEXUAL ASS… Grant 1
# ℹ 48 more rows
# ℹ 2 more variables: date <chr>, dupe_count <int>
And the answer is … maybe? Because the original dataset doesn’t have a unique identifier for each grant, it’s possible that we have duplicates here, as many as 58. If we could confirm that these actually are duplicates, we can fix this by adding the function distinct()
to our cleaning script. This will keep only one copy of each unique record in our table. But we’d need to confirm that first.
# cleaning function
<- md_grants_loans |>
cleaned_md_grants_loans clean_names() |>
rename(source = grantor) |>
mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description)) |>
distinct()
# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,453 × 9
source grantee zip_code fiscal_year amount description category fiscal_period
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 COMME… PRINCE… 20772 2017 1.28e5 MARYLAND T… Grant 1
2 DEPAR… ASSOCI… 21201 2010 9.34e4 MINORITY O… Grant 1
3 MARYL… WESTED… 94107-1… 2014 1.61e6 GRANTS FOR… Grant 1
4 DEPAR… MARYLA… 21075 2010 2.14e5 BABIES BOR… Grant 1
5 DEPAR… ANACOS… 20710 2017 1.74e5 PAYMENTS M… Grant 1
6 DEPAR… WASHIN… 21740 2009 5.59e4 GRANT FUND… Grant 1
7 BOARD… DOMEST… 21045 2014 1.72e5 DOMESTIC V… Grant 1
8 MD SM… DACORE… 20601 2018 1.04e5 MARYLAND S… Loan 1
9 MARYL… MOUNT … 21727 2015 1.75e6 SELLINGER … Grant 1
10 DEPAR… OLNEY … 20830 2010 2.16e5 GRANT FUND… Grant 1
# ℹ 19,443 more rows
# ℹ 1 more variable: date <chr>
20.4 Cleaning strings
The rest of the problems with this data set all have to do with inconsistent format of values in a few of the columns. To fix these problems, we’re going to make use of mutate() in concert with “string functions” – special functions that allow us to clean up columns stored as character strings. The tidyverse package stringr
has lots of useful string functions, more than we’ll learn in this chapter.
Let’s start by cleaning up the zip field. Remember, some of the rows had a five-digit ZIP code, while others had a nine-digit ZIP code, separated by a hyphen or not.
We’re going to write code that tells R to make a new column for our zips, keeping the first five digits on the left, and get rid of anything after that by using mutate()
in concert with str_sub()
, from the stringr
package.
# cleaning function
<- md_grants_loans |>
cleaned_md_grants_loans clean_names() |>
rename(source = grantor) |>
mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description)) |>
distinct() |>
mutate(zip5 = str_sub(zip_code, start=1L, end=5L))
# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,453 × 10
source grantee zip_code fiscal_year amount description category fiscal_period
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 COMME… PRINCE… 20772 2017 1.28e5 MARYLAND T… Grant 1
2 DEPAR… ASSOCI… 21201 2010 9.34e4 MINORITY O… Grant 1
3 MARYL… WESTED… 94107-1… 2014 1.61e6 GRANTS FOR… Grant 1
4 DEPAR… MARYLA… 21075 2010 2.14e5 BABIES BOR… Grant 1
5 DEPAR… ANACOS… 20710 2017 1.74e5 PAYMENTS M… Grant 1
6 DEPAR… WASHIN… 21740 2009 5.59e4 GRANT FUND… Grant 1
7 BOARD… DOMEST… 21045 2014 1.72e5 DOMESTIC V… Grant 1
8 MD SM… DACORE… 20601 2018 1.04e5 MARYLAND S… Loan 1
9 MARYL… MOUNT … 21727 2015 1.75e6 SELLINGER … Grant 1
10 DEPAR… OLNEY … 20830 2010 2.16e5 GRANT FUND… Grant 1
# ℹ 19,443 more rows
# ℹ 2 more variables: date <chr>, zip5 <chr>
Let’s break down that last line of code. It says: take the value in each zip column and extract the first character on the left (1L) through the fifth character on the left (5L), and then use that five-digit zip to populate a new zip5 column.
If we arrange the zip5 column we can see that there are some non-digits in there, so let’s make those NA. For that, we’re going to use case_when()
, a function that let’s us say if a value meets a certain condition, then change it, and if it doesn’t, don’t change it.
# cleaning function
<- md_grants_loans |>
cleaned_md_grants_loans clean_names() |>
rename(source = grantor) |>
mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description)) |>
distinct() |>
mutate(zip5 = str_sub(zip_code, start=1L, end=5L)) |>
mutate(zip5 = case_when(
== "Vario" ~ NA,
zip5 == "UB7 O" ~ NA,
zip5 == "UB7 " ~ NA,
zip5 .default = zip5
))
# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,453 × 10
source grantee zip_code fiscal_year amount description category fiscal_period
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 COMME… PRINCE… 20772 2017 1.28e5 MARYLAND T… Grant 1
2 DEPAR… ASSOCI… 21201 2010 9.34e4 MINORITY O… Grant 1
3 MARYL… WESTED… 94107-1… 2014 1.61e6 GRANTS FOR… Grant 1
4 DEPAR… MARYLA… 21075 2010 2.14e5 BABIES BOR… Grant 1
5 DEPAR… ANACOS… 20710 2017 1.74e5 PAYMENTS M… Grant 1
6 DEPAR… WASHIN… 21740 2009 5.59e4 GRANT FUND… Grant 1
7 BOARD… DOMEST… 21045 2014 1.72e5 DOMESTIC V… Grant 1
8 MD SM… DACORE… 20601 2018 1.04e5 MARYLAND S… Loan 1
9 MARYL… MOUNT … 21727 2015 1.75e6 SELLINGER … Grant 1
10 DEPAR… OLNEY … 20830 2010 2.16e5 GRANT FUND… Grant 1
# ℹ 19,443 more rows
# ℹ 2 more variables: date <chr>, zip5 <chr>
That last bit is a little complex, so let’s break it down.
What the code above says, in English, is this: Look at all the values in the zip5 column. If the value is “Vario”, then (that’s what the “~” means, then) replace it with NA. Same for the other variations. If it’s anything other than that (that’s what “TRUE” means, otherwise), then keep the existing value in that column.
Instead of specifying the exact value, we can also solve the problem by using something more generalizable, using a function called str_detect(), which allows us to search parts of words.
The second line of our case_when() function below now says, in English: look in the city column. If you find that one of the values starts with “UB7” (the “^” symbol means “starts with”), then (the tilde ~ means then) change it to NA.
# cleaning function
<- md_grants_loans |>
cleaned_md_grants_loans clean_names() |>
rename(source = grantor) |>
mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description)) |>
distinct() |>
mutate(zip5 = str_sub(zip_code, start=1L, end=5L)) |>
mutate(zip5 = case_when(
== "Vario" ~ NA,
zip5 str_detect(zip5, "^UB7") ~ NA,
.default = zip5
))
# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,453 × 10
source grantee zip_code fiscal_year amount description category fiscal_period
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 COMME… PRINCE… 20772 2017 1.28e5 MARYLAND T… Grant 1
2 DEPAR… ASSOCI… 21201 2010 9.34e4 MINORITY O… Grant 1
3 MARYL… WESTED… 94107-1… 2014 1.61e6 GRANTS FOR… Grant 1
4 DEPAR… MARYLA… 21075 2010 2.14e5 BABIES BOR… Grant 1
5 DEPAR… ANACOS… 20710 2017 1.74e5 PAYMENTS M… Grant 1
6 DEPAR… WASHIN… 21740 2009 5.59e4 GRANT FUND… Grant 1
7 BOARD… DOMEST… 21045 2014 1.72e5 DOMESTIC V… Grant 1
8 MD SM… DACORE… 20601 2018 1.04e5 MARYLAND S… Loan 1
9 MARYL… MOUNT … 21727 2015 1.75e6 SELLINGER … Grant 1
10 DEPAR… OLNEY … 20830 2010 2.16e5 GRANT FUND… Grant 1
# ℹ 19,443 more rows
# ℹ 2 more variables: date <chr>, zip5 <chr>
We’ve gotten the source and zip code data as clean as we can, and now we can answer our question: which zip code has gotten the most amount of money from the Maryland Tourism Board? A good rule of thumb is that you should only spend time cleaning fields that are critical to the specific analysis you want to do.
|>
cleaned_md_grants_loans filter(source == 'COMMERCE/MARYLAND TOURISM BOARD') |>
group_by(zip5) |>
summarize(total_amount = sum(amount)) |>
arrange(desc(total_amount))
# A tibble: 8 × 2
zip5 total_amount
<chr> <dbl>
1 21202 648069
2 21601 400000
3 20005 250000
4 21401 190570
5 20772 128041
6 21701 116708
7 21740 63070
8 21014 52490
Why, it’s downtown Baltimore, including the Inner Harbor area.