20  Data Cleaning Part II: Janitor

The necessary bane of every data journalist’s existence is data cleaning.

Every developer, every data system, every agency, they all have opinions about how data gets collected. Some decisions make sense from the outside. Some decisions are based entirely on internal politics: who is creating the data, how they are creating it, why they are creating it. Is it automated? Is it manual? Are data normalized? Are there free form fields where users can just type into or does the system restrict them to choices?

Your journalistic questions – what you want the data to tell you – are almost never part of that equation.

So cleaning data is the process of fixing issues in your data so you can answer the questions you want to answer. Data cleaning is a critical step that you can’t skip. A standard metric is that 80 percent of the time working with data will be spent cleaning and verifying data, and 20 percent the more exciting parts like analysis and visualization.

The tidyverse has a lot of built-in tools for data cleaning. We’re also going to make use of a new library, called janitor that has a bunch of great functions for cleaning data. Let’s load those now.

library(tidyverse)
library(janitor)

Let’s continue with our Maryland grants and loans data that we worked with in the previous chapter.

md_grants_loans <- read_csv("data/State_of_Maryland_Grant_and_Loan_Data__FY2009_to_FY2022_20250115.csv")
Rows: 19482 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Grantor, Grantee, Zip Code, Description, Category, Date
dbl (3): Fiscal Year, Amount, Fiscal Period

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

There are a number of issues with this data set that might get in the way of asking questions and receiving accurate answers. They are:

Let’s get cleaning. Our goal will be to build up one block of code that does all the necessary cleaning in order to answer this question: which zip code has gotten the most amount of money from the Maryland Tourism Board?

20.1 Cleaning headers

One of the first places we can start with cleaning data is cleaning the column names (or headers).

Every system has their own way of recording headers, and every developer has their own thoughts of what a good idea is within it. R is most happy when headers are lower case, without special characters.

If column headers start with a number, or have a space in between two words, you have to set them off with backticks when using them in a function. Generally speaking, we want one word (or words separated by an underscore), all lowercase, that don’t start with numbers.

The janitor library makes fixing headers trivially simple with the function clean_names()

# cleaning function
cleaned_md_grants_loans <- md_grants_loans |>
  clean_names()

# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,482 × 9
   grantor              grantee zip_code fiscal_year amount description category
   <chr>                <chr>   <chr>          <dbl>  <dbl> <chr>       <chr>   
 1 Commerce/Maryland T… PRINCE… 20772           2017 1.28e5 Maryland T… Grant   
 2 Department of Healt… Associ… 21201           2010 9.34e4 Minority O… Grant   
 3 Maryland Department… WESTED… 94107-1…        2014 1.61e6 GRANTS FOR… Grant   
 4 Department of Healt… Maryla… 21075           2010 2.14e5 Babies Bor… Grant   
 5 Department of Natur… Anacos… 20710           2017 1.74e5 Payments m… Grant   
 6 Department of Busin… Washin… 21740           2009 5.59e4 grant fund… Grant   
 7 Boards and Commissi… Domest… 21045           2014 1.72e5 Domestic V… Grant   
 8 MD Small Business D… Dacore… 20601           2018 1.04e5 Maryland S… Loan    
 9 Maryland Higher Edu… Mount … 21727           2015 1.75e6 Sellinger … Grant   
10 Department of Busin… Olney … 20830           2010 2.16e5 Grant fund… Grant   
# ℹ 19,472 more rows
# ℹ 2 more variables: fiscal_period <dbl>, date <chr>

This function changed Zip Code to zip_code and generally got rid of capital letters and replaced spaces with underscores. If we wanted to rename a column, we can use a tidyverse function rename() to do that. Let’s change grantor to source as an example. NOTE: when using rename(), the new name comes first.

# cleaning function
cleaned_md_grants_loans <- md_grants_loans |>
  clean_names() |> 
  rename(source = grantor)

# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,482 × 9
   source grantee zip_code fiscal_year amount description category fiscal_period
   <chr>  <chr>   <chr>          <dbl>  <dbl> <chr>       <chr>            <dbl>
 1 Comme… PRINCE… 20772           2017 1.28e5 Maryland T… Grant                1
 2 Depar… Associ… 21201           2010 9.34e4 Minority O… Grant                1
 3 Maryl… WESTED… 94107-1…        2014 1.61e6 GRANTS FOR… Grant                1
 4 Depar… Maryla… 21075           2010 2.14e5 Babies Bor… Grant                1
 5 Depar… Anacos… 20710           2017 1.74e5 Payments m… Grant                1
 6 Depar… Washin… 21740           2009 5.59e4 grant fund… Grant                1
 7 Board… Domest… 21045           2014 1.72e5 Domestic V… Grant                1
 8 MD Sm… Dacore… 20601           2018 1.04e5 Maryland S… Loan                 1
 9 Maryl… Mount … 21727           2015 1.75e6 Sellinger … Grant                1
10 Depar… Olney … 20830           2010 2.16e5 Grant fund… Grant                1
# ℹ 19,472 more rows
# ℹ 1 more variable: date <chr>

20.2 Changing capitalization

Right now the source, grantee and description columns have inconsistent capitalization. We can fix that using a mutate statement and a function that changes the case of text called str_to_upper(). We’ll use the same columns, overwriting what’s in there since all we’re doing is changing case.

# cleaning function
cleaned_md_grants_loans <- md_grants_loans |>
  clean_names() |> 
  rename(source = grantor) |> 
  mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description))

# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,482 × 9
   source grantee zip_code fiscal_year amount description category fiscal_period
   <chr>  <chr>   <chr>          <dbl>  <dbl> <chr>       <chr>            <dbl>
 1 COMME… PRINCE… 20772           2017 1.28e5 MARYLAND T… Grant                1
 2 DEPAR… ASSOCI… 21201           2010 9.34e4 MINORITY O… Grant                1
 3 MARYL… WESTED… 94107-1…        2014 1.61e6 GRANTS FOR… Grant                1
 4 DEPAR… MARYLA… 21075           2010 2.14e5 BABIES BOR… Grant                1
 5 DEPAR… ANACOS… 20710           2017 1.74e5 PAYMENTS M… Grant                1
 6 DEPAR… WASHIN… 21740           2009 5.59e4 GRANT FUND… Grant                1
 7 BOARD… DOMEST… 21045           2014 1.72e5 DOMESTIC V… Grant                1
 8 MD SM… DACORE… 20601           2018 1.04e5 MARYLAND S… Loan                 1
 9 MARYL… MOUNT … 21727           2015 1.75e6 SELLINGER … Grant                1
10 DEPAR… OLNEY … 20830           2010 2.16e5 GRANT FUND… Grant                1
# ℹ 19,472 more rows
# ℹ 1 more variable: date <chr>

What this does is make it so that using group_by will result in fewer rows due to inconsistent capitalization. It won’t fix misspellings, but working off a single case style definitely helps.

20.3 Duplicates

One of the most difficult problems to fix in data is duplicate records in the data. They can creep in with bad joins, bad data entry practices, mistakes – all kinds of reasons. A duplicated record isn’t always there because of an error, but you need to know if it’s there before making that determination.

So the question is, do we have any records repeated?

Here we’ll use a function called get_dupes from the janitor library to check for fully repeated records in our cleaned data set.

cleaned_md_grants_loans |>
  get_dupes()
No variable names specified - using all columns.
# A tibble: 58 × 10
   source grantee zip_code fiscal_year amount description category fiscal_period
   <chr>  <chr>   <chr>          <dbl>  <dbl> <chr>       <chr>            <dbl>
 1 BOARD… CASA C… 21740           2009 6.62e4 SERVICES T… Grant                1
 2 BOARD… CASA C… 21740           2009 6.62e4 SERVICES T… Grant                1
 3 BOARD… FAMILY… 21218           2009 2.18e5 PRE-ADJUDI… Grant                1
 4 BOARD… FAMILY… 21218           2009 2.18e5 PRE-ADJUDI… Grant                1
 5 BOARD… HEARTL… 21705           2009 5.32e4 UNDERSERVE… Grant                1
 6 BOARD… HEARTL… 21705           2009 5.32e4 UNDERSERVE… Grant                1
 7 BOARD… MARYLA… 21012           2009 9.72e4 CAPACITY B… Grant                1
 8 BOARD… MARYLA… 21012           2009 9.72e4 CAPACITY B… Grant                1
 9 BOARD… MARYLA… 21012           2009 9.72e4 SEXUAL ASS… Grant                1
10 BOARD… MARYLA… 21012           2009 9.72e4 SEXUAL ASS… Grant                1
# ℹ 48 more rows
# ℹ 2 more variables: date <chr>, dupe_count <int>

And the answer is … maybe? Because the original dataset doesn’t have a unique identifier for each grant, it’s possible that we have duplicates here, as many as 58. If we could confirm that these actually are duplicates, we can fix this by adding the function distinct() to our cleaning script. This will keep only one copy of each unique record in our table. But we’d need to confirm that first.

# cleaning function
cleaned_md_grants_loans <- md_grants_loans |>
  clean_names() |> 
  rename(source = grantor) |> 
  mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description)) |> 
  distinct()

# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,453 × 9
   source grantee zip_code fiscal_year amount description category fiscal_period
   <chr>  <chr>   <chr>          <dbl>  <dbl> <chr>       <chr>            <dbl>
 1 COMME… PRINCE… 20772           2017 1.28e5 MARYLAND T… Grant                1
 2 DEPAR… ASSOCI… 21201           2010 9.34e4 MINORITY O… Grant                1
 3 MARYL… WESTED… 94107-1…        2014 1.61e6 GRANTS FOR… Grant                1
 4 DEPAR… MARYLA… 21075           2010 2.14e5 BABIES BOR… Grant                1
 5 DEPAR… ANACOS… 20710           2017 1.74e5 PAYMENTS M… Grant                1
 6 DEPAR… WASHIN… 21740           2009 5.59e4 GRANT FUND… Grant                1
 7 BOARD… DOMEST… 21045           2014 1.72e5 DOMESTIC V… Grant                1
 8 MD SM… DACORE… 20601           2018 1.04e5 MARYLAND S… Loan                 1
 9 MARYL… MOUNT … 21727           2015 1.75e6 SELLINGER … Grant                1
10 DEPAR… OLNEY … 20830           2010 2.16e5 GRANT FUND… Grant                1
# ℹ 19,443 more rows
# ℹ 1 more variable: date <chr>

20.4 Cleaning strings

The rest of the problems with this data set all have to do with inconsistent format of values in a few of the columns. To fix these problems, we’re going to make use of mutate() in concert with “string functions” – special functions that allow us to clean up columns stored as character strings. The tidyverse package stringr has lots of useful string functions, more than we’ll learn in this chapter.

Let’s start by cleaning up the zip field. Remember, some of the rows had a five-digit ZIP code, while others had a nine-digit ZIP code, separated by a hyphen or not.

We’re going to write code that tells R to make a new column for our zips, keeping the first five digits on the left, and get rid of anything after that by using mutate() in concert with str_sub(), from the stringr package.

# cleaning function
cleaned_md_grants_loans <- md_grants_loans |>
  clean_names() |> 
  rename(source = grantor) |> 
  mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description)) |> 
  distinct() |>
  mutate(zip5 = str_sub(zip_code, start=1L, end=5L))


# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,453 × 10
   source grantee zip_code fiscal_year amount description category fiscal_period
   <chr>  <chr>   <chr>          <dbl>  <dbl> <chr>       <chr>            <dbl>
 1 COMME… PRINCE… 20772           2017 1.28e5 MARYLAND T… Grant                1
 2 DEPAR… ASSOCI… 21201           2010 9.34e4 MINORITY O… Grant                1
 3 MARYL… WESTED… 94107-1…        2014 1.61e6 GRANTS FOR… Grant                1
 4 DEPAR… MARYLA… 21075           2010 2.14e5 BABIES BOR… Grant                1
 5 DEPAR… ANACOS… 20710           2017 1.74e5 PAYMENTS M… Grant                1
 6 DEPAR… WASHIN… 21740           2009 5.59e4 GRANT FUND… Grant                1
 7 BOARD… DOMEST… 21045           2014 1.72e5 DOMESTIC V… Grant                1
 8 MD SM… DACORE… 20601           2018 1.04e5 MARYLAND S… Loan                 1
 9 MARYL… MOUNT … 21727           2015 1.75e6 SELLINGER … Grant                1
10 DEPAR… OLNEY … 20830           2010 2.16e5 GRANT FUND… Grant                1
# ℹ 19,443 more rows
# ℹ 2 more variables: date <chr>, zip5 <chr>

Let’s break down that last line of code. It says: take the value in each zip column and extract the first character on the left (1L) through the fifth character on the left (5L), and then use that five-digit zip to populate a new zip5 column.

If we arrange the zip5 column we can see that there are some non-digits in there, so let’s make those NA. For that, we’re going to use case_when(), a function that let’s us say if a value meets a certain condition, then change it, and if it doesn’t, don’t change it.

# cleaning function
cleaned_md_grants_loans <- md_grants_loans |>
  clean_names() |> 
  rename(source = grantor) |> 
  mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description)) |> 
  distinct() |>
  mutate(zip5 = str_sub(zip_code, start=1L, end=5L)) |>
  mutate(zip5 = case_when(
    zip5 == "Vario" ~ NA,
    zip5 == "UB7 O" ~ NA,
    zip5 == "UB7 " ~ NA,
    .default = zip5
  ))

# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,453 × 10
   source grantee zip_code fiscal_year amount description category fiscal_period
   <chr>  <chr>   <chr>          <dbl>  <dbl> <chr>       <chr>            <dbl>
 1 COMME… PRINCE… 20772           2017 1.28e5 MARYLAND T… Grant                1
 2 DEPAR… ASSOCI… 21201           2010 9.34e4 MINORITY O… Grant                1
 3 MARYL… WESTED… 94107-1…        2014 1.61e6 GRANTS FOR… Grant                1
 4 DEPAR… MARYLA… 21075           2010 2.14e5 BABIES BOR… Grant                1
 5 DEPAR… ANACOS… 20710           2017 1.74e5 PAYMENTS M… Grant                1
 6 DEPAR… WASHIN… 21740           2009 5.59e4 GRANT FUND… Grant                1
 7 BOARD… DOMEST… 21045           2014 1.72e5 DOMESTIC V… Grant                1
 8 MD SM… DACORE… 20601           2018 1.04e5 MARYLAND S… Loan                 1
 9 MARYL… MOUNT … 21727           2015 1.75e6 SELLINGER … Grant                1
10 DEPAR… OLNEY … 20830           2010 2.16e5 GRANT FUND… Grant                1
# ℹ 19,443 more rows
# ℹ 2 more variables: date <chr>, zip5 <chr>

That last bit is a little complex, so let’s break it down.

What the code above says, in English, is this: Look at all the values in the zip5 column. If the value is “Vario”, then (that’s what the “~” means, then) replace it with NA. Same for the other variations. If it’s anything other than that (that’s what “TRUE” means, otherwise), then keep the existing value in that column.

Instead of specifying the exact value, we can also solve the problem by using something more generalizable, using a function called str_detect(), which allows us to search parts of words.

The second line of our case_when() function below now says, in English: look in the city column. If you find that one of the values starts with “UB7” (the “^” symbol means “starts with”), then (the tilde ~ means then) change it to NA.

# cleaning function
cleaned_md_grants_loans <- md_grants_loans |>
  clean_names() |> 
  rename(source = grantor) |> 
  mutate(source = str_to_upper(source), grantee = str_to_upper(grantee), description = str_to_upper(description)) |> 
  distinct() |>
  mutate(zip5 = str_sub(zip_code, start=1L, end=5L)) |>
  mutate(zip5 = case_when(
    zip5 == "Vario" ~ NA,
    str_detect(zip5, "^UB7") ~ NA,
    .default = zip5
  ))

# display the cleaned dataset
cleaned_md_grants_loans
# A tibble: 19,453 × 10
   source grantee zip_code fiscal_year amount description category fiscal_period
   <chr>  <chr>   <chr>          <dbl>  <dbl> <chr>       <chr>            <dbl>
 1 COMME… PRINCE… 20772           2017 1.28e5 MARYLAND T… Grant                1
 2 DEPAR… ASSOCI… 21201           2010 9.34e4 MINORITY O… Grant                1
 3 MARYL… WESTED… 94107-1…        2014 1.61e6 GRANTS FOR… Grant                1
 4 DEPAR… MARYLA… 21075           2010 2.14e5 BABIES BOR… Grant                1
 5 DEPAR… ANACOS… 20710           2017 1.74e5 PAYMENTS M… Grant                1
 6 DEPAR… WASHIN… 21740           2009 5.59e4 GRANT FUND… Grant                1
 7 BOARD… DOMEST… 21045           2014 1.72e5 DOMESTIC V… Grant                1
 8 MD SM… DACORE… 20601           2018 1.04e5 MARYLAND S… Loan                 1
 9 MARYL… MOUNT … 21727           2015 1.75e6 SELLINGER … Grant                1
10 DEPAR… OLNEY … 20830           2010 2.16e5 GRANT FUND… Grant                1
# ℹ 19,443 more rows
# ℹ 2 more variables: date <chr>, zip5 <chr>

We’ve gotten the source and zip code data as clean as we can, and now we can answer our question: which zip code has gotten the most amount of money from the Maryland Tourism Board? A good rule of thumb is that you should only spend time cleaning fields that are critical to the specific analysis you want to do.

cleaned_md_grants_loans |> 
  filter(source == 'COMMERCE/MARYLAND TOURISM BOARD') |> 
  group_by(zip5) |> 
  summarize(total_amount = sum(amount)) |> 
  arrange(desc(total_amount))
# A tibble: 8 × 2
  zip5  total_amount
  <chr>        <dbl>
1 21202       648069
2 21601       400000
3 20005       250000
4 21401       190570
5 20772       128041
6 21701       116708
7 21740        63070
8 21014        52490

Why, it’s downtown Baltimore, including the Inner Harbor area.