19 Data Cleaning Part I: Data smells

Any time you are given a dataset from anyone, you should immediately be suspicious. Is this data what I think it is? Does it include what I expect? Is there anything I need to know about it? Will it produce the information I expect?

One of the first things you should do is give it the smell test.

Failure to give data the smell test can lead you to miss stories and get your butt kicked on a competitive story.

With data smells, we’re trying to find common mistakes in data. For more on data smells, read the GitHub wiki post that started it all. Some common data smells are:

Missing data or missing values
Gaps in data
Wrong type of data
Outliers
Sharp curves
Conflicting information within a dataset
Conflicting information across datasets
Wrongly derived data
Internal inconsistency
External inconsistency
Wrong spatial data
Unusable data, including non-standard abbreviations, ambiguous data, extraneous data, inconsistent data

Not all of these data smells are detectable in code. You may have to ask people about the data. You may have to compare it to another dataset yourself. Does the agency that uses the data produce reports from the data? Does your analysis match those reports? That will expose wrongly derived data, or wrong units, or mistakes you made with inclusion or exclusion.

But with several of these data smells, we can do them first, before we do anything else.

We’re going to examine several here as they apply to some Maryland state government payments data and Maryland state government grant & loan data.

19.1 Wrong Type

First, let’s look at Wrong Type Of Data.

We can sniff that out by looking at the output of readr.

Let’s load the tidyverse.

# Remove scientific notation
options(scipen=999)
# Load the tidyverse
library(tidyverse)

This time, we’re going to load the data in a CSV format, which stands for comma separated values and is essentially a fancy structured text file. Each column in the csv is separated – “delimited” – by a comma from the next column.

We’re also going to introduce a new argument to our function that reads in the data, read_csv(), called “guess_max”. As R reads in the csv file, it will attempt to make some calls on what “data type” to assign to each field: number, character, date, and so on. The “guess_max” argument says: look at the values in the whatever number of rows we specify before deciding which data type to assign. In this case, we’ll pick 10.

# Load the data
payments <- read_csv("data/State_of_Maryland_Payments_Data__FY2008_to_FY2024_20250115.csv", guess_max=10)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 369008 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Agency Name, Vendor Name, Date, Category
dbl (4): Fiscal Year, Vendor Zip, Amount, Fiscal Period

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Pay attention to the red warning that signals “one or more parsing issues.” It advises us to run the problems() function to see what went wrong. Let’s do that.

problems(payments)

# A tibble: 369 × 5
     row   col expected actual file                                             
   <int> <int> <chr>    <chr>  <chr>                                            
 1  9568     4 a double L4Z4B  /Users/dwillis/code/datajournalismbook-maryland/…
 2 10117     4 a double L4Y1Z  /Users/dwillis/code/datajournalismbook-maryland/…
 3 10476     4 a double 0LAND  /Users/dwillis/code/datajournalismbook-maryland/…
 4 12308     4 a double H4S    /Users/dwillis/code/datajournalismbook-maryland/…
 5 13084     4 a double N2G    /Users/dwillis/code/datajournalismbook-maryland/…
 6 15549     4 a double GOR    /Users/dwillis/code/datajournalismbook-maryland/…
 7 15772     4 a double N5V    /Users/dwillis/code/datajournalismbook-maryland/…
 8 15844     4 a double T1Y    /Users/dwillis/code/datajournalismbook-maryland/…
 9 15877     4 a double R2C    /Users/dwillis/code/datajournalismbook-maryland/…
10 15973     4 a double L4L34  /Users/dwillis/code/datajournalismbook-maryland/…
# ℹ 359 more rows

It produces a table of all the parsing problems. It has 369 rows, which means we have that some problems but not a huge number considering we have 369,000 rows. In almost every case here, the readr library has guessed that a given column was of a “double” data type – a number. It did it based on very limited information – only 10 rows. So, when it hit a value that looked like a date, or a character string, it didn’t know what to do. So it just didn’t read in that value correctly.

The easy way to fix this is to set the guess_max argument higher. It will take a little longer to load, but we’ll use every single row in the data set to guess the column type – all 322,138 of them.

payments <- read_csv("data/State_of_Maryland_Payments_Data__FY2008_to_FY2024_20250115.csv", guess_max=369008)

Rows: 369008 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Agency Name, Vendor Name, Vendor Zip, Date, Category
dbl (3): Fiscal Year, Amount, Fiscal Period

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

No parsing errors this time! You can see what the columns are using the glimpse function:

glimpse(payments)

Rows: 369,008
Columns: 8
$ `Fiscal Year`   <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, …
$ `Agency Name`   <chr> "AID TO UNIVERSITY OF MD MEDICAL SYSTEM", "BALTIMORE C…
$ `Vendor Name`   <chr> "U M M S", "3M PHJ3884", "4 IMPRINT", "A J STATIONERS"…
$ `Vendor Zip`    <chr> "21273", "15250", "53201", "21226", "21228", "21210", …
$ Amount          <dbl> 9701191.00, 6212.00, 8312.96, 21810.09, 30195.57, 1002…
$ `Fiscal Period` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Date            <chr> "01/01/2008 12:00:00 AM", "01/01/2008 12:00:00 AM", "0…
$ Category        <chr> "Vendor Payment", "Vendor Payment", "Vendor Payment", …

Things that should be characters – like agency name, vendor name – are characters (chr). Things that should be numbers (dbl) – like amount and fiscal year – are numbers. We’ve seen before that sometimes dates aren’t defined as date datatypes by R - we can fix that using lubridate.

19.2 Wrong Spatial Data

The second smell we can find in code is wrong spatial data. Spatial data means data that refers to some geography; in this dataset the only geographical element is the vendor’s zip code. Zip codes should be, at a minimum, five characters long (although composed of numbers, zip codes aren’t used as numbers).

We can check to see if any of the zip codes are less than five characters by using a function called str_length inside a filter:

payments |>
  group_by(`Vendor Zip`) |>
  filter(str_length(`Vendor Zip`) < 5) |> 
  summarise(
    count=n()
  ) |>
  arrange(desc(count))

# A tibble: 553 × 2
   `Vendor Zip` count
   <chr>        <int>
 1 2241          2044
 2 8873           434
 3 2284           356
 4 7921           296
 5 4915           197
 6 8650           195
 7 7188           194
 8 7101           191
 9 1441           133
10 7102           112
# ℹ 543 more rows

So, yes, we definitely have some zip codes that are less than 5 characters long, which is not good, particularly because we don’t have any other geographical information (such as a state) that would tell us whether we’re missing a leading zero or some other character.

19.3 Gaps in data & Missing data

Let’s now look at gaps in data. These often occur when you have a date or time element in your data, but there are other potential gaps, too. To illustrate those, we’re going to introduce some Maryland state grant and loan data from 2009 forward. Let’s load it and take a look:

md_grants_loans <- read_csv("data/State_of_Maryland_Grant_and_Loan_Data__FY2009_to_FY2022_20250115.csv")

Rows: 19482 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Grantor, Grantee, Zip Code, Description, Category, Date
dbl (3): Fiscal Year, Amount, Fiscal Period

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Each row represents a recipient of state grant or loan, along with information about their location and the state agency that provided the money. When we talk about gaps, often they indicate the administrative rules. Here’s an example: let’s count the number of payments in each category (Grant or Loan) by year in this dataset:

md_grants_loans |> 
  group_by(`Fiscal Year`, Category) |> 
  summarize(count = n()) |> 
  arrange(`Fiscal Year`)

`summarise()` has grouped output by 'Fiscal Year'. You can override using the
`.groups` argument.

# A tibble: 35 × 3
# Groups:   Fiscal Year [15]
   `Fiscal Year` Category count
           <dbl> <chr>    <int>
 1          2009 Grant      819
 2          2010 Grant      612
 3          2010 Loan        76
 4          2010 <NA>         1
 5          2011 Grant      990
 6          2011 Loan        45
 7          2011 <NA>         1
 8          2012 Grant     1100
 9          2012 Loan        58
10          2013 Grant      986
# ℹ 25 more rows

We can see a couple of issues here: first, there is no loan data for FY 2009. That’s mentioned in the source page for the data. It’s good to be aware of all gaps in data, but they don’t always represent a problem. Second, and more problematic, there are a few records where the Category is NA - that data is missing. There also are some inconsistent values - there are 50 records in FY2013 with the category of “L” (probably loans) and one in FY 2017 that is listed as “Contract”.

19.4 Unusual Outliers

Any time you are going to focus on a column for analysis, you should check for unusual values. Are there any unusually large values or unusually small values? Are there any values that raise immediate questions about the data? Let’s look at the smallest amounts in the grants and loan data.

md_grants_loans |> 
  arrange(Amount)

# A tibble: 19,482 × 9
   Grantor          Grantee `Zip Code` `Fiscal Year` Amount Description Category
   <chr>            <chr>   <chr>              <dbl>  <dbl> <chr>       <chr>   
 1 Department of N… Land P… 21285               2009   60   Catherine … Grant   
 2 Maryland Depart… UNITED… 21031               2012   96.0 FOOD SERVI… Grant   
 3 Conservation Re… Easter… 21601               2018  186.  Rolling Vi… Grant   
 4 Maryland Depart… FAMILY… 20877               2013  304   RTTT - EAR… Grant   
 5 Maryland Depart… Washin… 21740               2017  361   Hold Harml… Grant   
 6 Maryland Depart… THE CH… 21215-3211          2012  362.  FOOD SERVI… Grant   
 7 Maryland Depart… ARCHDI… 21227               2012  379.  MARYLAND M… Grant   
 8 Maryland Depart… HUMAN … 21158               2013  387   CASH FOR C… Grant   
 9 Department of N… The Co… 22209               2012  402.  R Creighto… Grant   
10 Governor's Offi… Court … 21204               2018  411.  Children's… Grant   
# ℹ 19,472 more rows
# ℹ 2 more variables: `Fiscal Period` <dbl>, Date <chr>

There are two grants for less than $100, which might not be problematic at all, but given that just two of 19,000 are for very small amounts you might wonder if there are suggested amounts for applicants and how tiny ones get evaluated compared to very large requests. As journalists, we should be skeptical of information put in front of us and ask why or what it says about the data itself.