16 Filters and selections

More often than not, we have more data than we want. Sometimes we need to be rid of that data. In dplyr, there’s two ways to go about this: filtering and selecting.

Filtering creates a subset of the data based on criteria. All records where the amount is greater than 150,000. All records that match “College Park”. Something like that. Filtering works with rows – when we filter, we get fewer rows back than we start with.

Selecting simply returns only the fields named. So if you only want to see city and amount, you select those fields. When you look at your data again, you’ll have two columns. If you try to use one of your columns that you had before you used select, you’ll get an error. Selecting works with columns. You will have the same number of records when you are done, but fewer columns of data to work with.

Let’s continue to work with the UMD course data we used in the previous chapter. First, we need to load the tidyverse:

library(tidyverse)

umd_courses <- read_rds("data/umd_courses.rds")

If we want to see only those courses offered a particular department, we can use the filter function to isolate just those records. Filter works with something called a comparison operator. We need to filter all records equal to “Journalism”. The comparison operators in R, like most programming languages, are == for equal to, != for not equal to, > for greater than, >= for greater than or equal to and so on.

Be careful: = is not == and = is not “equal to”. = is an assignment operator in most languages – how things get named.

journalism_courses <- umd_courses |> filter(department == "Journalism")

head(journalism_courses)

# A tibble: 6 × 9
  id       title        description   term department sections instructors seats
  <chr>    <chr>        <chr>        <dbl> <chr>         <dbl> <chr>       <dbl>
1 JOUR282  Beyond Face… "Credit on… 202112 Journalism        1 Denitsa Yo…    35
2 JOUR698  Special Pro…  <NA>       202112 Journalism        0 <NA>            0
3 JOUR175  Media Liter… "Additiona… 202112 Journalism        1 Susan Moel…    20
4 JOUR199  Survey Appr…  <NA>       202112 Journalism        1 Karen Denny     5
5 JOUR130  Self-Presen… "Credit on… 202112 Journalism        1 Amber Moore    25
6 JOUR459W Special Top… "This cour… 202112 Journalism        1 Shannon Sc…    23
# ℹ 1 more variable: syllabus_count <dbl>

And just like that, we have just Journalism results, which we can verify looking at the head, the first six rows.

We also have more data than we might want. For example, we may only want to work with the course id and title.

To simplify our dataset, we can use select.

selected_journalism_courses <- journalism_courses |> select(id, title)

head(selected_journalism_courses)

# A tibble: 6 × 2
  id       title                                                                
  <chr>    <chr>                                                                
1 JOUR282  Beyond Facebook: How Social Media are Transforming Society, Culture,…
2 JOUR698  Special Problems in Communication                                    
3 JOUR175  Media Literacy                                                       
4 JOUR199  Survey Apprenticeship                                                
5 JOUR130  Self-Presentation in the Age of YouTube                              
6 JOUR459W Special Topics in Journalism; Sports Media & Athlete Branding

And now we only have two columns of data for whatever analysis we might want to do.

16.1 Combining filters

So let’s say we wanted to see all the courses in the Theatre department with at least 15 seats. We can do this a number of ways. The first is we can chain together a whole lot of filters.

theatre_seats_15 <- umd_courses |> filter(department == "Theatre") |> filter(seats >= 15)

nrow(theatre_seats_15)

[1] 308

That gives us 308 records But that’s repetitive, no? We can do better using a single filter and boolean operators – AND and OR. In this case, AND is & and OR is |.

The difference? With AND, all conditions must be true to be included. With OR, any of those conditions things can be true and it will be included.

Here’s the difference.

and_theatre_seats_15 <- umd_courses |> filter(department == "Theatre" & seats >= 15)

nrow(and_theatre_seats_15)

[1] 308

So AND gives us the same answer we got before. What does OR give us?

and_theatre_seats_15 <- umd_courses |> filter(department == "Theatre" | seats >= 15)

nrow(and_theatre_seats_15)

[1] 54000

So there’s 54,000 rows that are EITHER Theatre classes OR have at least 15 seats. OR is additive; AND is restrictive.

A general tip about using filter: it’s easier to work your way towards the filter syntax you need rather than try and write it once and trust the result. Each time you modify your filter, check the results to see if they make sense. This adds a little time to your process but you’ll thank yourself for doing it because it helps avoid mistakes.