library(tidyverse)
16 Filters and selections
More often than not, we have more data than we want. Sometimes we need to be rid of that data. In dplyr
, there’s two ways to go about this: filtering and selecting.
Filtering creates a subset of the data based on criteria. All records where the amount is greater than 150,000. All records that match “College Park”. Something like that. Filtering works with rows – when we filter, we get fewer rows back than we start with.
Selecting simply returns only the fields named. So if you only want to see city and amount, you select those fields. When you look at your data again, you’ll have two columns. If you try to use one of your columns that you had before you used select, you’ll get an error. Selecting works with columns. You will have the same number of records when you are done, but fewer columns of data to work with.
Let’s continue to work with the UMD course data we used in the previous chapter. First, we need to load the tidyverse:
<- read_rds("data/umd_courses.rds") umd_courses
If we want to see only those courses offered a particular department, we can use the filter
function to isolate just those records. Filter works with something called a comparison operator. We need to filter all records equal to “Journalism”. The comparison operators in R, like most programming languages, are == for equal to, != for not equal to, > for greater than, >= for greater than or equal to and so on.
Be careful: =
is not ==
and =
is not “equal to”. =
is an assignment operator in most languages – how things get named.
<- umd_courses |> filter(department == "Journalism")
journalism_courses
head(journalism_courses)
# A tibble: 6 × 9
id title description term department sections instructors seats
<chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 JOUR282 Beyond Face… "Credit on… 202112 Journalism 1 Denitsa Yo… 35
2 JOUR698 Special Pro… <NA> 202112 Journalism 0 <NA> 0
3 JOUR175 Media Liter… "Additiona… 202112 Journalism 1 Susan Moel… 20
4 JOUR199 Survey Appr… <NA> 202112 Journalism 1 Karen Denny 5
5 JOUR130 Self-Presen… "Credit on… 202112 Journalism 1 Amber Moore 25
6 JOUR459W Special Top… "This cour… 202112 Journalism 1 Shannon Sc… 23
# ℹ 1 more variable: syllabus_count <dbl>
And just like that, we have just Journalism results, which we can verify looking at the head, the first six rows.
We also have more data than we might want. For example, we may only want to work with the course id and title.
To simplify our dataset, we can use select.
<- journalism_courses |> select(id, title)
selected_journalism_courses
head(selected_journalism_courses)
# A tibble: 6 × 2
id title
<chr> <chr>
1 JOUR282 Beyond Facebook: How Social Media are Transforming Society, Culture,…
2 JOUR698 Special Problems in Communication
3 JOUR175 Media Literacy
4 JOUR199 Survey Apprenticeship
5 JOUR130 Self-Presentation in the Age of YouTube
6 JOUR459W Special Topics in Journalism; Sports Media & Athlete Branding
And now we only have two columns of data for whatever analysis we might want to do.
16.1 Combining filters
So let’s say we wanted to see all the courses in the Theatre department with at least 15 seats. We can do this a number of ways. The first is we can chain together a whole lot of filters.
<- umd_courses |> filter(department == "Theatre") |> filter(seats >= 15)
theatre_seats_15
nrow(theatre_seats_15)
[1] 308
That gives us 308 records But that’s repetitive, no? We can do better using a single filter and boolean operators – AND and OR. In this case, AND is &
and OR is |
.
The difference? With AND, all conditions must be true to be included. With OR, any of those conditions things can be true and it will be included.
Here’s the difference.
<- umd_courses |> filter(department == "Theatre" & seats >= 15)
and_theatre_seats_15
nrow(and_theatre_seats_15)
[1] 308
So AND gives us the same answer we got before. What does OR give us?
<- umd_courses |> filter(department == "Theatre" | seats >= 15)
and_theatre_seats_15
nrow(and_theatre_seats_15)
[1] 54000
So there’s 54,000 rows that are EITHER Theatre classes OR have at least 15 seats. OR is additive; AND is restrictive.
A general tip about using filter: it’s easier to work your way towards the filter syntax you need rather than try and write it once and trust the result. Each time you modify your filter, check the results to see if they make sense. This adds a little time to your process but you’ll thank yourself for doing it because it helps avoid mistakes.