6  Filters and selections

More often than not, we have more data than we want. Sometimes we need to be rid of that data. In dplyr, there’s two ways to go about this: filtering and selecting.

Filtering creates a subset of the data based on criteria. All records where the count is greater than 10. All records that match “Maryland”. Something like that.

Selecting simply returns only the fields named. So if you only want to see School and Attendance, you select those fields. When you look at your data again, you’ll have two columns. If you try to use one of your columns that you had before you used select, you’ll get an error.

Let’s work with our football attendance data to show some examples.

For this walkthrough:

First we’ll need the tidyverse.

library(tidyverse)

Now import the data.

attendance <- read_csv('data/attendance.csv')
Rows: 146 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Institution, Conference
dbl (12): 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

So, first things first, let’s say we don’t care about all this Air Force, Akron, Alabama crap and just want to see Dear Old Maryland We do that with filter and then we pass it a condition.

Before we do that, a note about conditions. Most of the conditional operators you’ll understand – greater than and less than are > and <. The tough one to remember is equal to. In conditional statements, equal to is == not =. If you haven’t noticed, = is a variable assignment operator, not a conditional statement. So equal is == and NOT equal is !=.

So if you want to see Institutions equal to Maryland, you do this:

attendance |> filter(Institution == "Maryland")
# A tibble: 2 × 14
  Institution Conference `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
  <chr>       <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Maryland    Big Ten    288946     NA     NA     NA     NA     NA     NA     NA
2 Maryland    Big Ten        NA 281884 310389 237690 237859 201562 226871     NA
# ℹ 4 more variables: `2021` <dbl>, `2022` <dbl>, `2023` <dbl>, `2024` <dbl>

Or if we want to see schools that had more than half a million people buy tickets to a football game last season, we do the following. NOTE THE BACKTICKS.

attendance |> filter(`2024` >= 500000)
# A tibble: 17 × 14
   Institution    Conference    `2013` `2014` `2015` `2016` `2017` `2018` `2019`
   <chr>          <chr>          <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Alabama        SEC           710538 710736 707786 712747 712053 710931 707817
 2 Auburn         SEC           685252 612157 612157 695498 605120 591236 600355
 3 Clemson        ACC           574333 572262 588266 566787 565412 562799 566074
 4 Florida        SEC           524638 515001 630457 439229 520290 576299 508103
 5 Georgia        SEC           556476 649222 649222 556476 556476 649222 649722
 6 LSU            SEC           639927 712063 654084 708618 591034 705733 705892
 7 Michigan       Big Ten       781144 734364 771174 883741 669534 775156 780215
 8 Nebraska       Big Ten       727466 638744 629983 631402 628583 623240 625436
 9 Notre Dame     FBS Independ… 484770 484770 484770 484770 543354 465732 534017
10 Ohio St.       Big Ten       734528 744075 750705 750944 752464 713630 723679
11 Oklahoma       SEC           508334 510972 512139 521142 519119 607146 499533
12 Penn St.       Big Ten       676112 711358 698590 701800 746946 738396 739747
13 South Carolina SEC           576805 569664 472934 538441 550099 515396 545737
14 Tennessee      SEC           669087 698276 704088 706776 670454 650887 702912
15 Texas          SEC           593857 564618 540210 587283 556667 586277 577834
16 Texas A&M      SEC           697003 630735 725354 713418 691612 698908 711258
17 Wisconsin      Big Ten       552378 556642 546099 476144 551766 540072 535301
# ℹ 5 more variables: `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>,
#   `2024` <dbl>

But what if we want to see all of the Power Five conferences? We could use conditional logic in our filter. The conditional logic operators are | for OR and & for AND. NOTE: AND means all conditions have to be met. OR means any of the conditions work. So be careful about boolean logic.

attendance |> filter(Conference == "Big 10" | Conference == "SEC" | Conference == "Pac-12" | Conference == "ACC" | Conference == "Big 12")
# A tibble: 50 × 14
   Institution    Conference `2013` `2014` `2015` `2016` `2017` `2018` `2019`
   <chr>          <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Alabama        SEC        710538 710736 707786 712747 712053 710931 707817
 2 Arizona        Big 12     285713 354973 308355 338017 255791 318051 237194
 3 Arizona St.    Big 12     501509 343073 368985 286417 359660 291091 344161
 4 Arkansas       SEC        431174 399124 471279 487067 442569 367748 356517
 5 Auburn         SEC        685252 612157 612157 695498 605120 591236 600355
 6 Baylor         Big 12     321639 280257 276960 275029 262978 248017 318621
 7 Boston College ACC        198035 239893 211433 192942 215546 263363 205111
 8 BYU            Big 12     367349 342843 351191 351413 337599 314855 357281
 9 California     ACC        345303 286051 292797 279769 219290 300061 254597
10 Cincinnati     Big 12     190624 173037 222578 235095 170603 183112 215908
# ℹ 40 more rows
# ℹ 5 more variables: `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>,
#   `2024` <dbl>

But that’s a lot of repetitive code. And a lot of typing. And typing is the devil. So what if we could create a list and pass it into the filter? It’s pretty simple.

We can create a new variable – remember variables can represent just about anything – and create a list. To do that we use the c operator, which stands for concatenate. That just means take all the stuff in the parenthesis after the c and bunch it into a list.

Note here: text is in quotes. If they were numbers, we wouldn’t need the quotes.

powerfive <- c("SEC", "Big Ten", "Pac-12", "Big 12", "ACC")

Now with a list, we can use the %in% operator. It does what you think it does – it gives you data that matches things IN the list you give it.

attendance |> filter(Conference %in% powerfive)
# A tibble: 69 × 14
   Institution    Conference `2013` `2014` `2015` `2016` `2017` `2018` `2019`
   <chr>          <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Alabama        SEC        710538 710736 707786 712747 712053 710931 707817
 2 Arizona        Big 12     285713 354973 308355 338017 255791 318051 237194
 3 Arizona St.    Big 12     501509 343073 368985 286417 359660 291091 344161
 4 Arkansas       SEC        431174 399124 471279 487067 442569 367748 356517
 5 Auburn         SEC        685252 612157 612157 695498 605120 591236 600355
 6 Baylor         Big 12     321639 280257 276960 275029 262978 248017 318621
 7 Boston College ACC        198035 239893 211433 192942 215546 263363 205111
 8 BYU            Big 12     367349 342843 351191 351413 337599 314855 357281
 9 California     ACC        345303 286051 292797 279769 219290 300061 254597
10 Cincinnati     Big 12     190624 173037 222578 235095 170603 183112 215908
# ℹ 59 more rows
# ℹ 5 more variables: `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>,
#   `2024` <dbl>

6.1 Selecting data to make it easier to read

So now we have our Power Five list. What if we just wanted to see attendance from the most recent season and ignore all the rest? Select to the rescue.

attendance |> filter(Conference %in% powerfive) |> select(Institution, Conference, `2024`)
# A tibble: 69 × 3
   Institution    Conference `2024`
   <chr>          <chr>       <dbl>
 1 Alabama        SEC        700539
 2 Arizona        Big 12     327230
 3 Arizona St.    Big 12     293901
 4 Arkansas       SEC        480478
 5 Auburn         SEC        704344
 6 Baylor         Big 12     253810
 7 Boston College ACC        279236
 8 BYU            Big 12     377091
 9 California     ACC        274214
10 Cincinnati     Big 12     219688
# ℹ 59 more rows

If you have truly massive data, Select has tools to help you select fields that start_with the same things or ends with a certain word. The documentation will guide you if you need those someday. For 90 plus percent of what we do, just naming the fields will be sufficient.

6.2 Using conditional filters to set limits

Let’s return to the problem of one-hit wonders in basketball mucking up our true shooting analysis. How can we set limits in something like a question of who had the best season? Let’s grab every player from last season.

For this walkthrough:

Let’s get set up similar to the previous chapter.

players <- read_csv("data/players25.csv")
Rows: 5818 Columns: 63
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Team, Player, Class, Pos.x, Height, Hometown, High School, Summary...
dbl (49): #, Weight, Rk.x, G, GS, MP, FG, FGA, FG%, 3P, 3PA, 3P%, 2P, 2PA, 2...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
players |>
  mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
  arrange(desc(trueshooting))
# A tibble: 5,818 × 64
   Team    Player   `#` Class Pos.x Height Weight Hometown `High School` Summary
   <chr>   <chr>  <dbl> <chr> <chr> <chr>   <dbl> <chr>    <chr>         <chr>  
 1 Wiscon… Isaac…    15 JR    G     6-2       165 Oregon,… Oregon (WI)   0.2 Pt…
 2 TCU Ho… Cole …    35 SR    G     6-3       180 San Jos… Bellarmine C… 3.0 Pt…
 3 TCU Ho… Drew …    30 FR    G     6-2       180 Dallas,… Highland Par… 1.5 Pt…
 4 Stanfo… Derin…     1 SO    G     6-4       190 Istanbu… The Ashevill… 0.6 Pt…
 5 St. Bo… Jack …    15 SO    G     6-0       172 Olean, … Olean (NY)    3.0 Pt…
 6 Seattl… Eric …    19 FR    G     5-10      167 Beijing… Ruamrudee In… 3.0 Pt…
 7 Samfor… Corey…    10 FR    G     6-4       175 Fairbur… Landmark Chr… 1.5 Pt…
 8 Queens… Aneek…    85 SO    G     5-10      145 Foster … San Mateo (C… 3.0 Pt…
 9 Old Do… CJ Pa…    15 FR    F     6-6       182 Norfolk… Maury (VA)    1.0 Pt…
10 Oklaho… Jake …    30 SR    G     6-3       187 Norman,… Loyola Acade… 0.4 Pt…
# ℹ 5,808 more rows
# ℹ 54 more variables: Rk.x <dbl>, Pos.y <chr>, G <dbl>, GS <dbl>, MP <dbl>,
#   FG <dbl>, FGA <dbl>, `FG%` <dbl>, `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>,
#   `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>,
#   `FT%` <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>,
#   BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>, Rk.y <dbl>, Pos <chr>,
#   PER <dbl>, `TS%` <dbl>, `3PAr` <dbl>, FTr <dbl>, PProd <dbl>, …

In that season, we’ve got several players that can lay claim to the title of One Shot One Three True Shooting champion.

In most contests, like the batting title in Major League Baseball, there’s a minimum number of X to qualify. In baseball, it’s at bats. In basketball, it attempts. So let’s set a floor and see how it changes. What if we said you had to have played 100 minutes in a season? The top players in college basketball play more than 1000 minutes in a season. So 100 is not that much. Let’s try it and see.

players |>
  mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
  arrange(desc(trueshooting)) |>
  filter(MP > 100)
# A tibble: 3,682 × 64
   Team    Player   `#` Class Pos.x Height Weight Hometown `High School` Summary
   <chr>   <chr>  <dbl> <chr> <chr> <chr>   <dbl> <chr>    <chr>         <chr>  
 1 Arizon… Conra…    55 SO    G     6-0       165 Granoll… Club Joventu… 1.6 Pt…
 2 Syracu… Nahee…    10 SR    C     7-4       248 Philade… Plymouth Whi… 2.4 Pt…
 3 Butler… Augus…     0 SO    F     6-8       220 Salvado… NBA Academy … 5.1 Pt…
 4 UConn … Samso…    35 SR    F     6-10      205 Lomé, T… The Patrick … 7.5 Pt…
 5 Dayton… Isaac…    13 JR    C     6-11      251 Port Al… Alberni Dist… 3.2 Pt…
 6 Utah S… Isaac…    23 FR    F     6-8       240 Idaho F… Hillcrest (I… 2.1 Pt…
 7 Northw… Lado …     0 SR    F     6-10      215 Dallas,… Dallas Lakel… 2.4 Pt…
 8 Purdue… Will …    44 SO    C     7-2       260 Stockho… Riksbasketgy… 1.8 Pt…
 9 Baylor… Josh …    17 JR    F     6-10      230 Asaba, … NBA Academy … 7.4 Pt…
10 Milwau… Dariu…    34 SR    F     6-8       225 Murray,… Murray (KY)   3.0 Pt…
# ℹ 3,672 more rows
# ℹ 54 more variables: Rk.x <dbl>, Pos.y <chr>, G <dbl>, GS <dbl>, MP <dbl>,
#   FG <dbl>, FGA <dbl>, `FG%` <dbl>, `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>,
#   `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>,
#   `FT%` <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>,
#   BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>, Rk.y <dbl>, Pos <chr>,
#   PER <dbl>, `TS%` <dbl>, `3PAr` <dbl>, FTr <dbl>, PProd <dbl>, …

Now you get Radford’s TJ NeSmith, who played in seven games and was on the floor for 113 minutes. So he played a little bit, but not a lot. But in that time, he only attempted 32 shots, and made 75 percent of them. In other words, when he shot, he probably scored. He just rarely shot.

So is 100 minutes our level? Here’s the truth – there’s not really an answer here. We’re picking a cutoff. If you can cite a reason for it and defend it, then it probably works.

6.3 Top list

One last little dplyr trick that’s nice to have in the toolbox is a shortcut for selecting only the top values for your dataset. Want to make a Top 10 List? Or Top 25? Or Top Whatever You Want? It’s easy.

So what are the top 10 Power Five schools by season attendance. All we’re doing here is chaining commands together with what we’ve already got. We’re filtering by our list of Power Five conferences, we’re selecting the three fields we need, now we’re going to arrange it by total attendance and then we’ll introduce the new function: top_n. The top_n function just takes a number. So we want a top 10 list? We do it like this:

attendance |> filter(Conference %in% powerfive) |> select(Institution, Conference, `2024`) |> arrange(desc(`2024`)) |> top_n(10)
Selecting by 2024
# A tibble: 10 × 3
   Institution Conference `2024`
   <chr>       <chr>       <dbl>
 1 Ohio St.    Big Ten    936550
 2 Michigan    Big Ten    884382
 3 Penn St.    Big Ten    864665
 4 Texas       SEC        817852
 5 Texas A&M   SEC        719927
 6 Tennessee   SEC        713405
 7 LSU         SEC        708645
 8 Auburn      SEC        704344
 9 Alabama     SEC        700539
10 Florida     SEC        630116

That’s all there is to it. Just remember – for it to work correctly, you need to arrange your data BEFORE you run top_n. Otherwise, you’re just getting the first 10 values in the list. The function doesn’t know what field you want the top values of. You have to do it.