6 Filters and selections

More often than not, we have more data than we want. Sometimes we need to be rid of that data. In dplyr, there’s two ways to go about this: filtering and selecting.

Filtering creates a subset of the data based on criteria. All records where the count is greater than 10. All records that match “Maryland”. Something like that.

Selecting simply returns only the fields named. So if you only want to see School and Attendance, you select those fields. When you look at your data again, you’ll have two columns. If you try to use one of your columns that you had before you used select, you’ll get an error.

Let’s work with our football attendance data to show some examples.

For this walkthrough:

First we’ll need the tidyverse.

library(tidyverse)

Now import the data.

attendance <- read_csv('data/attendance.csv')

Rows: 146 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Institution, Conference
dbl (11): 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

So, first things first, let’s say we don’t care about all this Air Force, Akron, Alabama crap and just want to see Dear Old Maryland We do that with filter and then we pass it a condition.

Before we do that, a note about conditions. Most of the conditional operators you’ll understand – greater than and less than are > and <. The tough one to remember is equal to. In conditional statements, equal to is == not =. If you haven’t noticed, = is a variable assignment operator, not a conditional statement. So equal is == and NOT equal is !=.

So if you want to see Institutions equal to Maryland, you do this:

attendance |> filter(Institution == "Maryland")

# A tibble: 2 × 13
  Insti…¹ Confe…² `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020` `2021`
  <chr>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Maryla… ACC     288946     NA     NA     NA     NA     NA     NA     NA     NA
2 Maryla… Big Ten     NA 281884 310389 237690 237859 201562 226871     NA 266480
# … with 2 more variables: `2022` <dbl>, `2023` <dbl>, and abbreviated variable
#   names ¹Institution, ²Conference

Or if we want to see schools that had more than half a million people buy tickets to a football game last season, we do the following. NOTE THE BACKTICKS.

attendance |> filter(`2022` >= 500000)

# A tibble: 17 × 13
   Institution   Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
   <chr>         <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Alabama       SEC     710538 710736 707786 712747 712053 710931 707817  97120
 2 Arkansas      SEC     431174 399124 471279 487067 442569 367748 356517  82500
 3 Auburn        SEC     685252 612157 612157 695498 605120 591236 600355  87450
 4 Clemson       ACC     574333 572262 588266 566787 565412 562799 566074 112367
 5 Florida       SEC     524638 515001 630457 439229 520290 576299 508103  74348
 6 Georgia       SEC     556476 649222 649222 556476 556476 649222 649722  61572
 7 LSU           SEC     639927 712063 654084 708618 591034 705733 705892  87233
 8 Michigan      Big Ten 781144 734364 771174 883741 669534 775156 780215     NA
 9 Nebraska      Big Ten 727466 638744 629983 631402 628583 623240 625436     NA
10 Ohio St.      Big Ten 734528 744075 750705 750944 752464 713630 723679   3254
11 Oklahoma      Big 12  508334 510972 512139 521142 519119 607146 499533 113500
12 Penn St.      Big Ten 676112 711358 698590 701800 746946 738396 739747   4500
13 South Caroli… SEC     576805 569664 472934 538441 550099 515396 545737  77075
14 Tennessee     SEC     669087 698276 704088 706776 670454 650887 702912 112660
15 Texas         Big 12  593857 564618 540210 587283 556667 586277 577834  85690
16 Texas A&M     SEC     697003 630735 725354 713418 691612 698908 711258  99503
17 Wisconsin     Big Ten 552378 556642 546099 476144 551766 540072 535301     NA
# … with 3 more variables: `2021` <dbl>, `2022` <dbl>, `2023` <dbl>, and
#   abbreviated variable name ¹Conference

But what if we want to see all of the Power Five conferences? We could use conditional logic in our filter. The conditional logic operators are | for OR and & for AND. NOTE: AND means all conditions have to be met. OR means any of the conditions work. So be careful about boolean logic.

attendance |> filter(Conference == "Big 10" | Conference == "SEC" | Conference == "Pac-12" | Conference == "ACC" | Conference == "Big 12")

# A tibble: 51 × 13
   Institution   Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
   <chr>         <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Alabama       SEC     710538 710736 707786 712747 712053 710931 707817  97120
 2 Arizona       Pac-12  285713 354973 308355 338017 255791 318051 237194     NA
 3 Arizona St.   Pac-12  501509 343073 368985 286417 359660 291091 344161     NA
 4 Arkansas      SEC     431174 399124 471279 487067 442569 367748 356517  82500
 5 Auburn        SEC     685252 612157 612157 695498 605120 591236 600355  87450
 6 Baylor        Big 12  321639 280257 276960 275029 262978 248017 318621  46668
 7 Boston Colle… ACC     198035 239893 211433 192942 215546 263363 205111     NA
 8 California    Pac-12  345303 286051 292797 279769 219290 300061 254597     NA
 9 Clemson       ACC     574333 572262 588266 566787 565412 562799 566074 112367
10 Colorado      Pac-12  230778 226670 236331 279652 282335 274852 297435     NA
# … with 41 more rows, 3 more variables: `2021` <dbl>, `2022` <dbl>,
#   `2023` <dbl>, and abbreviated variable name ¹Conference

But that’s a lot of repetitive code. And a lot of typing. And typing is the devil. So what if we could create a list and pass it into the filter? It’s pretty simple.

We can create a new variable – remember variables can represent just about anything – and create a list. To do that we use the c operator, which stands for concatenate. That just means take all the stuff in the parenthesis after the c and bunch it into a list.

Note here: text is in quotes. If they were numbers, we wouldn’t need the quotes.

powerfive <- c("SEC", "Big Ten", "Pac-12", "Big 12", "ACC")

Now with a list, we can use the %in% operator. It does what you think it does – it gives you data that matches things IN the list you give it.

attendance |> filter(Conference %in% powerfive)

# A tibble: 65 × 13
   Institution   Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
   <chr>         <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Alabama       SEC     710538 710736 707786 712747 712053 710931 707817  97120
 2 Arizona       Pac-12  285713 354973 308355 338017 255791 318051 237194     NA
 3 Arizona St.   Pac-12  501509 343073 368985 286417 359660 291091 344161     NA
 4 Arkansas      SEC     431174 399124 471279 487067 442569 367748 356517  82500
 5 Auburn        SEC     685252 612157 612157 695498 605120 591236 600355  87450
 6 Baylor        Big 12  321639 280257 276960 275029 262978 248017 318621  46668
 7 Boston Colle… ACC     198035 239893 211433 192942 215546 263363 205111     NA
 8 California    Pac-12  345303 286051 292797 279769 219290 300061 254597     NA
 9 Clemson       ACC     574333 572262 588266 566787 565412 562799 566074 112367
10 Colorado      Pac-12  230778 226670 236331 279652 282335 274852 297435     NA
# … with 55 more rows, 3 more variables: `2021` <dbl>, `2022` <dbl>,
#   `2023` <dbl>, and abbreviated variable name ¹Conference

6.1 Selecting data to make it easier to read

So now we have our Power Five list. What if we just wanted to see attendance from the most recent season and ignore all the rest? Select to the rescue.

attendance |> filter(Conference %in% powerfive) |> select(Institution, Conference, `2022`)

# A tibble: 65 × 3
   Institution    Conference `2022`
   <chr>          <chr>       <dbl>
 1 Alabama        SEC        692870
 2 Arizona        Pac-12     309465
 3 Arizona St.    Pac-12     258488
 4 Arkansas       SEC        512087
 5 Auburn         SEC        681621
 6 Baylor         Big 12     272779
 7 Boston College ACC        214233
 8 California     Pac-12     270172
 9 Clemson        ACC        564860
10 Colorado       Pac-12     257084
# … with 55 more rows

If you have truly massive data, Select has tools to help you select fields that start_with the same things or ends with a certain word. The documentation will guide you if you need those someday. For 90 plus percent of what we do, just naming the fields will be sufficient.

6.2 Using conditional filters to set limits

Let’s return to the problem of one-hit wonders in basketball mucking up our true shooting analysis. How can we set limits in something like a question of who had the best season? Let’s grab every player from last season.

For this walkthrough:

Let’s get set up similar to the previous chapter.

players <- read_csv("data/players24.csv")

Rows: 5678 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Team, Player, Class, Pos, Height, Hometown, High School, Summary, ...
dbl (49): #, Weight, Rk.x, G, GS, MP, FG, FGA, FG%, 2P, 2PA, 2P%, 3P, 3PA, 3...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

players |>
  mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
  arrange(desc(trueshooting))

# A tibble: 5,678 × 60
   Team     Player   `#` Class Pos   Height Weight Homet…¹ High …² Summary  Rk.x
   <chr>    <chr>  <dbl> <chr> <chr> <chr>   <dbl> <chr>   <chr>   <chr>   <dbl>
 1 Winthro… Henry…    21 FR    G     5-11      175 Lewisb… Lewisb… 1.3 Pt…    14
 2 Tenness… D.J. …    44 FR    G     6-5       207 Ricahr… Minnes… 3.0 Pt…    13
 3 San Jos… Benne…    24 FR    G     6-0       180 Austin… L.C. A… 1.2 Pt…    13
 4 Richmon… Liam …    10 JR    G     6-2       190 Kansas… Pembro… 1.0 Pt…    12
 5 Niagara… Jalen…    10 SO    G     5-10      160 Chicag… De La … 0.6 Pt…    12
 6 Nevada … Isaac…    24 FR    F     6-8       195 Phoeni… Hillcr… 3.0 Pt…    13
 7 Georget… Austi…     4 SO    F     6-4       180 Hoover… Hoover… 1.0 Pt…    11
 8 Florida… Renat…    15 JR    G     6-1       180 Planta… Americ… 3.0 Pt…    15
 9 Florida… Adefe…    22 FR    G     6-1       160 Mirama… Norlan… 1.5 Pt…    16
10 Dayton … Attic…    54 SO    G     6-6       190 Wester… St. Fr… 0.4 Pt…    12
# … with 5,668 more rows, 49 more variables: G <dbl>, GS <dbl>, MP <dbl>,
#   FG <dbl>, FGA <dbl>, `FG%` <dbl>, `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>,
#   `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
#   ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>,
#   TOV <dbl>, PF <dbl>, PTS <dbl>, Rk.y <dbl>, PER <dbl>, `TS%` <dbl>,
#   `eFG%` <dbl>, `3PAr` <dbl>, FTr <dbl>, PProd <dbl>, `ORB%` <dbl>,
#   `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, …

In that season, we’ve got several players that can lay claim to the title of One Shot One Three True Shooting champion.

In most contests, like the batting title in Major League Baseball, there’s a minimum number of X to qualify. In baseball, it’s at bats. In basketball, it attempts. So let’s set a floor and see how it changes. What if we said you had to have played 100 minutes in a season? The top players in college basketball play more than 1000 minutes in a season. So 100 is not that much. Let’s try it and see.

players |>
  mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
  arrange(desc(trueshooting)) |>
  filter(MP > 100)

# A tibble: 3,820 × 60
   Team     Player   `#` Class Pos   Height Weight Homet…¹ High …² Summary  Rk.x
   <chr>    <chr>  <dbl> <chr> <chr> <chr>   <dbl> <chr>   <chr>   <chr>   <dbl>
 1 Radford… TJ Ne…    23 SR    F     6-9       220 Burlin… Graham… 8.9 Pt…     9
 2 UNC Wil… Khama…    21 JR    F     6-8       175 Tallah… James … 3.8 Pt…     7
 3 Baylor … Dantw…    12 SR    G     6-2       190 Ocala,… PHS Ac… 4.7 Pt…    10
 4 Furman … Coope…    21 FR    F     6-11      210 Woodbr… Saint … 3.9 Pt…     9
 5 Temple … Emman…    21 JR    C     6-10      225 Jackso… Oak Hi… 0.7 Pt…    10
 6 UNC Tar… Jalen…    13 SO    F     6-10      225 Gary, … West S… 3.9 Pt…     8
 7 St. Bon… Noel …    20 JR    C     6-11      250 Leesbu… Flint … 6.1 Pt…     7
 8 North D… Brian…    12 JR    F     6-9       230 Chicag… DePaul… 2.3 Pt…    10
 9 Illinoi… Nicco…    11 FR    G     6-1       170 Bologn… DME Ac… 1.5 Pt…    11
10 College… James…    23 FR    F     6-11      210 Fayett… E.E. S… 5.0 Pt…     7
# … with 3,810 more rows, 49 more variables: G <dbl>, GS <dbl>, MP <dbl>,
#   FG <dbl>, FGA <dbl>, `FG%` <dbl>, `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>,
#   `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
#   ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>,
#   TOV <dbl>, PF <dbl>, PTS <dbl>, Rk.y <dbl>, PER <dbl>, `TS%` <dbl>,
#   `eFG%` <dbl>, `3PAr` <dbl>, FTr <dbl>, PProd <dbl>, `ORB%` <dbl>,
#   `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, …

Now you get Radford’s TJ NeSmith, who played in seven games and was on the floor for 113 minutes. So he played a little bit, but not a lot. But in that time, he only attempted 32 shots, and made 75 percent of them. In other words, when he shot, he probably scored. He just rarely shot.

So is 100 minutes our level? Here’s the truth – there’s not really an answer here. We’re picking a cutoff. If you can cite a reason for it and defend it, then it probably works.

6.3 Top list

One last little dplyr trick that’s nice to have in the toolbox is a shortcut for selecting only the top values for your dataset. Want to make a Top 10 List? Or Top 25? Or Top Whatever You Want? It’s easy.

So what are the top 10 Power Five schools by season attendance. All we’re doing here is chaining commands together with what we’ve already got. We’re filtering by our list of Power Five conferences, we’re selecting the three fields we need, now we’re going to arrange it by total attendance and then we’ll introduce the new function: top_n. The top_n function just takes a number. So we want a top 10 list? We do it like this:

attendance |> filter(Conference %in% powerfive) |> select(Institution, Conference, `2023`) |> arrange(desc(`2023`)) |> top_n(10)

Selecting by 2023

# A tibble: 10 × 3
   Institution Conference `2023`
   <chr>       <chr>       <dbl>
 1 Michigan    Big Ten    769797
 2 Penn St.    Big Ten    758864
 3 Tennessee   SEC        713405
 4 LSU         SEC        705191
 5 Alabama     SEC        700539
 6 Texas A&M   SEC        694639
 7 Georgia     SEC        649222
 8 Ohio St.    Big Ten    622753
 9 Auburn      SEC        616301
10 Texas       Big 12     609748

That’s all there is to it. Just remember – for it to work correctly, you need to sort your data BEFORE you run top_n. Otherwise, you’re just getting the first 10 values in the list. The function doesn’t know what field you want the top values of. You have to do it.