library(tidyverse)
6 Filters and selections
More often than not, we have more data than we want. Sometimes we need to be rid of that data. In dplyr
, there’s two ways to go about this: filtering and selecting.
Filtering creates a subset of the data based on criteria. All records where the count is greater than 10. All records that match “Maryland”. Something like that.
Selecting simply returns only the fields named. So if you only want to see School and Attendance, you select those fields. When you look at your data again, you’ll have two columns. If you try to use one of your columns that you had before you used select, you’ll get an error.
Let’s work with our football attendance data to show some examples.
For this walkthrough:
First we’ll need the tidyverse.
Now import the data.
<- read_csv('data/attendance.csv') attendance
Rows: 146 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Institution, Conference
dbl (11): 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
So, first things first, let’s say we don’t care about all this Air Force, Akron, Alabama crap and just want to see Dear Old Maryland We do that with filter
and then we pass it a condition.
Before we do that, a note about conditions. Most of the conditional operators you’ll understand – greater than and less than are > and <. The tough one to remember is equal to. In conditional statements, equal to is == not =. If you haven’t noticed, = is a variable assignment operator, not a conditional statement. So equal is == and NOT equal is !=.
So if you want to see Institutions equal to Maryland, you do this:
|> filter(Institution == "Maryland") attendance
# A tibble: 2 × 13
Insti…¹ Confe…² `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020` `2021`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Maryla… ACC 288946 NA NA NA NA NA NA NA NA
2 Maryla… Big Ten NA 281884 310389 237690 237859 201562 226871 NA 266480
# … with 2 more variables: `2022` <dbl>, `2023` <dbl>, and abbreviated variable
# names ¹Institution, ²Conference
Or if we want to see schools that had more than half a million people buy tickets to a football game last season, we do the following. NOTE THE BACKTICKS.
|> filter(`2022` >= 500000) attendance
# A tibble: 17 × 13
Institution Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alabama SEC 710538 710736 707786 712747 712053 710931 707817 97120
2 Arkansas SEC 431174 399124 471279 487067 442569 367748 356517 82500
3 Auburn SEC 685252 612157 612157 695498 605120 591236 600355 87450
4 Clemson ACC 574333 572262 588266 566787 565412 562799 566074 112367
5 Florida SEC 524638 515001 630457 439229 520290 576299 508103 74348
6 Georgia SEC 556476 649222 649222 556476 556476 649222 649722 61572
7 LSU SEC 639927 712063 654084 708618 591034 705733 705892 87233
8 Michigan Big Ten 781144 734364 771174 883741 669534 775156 780215 NA
9 Nebraska Big Ten 727466 638744 629983 631402 628583 623240 625436 NA
10 Ohio St. Big Ten 734528 744075 750705 750944 752464 713630 723679 3254
11 Oklahoma Big 12 508334 510972 512139 521142 519119 607146 499533 113500
12 Penn St. Big Ten 676112 711358 698590 701800 746946 738396 739747 4500
13 South Caroli… SEC 576805 569664 472934 538441 550099 515396 545737 77075
14 Tennessee SEC 669087 698276 704088 706776 670454 650887 702912 112660
15 Texas Big 12 593857 564618 540210 587283 556667 586277 577834 85690
16 Texas A&M SEC 697003 630735 725354 713418 691612 698908 711258 99503
17 Wisconsin Big Ten 552378 556642 546099 476144 551766 540072 535301 NA
# … with 3 more variables: `2021` <dbl>, `2022` <dbl>, `2023` <dbl>, and
# abbreviated variable name ¹Conference
But what if we want to see all of the Power Five conferences? We could use conditional logic in our filter. The conditional logic operators are |
for OR and &
for AND. NOTE: AND means all conditions have to be met. OR means any of the conditions work. So be careful about boolean logic.
|> filter(Conference == "Big 10" | Conference == "SEC" | Conference == "Pac-12" | Conference == "ACC" | Conference == "Big 12") attendance
# A tibble: 51 × 13
Institution Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alabama SEC 710538 710736 707786 712747 712053 710931 707817 97120
2 Arizona Pac-12 285713 354973 308355 338017 255791 318051 237194 NA
3 Arizona St. Pac-12 501509 343073 368985 286417 359660 291091 344161 NA
4 Arkansas SEC 431174 399124 471279 487067 442569 367748 356517 82500
5 Auburn SEC 685252 612157 612157 695498 605120 591236 600355 87450
6 Baylor Big 12 321639 280257 276960 275029 262978 248017 318621 46668
7 Boston Colle… ACC 198035 239893 211433 192942 215546 263363 205111 NA
8 California Pac-12 345303 286051 292797 279769 219290 300061 254597 NA
9 Clemson ACC 574333 572262 588266 566787 565412 562799 566074 112367
10 Colorado Pac-12 230778 226670 236331 279652 282335 274852 297435 NA
# … with 41 more rows, 3 more variables: `2021` <dbl>, `2022` <dbl>,
# `2023` <dbl>, and abbreviated variable name ¹Conference
But that’s a lot of repetitive code. And a lot of typing. And typing is the devil. So what if we could create a list and pass it into the filter? It’s pretty simple.
We can create a new variable – remember variables can represent just about anything – and create a list. To do that we use the c
operator, which stands for concatenate. That just means take all the stuff in the parenthesis after the c and bunch it into a list.
Note here: text is in quotes. If they were numbers, we wouldn’t need the quotes.
<- c("SEC", "Big Ten", "Pac-12", "Big 12", "ACC") powerfive
Now with a list, we can use the %in% operator. It does what you think it does – it gives you data that matches things IN the list you give it.
|> filter(Conference %in% powerfive) attendance
# A tibble: 65 × 13
Institution Confe…¹ `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alabama SEC 710538 710736 707786 712747 712053 710931 707817 97120
2 Arizona Pac-12 285713 354973 308355 338017 255791 318051 237194 NA
3 Arizona St. Pac-12 501509 343073 368985 286417 359660 291091 344161 NA
4 Arkansas SEC 431174 399124 471279 487067 442569 367748 356517 82500
5 Auburn SEC 685252 612157 612157 695498 605120 591236 600355 87450
6 Baylor Big 12 321639 280257 276960 275029 262978 248017 318621 46668
7 Boston Colle… ACC 198035 239893 211433 192942 215546 263363 205111 NA
8 California Pac-12 345303 286051 292797 279769 219290 300061 254597 NA
9 Clemson ACC 574333 572262 588266 566787 565412 562799 566074 112367
10 Colorado Pac-12 230778 226670 236331 279652 282335 274852 297435 NA
# … with 55 more rows, 3 more variables: `2021` <dbl>, `2022` <dbl>,
# `2023` <dbl>, and abbreviated variable name ¹Conference
6.1 Selecting data to make it easier to read
So now we have our Power Five list. What if we just wanted to see attendance from the most recent season and ignore all the rest? Select to the rescue.
|> filter(Conference %in% powerfive) |> select(Institution, Conference, `2022`) attendance
# A tibble: 65 × 3
Institution Conference `2022`
<chr> <chr> <dbl>
1 Alabama SEC 692870
2 Arizona Pac-12 309465
3 Arizona St. Pac-12 258488
4 Arkansas SEC 512087
5 Auburn SEC 681621
6 Baylor Big 12 272779
7 Boston College ACC 214233
8 California Pac-12 270172
9 Clemson ACC 564860
10 Colorado Pac-12 257084
# … with 55 more rows
If you have truly massive data, Select has tools to help you select fields that start_with the same things or ends with a certain word. The documentation will guide you if you need those someday. For 90 plus percent of what we do, just naming the fields will be sufficient.
6.2 Using conditional filters to set limits
Let’s return to the problem of one-hit wonders in basketball mucking up our true shooting analysis. How can we set limits in something like a question of who had the best season? Let’s grab every player from last season.
For this walkthrough:
Let’s get set up similar to the previous chapter.
<- read_csv("data/players24.csv") players
Rows: 5678 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Team, Player, Class, Pos, Height, Hometown, High School, Summary, ...
dbl (49): #, Weight, Rk.x, G, GS, MP, FG, FGA, FG%, 2P, 2PA, 2P%, 3P, 3PA, 3...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
|>
players mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
arrange(desc(trueshooting))
# A tibble: 5,678 × 60
Team Player `#` Class Pos Height Weight Homet…¹ High …² Summary Rk.x
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Winthro… Henry… 21 FR G 5-11 175 Lewisb… Lewisb… 1.3 Pt… 14
2 Tenness… D.J. … 44 FR G 6-5 207 Ricahr… Minnes… 3.0 Pt… 13
3 San Jos… Benne… 24 FR G 6-0 180 Austin… L.C. A… 1.2 Pt… 13
4 Richmon… Liam … 10 JR G 6-2 190 Kansas… Pembro… 1.0 Pt… 12
5 Niagara… Jalen… 10 SO G 5-10 160 Chicag… De La … 0.6 Pt… 12
6 Nevada … Isaac… 24 FR F 6-8 195 Phoeni… Hillcr… 3.0 Pt… 13
7 Georget… Austi… 4 SO F 6-4 180 Hoover… Hoover… 1.0 Pt… 11
8 Florida… Renat… 15 JR G 6-1 180 Planta… Americ… 3.0 Pt… 15
9 Florida… Adefe… 22 FR G 6-1 160 Mirama… Norlan… 1.5 Pt… 16
10 Dayton … Attic… 54 SO G 6-6 190 Wester… St. Fr… 0.4 Pt… 12
# … with 5,668 more rows, 49 more variables: G <dbl>, GS <dbl>, MP <dbl>,
# FG <dbl>, FGA <dbl>, `FG%` <dbl>, `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>,
# `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
# ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>,
# TOV <dbl>, PF <dbl>, PTS <dbl>, Rk.y <dbl>, PER <dbl>, `TS%` <dbl>,
# `eFG%` <dbl>, `3PAr` <dbl>, FTr <dbl>, PProd <dbl>, `ORB%` <dbl>,
# `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, …
In that season, we’ve got several players that can lay claim to the title of One Shot One Three True Shooting champion.
In most contests, like the batting title in Major League Baseball, there’s a minimum number of X to qualify. In baseball, it’s at bats. In basketball, it attempts. So let’s set a floor and see how it changes. What if we said you had to have played 100 minutes in a season? The top players in college basketball play more than 1000 minutes in a season. So 100 is not that much. Let’s try it and see.
|>
players mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
arrange(desc(trueshooting)) |>
filter(MP > 100)
# A tibble: 3,820 × 60
Team Player `#` Class Pos Height Weight Homet…¹ High …² Summary Rk.x
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Radford… TJ Ne… 23 SR F 6-9 220 Burlin… Graham… 8.9 Pt… 9
2 UNC Wil… Khama… 21 JR F 6-8 175 Tallah… James … 3.8 Pt… 7
3 Baylor … Dantw… 12 SR G 6-2 190 Ocala,… PHS Ac… 4.7 Pt… 10
4 Furman … Coope… 21 FR F 6-11 210 Woodbr… Saint … 3.9 Pt… 9
5 Temple … Emman… 21 JR C 6-10 225 Jackso… Oak Hi… 0.7 Pt… 10
6 UNC Tar… Jalen… 13 SO F 6-10 225 Gary, … West S… 3.9 Pt… 8
7 St. Bon… Noel … 20 JR C 6-11 250 Leesbu… Flint … 6.1 Pt… 7
8 North D… Brian… 12 JR F 6-9 230 Chicag… DePaul… 2.3 Pt… 10
9 Illinoi… Nicco… 11 FR G 6-1 170 Bologn… DME Ac… 1.5 Pt… 11
10 College… James… 23 FR F 6-11 210 Fayett… E.E. S… 5.0 Pt… 7
# … with 3,810 more rows, 49 more variables: G <dbl>, GS <dbl>, MP <dbl>,
# FG <dbl>, FGA <dbl>, `FG%` <dbl>, `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>,
# `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
# ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>,
# TOV <dbl>, PF <dbl>, PTS <dbl>, Rk.y <dbl>, PER <dbl>, `TS%` <dbl>,
# `eFG%` <dbl>, `3PAr` <dbl>, FTr <dbl>, PProd <dbl>, `ORB%` <dbl>,
# `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, …
Now you get Radford’s TJ NeSmith, who played in seven games and was on the floor for 113 minutes. So he played a little bit, but not a lot. But in that time, he only attempted 32 shots, and made 75 percent of them. In other words, when he shot, he probably scored. He just rarely shot.
So is 100 minutes our level? Here’s the truth – there’s not really an answer here. We’re picking a cutoff. If you can cite a reason for it and defend it, then it probably works.
6.3 Top list
One last little dplyr trick that’s nice to have in the toolbox is a shortcut for selecting only the top values for your dataset. Want to make a Top 10 List? Or Top 25? Or Top Whatever You Want? It’s easy.
So what are the top 10 Power Five schools by season attendance. All we’re doing here is chaining commands together with what we’ve already got. We’re filtering by our list of Power Five conferences, we’re selecting the three fields we need, now we’re going to arrange it by total attendance and then we’ll introduce the new function: top_n
. The top_n
function just takes a number. So we want a top 10 list? We do it like this:
|> filter(Conference %in% powerfive) |> select(Institution, Conference, `2023`) |> arrange(desc(`2023`)) |> top_n(10) attendance
Selecting by 2023
# A tibble: 10 × 3
Institution Conference `2023`
<chr> <chr> <dbl>
1 Michigan Big Ten 769797
2 Penn St. Big Ten 758864
3 Tennessee SEC 713405
4 LSU SEC 705191
5 Alabama SEC 700539
6 Texas A&M SEC 694639
7 Georgia SEC 649222
8 Ohio St. Big Ten 622753
9 Auburn SEC 616301
10 Texas Big 12 609748
That’s all there is to it. Just remember – for it to work correctly, you need to sort your data BEFORE you run top_n. Otherwise, you’re just getting the first 10 values in the list. The function doesn’t know what field you want the top values of. You have to do it.