12 Z-scores

Z-scores are a handy way to standardize numbers so you can compare things across groupings or time. In this class, we may want to compare teams by year, or era. We can use z-scores to answer questions like who was the greatest X of all time, because a z-score can put them in context to their era.

A z-score is a measure of how a particular stat is from the mean. It’s measured in standard deviations from that mean. A standard deviation is a measure of how much variation – how spread out – numbers are in a data set. What it means here, with regards to z-scores, is that zero is perfectly average. If it’s 1, it’s one standard deviation above the mean, and 34 percent of all cases are between 0 and 1.

If you think of the normal distribution, it means that 84.3 percent of all case are below that 1. If it were -1, it would mean the number is one standard deviation below the mean, and 84.3 percent of cases would be above that -1. So if you have numbers with z-scores of 3 or even 4, that means that number is waaaaaay above the mean.

So let’s use last year’s Maryland women’s basketball team, which if haven’t been paying attention to current events, was talented but had a few struggles.

12.1 Calculating a Z score in R

For this we’ll need the logs of all college basketball games last season.

For this walkthrough:

Load the tidyverse.

library(tidyverse)

And load the data.

gamelogs <- read_csv("data/wbblogs24.csv")

Rows: 11503 Columns: 48
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (8): Season, TeamFullName, Opponent, HomeAway, W_L, URL, Conference, Team
dbl  (39): Game, TeamScore, OpponentScore, TeamFG, TeamFGA, TeamFGPCT, Team3...
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The first thing we need to do is select some fields we think represent team quality and a few things to help us keep things straight. So I’m going to pick shooting percentage, rebounding and the opponent version of the same two:

teamquality <- gamelogs |> 
  select(Conference, Team, TeamFGPCT, TeamTotalRebounds, OpponentFGPCT, OpponentTotalRebounds)

And since we have individual game data, we need to collapse this into one record for each team. We do that with … group by and summarize.

teamtotals <- teamquality |> 
  group_by(Conference, Team) |> 
  summarise(
    FGAvg = mean(TeamFGPCT), 
    ReboundAvg = mean(TeamTotalRebounds), 
    OppFGAvg = mean(OpponentFGPCT),
    OffRebAvg = mean(OpponentTotalRebounds)
    )

`summarise()` has grouped output by 'Conference'. You can override using the
`.groups` argument.

To calculate a z-score in R, the easiest way is to use the scale function in base R. To use it, you use scale(FieldName, center=TRUE, scale=TRUE). The center and scale indicate if you want to subtract from the mean and if you want to divide by the standard deviation, respectively. We do.

When we have multiple z-scores, it’s pretty standard practice to add them together into a composite score. That’s what we’re doing at the end here with TotalZscore. Note: We have to invert OppZscore and OppRebZScore by multiplying it by a negative 1 because the lower someone’s opponent number is, the better.

teamzscore <- teamtotals |> 
  mutate(
    FGzscore = as.numeric(scale(FGAvg, center = TRUE, scale = TRUE)),
    RebZscore = as.numeric(scale(ReboundAvg, center = TRUE, scale = TRUE)),
    OppZscore = as.numeric(scale(OppFGAvg, center = TRUE, scale = TRUE)) * -1,
    OppRebZScore = as.numeric(scale(OffRebAvg, center = TRUE, scale = TRUE)) * -1,
    TotalZscore = FGzscore + RebZscore + OppZscore + OppRebZScore
  )

So now we have a dataframe called teamzscore that has 360 basketball teams with Z scores. What does it look like?

head(teamzscore)

# A tibble: 6 × 11
# Groups:   Conference [1]
  Confere…¹ Team   FGAvg Rebou…² OppFG…³ OffRe…⁴ FGzsc…⁵ RebZs…⁶ OppZs…⁷ OppRe…⁸
  <chr>     <chr>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 A-10 WBB  Davi… NA        NA    NA        NA   NA       NA      NA      NA    
2 A-10 WBB  Dayt…  0.400    35.7   0.437    28.7 -0.0992   1.17   -1.10    0.649
3 A-10 WBB  Duqu…  0.424    33.8   0.390    32.4  0.595    0.637   0.441  -0.432
4 A-10 WBB  Ford… NA        NA    NA        NA   NA       NA      NA      NA    
5 A-10 WBB  Geor…  0.397    34.9   0.381    32.7 -0.201    0.946   0.764  -0.530
6 A-10 WBB  Geor…  0.370    33.7   0.390    29.7 -0.980    0.631   0.466   0.364
# … with 1 more variable: TotalZscore <dbl>, and abbreviated variable names
#   ¹Conference, ²ReboundAvg, ³OppFGAvg, ⁴OffRebAvg, ⁵FGzscore, ⁶RebZscore,
#   ⁷OppZscore, ⁸OppRebZScore

A way to read this – a team with a TotalZScore of 0 is precisely average. The larger the positive number, the more exceptional they are. The larger the negative number, the more truly terrible they are.

So who are the best teams in the country?

teamzscore |> arrange(desc(TotalZscore))

# A tibble: 360 × 11
# Groups:   Conference [33]
   Confere…¹ Team  FGAvg Rebou…² OppFG…³ OffRe…⁴ FGzsc…⁵ RebZs…⁶ OppZs…⁷ OppRe…⁸
   <chr>     <chr> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 SEC WBB   Sout… 0.495    42.2   0.324    28.3   2.13    1.70    2.17    1.22 
 2 MAAC WBB  Fair… 0.464    33.1   0.361    27.4   2.26    1.26    1.61    1.64 
 3 Big East… Conn… 0.498    33.5   0.356    27.8   2.41    0.883   1.78    0.999
 4 WCC WBB   Gonz… 0.482    34.2   0.403    25.4   1.82    1.05    0.811   2.26 
 5 Sun Belt… Jame… 0.424    42.4   0.363    31.0   1.10    1.64    1.64    1.47 
 6 Big 12 W… Texas 0.490    36.2   0.384    24.4   1.63    0.775   1.17    2.22 
 7 Summit W… Sout… 0.476    35.1   0.370    26.5   1.95    0.878   1.54    1.42 
 8 OVC WBB   Sout… 0.452    35.6   0.359    31.1   1.90    1.32    1.90    0.596
 9 MWC WBB   Neva… 0.463    37.6   0.381    28.2   1.69    1.84    0.954   1.18 
10 SWAC WBB  Jack… 0.397    38.5   0.341    29.9   0.722   1.36    1.79    1.38 
# … with 350 more rows, 1 more variable: TotalZscore <dbl>, and abbreviated
#   variable names ¹Conference, ²ReboundAvg, ³OppFGAvg, ⁴OffRebAvg, ⁵FGzscore,
#   ⁶RebZscore, ⁷OppZscore, ⁸OppRebZScore

Don’t sleep on Fairfield! If we look for Power Five schools, UConn and South Carolina are at the top, which checks out.

But closer to home, how is Maryland doing?

teamzscore |> 
  filter(Conference == "Big Ten WBB") |> 
  arrange(desc(TotalZscore)) |>
  select(Team, TotalZscore)

Adding missing grouping variables: `Conference`

# A tibble: 14 × 3
# Groups:   Conference [1]
   Conference  Team           TotalZscore
   <chr>       <chr>                <dbl>
 1 Big Ten WBB Iowa                 4.41 
 2 Big Ten WBB Indiana              3.84 
 3 Big Ten WBB Nebraska             3.15 
 4 Big Ten WBB Penn State           1.88 
 5 Big Ten WBB Illinois             1.20 
 6 Big Ten WBB Michigan             0.396
 7 Big Ten WBB Maryland            -0.651
 8 Big Ten WBB Ohio State          -0.662
 9 Big Ten WBB Minnesota           -0.856
10 Big Ten WBB Wisconsin           -0.905
11 Big Ten WBB Michigan State      -1.17 
12 Big Ten WBB Rutgers             -1.91 
13 Big Ten WBB Purdue              -2.54 
14 Big Ten WBB Northwestern        -6.20

So, as we can see, with our composite Z Score, Maryland is below average; not great. But better than Ohio State. Notice how, by this measure, Indiana and Iowa are far ahead of most of the conference, with Nebraska a somewhat surprising third.

We can limit our results to just Power Five conferences plus the Big East:

powerfive_plus_one <- c("SEC WBB", "Big Ten WBB", "Pac-12 WBB", "Big 12 WBB", "ACC WBB", "Big East WBB")
teamzscore |> 
  filter(Conference %in% powerfive_plus_one) |> 
  arrange(desc(TotalZscore)) |>
  select(Team, TotalZscore)

Adding missing grouping variables: `Conference`

# A tibble: 80 × 3
# Groups:   Conference [6]
   Conference   Team            TotalZscore
   <chr>        <chr>                 <dbl>
 1 SEC WBB      South Carolina         7.22
 2 Big East WBB Connecticut            6.07
 3 Big 12 WBB   Texas                  5.80
 4 Pac-12 WBB   Stanford               4.97
 5 SEC WBB      Louisiana State        4.83
 6 Pac-12 WBB   UCLA                   4.67
 7 ACC WBB      Virginia Tech          4.53
 8 Big Ten WBB  Iowa                   4.41
 9 Big Ten WBB  Indiana                3.84
10 ACC WBB      Duke                   3.65
# … with 70 more rows

This makes a certain amount of sense: three of the Final Four teams - South Carolina, UConn and Iowa are in the top 10. N.C. State, the fourth team, ranks 16th. Duke is an interesting #10 here. It doesn’t necessarily mean they were the ninth-best team, but given their competition they shot the ball and rebounded the ball very well.

12.2 Writing about z-scores

The great thing about z-scores is that they make it very easy for you, the sports analyst, to create your own measures of who is better than who. The downside: Only a small handful of sports fans know what the hell a z-score is.

As such, you should try as hard as you can to avoid writing about them.

If the word z-score appears in your story or in a chart, you need to explain what it is. “The ranking uses a statistical measure of the distance from the mean called a z-score” is a good way to go about it. You don’t need a full stats textbook definition, just a quick explanation. And keep it simple.

Never use z-score in a headline. Write around it. Away from it. Z-score in a headline is attention repellent. You won’t get anyone to look at it. So “Tottenham tops in z-score” bad, “Tottenham tops in the Premiere League” good.