11  Residuals

When looking at a linear model of your data, there’s a measure you need to be aware of called residuals. The residual is the distance between what the model predicted and what the real outcome is. Take our model at the end of the correlation and regression chapter. Our model predicted Maryland’s women soccer should have outscored George Mason by a goal a year ago. The match was a 3-2 loss. So our residual is -2.

Residuals can tell you several things, but most important is if a linear model the right model for your data. If the residuals appear to be random, then a linear model is appropriate. If they have a pattern, it means something else is going on in your data and a linear model isn’t appropriate.

Residuals can also tell you who is under-performing and over-performing the model. And the more robust the model – the better your r-squared value is – the more meaningful that label of under or over-performing is.

Let’s go back to our model for men’s college basketball. For our predictor, let’s use Net FG Percentage - the difference between the two teams’ shooting success.

For this walkthrough:

Then load the tidyverse.

library(tidyverse)
logs <- read_csv("data/cbblogs1125.csv.zip")
Multiple files in zip: reading 'cbblogs1125.csv'
Rows: 169224 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (10): Season, GameType, TeamFullName, Opponent, HomeAway, W_L, OT, URL,...
dbl  (50): file_source, Game, TeamScore, OpponentScore, TeamFG, TeamFGA, Tea...
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First, let’s make the columns we’ll need.

residualmodel <- logs |> mutate(differential = TeamScore - OpponentScore, FGPctMargin = TeamFGPCT - OpponentFGPCT)

Now let’s create our model.

fit <- lm(differential ~ FGPctMargin, data = residualmodel)
summary(fit)

Call:
lm(formula = differential ~ FGPctMargin, data = residualmodel)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.912  -6.248  -0.206   5.946  70.084 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.55052    0.02324   23.69   <2e-16 ***
FGPctMargin 119.82551    0.20700  578.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.544 on 169218 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.6645,    Adjusted R-squared:  0.6645 
F-statistic: 3.351e+05 on 1 and 169218 DF,  p-value: < 2.2e-16

We’ve seen this output before, but let’s review because if you are using scatterplots to make a point, you should do this. First, note the Min and Max residual at the top. A team has under-performed the model by 51 points (!), and a team has overperformed it by 70 points (!!). The median residual, where half are above and half are below, is just slightly below the fit line. Close here is good.

Next: Look at the Adjusted R-squared value. What that says is that 66 percent of a team’s scoring differential can be predicted by their FG percentage margin.

Last: Look at the p-value. We are looking for a p-value smaller than .05. At .05, we can say that our correlation didn’t happen at random. And, in this case, it REALLY didn’t happen at random. But if you know a little bit about basketball, it doesn’t surprise you that the more you shoot better than your opponent, the more you win by. It’s an intuitive result.

What we want to do now is look at those residuals. We want to add them to our individual game records. We can do that by creating two new fields – predicted and residuals – to our dataframe like this:

residualmodel <- residualmodel |> mutate(predicted = predict(fit), residuals = residuals(fit))
Error in `mutate()`:
ℹ In argument: `predicted = predict(fit)`.
Caused by error:
! `predicted` must be size 169224 or 1, not 169220.

Uh, oh. What’s going on here? When you get a message like this, where R is complaining about the size of the data, it most likely means that your model is using some columns that have NA values. In this case, the number of columns looks small - perhaps 3 - so let’s just get rid of those rows by using the calculated columns from our model:

residualmodel <- residualmodel |> filter(!is.na(FGPctMargin))

Now we can try re-running the code to add the predicted and residuals columns:

residualmodel <- residualmodel |> mutate(predicted = predict(fit), residuals = residuals(fit))

Now we can sort our data by those residuals. Sorting in descending order gives us the games where teams overperformed the model. To make it easier to read, I’m going to use select to give us just the columns we need to see and limit our results to Big Ten games.

residualmodel |> filter(Conference == 'Big Ten MBB', GameType == 'REG (Conf)') |> arrange(desc(residuals)) |> select(Date, Team, Opponent, W_L, differential, FGPctMargin, predicted, residuals)
# A tibble: 3,884 × 8
   Date       Team   Opponent W_L   differential FGPctMargin predicted residuals
   <date>     <chr>  <chr>    <chr>        <dbl>       <dbl>     <dbl>     <dbl>
 1 2016-01-18 Purdue Rutgers  W               50      0.109      13.6       36.4
 2 2023-01-28 Maryl… Nebraska W               19     -0.087      -9.87      28.9
 3 2020-01-21 Maryl… Northwe… W               11     -0.148     -17.2       28.2
 4 2016-03-02 Michi… Rutgers  W               31      0.0250      3.55      27.5
 5 2011-02-15 Ohio … Michiga… W               10     -0.145     -16.8       26.8
 6 2023-02-12 Iowa   Minneso… W               12     -0.125     -14.4       26.4
 7 2013-01-30 India… Purdue   W               37      0.093      11.7       25.3
 8 2015-01-20 Wisco… Iowa     W               32      0.057       7.38      24.6
 9 2016-01-23 India… Northwe… W               32      0.071       9.06      22.9
10 2012-01-29 India… Iowa     W               14     -0.0780     -8.80      22.8
# ℹ 3,874 more rows

So looking at this table, what you see here are the teams who scored more than their FG percentage margin would indicate. One of the predicted values should jump off the page at you.

Look at that Maryland-Northwestern game from 2020. The Wildcats shot better than the Terps, and the model predicted Northwestern would win by 17 points. Instead, Maryland won by 11!

But, before we can bestow any validity on this model, we need to see if this linear model is appropriate. We’ve done that some looking at our p-values and R-squared values. But one more check is to look at the residuals themselves. We do that by plotting the residuals with the predictor. We’ll get into plotting soon, but for now just seeing it is enough.

The lack of a shape here – the seemingly random nature – is a good sign that a linear model works for our data. If there was a pattern, that would indicate something else was going on in our data and we needed a different model.

Another way to view your residuals is by connecting the predicted value with the actual value.

`geom_smooth()` using formula = 'y ~ x'

The blue line here separates underperformers from overperformers.

11.1 Fouls

Now let’s look at it where it doesn’t work as well: the total number of fouls

fouls <- logs |> 
  mutate(
    differential = TeamScore - OpponentScore, 
    TotalFouls = TeamPersonalFouls+OpponentPersonalFouls
  )
pfit <- lm(differential ~ TotalFouls, data = fouls)
summary(pfit)

Call:
lm(formula = differential ~ TotalFouls, data = fouls)

Residuals:
    Min      1Q  Median      3Q     Max 
-95.709 -10.567   0.008   9.645 106.574 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.831495   0.199862   19.17   <2e-16 ***
TotalFouls  -0.070755   0.005468  -12.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.47 on 169218 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.0009884, Adjusted R-squared:  0.0009825 
F-statistic: 167.4 on 1 and 169218 DF,  p-value: < 2.2e-16

So from top to bottom:

  • Our min and max go from -95 to positive 107
  • Our adjusted R-squared is … 0.0009825. Not much at all.
  • Our p-value is … is less than than .05, so that’s something.

So what we can say about this model is that it’s statistically significant, but doesn’t really explain much. It’s not meaningless, but on its own the total number of fouls doesn’t go very far in explaining the point differential. Normally, we’d stop right here – why bother going forward with a predictive model that isn’t terribly predictive? But let’s do it anyway. Oh, and see that “(4 observations deleted due to missingness)” bit? That means we need to lose some incomplete data again.

fouls <- fouls |> filter(!is.na(TotalFouls))
fouls$predicted <- predict(pfit)
fouls$residuals <- residuals(pfit)
fouls |> arrange(desc(residuals)) |> select(Team, Opponent, W_L, TeamScore, OpponentScore, TotalFouls, residuals)
# A tibble: 169,220 × 7
   Team              Opponent W_L   TeamScore OpponentScore TotalFouls residuals
   <chr>             <chr>    <chr>     <dbl>         <dbl>      <dbl>     <dbl>
 1 Bryant            <NA>     W           147            39         34     107. 
 2 Southern          <NA>     W           116            12         37     103. 
 3 McNeese State     <NA>     W           140            37         33     102. 
 4 Western Carolina  <NA>     W           141            39         30     100. 
 5 Appalachian State <NA>     W           135            34         35      99.6
 6 Kansas City       <NA>     W           119            19         24      97.9
 7 Purdue Fort Wayne <NA>     W           130            34         19      93.5
 8 James Madison     <NA>     W           135            40         29      93.2
 9 Grambling         <NA>     W           147            52         20      92.6
10 Utah              Mississ… W           143            49         30      92.3
# ℹ 169,210 more rows

First, note all of the biggest misses here are all blowout games. The worst games of the season, the worst being Bryant vs. Thomas. The model missed that differential by … 107 points. The margin of victory? 108 points. In other words, this model is not great! But let’s look at it anyway.

Well … it actually says that a linear model is appropriate. Which an important lesson – just because your residual plot says a linear model works here, that doesn’t say your linear model is good. There are other measures for that, and you need to use them.

Here’s the segment plot of residuals – you’ll see some really long lines. That’s a bad sign. Another bad sign? A flat fit line. It means there’s no relationship between these two things. Which we already know.

`geom_smooth()` using formula = 'y ~ x'