11  Residuals

When looking at a linear model of your data, there’s a measure you need to be aware of called residuals. The residual is the distance between what the model predicted and what the real outcome is. Take our model at the end of the correlation and regression chapter. Our model predicted Maryland’s women soccer should have outscored Navy by 2.12 goals a year ago. The match was a 3-3 draw. So our residual is -2.12.

Residuals can tell you several things, but most important is if a linear model the right model for your data. If the residuals appear to be random, then a linear model is appropriate. If they have a pattern, it means something else is going on in your data and a linear model isn’t appropriate.

Residuals can also tell you who is under-performing and over-performing the model. And the more robust the model – the better your r-squared value is – the more meaningful that label of under or over-performing is.

Let’s go back to our model for college basketball. For our predictor, let’s use Net FG Percentage - the difference between the two teams’ shooting success.

For this walkthrough:

Then load the tidyverse.

library(tidyverse)
logs <- read_csv("data/cbblogs1524.csv")
Rows: 98161 Columns: 51
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (9): Season, TeamFull, Opponent, HomeAway, W_L, URL, Conference, Team,...
dbl  (39): Game, TeamScore, OpponentScore, TeamFG, TeamFGA, TeamFGPCT, Team3...
lgl   (2): Blank, season
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First, let’s make the columns we’ll need.

residualmodel <- logs |> mutate(differential = TeamScore - OpponentScore, FGPctMargin = TeamFGPCT - OpponentFGPCT)

Now let’s create our model.

fit <- lm(differential ~ FGPctMargin, data = residualmodel)
summary(fit)

Call:
lm(formula = differential ~ FGPctMargin, data = residualmodel)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.242  -6.244  -0.221   5.916  68.455 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.53004    0.03038   17.45   <2e-16 ***
FGPctMargin 122.54301    0.27274  449.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.503 on 98133 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.6729,    Adjusted R-squared:  0.6729 
F-statistic: 2.019e+05 on 1 and 98133 DF,  p-value: < 2.2e-16

We’ve seen this output before, but let’s review because if you are using scatterplots to make a point, you should do this. First, note the Min and Max residual at the top. A team has under-performed the model by 51 points (!), and a team has overperformed it by 68 points (!!). The median residual, where half are above and half are below, is just slightly below the fit line. Close here is good.

Next: Look at the Adjusted R-squared value. What that says is that 67 percent of a team’s scoring differential can be predicted by their FG percentage margin.

Last: Look at the p-value. We are looking for a p-value smaller than .05. At .05, we can say that our correlation didn’t happen at random. And, in this case, it REALLY didn’t happen at random. But if you know a little bit about basketball, it doesn’t surprise you that the more you shoot better than your opponent, the more you win by. It’s an intuitive result.

What we want to do now is look at those residuals. We want to add them to our individual game records. We can do that by creating two new fields – predicted and residuals – to our dataframe like this:

residualmodel <- residualmodel |> mutate(predicted = predict(fit), residuals = residuals(fit))
Error in `mutate()`:
ℹ In argument: `predicted = predict(fit)`.
Caused by error:
! `predicted` must be size 98161 or 1, not 98135.

Uh, oh. What’s going on here? When you get a message like this, where R is complaining about the size of the data, it most likely means that your model is using some columns that have NA values. In this case, the number of columns looks small - perhaps 3 - so let’s just get rid of those rows by using the calculated columns from our model:

residualmodel <- residualmodel |> filter(!is.na(FGPctMargin))

Now we can try re-running the code to add the predicted and residuals columns:

residualmodel <- residualmodel |> mutate(predicted = predict(fit), residuals = residuals(fit))

Now we can sort our data by those residuals. Sorting in descending order gives us the games where teams overperformed the model. To make it easier to read, I’m going to use select to give us just the columns we need to see and limit our results to Big Ten teams.

residualmodel |> filter(Conference == 'Big Ten') |> arrange(desc(residuals)) |> select(Date, Team, Opponent, W_L, differential, FGPctMargin, predicted, residuals)
# A tibble: 1,046 × 8
   Date       Team         Opponent        W_L   diffe…¹ FGPct…² predi…³ resid…⁴
   <date>     <chr>        <chr>           <chr>   <dbl>   <dbl>   <dbl>   <dbl>
 1 2020-12-15 Northwestern Quincy          W          52  0.089    11.4     40.6
 2 2020-11-25 Nebraska     McNeese State   W          47  0.118    15.0     32.0
 3 2020-11-25 Illinois     North Carolina… W          62  0.243    30.3     31.7
 4 2021-11-22 Iowa         Western Michig… W          48  0.147    18.5     29.5
 5 2021-11-16 Iowa         North Carolina… W          17 -0.098   -11.5     28.5
 6 2020-12-05 Northwestern Chicago State   W          45  0.15     18.9     26.1
 7 2014-11-16 Illinois     Coppin State    W          58  0.258    32.1     25.9
 8 2020-12-13 Iowa         Northern Illin… W          53  0.219    27.4     25.6
 9 2020-12-10 Minnesota    UMKC            W          29  0.0510    6.78    22.2
10 2022-01-14 Purdue       Nebraska        W          27  0.0360    4.94    22.1
# … with 1,036 more rows, and abbreviated variable names ¹​differential,
#   ²​FGPctMargin, ³​predicted, ⁴​residuals

So looking at this table, what you see here are the teams who scored more than their FG percentage margin would indicate. One of them should jump off the page at you.

Look at that Maryland-Northwestern game from 2020. The Wildcats shot better than the Terps, but the model predicted Northwestern would win by 17 points. Instead, Maryland won by 11!

But, before we can bestow any validity on this model, we need to see if this linear model is appropriate. We’ve done that some looking at our p-values and R-squared values. But one more check is to look at the residuals themselves. We do that by plotting the residuals with the predictor. We’ll get into plotting soon, but for now just seeing it is enough.

The lack of a shape here – the seemingly random nature – is a good sign that a linear model works for our data. If there was a pattern, that would indicate something else was going on in our data and we needed a different model.

Another way to view your residuals is by connecting the predicted value with the actual value.

`geom_smooth()` using formula = 'y ~ x'

The blue line here separates underperformers from overperformers.

11.1 Fouls

Now let’s look at it where it doesn’t work as well: the total number of fouls

fouls <- logs |> 
  mutate(
    differential = TeamScore - OpponentScore, 
    TotalFouls = TeamPersonalFouls+OpponentPersonalFouls
  )
pfit <- lm(differential ~ TotalFouls, data = fouls)
summary(pfit)

Call:
lm(formula = differential ~ TotalFouls, data = fouls)

Residuals:
    Min      1Q  Median      3Q     Max 
-81.056 -10.628  -0.056   9.587 106.658 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.771005   0.262763  14.351   <2e-16 ***
TotalFouls  -0.071448   0.007219  -9.898   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.61 on 98133 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.0009973, Adjusted R-squared:  0.0009871 
F-statistic: 97.97 on 1 and 98133 DF,  p-value: < 2.2e-16

So from top to bottom:

  • Our min and max go from -94 to positive 107
  • Our adjusted R-squared is … 0.000986. Not much at all.
  • Our p-value is … is less than than .05, so that’s something.

So what we can say about this model is that it’s statistically significant, but doesn’t really explain much. It’s not meaningless, but on its own the total number of fouls doesn’t go very far in explaining the point differential. Normally, we’d stop right here – why bother going forward with a predictive model that isn’t terribly predictive? But let’s do it anyway. Oh, and see that “(65 observations deleted due to missingness)” bit? That means we need to lose some incomplete data again.

fouls <- fouls |> filter(!is.na(TotalFouls))
fouls$predicted <- predict(pfit)
fouls$residuals <- residuals(pfit)
fouls |> arrange(desc(residuals)) |> select(Team, Opponent, W_L, TeamScore, OpponentScore, TotalFouls, residuals)
# A tibble: 98,135 × 7
   Team                  Opponent         W_L   TeamSc…¹ Oppon…² Total…³ resid…⁴
   <chr>                 <chr>            <chr>    <dbl>   <dbl>   <dbl>   <dbl>
 1 Bryant                Thomas (ME)      W          147      39      34   107. 
 2 McNeese State         Dallas Christian W          140      37      33   102. 
 3 Appalachian State     Toccoa Falls     W          135      34      35    99.7
 4 Purdue Fort Wayne     <NA>             W          130      34      19    93.6
 5 James Madison         Carlow           W          135      40      29    93.3
 6 North Dakota State    Oak Hills        W          108      14      12    91.1
 7 Florida International Trinity (FL)     W          146      55      39    90.0
 8 Lamar                 Howard Payne     W          121      32      35    87.7
 9 Georgia Southern      Carver College   W          139      51      38    86.9
10 Youngstown State      Franciscan       W          134      46      35    86.7
# … with 98,125 more rows, and abbreviated variable names ¹​TeamScore,
#   ²​OpponentScore, ³​TotalFouls, ⁴​residuals

First, note all of the biggest misses here are all blowout games. The worst games of the season, the worst being Bryant vs. Thomas. The model missed that differential by … 107 points. The margin of victory? 108 points. In other words, this model is not great! But let’s look at it anyway.

Well … it actually says that a linear model is appropriate. Which an important lesson – just because your residual plot says a linear model works here, that doesn’t say your linear model is good. There are other measures for that, and you need to use them.

Here’s the segment plot of residuals – you’ll see some really long lines. That’s a bad sign. Another bad sign? A flat fit line. It means there’s no relationship between these two things. Which we already know.

`geom_smooth()` using formula = 'y ~ x'