library(tidyverse)
11 Residuals
When looking at a linear model of your data, there’s a measure you need to be aware of called residuals. The residual is the distance between what the model predicted and what the real outcome is. Take our model at the end of the correlation and regression chapter. Our model predicted Maryland’s women soccer should have outscored Navy by 2.12 goals a year ago. The match was a 3-3 draw. So our residual is -2.12.
Residuals can tell you several things, but most important is if a linear model the right model for your data. If the residuals appear to be random, then a linear model is appropriate. If they have a pattern, it means something else is going on in your data and a linear model isn’t appropriate.
Residuals can also tell you who is under-performing and over-performing the model. And the more robust the model – the better your r-squared value is – the more meaningful that label of under or over-performing is.
Let’s go back to our model for college basketball. For our predictor, let’s use Net FG Percentage - the difference between the two teams’ shooting success.
For this walkthrough:
Then load the tidyverse.
<- read_csv("data/cbblogs1524.csv") logs
Rows: 98161 Columns: 51
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Season, TeamFull, Opponent, HomeAway, W_L, URL, Conference, Team,...
dbl (39): Game, TeamScore, OpponentScore, TeamFG, TeamFGA, TeamFGPCT, Team3...
lgl (2): Blank, season
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First, let’s make the columns we’ll need.
<- logs |> mutate(differential = TeamScore - OpponentScore, FGPctMargin = TeamFGPCT - OpponentFGPCT) residualmodel
Now let’s create our model.
<- lm(differential ~ FGPctMargin, data = residualmodel)
fit summary(fit)
Call:
lm(formula = differential ~ FGPctMargin, data = residualmodel)
Residuals:
Min 1Q Median 3Q Max
-51.242 -6.244 -0.221 5.916 68.455
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.53004 0.03038 17.45 <2e-16 ***
FGPctMargin 122.54301 0.27274 449.31 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.503 on 98133 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.6729, Adjusted R-squared: 0.6729
F-statistic: 2.019e+05 on 1 and 98133 DF, p-value: < 2.2e-16
We’ve seen this output before, but let’s review because if you are using scatterplots to make a point, you should do this. First, note the Min and Max residual at the top. A team has under-performed the model by 51 points (!), and a team has overperformed it by 68 points (!!). The median residual, where half are above and half are below, is just slightly below the fit line. Close here is good.
Next: Look at the Adjusted R-squared value. What that says is that 67 percent of a team’s scoring differential can be predicted by their FG percentage margin.
Last: Look at the p-value. We are looking for a p-value smaller than .05. At .05, we can say that our correlation didn’t happen at random. And, in this case, it REALLY didn’t happen at random. But if you know a little bit about basketball, it doesn’t surprise you that the more you shoot better than your opponent, the more you win by. It’s an intuitive result.
What we want to do now is look at those residuals. We want to add them to our individual game records. We can do that by creating two new fields – predicted and residuals – to our dataframe like this:
<- residualmodel |> mutate(predicted = predict(fit), residuals = residuals(fit)) residualmodel
Error in `mutate()`:
ℹ In argument: `predicted = predict(fit)`.
Caused by error:
! `predicted` must be size 98161 or 1, not 98135.
Uh, oh. What’s going on here? When you get a message like this, where R is complaining about the size of the data, it most likely means that your model is using some columns that have NA values. In this case, the number of columns looks small - perhaps 3 - so let’s just get rid of those rows by using the calculated columns from our model:
<- residualmodel |> filter(!is.na(FGPctMargin)) residualmodel
Now we can try re-running the code to add the predicted and residuals columns:
<- residualmodel |> mutate(predicted = predict(fit), residuals = residuals(fit)) residualmodel
Now we can sort our data by those residuals. Sorting in descending order gives us the games where teams overperformed the model. To make it easier to read, I’m going to use select to give us just the columns we need to see and limit our results to Big Ten teams.
|> filter(Conference == 'Big Ten') |> arrange(desc(residuals)) |> select(Date, Team, Opponent, W_L, differential, FGPctMargin, predicted, residuals) residualmodel
# A tibble: 1,046 × 8
Date Team Opponent W_L diffe…¹ FGPct…² predi…³ resid…⁴
<date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2020-12-15 Northwestern Quincy W 52 0.089 11.4 40.6
2 2020-11-25 Nebraska McNeese State W 47 0.118 15.0 32.0
3 2020-11-25 Illinois North Carolina… W 62 0.243 30.3 31.7
4 2021-11-22 Iowa Western Michig… W 48 0.147 18.5 29.5
5 2021-11-16 Iowa North Carolina… W 17 -0.098 -11.5 28.5
6 2020-12-05 Northwestern Chicago State W 45 0.15 18.9 26.1
7 2014-11-16 Illinois Coppin State W 58 0.258 32.1 25.9
8 2020-12-13 Iowa Northern Illin… W 53 0.219 27.4 25.6
9 2020-12-10 Minnesota UMKC W 29 0.0510 6.78 22.2
10 2022-01-14 Purdue Nebraska W 27 0.0360 4.94 22.1
# … with 1,036 more rows, and abbreviated variable names ¹differential,
# ²FGPctMargin, ³predicted, ⁴residuals
So looking at this table, what you see here are the teams who scored more than their FG percentage margin would indicate. One of them should jump off the page at you.
Look at that Maryland-Northwestern game from 2020. The Wildcats shot better than the Terps, but the model predicted Northwestern would win by 17 points. Instead, Maryland won by 11!
But, before we can bestow any validity on this model, we need to see if this linear model is appropriate. We’ve done that some looking at our p-values and R-squared values. But one more check is to look at the residuals themselves. We do that by plotting the residuals with the predictor. We’ll get into plotting soon, but for now just seeing it is enough.
The lack of a shape here – the seemingly random nature – is a good sign that a linear model works for our data. If there was a pattern, that would indicate something else was going on in our data and we needed a different model.
Another way to view your residuals is by connecting the predicted value with the actual value.
`geom_smooth()` using formula = 'y ~ x'
The blue line here separates underperformers from overperformers.
11.1 Fouls
Now let’s look at it where it doesn’t work as well: the total number of fouls
<- logs |>
fouls mutate(
differential = TeamScore - OpponentScore,
TotalFouls = TeamPersonalFouls+OpponentPersonalFouls
)
<- lm(differential ~ TotalFouls, data = fouls)
pfit summary(pfit)
Call:
lm(formula = differential ~ TotalFouls, data = fouls)
Residuals:
Min 1Q Median 3Q Max
-81.056 -10.628 -0.056 9.587 106.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.771005 0.262763 14.351 <2e-16 ***
TotalFouls -0.071448 0.007219 -9.898 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.61 on 98133 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.0009973, Adjusted R-squared: 0.0009871
F-statistic: 97.97 on 1 and 98133 DF, p-value: < 2.2e-16
So from top to bottom:
- Our min and max go from -94 to positive 107
- Our adjusted R-squared is … 0.000986. Not much at all.
- Our p-value is … is less than than .05, so that’s something.
So what we can say about this model is that it’s statistically significant, but doesn’t really explain much. It’s not meaningless, but on its own the total number of fouls doesn’t go very far in explaining the point differential. Normally, we’d stop right here – why bother going forward with a predictive model that isn’t terribly predictive? But let’s do it anyway. Oh, and see that “(65 observations deleted due to missingness)” bit? That means we need to lose some incomplete data again.
<- fouls |> filter(!is.na(TotalFouls))
fouls $predicted <- predict(pfit)
fouls$residuals <- residuals(pfit) fouls
|> arrange(desc(residuals)) |> select(Team, Opponent, W_L, TeamScore, OpponentScore, TotalFouls, residuals) fouls
# A tibble: 98,135 × 7
Team Opponent W_L TeamSc…¹ Oppon…² Total…³ resid…⁴
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Bryant Thomas (ME) W 147 39 34 107.
2 McNeese State Dallas Christian W 140 37 33 102.
3 Appalachian State Toccoa Falls W 135 34 35 99.7
4 Purdue Fort Wayne <NA> W 130 34 19 93.6
5 James Madison Carlow W 135 40 29 93.3
6 North Dakota State Oak Hills W 108 14 12 91.1
7 Florida International Trinity (FL) W 146 55 39 90.0
8 Lamar Howard Payne W 121 32 35 87.7
9 Georgia Southern Carver College W 139 51 38 86.9
10 Youngstown State Franciscan W 134 46 35 86.7
# … with 98,125 more rows, and abbreviated variable names ¹TeamScore,
# ²OpponentScore, ³TotalFouls, ⁴residuals
First, note all of the biggest misses here are all blowout games. The worst games of the season, the worst being Bryant vs. Thomas. The model missed that differential by … 107 points. The margin of victory? 108 points. In other words, this model is not great! But let’s look at it anyway.
Well … it actually says that a linear model is appropriate. Which an important lesson – just because your residual plot says a linear model works here, that doesn’t say your linear model is good. There are other measures for that, and you need to use them.
Here’s the segment plot of residuals – you’ll see some really long lines. That’s a bad sign. Another bad sign? A flat fit line. It means there’s no relationship between these two things. Which we already know.
`geom_smooth()` using formula = 'y ~ x'