library(tidyverse)
11 Residuals
When looking at a linear model of your data, there’s a measure you need to be aware of called residuals. The residual is the distance between what the model predicted and what the real outcome is. Take our model at the end of the correlation and regression chapter. Our model predicted Maryland’s women soccer should have outscored George Mason by a goal a year ago. The match was a 3-2 loss. So our residual is -2.
Residuals can tell you several things, but most important is if a linear model the right model for your data. If the residuals appear to be random, then a linear model is appropriate. If they have a pattern, it means something else is going on in your data and a linear model isn’t appropriate.
Residuals can also tell you who is under-performing and over-performing the model. And the more robust the model – the better your r-squared value is – the more meaningful that label of under or over-performing is.
Let’s go back to our model for men’s college basketball. For our predictor, let’s use Net FG Percentage - the difference between the two teams’ shooting success.
For this walkthrough:
Then load the tidyverse.
<- read_csv("data/cbblogs1125.csv.zip") logs
Multiple files in zip: reading 'cbblogs1125.csv'
Rows: 169224 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Season, GameType, TeamFullName, Opponent, HomeAway, W_L, OT, URL,...
dbl (50): file_source, Game, TeamScore, OpponentScore, TeamFG, TeamFGA, Tea...
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First, let’s make the columns we’ll need.
<- logs |> mutate(differential = TeamScore - OpponentScore, FGPctMargin = TeamFGPCT - OpponentFGPCT) residualmodel
Now let’s create our model.
<- lm(differential ~ FGPctMargin, data = residualmodel)
fit summary(fit)
Call:
lm(formula = differential ~ FGPctMargin, data = residualmodel)
Residuals:
Min 1Q Median 3Q Max
-51.912 -6.248 -0.206 5.946 70.084
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.55052 0.02324 23.69 <2e-16 ***
FGPctMargin 119.82551 0.20700 578.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.544 on 169218 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.6645, Adjusted R-squared: 0.6645
F-statistic: 3.351e+05 on 1 and 169218 DF, p-value: < 2.2e-16
We’ve seen this output before, but let’s review because if you are using scatterplots to make a point, you should do this. First, note the Min and Max residual at the top. A team has under-performed the model by 51 points (!), and a team has overperformed it by 70 points (!!). The median residual, where half are above and half are below, is just slightly below the fit line. Close here is good.
Next: Look at the Adjusted R-squared value. What that says is that 66 percent of a team’s scoring differential can be predicted by their FG percentage margin.
Last: Look at the p-value. We are looking for a p-value smaller than .05. At .05, we can say that our correlation didn’t happen at random. And, in this case, it REALLY didn’t happen at random. But if you know a little bit about basketball, it doesn’t surprise you that the more you shoot better than your opponent, the more you win by. It’s an intuitive result.
What we want to do now is look at those residuals. We want to add them to our individual game records. We can do that by creating two new fields – predicted and residuals – to our dataframe like this:
<- residualmodel |> mutate(predicted = predict(fit), residuals = residuals(fit)) residualmodel
Error in `mutate()`:
ℹ In argument: `predicted = predict(fit)`.
Caused by error:
! `predicted` must be size 169224 or 1, not 169220.
Uh, oh. What’s going on here? When you get a message like this, where R is complaining about the size of the data, it most likely means that your model is using some columns that have NA values. In this case, the number of columns looks small - perhaps 3 - so let’s just get rid of those rows by using the calculated columns from our model:
<- residualmodel |> filter(!is.na(FGPctMargin)) residualmodel
Now we can try re-running the code to add the predicted and residuals columns:
<- residualmodel |> mutate(predicted = predict(fit), residuals = residuals(fit)) residualmodel
Now we can sort our data by those residuals. Sorting in descending order gives us the games where teams overperformed the model. To make it easier to read, I’m going to use select to give us just the columns we need to see and limit our results to Big Ten games.
|> filter(Conference == 'Big Ten MBB', GameType == 'REG (Conf)') |> arrange(desc(residuals)) |> select(Date, Team, Opponent, W_L, differential, FGPctMargin, predicted, residuals) residualmodel
# A tibble: 3,884 × 8
Date Team Opponent W_L differential FGPctMargin predicted residuals
<date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2016-01-18 Purdue Rutgers W 50 0.109 13.6 36.4
2 2023-01-28 Maryl… Nebraska W 19 -0.087 -9.87 28.9
3 2020-01-21 Maryl… Northwe… W 11 -0.148 -17.2 28.2
4 2016-03-02 Michi… Rutgers W 31 0.0250 3.55 27.5
5 2011-02-15 Ohio … Michiga… W 10 -0.145 -16.8 26.8
6 2023-02-12 Iowa Minneso… W 12 -0.125 -14.4 26.4
7 2013-01-30 India… Purdue W 37 0.093 11.7 25.3
8 2015-01-20 Wisco… Iowa W 32 0.057 7.38 24.6
9 2016-01-23 India… Northwe… W 32 0.071 9.06 22.9
10 2012-01-29 India… Iowa W 14 -0.0780 -8.80 22.8
# ℹ 3,874 more rows
So looking at this table, what you see here are the teams who scored more than their FG percentage margin would indicate. One of the predicted values should jump off the page at you.
Look at that Maryland-Northwestern game from 2020. The Wildcats shot better than the Terps, and the model predicted Northwestern would win by 17 points. Instead, Maryland won by 11!
But, before we can bestow any validity on this model, we need to see if this linear model is appropriate. We’ve done that some looking at our p-values and R-squared values. But one more check is to look at the residuals themselves. We do that by plotting the residuals with the predictor. We’ll get into plotting soon, but for now just seeing it is enough.
The lack of a shape here – the seemingly random nature – is a good sign that a linear model works for our data. If there was a pattern, that would indicate something else was going on in our data and we needed a different model.
Another way to view your residuals is by connecting the predicted value with the actual value.
`geom_smooth()` using formula = 'y ~ x'
The blue line here separates underperformers from overperformers.
11.1 Fouls
Now let’s look at it where it doesn’t work as well: the total number of fouls
<- logs |>
fouls mutate(
differential = TeamScore - OpponentScore,
TotalFouls = TeamPersonalFouls+OpponentPersonalFouls
)
<- lm(differential ~ TotalFouls, data = fouls)
pfit summary(pfit)
Call:
lm(formula = differential ~ TotalFouls, data = fouls)
Residuals:
Min 1Q Median 3Q Max
-95.709 -10.567 0.008 9.645 106.574
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.831495 0.199862 19.17 <2e-16 ***
TotalFouls -0.070755 0.005468 -12.94 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.47 on 169218 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.0009884, Adjusted R-squared: 0.0009825
F-statistic: 167.4 on 1 and 169218 DF, p-value: < 2.2e-16
So from top to bottom:
- Our min and max go from -95 to positive 107
- Our adjusted R-squared is … 0.0009825. Not much at all.
- Our p-value is … is less than than .05, so that’s something.
So what we can say about this model is that it’s statistically significant, but doesn’t really explain much. It’s not meaningless, but on its own the total number of fouls doesn’t go very far in explaining the point differential. Normally, we’d stop right here – why bother going forward with a predictive model that isn’t terribly predictive? But let’s do it anyway. Oh, and see that “(4 observations deleted due to missingness)” bit? That means we need to lose some incomplete data again.
<- fouls |> filter(!is.na(TotalFouls))
fouls $predicted <- predict(pfit)
fouls$residuals <- residuals(pfit) fouls
|> arrange(desc(residuals)) |> select(Team, Opponent, W_L, TeamScore, OpponentScore, TotalFouls, residuals) fouls
# A tibble: 169,220 × 7
Team Opponent W_L TeamScore OpponentScore TotalFouls residuals
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Bryant <NA> W 147 39 34 107.
2 Southern <NA> W 116 12 37 103.
3 McNeese State <NA> W 140 37 33 102.
4 Western Carolina <NA> W 141 39 30 100.
5 Appalachian State <NA> W 135 34 35 99.6
6 Kansas City <NA> W 119 19 24 97.9
7 Purdue Fort Wayne <NA> W 130 34 19 93.5
8 James Madison <NA> W 135 40 29 93.2
9 Grambling <NA> W 147 52 20 92.6
10 Utah Mississ… W 143 49 30 92.3
# ℹ 169,210 more rows
First, note all of the biggest misses here are all blowout games. The worst games of the season, the worst being Bryant vs. Thomas. The model missed that differential by … 107 points. The margin of victory? 108 points. In other words, this model is not great! But let’s look at it anyway.
Well … it actually says that a linear model is appropriate. Which an important lesson – just because your residual plot says a linear model works here, that doesn’t say your linear model is good. There are other measures for that, and you need to use them.
Here’s the segment plot of residuals – you’ll see some really long lines. That’s a bad sign. Another bad sign? A flat fit line. It means there’s no relationship between these two things. Which we already know.
`geom_smooth()` using formula = 'y ~ x'