library(tidyverse)
10 Multiple regression
Last chapter, we looked at correlations and linear regression to predict how one element of a game would predict the score. But we know that a single variable, in all but the rarest instances, is not going to be that predictive. We need more than one. Enter multiple regression. Multiple regression lets us add – wait for it – multiple predictors to our equation to help us get a better fit to reality.
That presents it’s own problems. So let’s get set up. The dataset we’ll use is all men’s college basketball games between 2015 and 2024.
For this walkthrough:
We need the tidyverse.
And the data.
<- read_csv("data/cbblogs1524.csv") logs
Rows: 98161 Columns: 51
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Season, TeamFull, Opponent, HomeAway, W_L, URL, Conference, Team,...
dbl (39): Game, TeamScore, OpponentScore, TeamFG, TeamFGA, TeamFGPCT, Team3...
lgl (2): Blank, season
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
One way to show how successful a basketball team was for a game is to show the differential between the team’s score and the opponent’s score. Score a lot more than the opponent = good, score a lot less than the opponent = bad. And, relatively speaking, the more the better. So let’s create that differential. Let’s add in net rebounds. And because we’ll need it later, let’s add the turnover margin.
<- logs |> mutate(
logs Differential = TeamScore - OpponentScore,
NetRebounds = TeamTotalRebounds - OpponentTotalRebounds,
TurnoverMargin = TeamTurnovers - OpponentTurnovers)
The linear model code we used before is pretty straight forward. Its field
is predicted by field
. Here’s a simple linear model that looks at predicting a team’s point differential by looking at their net turnovers.
<- lm(Differential ~ NetRebounds, data=logs)
rebounds summary(rebounds)
Call:
lm(formula = Differential ~ NetRebounds, data = logs)
Residuals:
Min 1Q Median 3Q Max
-59.216 -8.579 -0.233 8.203 78.185
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.651821 0.041475 15.72 <2e-16 ***
NetRebounds 1.072698 0.004278 250.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.97 on 98133 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.3905, Adjusted R-squared: 0.3904
F-statistic: 6.286e+04 on 1 and 98133 DF, p-value: < 2.2e-16
Remember: There’s a lot here, but only some of it we care about. What is the Adjusted R-squared value? What’s the p-value and is it less than .05? In this case, we can predict about 39 percent of the difference in differential with the net rebounds in the game.
To add more predictors to this mix, we merely add them. But it’s not that simple, as you’ll see in a moment. So first, let’s look at adding turnover margin to our prediction model:
<- lm(Differential ~ NetRebounds + TurnoverMargin, data=logs)
model1 summary(model1)
Call:
lm(formula = Differential ~ NetRebounds + TurnoverMargin, data = logs)
Residuals:
Min 1Q Median 3Q Max
-46.158 -6.899 -0.062 6.845 45.874
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.303968 0.032952 9.225 <2e-16 ***
NetRebounds 1.200686 0.003438 349.286 <2e-16 ***
TurnoverMargin -1.557334 0.006487 -240.067 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.3 on 98132 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.616, Adjusted R-squared: 0.616
F-statistic: 7.87e+04 on 2 and 98132 DF, p-value: < 2.2e-16
First things first: What is the adjusted R-squared? This model predicts about 61 percent of the differential.
Second: what is the p-value and is it less than .05?
Third: Compare the residual standard error. We went from 12.95 to 10.3. The meaning of this is both really opaque and also simple – by adding data, we reduced the amount of error in our model. Residual standard error is the total distance between what our model would predict and what we actually have in the data. So lots of residual error means the distance between reality and our model is wider. So the width of our predictive range in this example shrank while we improved the amount of the difference we could predict. That’s good, and not always going to be the case.
One of the more difficult things to understand about multiple regression is the issue of multicollinearity. What that means is that there is significant correlation overlap between two variables – the two are related to each other as well as to the target output – and all you are doing by adding both of them is adding error with no real value to the R-squared. In pure statistics, we don’t want any multicollinearity at all. Violating that assumption limits the applicability of what you are doing. So if we have some multicollinearity, it limits our scope of application to college basketball. We can’t say this will work for every basketball league and level everywhere. What we need to do is see how correlated each value is to each other and throw out ones that are highly co-correlated.
So to find those, we have to create a correlation matrix that shows us how each value is correlated to our outcome variable, but also with each other. We can do that in the Hmisc
library. We install that in the console with install.packages("Hmisc")
library(Hmisc)
We can pass in every numeric value to the Hmisc library and get a correlation matrix out of it, but since we have a large number of values – and many of them character values – we should strip that down and reorder them. So that’s what I’m doing here. I’m saying give me all the columns with numeric values, except for Game, and then show me the differential, net yards, turnover margin and then everything else.
<- logs |> select_if(is.numeric) |> select(-Game) |> select(Differential, NetRebounds, TurnoverMargin, TeamFGPCT, TeamTotalRebounds, OpponentFGPCT, OpponentTotalRebounds) simplelogs
Before we proceed, what we’re looking to do is follow the Differential column down, looking for correlation values near 1 or -1. Correlations go from -1, meaning perfect negative correlation, to 0, meaning no correlation, to 1, meaning perfect positive correlation. So we’re looking for numbers near 1 or -1 for their predictive value. BUT: We then need to see if that value is also highly correlated with something else. If it is, we have a decision to make.
We get our correlation matrix like this:
<- rcorr(as.matrix(simplelogs))
cormatrix
$r cormatrix
Differential NetRebounds TurnoverMargin TeamFGPCT
Differential 1.0000000 0.6248635 -0.37224236 0.60842699
NetRebounds 0.6248635 1.0000000 0.15509130 0.36726732
TurnoverMargin -0.3722424 0.1550913 1.00000000 -0.03109904
TeamFGPCT 0.6084270 0.3672673 -0.03109904 1.00000000
TeamTotalRebounds 0.4887337 0.7514775 0.09713362 0.02665011
OpponentFGPCT -0.6165787 -0.3776729 0.03438953 -0.11505868
OpponentTotalRebounds -0.4292394 -0.7190817 -0.13197214 -0.52680962
TeamTotalRebounds OpponentFGPCT OpponentTotalRebounds
Differential 0.48873374 -0.616578694 -0.429239429
NetRebounds 0.75147749 -0.377672886 -0.719081689
TurnoverMargin 0.09713362 0.034389532 -0.131972135
TeamFGPCT 0.02665011 -0.115058678 -0.526809615
TeamTotalRebounds 1.00000000 -0.546787074 -0.081890501
OpponentFGPCT -0.54678707 1.000000000 -0.005413951
OpponentTotalRebounds -0.08189050 -0.005413951 1.000000000
Notice right away – NetRebounds is highly correlated. But NetRebounds is also highly correlated with TeamTotalRebounds. And that makes sense: TeamTotalRebounds feeds into NetRebounds. Including both of these measures would be pointless – they would add error without adding much in the way of predictive power.
Your turn: What else do you see? What other values have predictive power and aren’t co-correlated? Add or remove some of the columns above and re-run the correlation matrix.
We can add more just by simply adding them. Let’s add the average FG PCT for both the team and opponent. They’re correlated to Differential, but not as much as you might expect.
<- lm(Differential ~ NetRebounds + TurnoverMargin + TeamFGPCT + OpponentFGPCT, data=logs)
model2 summary(model2)
Call:
lm(formula = Differential ~ NetRebounds + TurnoverMargin + TeamFGPCT +
OpponentFGPCT, data = logs)
Residuals:
Min 1Q Median 3Q Max
-26.171 -3.679 -0.023 3.651 37.368
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.226590 0.156372 1.449 0.147
NetRebounds 0.656535 0.002128 308.520 <2e-16 ***
TurnoverMargin -1.311144 0.003484 -376.362 <2e-16 ***
TeamFGPCT 91.270480 0.254245 358.986 <2e-16 ***
OpponentFGPCT -91.510307 0.254204 -359.988 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.474 on 98130 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.8915, Adjusted R-squared: 0.8915
F-statistic: 2.015e+05 on 4 and 98130 DF, p-value: < 2.2e-16
Go down the list:
What is the Adjusted R-squared now? What is the p-value and is it less than .05? What is the Residual standard error?
The final thing we can do with this is predict things. Look at our coefficients table. See the Estimates? We can build a formula from that, same as we did with linear regressions.
How does this apply in the real world? Let’s pretend for a minute that you are Kevin Willard, and you want to win conference titles. To do that, we need to know what attributes of a team we should emphasize. We can do that by looking at what previous Big Ten conference champions looked like.
So if our goal is to predict a conference champion team, we need to know what those teams did. Here’s the regular season conference champions in this dataset.
|>
logs filter(Team == "Michigan" & Season == '2020-2021' | Team == "Wisconsin" & Season == '2019-2020' | Team == "Michigan State" & Season == '2018-2019' | Team == "Michigan State" & Season == '2017-2018' | Team == 'Illinois' & Season == '2021-2022' | Team == 'Purdue' & Season == '2022-2023' | Team == 'Purdue' & Season == '2023-2024') |>
summarise(
meanNetRebounds = mean(NetRebounds),
meanTurnoverMargin = mean(TurnoverMargin),
meanTeamFGPCT = mean(TeamFGPCT),
meanOpponentFGPCT = mean(OpponentFGPCT)
)
# A tibble: 1 × 4
meanNetRebounds meanTurnoverMargin meanTeamFGPCT meanOpponentFGPCT
<dbl> <dbl> <dbl> <dbl>
1 9.61 1.97 0.477 0.399
Now it’s just plug and chug.
# (netrebounds estimate * meanNetRebounds) + (turnover margin estimate * meanTurnoverMargin) + (TeamFGPCT estimate * meanTeamFGPCT) + (OpponentFGPCT estimate * meanOpponentFGPCT) + Intercept
0.657726*9.645933) + (-1.313064*1.933014) + (91.403110*0.4780766) + (-91.650923*0.3981818) + 0.233954 (
[1] 11.24412
So a team with those numbers is going to average scoring 11 more points per game than their opponent. Not a ton, but hey, the Big Ten has been a competitive conference lately.
How does that compare to Maryland in 2023-24 season?
|>
logs filter(
== "Maryland" & Season == '2023-2024'
Team |>
) summarise(
meanNetRebounds = mean(NetRebounds),
meanTurnoverMargin = mean(TurnoverMargin),
meanTeamFGPCT = mean(TeamFGPCT),
meanOpponentFGPCT = mean(OpponentFGPCT)
)
# A tibble: 1 × 4
meanNetRebounds meanTurnoverMargin meanTeamFGPCT meanOpponentFGPCT
<dbl> <dbl> <dbl> <dbl>
1 1.27 -0.394 0.414 0.419
0.657726*1.277778) + (-0.3055556*1.933014) + (91.403110*0.4133056) + (-91.650923*0.4171944) + 0.233954 (
[1] 0.02490395
By this model, it predicted we would, on average, barely outscore our opponents over that season. The reality?
|>
logs filter(
== "Maryland" & Season == '2023-2024'
Team |> summarise(avg_score = mean(TeamScore), avg_opp = mean(OpponentScore)) )
# A tibble: 1 × 2
avg_score avg_opp
<dbl> <dbl>
1 69.0 65.9
We outscored them by about 3 points on average, so maybe this isn’t the best model, or suggests that perhaps Maryland found a way to be successful outside of these parameters. What would you change?