10  Multiple regression

Last chapter, we looked at correlations and linear regression to predict how one element of a game would predict the score. But we know that a single variable, in all but the rarest instances, is not going to be that predictive. We need more than one. Enter multiple regression. Multiple regression lets us add – wait for it – multiple predictors to our equation to help us get a better fit to reality.

That presents it’s own problems. So let’s get set up. The dataset we’ll use is all men’s college basketball games between 2011 and 2025.

For this walkthrough:

We need the tidyverse.

library(tidyverse)

And the data.

logs <- read_csv("data/cbblogs1125.csv.zip")
Multiple files in zip: reading 'cbblogs1125.csv'
Rows: 169224 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (10): Season, GameType, TeamFullName, Opponent, HomeAway, W_L, OT, URL,...
dbl  (50): file_source, Game, TeamScore, OpponentScore, TeamFG, TeamFGA, Tea...
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

One way to show how successful a basketball team was for a game is to show the differential between the team’s score and the opponent’s score. Score a lot more than the opponent = good, score a lot less than the opponent = bad. And, relatively speaking, the more the better. So let’s create that differential. Let’s add in net rebounds. And because we’ll need it later, let’s add the turnover margin.

logs <- logs |> mutate(
  Differential = TeamScore - OpponentScore, 
  NetRebounds = TeamTotalRebounds - OpponentTotalRebounds,
  TurnoverMargin = TeamTurnovers - OpponentTurnovers)

The linear model code we used before is pretty straight forward. Its field is predicted by field. Here’s a simple linear model that looks at predicting a team’s point differential by looking at their net turnovers.

rebounds <- lm(Differential ~ NetRebounds, data=logs)
summary(rebounds)

Call:
lm(formula = Differential ~ NetRebounds, data = logs)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.642  -8.532  -0.207   8.248  94.358 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.697139   0.031515   22.12   <2e-16 ***
NetRebounds 1.054990   0.003255  324.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.94 on 169218 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.3831,    Adjusted R-squared:  0.3831 
F-statistic: 1.051e+05 on 1 and 169218 DF,  p-value: < 2.2e-16

Remember: There’s a lot here, but only some of it we care about. What is the Adjusted R-squared value? What’s the p-value and is it less than .05? In this case, we can predict about 38 percent of the difference in differential with the net rebounds in the game.

To add more predictors to this mix, we merely add them. But it’s not that simple, as you’ll see in a moment. So first, let’s look at adding turnover margin to our prediction model:

model1 <- lm(Differential ~ NetRebounds + TurnoverMargin, data=logs)
summary(model1)

Call:
lm(formula = Differential ~ NetRebounds + TurnoverMargin, data = logs)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.307  -6.873  -0.064   6.830  49.523 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.338141   0.025092   13.48   <2e-16 ***
NetRebounds     1.184100   0.002621  451.72   <2e-16 ***
TurnoverMargin -1.546417   0.004933 -313.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.29 on 169217 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.6097,    Adjusted R-squared:  0.6097 
F-statistic: 1.322e+05 on 2 and 169217 DF,  p-value: < 2.2e-16

First things first: What is the adjusted R-squared? This model predicts about 61 percent of the differential.

Second: what is the p-value and is it less than .05?

Third: Compare the residual standard error. We went from 12.94 to 10.3. The meaning of this is both really opaque and also simple – by adding data, we reduced the amount of error in our model. Residual standard error is the total distance between what our model would predict and what we actually have in the data. So lots of residual error means the distance between reality and our model is wider. So the width of our predictive range in this example shrank while we improved the amount of the difference we could predict. That’s good, and not always going to be the case.

One of the more difficult things to understand about multiple regression is the issue of multicollinearity. What that means is that there is significant correlation overlap between two variables – the two are related to each other as well as to the target output – and all you are doing by adding both of them is adding error with no real value to the R-squared. In pure statistics, we don’t want any multicollinearity at all. Violating that assumption limits the applicability of what you are doing. So if we have some multicollinearity, it limits our scope of application to college basketball. We can’t say this will work for every basketball league and level everywhere. What we need to do is see how correlated each value is to each other and throw out ones that are highly co-correlated.

So to find those, we have to create a correlation matrix that shows us how each value is correlated to our outcome variable, but also with each other. We can do that in the Hmisc library. We install that in the console with install.packages("Hmisc")

library(Hmisc)

We can pass in every numeric value to the Hmisc library and get a correlation matrix out of it, but since we have a large number of values – and many of them character values – we should strip that down and reorder them. So that’s what I’m doing here. I’m saying give me all the columns with numeric values, except for Game, and then show me the differential, net yards, turnover margin and then everything else.

simplelogs <- logs |> select_if(is.numeric) |> select(-Game) |> select(Differential, NetRebounds, TurnoverMargin, TeamFGPCT, TeamTotalRebounds, OpponentFGPCT, OpponentTotalRebounds)

Before we proceed, what we’re looking to do is follow the Differential column down, looking for correlation values near 1 or -1. Correlations go from -1, meaning perfect negative correlation, to 0, meaning no correlation, to 1, meaning perfect positive correlation. So we’re looking for numbers near 1 or -1 for their predictive value. BUT: We then need to see if that value is also highly correlated with something else. If it is, we have a decision to make.

We get our correlation matrix like this:

cormatrix <- rcorr(as.matrix(simplelogs))

cormatrix$r
                      Differential NetRebounds TurnoverMargin   TeamFGPCT
Differential             1.0000000   0.6189143    -0.37292019  0.60659177
NetRebounds              0.6189143   1.0000000     0.15712085  0.36075626
TurnoverMargin          -0.3729202   0.1571208     1.00000000 -0.02617653
TeamFGPCT                0.6065918   0.3607563    -0.02617653  1.00000000
TeamTotalRebounds        0.4796608   0.7474168     0.09997749  0.02222265
OpponentFGPCT           -0.6124710  -0.3688248     0.02892920 -0.11829996
OpponentTotalRebounds   -0.4265115  -0.7180688    -0.13111165 -0.51822144
                      TeamTotalRebounds OpponentFGPCT OpponentTotalRebounds
Differential                 0.47966080  -0.612471013          -0.426511535
NetRebounds                  0.74741681  -0.368824848          -0.718068849
TurnoverMargin               0.09997749   0.028929201          -0.131111653
TeamFGPCT                    0.02222265  -0.118299958          -0.518221439
TeamTotalRebounds            1.00000000  -0.534652332          -0.074323920
OpponentFGPCT               -0.53465233   1.000000000          -0.006469967
OpponentTotalRebounds       -0.07432392  -0.006469967           1.000000000

Notice right away – NetRebounds is highly correlated. But NetRebounds is also highly correlated with TeamTotalRebounds. And that makes sense: TeamTotalRebounds feeds into NetRebounds. Including both of these measures would be pointless – they would add error without adding much in the way of predictive power.

Your turn: What else do you see? What other values have predictive power and aren’t co-correlated? Add or remove some of the columns above and re-run the correlation matrix.

We can add more just by simply adding them. Let’s add the average FG PCT for both the team and opponent. They’re correlated to Differential, but not as much as you might expect.

model2 <- lm(Differential ~ NetRebounds + TurnoverMargin + TeamFGPCT + OpponentFGPCT, data=logs)
summary(model2)

Call:
lm(formula = Differential ~ NetRebounds + TurnoverMargin + TeamFGPCT + 
    OpponentFGPCT, data = logs)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.955  -3.650  -0.026   3.626  37.661 

Coefficients:
                 Estimate Std. Error  t value Pr(>|t|)    
(Intercept)      0.168687   0.117743    1.433    0.152    
NetRebounds      0.655831   0.001604  408.886   <2e-16 ***
TurnoverMargin  -1.317830   0.002637 -499.805   <2e-16 ***
TeamFGPCT       89.952654   0.190823  471.392   <2e-16 ***
OpponentFGPCT  -90.064048   0.190918 -471.742   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.454 on 169215 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.8904,    Adjusted R-squared:  0.8904 
F-statistic: 3.438e+05 on 4 and 169215 DF,  p-value: < 2.2e-16

Go down the list:

What is the Adjusted R-squared now? What is the p-value and is it less than .05? What is the Residual standard error?

The final thing we can do with this is predict things. Look at our coefficients table. See the Estimates? We can build a formula from that, same as we did with linear regressions.

How does this apply in the real world? Let’s pretend for a minute that you are Buzz Williams, and you want to win conference titles. To do that, we need to know what attributes of a team we should emphasize. We can do that by looking at what previous Big Ten conference champions looked like.

So if our goal is to predict a conference champion team, we need to know what those teams did. Here’s the regular season conference champions in this dataset since the 2017 season:

logs |> 
  filter(Team == "Michigan" & Season == '2020-2021' | Team == "Wisconsin" & Season == '2019-2020' | Team == "Michigan State" & Season == '2018-2019' | Team == "Michigan State" & Season == '2017-2018' | Team == 'Illinois' & Season == '2021-2022' | Team == 'Purdue' & Season == '2022-2023' | Team == 'Purdue' & Season == '2023-2024' | Team == "Michigan State" & Season == '2024-2025') |> 
  summarise(
    meanNetRebounds = mean(NetRebounds),
    meanTurnoverMargin = mean(TurnoverMargin),
    meanTeamFGPCT = mean(TeamFGPCT),
    meanOpponentFGPCT = mean(OpponentFGPCT)
  )
# A tibble: 1 × 4
  meanNetRebounds meanTurnoverMargin meanTeamFGPCT meanOpponentFGPCT
            <dbl>              <dbl>         <dbl>             <dbl>
1            8.16               1.40         0.468             0.401

Now it’s just plug and chug.

# (netrebounds estimate * meanNetRebounds) + (turnover margin estimate * meanTurnoverMargin) + (TeamFGPCT estimate * meanTeamFGPCT) + (OpponentFGPCT estimate * meanOpponentFGPCT) + Intercept
(0.655831*8.155235) + (-1.317830*1.397112) + (89.952654*0.46787) + (-90.064048*0.4012419) + 0.168687
[1] 9.624665

So a team with those numbers is going to average scoring 9.6 more points per game than their opponent. Not a ton, but hey, the Big Ten has been a competitive conference lately.

How does that compare to Maryland in 2024-25 season?

logs |> 
  filter(
    Team == "Maryland" & Season == '2024-2025'
    ) |> 
  summarise(
    meanNetRebounds = mean(NetRebounds),
    meanTurnoverMargin = mean(TurnoverMargin),
    meanTeamFGPCT = mean(TeamFGPCT),
    meanOpponentFGPCT = mean(OpponentFGPCT)
  )
# A tibble: 1 × 4
  meanNetRebounds meanTurnoverMargin meanTeamFGPCT meanOpponentFGPCT
            <dbl>              <dbl>         <dbl>             <dbl>
1            2.47              -3.44         0.469             0.415
(0.655831*2.472222) + (-1.317830*-3.444444) + (89.952654*0.4687778) + (-90.064048*0.4150278) + 0.168687
[1] 11.11796

By this model, it predicted UMD would, on average, outscore its opponents by 11 points over that season. The reality?

logs |> 
     filter(
         Team == "Maryland" & Season == '2024-2025'
     ) |> summarise(avg_score = mean(TeamScore), avg_opp = mean(OpponentScore))
# A tibble: 1 × 2
  avg_score avg_opp
      <dbl>   <dbl>
1      81.1    67.2

We outscored them by nearly 14 points on average, which suggests that perhaps Maryland found a way to be even more successful outside of these parameters. What would you change?