17  Circular bar plots

Does November basketball really not matter? Are games played early in the season, before teams have had a chance to learn how to play together and when many teams feast on cupcake schedules, meaningful come March?

Let’s look, using a new form of chart called a circular bar plot. It’s a chart type that combines several forms we’ve used before: bar charts to show magnitude, stacked bar charts to show proportion, but we’re going to add bending the chart around a circle to add some visual interesting-ness to it. We’re also going to use time as an x-axis value to make a not subtle circle of time reference – a common technique with circular bar charts.

We’ll use a dataset of every women’s college basketball game last season.

For this walkthrough:

Load your libraries.

library(tidyverse)
library(lubridate)

And load your data.

logs <- read_csv("data/wbblogs24.csv")
Rows: 11503 Columns: 48
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (8): Season, TeamFullName, Opponent, HomeAway, W_L, URL, Conference, Team
dbl  (39): Game, TeamScore, OpponentScore, TeamFG, TeamFGA, TeamFGPCT, Team3...
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

17.1 Does November basketball matter?

So let’s test the notion of November Basketball Doesn’t Matter. What matters in basketball? Let’s start simple: Wins.

Sports Reference’s win columns are weird, so we need to scan through them and find W and L and we’ll give them numbers using case_when. I’ll also filter out post-season tournament games.

winlosslogs <- logs |> 
  filter(Date < '2024-03-15') |> 
  mutate(winloss = case_when(
    grepl("W", W_L) ~ 1, 
    grepl("L", W_L) ~ 0)
) 

We can group by date and conference and sum up the wins. How many wins by day does each conference get?

dates <- winlosslogs |> group_by(Date, Conference) |> summarise(wins = sum(winloss))
`summarise()` has grouped output by 'Date'. You can override using the
`.groups` argument.

Earlier, we did stacked bar charts. We have what we need to do that now.

ggplot() + geom_bar(data=dates, aes(x=Date, weight=wins, fill=Conference)) + theme_minimal()

Eeek. This is already looking not great. But to make it a circular bar chart, we add coord_polar() to our chart.

ggplot() + geom_bar(data=dates, aes(x=Date, weight=wins, fill=Conference)) + theme_minimal() + coord_polar()

Based on that, the day is probably too thin a slice, and there’s way too many conferences in college basketball. Let’s group this by months and filter out all but the power five conferences.

p5 <- c("SEC WBB", "Big Ten WBB", "Pac-12 WBB", "Big 12 WBB", "ACC WBB")

To get months, we’re going to use a function in the library lubridate called floor_date, which combine with mutate will give us a field of just months.

wins <- winlosslogs |> mutate(month = floor_date(Date, unit="months")) |> group_by(month, Conference) |> summarise(wins=sum(winloss)) |> filter(Conference %in% p5) 
`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.

Now we can use wins to make our circular bar chart of wins by month in the Power Five.

ggplot() + geom_bar(data=wins, aes(x=month, weight=wins, fill=Conference)) + theme_minimal() + coord_polar()

Yikes. That looks a lot like a broken pie chart. So months are too thick of a slice. Let’s use weeks in our floor date to see what that gives us.

wins <- winlosslogs |> mutate(week = floor_date(Date, unit="weeks")) |> group_by(week, Conference) |> summarise(wins=sum(winloss)) |> filter(Conference %in% p5) 
`summarise()` has grouped output by 'week'. You can override using the
`.groups` argument.
ggplot() + geom_bar(data=wins, aes(x=week, weight=wins, fill=Conference)) + theme_minimal() + coord_polar()

That looks better. But what does it say? Does November basketball matter? What this is saying is … yeah, it kinda does. The reason? Lots of wins get piled up in November and December, during non-conference play. So if you are a team with NCAA tournament dreams, you need to win games in November to make sure your tournament resume is where it needs to be come March. Does an individual win or loss matter? Probably not. But your record in November does.

17.2 Does it show you where you are?

So here is the problem we have:

  1. We have data for every game. In the past, we were able to calculate the team wins and losses because the way the data records them is team is the main team, and they win or lose. The opponent is recorded, but not in its own column of that name. In addition, the opponent has the mirror image of this game as well, where they are team. So essentially every game is in here twice – one for each team that plays in the game.
  2. We need to attach the opponent’s winning percentage to each game so we can decide if it’s a quality win for team.

First we need to populate opponent based on the whatever is not the team. Then what we have to do is invert the process that we’ve done before. We need to group by the opponent and we need to invert the wins and losses. A win in the win column is a win for the team. That means each loss in the win column is a WIN for the opponent.

Once we invert, the data looks very similar to what we’ve done before. One other thing: I noticed there’s some tournament games in here, so the filter at the end strips them out like we did before.

oppwinlosslogs <- logs |> 
  mutate(winloss = case_when(
    grepl("W", W_L) ~ 0, 
    grepl("L", W_L) ~ 1)
) |> 
  filter(Date < "2024-03-15")

So now we have a dataframe called oppwinlosslogs that has an inverted winloss column. So now we can group by the Opponent and sum the wins and it will tell us how many games the Opponent won. We can also count the wins and get a winning percentage.

oppwinlosslogs |> group_by(Opponent) |> summarise(games=n(), wins=sum(winloss)) |> mutate(winpct = wins/games) -> opprecord

Now we have a dataframe of 614 opponent winning records. Wait, what? There’s like ~350 teams in major college basketball, so why 614? If you look through it, there’s a bunch of teams playing lower level teams. Given that they are lower level, they’re likely cannon fodder and will lose the game, and we’re going to filter them out in a minute.

Now we can join the opponent winning percentage to our winlosslogs data so we can answer our question about quality wins.

winlosslogs <- logs |> 
  mutate(winloss = case_when(
    grepl("W", W_L) ~ 1, 
    grepl("L", W_L) ~ 0)
) |> 
  filter(Date < "2024-03-15")
winlosslogs |> left_join(opprecord, by=("Opponent")) -> winswithopppct

Now that we have a table called winswithopppct, we can filter out non-power 5 teams and teams that won less than 60 percent of their games and run the same calculations in the book.

p5 <- c("SEC WBB", "Big Ten WBB", "Pac-12 WBB", "Big 12 WBB", "ACC WBB")
winswithopppct |> filter(winpct > .6) |> mutate(week = floor_date(Date, unit="weeks")) |> group_by(week, Conference) |> summarise(wins=sum(winloss)) |> filter(Conference %in% p5) -> qualitywins
`summarise()` has grouped output by 'week'. You can override using the
`.groups` argument.

Now with our dataframe called qualitywins, we can chart it again.

ggplot() + geom_bar(data=qualitywins, aes(x=week, weight=wins, fill=Conference)) + theme_minimal() + coord_polar()

Look at this chart and compare it to the first one.