library(tidyverse)
19 Line charts
So far, we’ve talked about bar charts – stacked or otherwise – are good for showing relative size of a thing compared to another thing. Stacked Bars and Waffle charts are good at showing proportions of a whole.
Line charts are good for showing change over time.
Let’s look at how we can answer this question: Why did Maryland’s men’s team struggle at basketball last season?
We’ll need the logs of every game in college basketball for this.
For this walkthrough:
Let’s start getting all that we need. We can use the tidyverse shortcut.
And now load the data.
<- read_csv("data/logs24.csv") logs
Rows: 11990 Columns: 48
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Season, TeamFullName, Opponent, HomeAway, W_L, URL, Conference, Team
dbl (39): Game, TeamScore, OpponentScore, TeamFG, TeamFGA, TeamFGPCT, Team3...
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This data has every game from every team in it, so we need to use filtering to limit it, because we just want to look at Maryland. If you don’t remember, flip back to chapter 6.
<- logs |> filter(Team == "Maryland") umd
Because this data has just Maryland data in it, the dates are formatted correctly, and the data is long data (instead of wide), we have what we need to make line charts.
Line charts, unlike bar charts, do have a y-axis. So in our ggplot step, we have to define what our x and y axes are. In this case, the x axis is our Date – the most common x axis in line charts is going to be a date of some variety – and y in this case is up to us. We’ve seen from previous walkthroughs that how well a team shoots the ball has a lot to do with how well a team does in a season, so let’s chart that.
ggplot() + geom_line(data=umd, aes(x=Date, y=TeamFGPCT))
The problem here is that the Y axis doesn’t start with zero. That makes this look more dramatic than it is. To make the axis what you want, you can use scale_x_continuous
or scale_y_continuous
and pass in a list with the bottom and top value you want. You do that like this:
ggplot() +
geom_line(data=umd, aes(x=Date, y=TeamFGPCT)) +
scale_y_continuous(limits = c(0, .6))
Note also that our X axis labels are automated. It knows it’s a date and it just labels it by month.
19.1 This is too simple.
With datasets, we want to invite comparison. So let’s answer the question visually. Let’s put two lines on the same chart. How does Maryland compare to Illinois, for example?
<- logs |> filter(Team == "Illinois") ill
In this case, because we have two different datasets, we’re going to put everything in the geom instead of the ggplot step. We also have to explicitly state what dataset we’re using by saying data=
in the geom step.
First, let’s chart Maryland. Read carefully. First we set the data. Then we set our aesthetic. Unlike bars, we need an X and a Y variable. In this case, our X is the date of the game, Y is the thing we want the lines to move with. In this case, the Team Field Goal Percentage – TeamFGPCT.
ggplot() + geom_line(data=umd, aes(x=Date, y=TeamFGPCT), color="red")
Now, by using +, we can add Illinois to it. REMEMBER COPY AND PASTE IS A THING. Nothing changes except what data you are using.
ggplot() +
geom_line(data=umd, aes(x=Date, y=TeamFGPCT), color="red") +
geom_line(data=ill, aes(x=Date, y=TeamFGPCT), color="orange")
Let’s flatten our lines out by zeroing the Y axis. We’ll set the upper limit of the y-axis to 0.70 because Illinois shot fantastically well in one December game.
ggplot() +
geom_line(data=umd, aes(x=Date, y=TeamFGPCT), color="red") +
geom_line(data=ill, aes(x=Date, y=TeamFGPCT), color="orange") +
scale_y_continuous(limits = c(0, .70))
So visually speaking, the difference between Maryland and Illinois’ season is that while both had some significant variations shooting the ball, Illinois’ positive outliers were greater and Maryland’s low points were lower - especially after conference play started.
19.2 But what if I wanted to add a lot of lines.
Fine. How about all Power Five Schools? This data for example purposes. You don’t have to do it.
<- c("SEC MBB", "Big Ten MBB", "Pac-12 MBB", "Big 12 MBB", "ACC MBB")
powerfive
<- logs |> filter(Conference %in% powerfive) p5conf
I can keep layering on layers all day if I want. And if my dataset has more than one team in it, I need to use the group
command. And, the layering comes in order – so if you’re going to layer a bunch of lines with a smaller group of lines, you want the bunch on the bottom. So to do that, your code stacks from the bottom. The first geom in the code gets rendered first. The second gets layered on top of that. The third gets layered on that and so on.
ggplot() +
geom_line(data=p5conf, aes(x=Date, y=TeamFGPCT, group=Team), color="grey") +
geom_line(data=umd, aes(x=Date, y=TeamFGPCT), color="red") +
geom_line(data=ill, aes(x=Date, y=TeamFGPCT), color="orange") +
scale_y_continuous(limits = c(0, .65))
Warning: Removed 1 row containing missing values (`geom_line()`).
What do we see here? How have Maryland’s and Illinois’ seasons evolved against all the rest of the teams in college basketball?
But how does that compare to the average? We can add that pretty easily by creating a new dataframe with it and add another geom_line.
<- logs |> group_by(Date) |> summarise(mean_shooting=mean(TeamFGPCT)) average
ggplot() +
geom_line(data=p5conf, aes(x=Date, y=TeamFGPCT, group=Team), color="grey") +
geom_line(data=umd, aes(x=Date, y=TeamFGPCT), color="red") +
geom_line(data=ill, aes(x=Date, y=TeamFGPCT), color="orange") +
geom_line(data=average, aes(x=Date, y=mean_shooting), color="black") +
scale_y_continuous(limits = c(0, .70))