32 AI and Data Journalism

The first thing to know about the large language models that have attracted so much attention, money and coverage is this: they are not fact machines.

But they are - mostly - very useful for people who write code and for those trying to work through complex problems. That’s you. At its core, what a large language model does is predict the next word in a phrase or sentence. They are probabilistic prediction machines based on a huge set of training data. This chapter goes through some tasks and examples using LLMs.

32.1 Setup

We’ll be using a service called Groq for the examples here. You should sign up for a free account and create an API key. Make sure you copy that key. We’ll also need to install an R package to handle the responses:

# install.packages("devtools")
devtools::install_github("heurekalabsco/axolotr")

Then we can load that library and, using your API key, setup your credentials:

library(axolotr)

create_credentials(GROQ_API_KEY = "YOUR API KEY HERE")

Credentials updated successfully. Please restart your R session for changes to take effect.

See that “Please restart your R session for changes to take effect.”? Go ahead and do that; you’ll need to rerun the library() function above.

Let’s make sure that worked. We’ll be using the Llama 3.1 model released by Meta.

groq_response <- axolotr::ask(
  prompt = "Give me five names for a pet lemur",
  model = "llama-3.3-70b-versatile"
)

cat(groq_response)

Here are five name suggestions for a pet lemur:

1. **Loki**: Inspired by the mischievous Norse god, this name suits a playful and curious lemur.
2. **Mango**: A sweet and tropical name that matches the lemur's exotic and vibrant personality.
3. **Finnley**: A charming and adventurous name that suits a lively and energetic lemur.
4. **Kiko**: A fun and quirky name that means "strong" or "powerful" in some African cultures, making it a great fit for a bold and agile lemur.
5. **Sakura**: A beautiful and delicate name that means "cherry blossom" in Japanese, which suits a gentle and affectionate lemur with a touch of elegance.

Remember to choose a name that you and your lemur will love and enjoy using!

I guess you’re getting a lemur?

32.2 Three Uses of AI in Data Journalism

There are at least three good uses of AI in data journalism:

turning unstructured information into data
helping with code debugging and explanation
brainstorming about strategies for data analysis and visualization

If you’ve tried to use a large language model to actually do data analysis, it can work, but often the results can be frustrating. Think of AI as a potentially useful assistant for the work you’re doing. If you have a clear idea of the question you want to ask or the direction you want to go, they can help. If you don’t have a clear idea or question, they probably will be less helpful. Let’s go over a quick example of each use.

32.2.1 Turning Unstructured Information into Data

News organizations are sitting on a trove of valuable raw materials - the words, images, audio and video that they produce every day. We can (hopefully) search it, but search doesn’t always deliver meaning, let alone elevate patterns. For that, often it helps to turn that information into structured data. Let’s look at an example involving my friend Tyson Evans, who recently celebrated his 10th wedding anniversary. You can read about his wedding in The New York Times.

This announcement is a story, but it’s also data - or it should be.

What if we could extract those highlighted portions of the text into, say, a CSV file? That’s something that LLMs are pretty good at. Let’s give it a shot using the full text of that announcement:

text = "Gabriela Nunes Herman and Tyson Charles Evans were married Saturday at the home of their friends Marcy Gringlas and Joel Greenberg in Chilmark, Mass. Rachel Been, a friend of the couple who received a one-day solemnization certificate from Massachusetts, officiated. The bride, 33, will continue to use her name professionally. She is a Brooklyn-based freelance photographer for magazines and newspapers. She graduated from Wesleyan University in Middletown, Conn. She is a daughter of Dr. Talia N. Herman of Brookline, Mass., and Jeffrey N. Herman of Cambridge, Mass. The bride’s father is a lawyer and the executive vice president of DecisionQuest, a national trial consulting firm in Boston. Her mother is a senior primary care internist at Harvard Vanguard Medical Associates, a practice in Boston. The groom, 31, is a deputy editor of interactive news at The New York Times and an adjunct professor at the Columbia University Graduate School of Journalism. He graduated from the University of California, Los Angeles. He is the son of Carmen K. Evans of Climax Springs, Mo., and Frank J. Evans of St. Joseph, Mo. The groom’s father retired as the president of UPCO, a national retailer of pet supplies in St. Joseph."

evans_response <- axolotr::ask(
  prompt = paste("Given the following text, extract information into a CSV file with the following structure with no yapping: celebrant1,celebrant2,location,officiant,celebrant1_age,celebrant2_age,celebrant1_parent1,celebrant1_parent2,celebrant2_parent1,celebrant2_parent2", text),
  model = "llama-3.3-70b-versatile"
)

cat(evans_response)

Gabriela Nunes Herman,Tyson Charles Evans,Chilmark, Mass.,Rachel Been,33,31,Dr. Talia N. Herman,Jeffrey N. Herman,Carmen K. Evans,Frank J. Evans

A brief word about that “no yapping” bit; it’s a way to tell your friendly LLM to cut down on the chattiness in its response. What we care about is the data, not the narrative. And look at the results: without even providing an example or saying that the text described a wedding, the LLM did a solid job. Now imagine if you could do this with hundreds or thousands of similar announcements. You’ve just built a database.

32.2.2 Helping with Code Debugging and Explanation

When you’re writing code and run into error messages, you should read them. But if they do not make sense to you, you can ask an LLM to do some translation, which is another great use case for AI. As with any debugging exercise, you should provide some context, things like “Using R and the tidyverse …” and describing what you’re trying to do, but you also can ask LLMs to explain an error message in a different way. Here’s an example:

debug_response <- axolotr::ask(
  prompt = "Explain the following R error message using brief, simple language and suggest a single fix. I am using the tidyverse library: could not find function '|>'",
  model = "llama-3.3-70b-versatile"
)

cat(debug_response)

**Error Message:** "could not find function '|>'"

**What it means:** The "|" symbol is a new pipe operator in R, but your R version is too old to recognize it.

**Fix:** Update your R version to 4.1 or later, or use the older pipe operator (%>%) from the magrittr package, which is part of the tidyverse. Replace "|>" with "%>%" in your code.

The trouble is that if you run that several times, it will give you slightly different answers. Not fact machines. But you should be able to try some of the suggested solutions and see if any of them work. An even better use could be to pass in working code that you’re not fully understanding and ask the LLM to explain it to you.

32.2.3 Brainstorming about Strategies for Data Analysis and Visualization

Let’s say that you have some data that you want to interview, but aren’t sure how to proceed. LLMs can provide some direction, but you may not want to follow their directions exactly. You shouldn’t accept their judgments uncritically; you’ll still need to think for yourself. Here’s an example of how that might go:

idea_response <- axolotr::ask(
  prompt = "I have a CSV file of daily data on campus police incidents, including the type of incident, building location and time. Using R and the tidyverse, suggest some ways that I could find patterns in the data. Use the new-style pipe operator (|>) in any code examples",
  model = "llama-3.3-70b-versatile"
)

cat(idea_response)

Here are some ways you can find patterns in your campus police incidents data using R and the tidyverse:

### 1. Incident Frequency by Type
You can start by looking at the frequency of each type of incident. This can help you identify which types of incidents are most common.

```r
library(readr)
library(dplyr)

incidents <- read_csv("incidents.csv")

incidents |>
  count(type) |>
  arrange(desc(n))
```

### 2. Incident Frequency by Building
You can also look at the frequency of incidents by building location. This can help you identify which buildings are most prone to incidents.

```r
incidents |>
  count(building) |>
  arrange(desc(n))
```

### 3. Incident Frequency by Time of Day
You can look at the frequency of incidents by time of day to see if there are any patterns. For example, are incidents more common during the day or at night?

```r
library(lubridate)

incidents |>
  mutate(time = hms(time)) |>
  extract(time, into = c("hour", "minute", "second"), regex = "(\\d+):(\\d+):(\\d+)") |>
  count(hour) |>
  arrange(desc(n))
```

### 4. Incident Types by Building
You can also look at the types of incidents that occur in each building to see if there are any patterns.

```r
incidents |>
  group_by(building) |>
  count(type) |>
  arrange(building, desc(n))
```

### 5. Time Series Analysis
If your data has a date or timestamp column, you can use time series analysis to look for patterns over time.

```r
library(forecast)

incidents |>
  arrange(date) |>
  pull(type) |>
  ts() |>
  autoplot()
```

### 6. Heatmap of Incidents by Building and Time
You can create a heatmap to visualize the frequency of incidents by building and time of day.

```r
library(ggplot2)

incidents |>
  mutate(time = hms(time)) |>
  extract(time, into = c("hour", "minute", "second"), regex = "(\\d+):(\\d+):(\\d+)") |>
  ggplot(aes(x = building, y = hour)) +
  geom_tile(aes(fill = ..count..), stat = "bin2d") +
  scale_fill_gradient(low = "white", high = "red") +
  theme_void()
```

These are just a few examples of how you can find patterns in your campus police incidents data using R and the tidyverse. The specific methods you use will depend on the structure and content of your data.

Note that the column names may not match your data; the LLM is making predictions about your data, so you could provide the column names.

32.2.4 Can’t I just ask it to do data analysis for me?

Well, you can try, but I’m not confident you’ll like the results. As this story from The Pudding makes clear, the potential for using LLMs to not just assist with but perform data analysis is real. What will make the difference is how much context you can provide and how clear your ideas and questions are. You still have to do the work.