Rivers of Data


Derek Willis

Fifth in a series of essays humbly titled “Fixing Journalism”

News never stops. We’ve accepted this as a condition of our modern age, for better or worse, and all types of news organizations are coming to grips with serving an audience that is more connected than ever. Yet one of the ironies of the 24-hour news cycle is that even as newsrooms get better tools, when they use data it tends to be in lengthier project reporting rather than daily reporting.

This results in impressive and exclusive stories with big impact that win awards and garner plenty of attention among the journalism industry. In combing newspaper websites for entries for The Scoop, I’ll admit that my eyes are drawn to those projects that involve analyzing hundreds of thousands or millions of records. Bigger seems better to a lot of folks, and sometimes it is.

This isn’t an argument against such stories. They are necessary, even vital to journalism’s public service role and, if anything, we should do more of them. But not every paper has the resources to commit to such extensive work on a continous basis, and even if they did, there are plenty of daily and short-term stories that could benefit from data.

What’s more, the flow of information is increasing. Now newsrooms confront rivers of data both large and small. They flow continously, shifting in subtle and obvious ways. Some are external and some we create, even if we don’t recognize the process. We (grudgingly) accept the permanent campaign, the permanent news cycle. It’s time we grappled with the permanent river of data. Because like news, data never stops.

The pace of data has been somewhat slow in the past, thanks to the stubborn popularity of paper and the limitations of our own equipment – to say nothing of the relative lack of data-savvy folks within newsrooms. At many places the people with CAR skills are fenced off in the projects team, although slowly this seems to be changing.

Data isn’t just useful for finding one thing, or for one story. Like much of the other information that passes through a newsroom, data too often falls through the floor instead of being reused. Too often it’s not even collected. But even an older dataset that does not get updated can have research value – how many times do journalists need to know something about a situation that occurred years ago, or to trace the history of a person or organization?

Part of the problem is the word “data,” to be honest. To many reporters and editors, that either refers to a character from “Star Trek: The Next Generation” or to a government-produced chart in a report. There’s nothing systematic about the way we treat it, and so it’s no surprise that it doesn’t become a priority. But as a source of information, data should be given equal footing with other sources. Reporters pore over current and past statements, positions and documents. We should do the same with data.

It’s easier to sell editors and reporters on large datasets, particularly when they are exclusive and relate to a hot topic. Plus, it’s just fun to say that you analyzed hundreds of thousands or millions of records, as if the more records you have the harder the computer has to work. But we should care as much, if not more, about the smaller sets of data that flow through the newsroom: the regular reporting checks, the timelines we build and maintain, even the monotonous parade of press releases and reports.

Why? Because we’re dealing with so much information that seeing both individual items of importance and recognizing shifts in long-term patterns is harder than ever. Sure, we have Google, we have Nexis. But a typical search in either is looking for a single point of information (when was a person born?) or a broad summary (how many times has the phrase “supersize me” appeared in major newspapers?). You don’t get both, but you could. You just have to have a process flexible enough to allow for both the big picture and the little details.

Here’s an example: a few months ago I cooked up a Python script that would produce both a database of electronic federal campaign filings and an RSS feed of filings that updated several times a day. The latter is built from the former and thus I’m adding to a larger database and monitoring the flow of information into it. Doing this allowed me to spot a single filing – the final one by Enron’s PAC – and send it on to a colleague who used it for his column. Meanwhile, the larger dataset enabled me to figure out that one political committee had filed 18 amended reports in a two-day period and figure out how often something like that had happened in the past.

The secret to this is that by compiling the details, you automatically get the big picture. So even if individual events don’t yield many stories, chances are that the aggregation of them all will.

The best part is that we do some of this already – lots of newspapers run weekly property sales, crime reports and other small rivers of data, but they don’t always think about how those can be deployed to help everyday reporting. Think about the crime reports: a database editor could keep tabs on the larger data set fed by weekly updates, while the cops reporter could get running updates in a feed that could spark questions or alert him or her to emerging patterns.

And the Web is making this process easier, as governments and organizations put more and more information online. Part of the barrier of building and maintaining databases is the constant effort required for upkeep. While this won’t disappear entirely, the Web cuts down on the time and manpower needed, and it provides the best way to deliver the data.

I suspect a lot of newsroom folks might get hung up on the “but it’s a bunch of numbers” issue. Yes, data can be numbers, but it also can be a collection of events. Jo Craven McGinty of the New York Times – a longtime friend via IRE – recently demonstrated how simply collecting event information can produce interesting content. She took data on hate crimes in New York City, mapped it and summarized it for the paper (the national edition version trimmed the story but included the map and other graphics).

The advantage of maintaining this dataset is that, when future hate crimes occur, they can instantly be checked against the history for contextual information. Anybody can say there was a hate crime on Broad Street. Saying that it was the fifth in two years within a four-block span is what can separate one newspaper from the multitude of competing voices.

To do that, newsrooms need people who can “read” data like they can read the face of an interview subject or spot the meaningful paragraph buried in a lengthy report. That used to be asking a lot, but now we can present information in a variety of ways that are familiar to even those with basic computer skills, thanks to dynamic Web pages, RSS feeds that appear in your inbox and maps. We now have the means to make data accessible and interesting to even the least data-savvy journalist.

Done well, the display of data actually overcomes the massive nature of some datasets and draws in even people inexperienced in using data. Chicagocrime.org is a perfect example of this – Adrian Holovaty put a simple yet compelling front-end on a publicly-available dataset, and now everybody can be a part-time crime analyst.

This all may sound like a justification for hiring people with CAR skills, and I suppose that’s true in a certain sense. But it’s more than that – it’s a call for newsrooms to put their reporters and editors in the best position to publish with authority, context and the kind of detail that readers will appreciate and come to expect.

The tools and skills to make this happen are here now and will only get better. But we as journalists must decide to invest in our ability to handle information in order to make sense of those rivers of data that flow by us every day.