What APIs Mean for Data Journalists

Anthony DeBarros of USA Today and I talked about APIs at this year’s CAR conference in Raleigh. We got a lot of “Web people”, to use a lame expression, in the audience. If you’re a reporter who works with data, why should you care?

The simple answer is that APIs are an extension of what reporters do every day: ask questions. The difference is that instead of forcing reporters to gather data from multiple sources, format it to fit your local database needs and then update that database when new releases are available, APIs allow reporters to query live data from all over the Web. If you have experience working with, say, Microsoft Access and setting up an ODBC connection to a remote database, APIs are kind of like that – except that you have near-instant access to more sources of data, more useful tools (like geocoders) and more timely information than ever before.

My path working with data went something like this: spreadsheets came first, which I routinely describe as the “gateway drug” of computer-assisted reporting. Some people become such Excel wizards that it almost doesn’t make sense for them to move beyond that expertise; there is so much you can do in a spreadsheet that alone it would be worth the time to learn. But there were things about spreadsheets that annoyed and frustrated me. Pivot tables were a clumsy fit for me – they got me close to what I wanted in many instances but never quite there. And so I moved onto databases.

Databases are still one of my favorite things. They are powerful, relatively flexible and range in utility from the ultra-portable SQLite to the transactional goodness that is PostgreSQL. But they take time and effort to build, maintain and – perhaps most importantly in the long run – connect to additional sources of information. APIs are not a complete solution to these problems, but they provide a very good one that data journalists should be familiar with and consider incorporating into their work.

A simple example is the reporter who wants to track the votes of his or her state’s delegation in Congress. There are several APIs for this data, including the one I work on and another by OpenCongress. The reporter could build a database of these votes by hand or write scripts to parse the House and Senate vote data and insert them into it. But why, when the data is freely available via HTTP?

It can’t be that simple, can it? Well, no. But it can be simpler. The data you get from APIs usually comes in XML or JSON. Data journalists have, for better or worse, been dealing with XML for awhile now. JSON may be less familiar, but it is quite nice to deal with and there are plenty of libraries with which to do so. But even better than that is the fact that other people have already solved that problem for you. Not long after we released the NYT Congress API I noticed a Ruby client library for it on Github. I had never met the author; he had never contacted me. Just the same, he made it easier for people using Ruby to query the API and get back data. There’s also an excellent Python library for it, written by NPR’s Chris Amico.

Thus can you, the data journalist, benefit from other people who need and use APIs. Check out GovKit, a Ruby wrapper to multiple government and political APIs, created by the folks at the Participatory Politics Foundation. Go play with it, and figure out what sorts of things you can do when the number of data sources you’re able to tap into multiplies overnight. The possibilities for journalists are only limited by the kinds of questions we can imagine and try to answer. APIs can make it easier to act on that greatest of questions: What if?