On Legislative Data Transparency

This week I was honored to speak at at the Legislative Data and Transparency Conference put on by the Committee on House Administration. If you’re so inclined, the videos of the presentations are online at the conference site, although I must warn you that they contain heavy doses of XML references and other fun stuff. What follows is not my presentation, strictly speaking, but most of it along with some other thoughts.

Being a former Congressional Quarterly staffer, I have an innate fondness for House Admin, one of the lesser-known committees but one with a large influence over what kinds of information the public can see about the House side of the legislative process. The committee has jurisdiction over the Library of Congress, which by extension means Thomas, the online home of so much Congressional information.

There are many other posts about the general desires of what those folks committed to transparency want when it comes to Congress, but Daniel Schuman of the Sunlight Foundation sums them up pretty well: “To the maximum extent possible, legislative information must be available online, in real time, and in machine readable formats.”

I don’t disagree, and I am sympathetic to complaints that Congress has been slow to address the availability of bulk data. People such as Josh Tauberer have been screen-scraping Thomas since 2004, and I joined in the process a year later at washingtonpost.com. In 2012, we’re both still doing it, now joined by Sunlight, OpenCongress and who knows how many others (speaking of OpenCongress, if you want a less patient restatement of Schuman’s thoughts, OC’s David Moore has a stem-winder of a post for you).

I, too, long for the day when I don’t have to wonder when my HTML parsers will break after a seemingly innocuous change to Thomas’ styles, or when I don’t have to enter three different IDs for a new Senator (Bioguide, LIS and Thomas’ own unique sequential number). But my presentation on Thursday concentrated on a more fundamental need. Before bulk data can become really useful, it has to be more consistent, understandable and accurate. Right now, if you’re not willing to put in a lot of time studying the quirks of Congress, you will always face the likelihood that your data, however lovingly collected, has plenty of errors.

For example, in the Senate it is possible for the Majority Leader and Minority Leader to alter the rules of math when it comes to how many senators constitute a three-fifths majority. The death of Sen. Ted Kennedy in 2009 reduced the number of Democrats in the chamber at that time to 59, and the total number of senators “duly elected and sworn” to 99. For votes requiring a three-fifths majority (thanks, Malcolm), a 99-member Senate would need 59.4 senators for passage, or at least 59. But the party leaders agreed to keep the three-fifths threshold at 60 votes throughout the period when the Senate had 99 senators, not 100. For much of that period, any two-thirds vote displayed on nytimes.com had the wrong number of votes required for passage, because I was relying on math. I could not find any place in the Congressional Record or anywhere else where this was documented.

An edge case, you might say. But when it comes to Congress, there are loads of them. A reporter called me several weeks ago to ask why a seemingly simple question about three members of her state’s delegation was maddeningly hard to answer. All three had been elected to the House the same year, and had served since then. But each of them had a different number of total votes he or she was eligible to vote on. How could that be?

It took me a little while, but the only explanation I could find was that their dates of service had to differ in some way, and my guess was that not all of them were sworn in for each session on the same day. It happens. Unfortunately, neither Thomas nor the Clerk of the House provides an easy way to find out when a particular member was sworn in, despite the fact that it is a basic element of what makes someone a Member of Congress. At the conference, I heard someone say that it would be possible to provide a list of swearing-in dates for every lawmaker. That’s good, and needed, but it’s not good enough. I need, and the data demands, timestamps in this case. That’s the only way I can be sure of what votes a member was or was not eligible to vote on.

You might think that you could find the total number of House votes for a given year by looking at the Clerk’s votes site. In 2011, the last vote was roll call 949. Alas, officially, there were 948 votes that year, because roll call 484 was vacated and replaced by vote 485, and thus never really happened.

In my presentation I cited a few other examples, but they mostly boil down to this: unless we can make congressional information easier to use and understand by people outside the small circle of legislative wonks, bulk data access by itself won’t solve our problems. Today the most creative uses of congressional information, such as Sunlight’s Capitol Words project, suffer from this limitation. I love Capitol Words, but right now the Congressional Record – the source for it – cannot reliably tell me in a machine-readable form whether a particular word or phrase or speech was even spoken out loud on the floor of the House or Senate. That’s kind of a big deal, for reporters, historians and the public.

If we can’t use congressional data to answer what should be straightforward questions, or can’t agree on what the answers should be, providing immediate access to that data in bulk form may not be as helpful as we would think, and in some cases risks adding to the confusion. It may expose more of those problems, which is of some usefulness, but if the ultimate goal is not just access but understanding, we need to address the fundamental issues of accuracy and consistency before we switch on the firehose.