LLM Extraction Challenge: Fundraising Emails

What kind of sicko signs up for political fundraising emails from just about every committee? Oh, right, that’s me.

Sure, there’s a certain masochism to this, but I’m genuinely interested in seeing how campaigns communicate their messages to prospective donors and the distance between what they are willing to say in an email to their supporters compared to other contexts. But, as with nearly every aspect of election data, there’s a catch: the fundraising emails don’t have the same kind of unique identifiers that, say, Federal Election Commission records do. At best, they have FEC-mandated disclaimers, but even those aren’t as useful as they could be: they use committee names rather than the unique IDs that campaign finance data relies on.

Here’s where I’ve turned to LLMs to help. Initially, I thought that I’d would be able to use pattern matching to identify and extract disclaimer language from emails, and to a certain degree that’s possible. Many of the emails employ similar language and put the disclaimer near the end of messages, like this one from the Pennsylvania Democratic Party:

The “Paid for by The” is a good marker for this language, but not every committee uses that exact syntax and you can imagine that there might be fundraising emails where the phrase “paid for by the taxpayers” appears in a different part of the email. The other issue is getting the name from the disclaimer - knowing when to stop, basically. Sometimes that’s easy: the entire disclaimer is “Paid for by [name of committee]” and no other nearby text. Other times the disclaimer language just keeps going:

This seems like something I could train a library to detect, and that did cross my mind. But mostly I was interested in seeing both how an LLM would handle this task and whether there were differences between various models, given the same data and prompt. So I took 1,000 emails from November 2024 (I get anywhere from 3-12k a month, depending on where we are in the election calendar) and identified the committee name from the disclaimer text (along with the name of the sender, which often is only in the text). Then I fed the same emails to multiple LLMs and provided the following instructions:

Produce a JSON object with the following keys: ‘committee’, which is the name of the committee in the disclaimer that begins with Paid for by but does not include Paid for by, the committee address or the treasurer name. If no committee is present, the value of ‘committee’ should be None. Also add a key called ‘sender’, which is the name of the person, if any, mentioned as the author of the email. If there is no person named, the value is None. Do not include any other text, no yapping.

I ran that prompt against 20 different models via Ollama, Groq and directly (Anthropic and OpenAI), and then compared the results to my original data that I extracted committee names from, using the naming style present in the email. If you’ve played around with a lot of LLMs, you won’t be surprised by the results - or maybe some of them will surprise you:

Model	Total Records	Committee Matches	Percent
OpenAI 4o	1000	934	93.4%
Claude 3.5 Sonnet	985	903	91.7%
OpenAI 4o November	1000	858	85.8%
Gemma2	495	403	81.4%
Llama 3.3	1000	802	80.2%
Llama 3.1 8b	507	406	80.1%
Phi 4	1000	798	79.8%
Gemma2 27b	994	790	79.5%
Claude 3.5 Haiku	880	654	74.3%
Mixtral	461	337	73.1%
QwQ	996	716	71.9%
Mistral Small	502	346	68.9%
DeepSeek R1 32b	1000	663	66.3%
Starling LM	270	168	62.2%
EXAONE 3.5	960	582	60.6%
InternLM2	992	574	57.9%
Llama 3.2 3b	524	239	45.6%
DeepSeek R1 8b	781	296	37.9%
Phi 3	521	143	27.4%
Solar Pro	1000	259	25.9%

First, an explainer on what the columns mean: the total records is the number, out of 1,000 emails, where the LLM was able to generate a JSON response without raising an error. The commercial models, plus newer open source ones such as Llama 3.3, DeepSeek’s 32b version and Phi 4, were basically able to do that for all or nearly all of the records. Others struggled to get even half of them.

The number of committee matches represents the number of times when the LLM did produce a valid response that it correctly matched the committee in the training dataset. My own cutoff for “good enough” here is about 80 percent, meaning that in addition to the models I mentioned above, Gemma2 (27b version) gets pretty close.

What stands out to me here is the performance of Llama 3.3 and the drop-off for the small models (honestly, though, shout-out to InternLM2, which did kinda ok!). I recognize that the latter condition is probably open to change given some fine-tuning, of which I performed none. This is zero-shot stuff. My next steps are to do more emails and I’d love some feedback on ways to improve this process.

You can see the training data, the JSON files and the code used to produce them at this repository.