The Disinfodrome system hosts multiple large sets of PDF files. Here’s an offhand list of sources.
There are a variety of Congressional investigations, some 13k docs total.
There are a number of large FOIA folders, the only one that’s generally available is the Arizona communications with Cyberninjas, which is 34k documents.
The FEC can be confusing when dealing with a shifty PAC operator, but it’s a source of huge volumes of documents, like the 425k pages of Trump filings.
Inoreader offers an RSS to PDFs in Dropbox service, which is useful for capturing entire Substacks.
There are a variety of ways to mine web sites for documents, those become PDFs in an effort to disarm any possible malware/beacons.
Some leaks are largely PDF or things that are amenable to conversion to PDF.
There are some books that offer information so compelling that I hunt them up in PDF format so they can be indexed.
One of the things that was supposed to happen in Q3, before the world went absolutely crazy, was that we’d be doing more with court cases. I have my own personal 1st Amendment case in Texas that is about to resolve, but other than that any court related stuff here has been incidental - like Kellye SoRelle Finds Out.
The other day I noticed the plaintiff in the Texas case, who is the poster child for being a slow learner, got himself served in another lawsuit. Among the plaintiffs I see there’s a professional nuisance, so I’m just not ever going to mention any details, but this did put me back into the mode of checking out documents on PACER, which costs.
Any case of public interest is quite likely to end up in RECAP, a free service that PACER users install so their purchases join an enormous pool of freely available court documents. You just need to add the extension to your browser, once installed you click it and it’ll offer a search dialog.
Not just an extension, RECAP also offers an API, so it can be programmed. Today I am experimenting with the Dify (say: dee-fy) AI framework and I don’t see a specific RECAP tool (yet) so I will have to explore how to make search calls with a supported wrapper.
Conclusion:
The big picture? What did we learn this year? What has mattered in all the things we’ve done and seen?
Which contributed to …
And now there are other things brewing that are similar - long term attention to detail is about to produce results against the enemies of our democracy. The root of this is the decay of the 4th estate, who are too busy sanewashing a career criminal with dementia to do detailed digging.
“Do your own research” is a Qanon trope, but if we connect AI tools to known good data sources (PACER only somewhat qualifies) then we are empowered. I say “somewhat” here because of the frivolous litigation angle - crackpots use court filings to launder conspiracy theories into objective reality.
Analytical Tradecraft Standards in an Age of AI is a good read if you plan to apply things like this to large document caches. The summary is pretty straight forward - You can use AI to augment searching, not replace it. You can use AI to draft content, not author it. And these are true only if you have quality control on inputs.