Disarming Dangerous Documents

Looks like MIOS is going to involve a LOT of reading.

Apr 18, 2024

We’re a couple weeks into Q2 and the Malign Influence Operations Safari is coming right along. MIOS: Doin’ A FISA telegraphs a story that is apparently queued up at a well known publication, without giving up any of the particulars. This one was delayed by document provenance issues, but we got that cleared.

I have another one that I think is even better, but it’s got a serious provenance problem. It’s been pitched to several outlets and I think I finally have one that’s going to make it visible. A small hint in advance - it’s got something to do with Rushing Into Semrush.

ANDDDDddddddd since this is a trend, a third piece based on leak data that’s been stranded for years, even longer than the first, is now getting serious attention from some of the usual suspects. No story yet, not even a pitch, but I know there will be.

This flurry of activity puts me back in the mode of thinking about how non-technical folk deal with potentially hazardous material.

Attention Conservation Notice:

I’ve been in an internet sissy fight with Open Semantic Search, the elderly document indexing system I use, for the last thirty six hours. Activity is way up, so much so that I’m going to have to do something different. This is going to be back story, if you just want to read some documents, or to read something someone else writes after reading some documents, you can probably go find something else to do.

Open Semantic Search:

I have been a fan of Open Semantic Search since I first laid eyes on it. But I am NOT a fan of 193 open issues on Github and I don’t think the developer is going to return. The underlying concept is fabulous:

Apache Tika extracting document metadata.
Tesseract doing Optical Character Recognition.
spaCy doing named entity recognition.
Internal interface to Neo4j for graph database analysis.
Optional internet to Elasticsearch, an unfinished thought.
Faceted search and annotations.

I tried the Elastic plugin back when and got little value out of it. I have never liked Neo4j and I spent some time trying to convert it to output to ArangoDB. The days are long, the years are short, and only get paid for this in a very indirect fashion, so I let it go.

Hacks, Leaks, and Revelations:

While Hacks, Leaks, and Revelations is a great book for the technically minded would-be investigator, it’s a LOT to have to hold the hand of someone who is a qualitative analyst and writer. The thing that most excited me in the book was Dangerzone, but it misbehaved on Ubuntu 22.04 and again on 23.10. I had other things to do and now Ubuntu 24.04 Noble Numbat is imminent, just a week to go, so I’m resolutely not touching it again until after the upgrade shuffle is done.

I considered the Qubes oriented approach to using it, but again just as I got interested my twelve year old workstation started having fits, so the new laptop sits with Ubuntu 23.10 on it, waiting to pounce if the big machine quits in a permanent fashion. The new laptop is also the only thing I have new enough to support the AVX2 instructions that LM Studio requires, so there’s another stumbling block to Qubes.

Stepping back a bit, I recognize this as one of those times where round robin maintenance isn’t going to get the job done. I need to pick just one problem from the set and focus on it until it’s done, then grab another one and finish it off. Once that’s happened I should have more room to maneuver.

Intermediate Solutions:

So Open Semantic Search is behaving badly, unless I log in and run one portion of it from the command line. Theoretically there’s no difference, but in practice unattended mode fails and the terminal process keeps chugging along.

And I do mean CHUG. It’s been like this half the time for a good six or eight hours during which it’s only done 12k of 34k documents in the most urgent batch.

The safety in this case came courtesy of Libre Office, which I just discovered has an aggressive, promiscuous “whatever to PDF” command line function. The only thing I’ve found thus far that it won’t do are twenty year old Lotus spreadsheet files.

soffice --headless --convert-to pdf your_document.doc

So I monkeyed around with find+exec and rsync, doctoring a source tree to include PDF conversions, then moving just the PDFs to an NFS share for the OSS instance.

Another thing I learned in this process is that there are simple things even the least technical of users can do:

“Potentially hazardous documents? We just open them with Google Docs.”

Networks, Neural Among Them:

If you’ve got a big pile of documents the AI way is to employ Retrieval Augmented Generation, fusing their content with a Large Language Model, a type of neural network. There are a bunch of ways to experiment with this, but they all want a vector database rather than a NoSQL system like ArangoDB or Elasticsearch. Briefly, the vectors are a numeric representation of text, which is how LLMs see the data used to train them.

While being able to have a chat with an AI that knows the details of a dataset is pretty cool, there are things that an AI can’t do at this time. Network analysis is one of those things. The Disinfodrome server has 14,000 documents from all the Trump Russia investigations, but if you want to know connections between individuals, companies, etc, the manual graph I’ve created over the last three and a half years by recording entities mentioned in 1,400 news articles and court documents is a much better tool for the majority of questions one might ask.

OSS has a native output to Neo4j, but it’s got an extremely rudimentary interface. If I’ve got some data to look at, maybe Maltego is the right choice. If there is a lot, it’s gotta be Gephi. Updating how documents are handled will result in API or database access, and from there it’s easy to create GML format files using NetworkX, or maybe launch Graphistry directly.

Conclusion:

It’s weird to have a pipeline of interlocking upgrades that are moving slowly AND a pipeline of hits moving at what is lightning speed for the business. Every lick of this stuff is Malign Information Operations Safari material, too, but I’m not sure how well it’s working for me to talk methods while talking around the actual content.

Maybe we need to do what’s in Hacks, Leaks, and Revelations - pick something in the way of a leak to use as a test set. Any requests?

🇺🇦 Netwar Irregulars Bulletin 🇺🇦