I finally bit the bullet and put $10 into Claude.ai API credits. The motivation is simple - there are nearly 2,500 unique Maltego file names on my system, the result of having been a licensed user for fourteen years.
This is problematic for all sorts of reasons, and last fall I took a swing at creating some tools to simplify things.
Attention Conservation Notice:
A deep dive into the secret desires of a longtime Maltego user who happens to be a serious network analysis nerd, too. There’s a lot of hardcore GraphCraft in here, enter at your peril.
Problem Scope:
Given that there are almost 2,500 files, finding specific things can be a trick. Here’s a general overview of what’s available.
There are 676 files that begin with “20[12][0-9]-”, starting with five files created in 2017, and peaking with 226 date stamped files created in 2024. The pace this year is dramatically reduced, just 46 files thus far.
There are fewer than a dozen “master” files, large graphs which contain the contents of many smaller graphs. As an example of how that works, when I added an update to the big MAGA graph, I would create a YYYY-MM-DD-<topic> file name if there were a burst of articles about a person or situation, then apply IBM Watson’s named entity recognition to the news stories, and once I was happy with the subgraph I would paste the results back into the large parent graph.
There are 592 files dating from 2020 on that either went into the big graph, or they are reports I pulled out of the big graph and gave to someone else.
And that means there are almost 2,000 files that have some sort of name but which lack the leading date stamp. I checked the Unix timestamps and that’s hopeless; things get moved around when I change operating systems, I didn’t create 928 unique files on May 3rd of 2019, that was when I retired my old Intel MacBook Pro and migrated to an HP workstation.
The Terror Within:
There are some entities that are important enough that they have multiple YYYY-MM-DD files with their name associated. But the vast majority of the 8,534 entities in this graph are not particularly notable.
There is some hope for the work after 2020, but the eight years prior? If an entity wasn’t important enough to get a file of its own, I may have a bunch of stuff on it, but finding it is well nigh impossible.
There was a period of time where I could feed this pile of Maltego files to Open Semantic Search and it did a tolerable job of making them accessible. I don’t know what precisely changed, probably something in Tika, and this capability disappeared. OSS is orphanware, I know how to build it from scratch, but I am NOT taking over project like that without the blessing of Markus and a patron that will cover the time I spend on it.
A Glimmer Of Hope:
I set out last year building a Maltego2Gephi utility, just a simple thing to take the GraphML format information you can cut/paste from Maltego and turn it into GML that Gephi can read. It worked, albeit poorly, and while exploring this problem I discovered I could unpack the newer MTGL files using the tar command. I started digging around inside the structure and I found that the entity types in use were stored as XML, while the actual data was in Lucene indices.
Lucene, the format used by both ArangoDB and Elasticsearch. I started looking for a command line utility that would expose the content - because the Unix strings command can do nothing with this complex data format. I ran aground in a shoal of Python abandonware. A lot of people used to do things like this, but I couldn’t make out what Lucene users are doing now for command line access.
Enter Claude:
I’ve been seeing videos about Claude Code like this one for the last couple weeks.
So Friday night I put $10 into their API and waded in swinging.
Maltego2Arango, the produced code, which hints at the bigger picture here, DOES! NOT! WORK!
To be fair, this is an absolutely terrible place to have started - a backwater problem associated with a bunch of end of life repositories and building it on MacOS instead of Linux. But Claude offers a framework, a discernible strategy for going at the problem, and it cost me the princely sum of $2.84 to get there.
Conclusion:
Having seen Claude Code in action in the terminal window of my PyCharm IDE I’m ready for more. Giving it the Maltego problem was an acid test - lots of complexity, lots of abandoned code, and what I learned is that I need to have a bit more of a foundation before this is going to work. By that I mean some sort of workable way to get something out of Lucene indices, a place to stand for that first move.
The next thing is going to involve putting attention on SubstackSNA. The substack_api repo is under active development. I am already pulling stuff out and getting it into ArangoDB. There is a well documented ArangoDB NetworkX adapter. This is something I could do by hand, there’s not hideous mess when it comes to installing support packages, unlike what I found with Lucene. This seems much simpler.
The far horizon here is ArangoDB’s combination of LLMs and a knowledge graph. LLM hallucinations are likely to get worse as the models begin to eat their own tails, and there is a lot of model poisoning work out there. If humans collect stuff they know to be legitimate, like I did with that enormous MAGA graph, the fusion of that real knowledge graph with an LLM will permit the creation of an expert system that doesn’t just tell plausible lies at random.
Back in 2010 I was the designer for Progressive Congress News, which reached 23% of Congressional staff at its peak, and which ended in the top five among seven hundred applicants for the 2011 Knight Foundation News Challenge. Even back then, we were building knowledge graphs using a social media platform. I think AI enabled human curators are going to best the machines … I just need a little machine assist to get a minimum viable prototype together so I can prove this theory.
Have you tried cursor.com with the VSCode plug in ?