Maltego File Grooming

An endless chore draws to a close ...

Jun 16, 2025

Years ago I spent some months working with a fellow Maltego curator. This person made endless derivative entities and they came to infest all of my “master” graphs, as well as a number of child graphs from that time onward. I’d remove them from a file I was working. I’d cut and paste from elsewhere, and they’d creep back into circulation.

I’ve been removing this stuff as I encounter it for over three years. And this weekend was The End. As I work on command line handling of graphs I do not want this shite spilling everywhere, so I am doing a final cleanse.

This sort of work always feels like the worst drudgery, but I always learn things about my data and my processes …

Attention Conservation Notice:

Just Maltego curation stuff herein. If you don’t have an enormous pile of these files to wrangle, just move on with your day.

Scope:

There’s a Maltego folder with 1,561 files in it that was once the definitive collection.

That became unwieldy and the Scratch folder took its place for daily work … until it ballooned to 655 files.

And finally the MALT folder in my home directory became the daily work space … and it’s now 350 files.

That’s 2,566 file names. But when I rsync all three locations to a ZFS dataset on my Proxmox system there are just 2,367 unique mtg[lx] files. Since I did the rsync in order of creation - Maltego, Scratch, then MALT, I’m confident those 2,367 files are the latest and the 199 that went missing are duplicates. I might lose some stuff from the fringes but the core of my work is safe.

1,210 of those files are the pre-2016 MTGX format. 1,157 are the post-2016 MTGL format. There’s an explanation of the difference in the two file types found in my Maltego2Arango Github repo. Briefly, the old format has an explicit graph in GraphML format. The new format uses Lucene indices and the graph is implicit, so it’s much harder to get at the details.

Unpacking:

The graph details are complex, but a good first step would be exploring the entities. Each of the two file types has the same folders with it, so I used this incantation to break them all out.

find . -type f -name "*mtg[lx]" -execdir unzip {} -d {}.dir \;

This yielded 18,487 entity files and I was pleased to find only three instances of the entities I was trying to eliminate had slipped through the dragnet. They were terminated with prejudice.

Next step was to check the .entity files to see if they’ve changed over the years.

time find . -type f -name "*entity" -exec openssl sha256 {} >> SHA256.txt \;

And … oh my have they changed. There are 1,081 files with a Domain entity in them and there are eleven different revisions of that Entity type(!?!?) I interrobang here because you would think that the notion of a Domain would be pretty well settled.

Over all there are 190 Entities in use, but they have 528 unique appearances.

Implications of Variances:

So what should be done with these differences?

I manually reviewed a couple types, there are minor changes to the XML in the .entity file. Looks like at some point there was a <Groups/> tag added, but it’s not symmetrical, there’s no open/close tag.

I piled all of one type with many revisions into a single file and sorted based on matching line count. The files are substantially similar. A SHA256 hash can change if there’s one single new space or carriage return in a file.

What I THINK at this point is that the changes in the .entity files probably don’t matter much at the level where we’d use them. I suspect that simply converting these XML files to JSON and stashing them in ArangoDB will lead to 190 Entity types that can be referenced in graph databases.

Loading:

A bit later I had 18,241 json files in a directory tree on my ArangoDB server.

And then I learned that the Python script ChatGPT had written produced trash output, not JSON.

… time passes …

And the replacement files created with a proper, albeit much slower XML to JSON utility, led to 18.239 records in an ArangoDB collection. Not the 190 or 1,081, but an entry per each occurrence. I guess that’s a start.

Conclusion:

I guess Maltego2Arango is going to progress. There really needs to be something like the strings command. Since this requires some deep handling of the Lucene indices in newer Maltego files, it’s not unreasonable to require those wanting to use it to have Docker and an ArangoDB instance. SQLite3 might be a simpler starting point, but we ARE dealing with graph data, and I suppose I should write some transforms …

Big picture: where’s the AI FOSS? I’m producing FOSS here … but AI is serving the same role that soft shoulders and swampy spring ditches do on Iowa gravel roads. Keeping your eye on the road and staying right in the center is the safest course …

And post by post, the search results for Maltego2Arango will improve.

🇺🇦 Netwar Irregulars Bulletin 🇺🇦

Discussion about this post