The Dead Internet Theory, the notion that humans are gone and it’s all just manipulative bots, started almost ten years ago. The recent advances in AI now make this outcome a distinct possibility.
I’ve been thinking about this for a while and an article on “model collapse” just turned up, so I guess it’s time to discuss this.
Attention Conservation Notice:
Speculation on the far horizon for AI developments. If you’re into the big picture, trying to see what’s next, this might just be a good read for you.
Model Collapse:
The AI models we have were fed on human authored text scraped from basically everywhere. Now with DeepSeek we’re hearing about a model that aggregates results from other models. The problem comes with the feedback loop, as described in Model Collapse over at TechCrunch.
Like the Ouroboros, current AIs are feeding on slop produced by the prior generation. The closest human experience to this is what’s happened to popular music in the last couple decades. Hip hop sampling is creative, it’s the current top 40 stuff, where you can’t tell when one song ends and another begins, that is exemplary of the “hits picked by machines”.
I have a personal “progress canary” in this area - when an LLM can do a reasonable job with regular expressions, that will be a big step forward. I would probably reach for a mechanical pencil and pad if I had to work on this, but I can interpret it.
([A-z]{3} [\d]{2} [\d]{1,2}:[\d]{1,2}:[\d]{1,2}) ([\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}) (\[S\=[\d]{9}\]) (\[[A-z]ID=.{1,18}\])\s{1,3}?(\(N\s[\d]{5,20}\))?(\s+(.*))\s{1,3}?(\[Time:.*\])?
If you wanted to understand what’s going on, I can strongly recommend Mastering Regular Expressions. This is old school Unix text pattern matching stuff, very powerful, but fairly impenetrable unless you spend some time with that book.
The Model Collapse theory is going to prove out. LLMs might get better at regex … maybe … but their knowledge is going to hit the skids as they ingest more and more slop.
Countermeasures:
I have made several false starts with AI, stumbling over my ancient DDR3 systems that lack the AVX2 processor extensions, then stumbling over wanting a GraphRAG system and not liking any of the offerings. Today I have a minuscule box with AVX2, a little Dell Optiplex with an i7-4790, 32GB of ram, and no GPU. There’s an HP EliteDesk G5 with an i7-9700T, also 32GB but could go to 64GB, also sans GPU. It’s likely to remain Windows 11, as I have a job prospect that would involve using Sentinel Visualizer.
The direction I am headed involves ArangoDB’s LLM + Knowledge Graph. This platform is an old friend from Twitter streaming days, one where I’ve already climbed the barrier to entry. A plain LLM makes its responses based on a sliding window into the text, both the prompt AND the language model. This is fine for crafting English paragraphs, but if there are words and phrases with complex definitions, it breaks down. Human knowledge is a graph of connected concepts.
Here’s a primo example of how that works. 1,575 of the 8,518 entities in this graph are URLs to top news outlets, court documents, and the like. I think it’s about a million words overall … but the 19,640 links connecting the nearly 7,000 entities articles they’re in and each other? Producing results of out a structure like this takes the human that put it together.
Human written content matters for LLMs, human curated content matters for knowledge graphs. If you read any of these you know I’m starting to get a jaundiced view of LLM effects on human networks.
Foreshadowing:
This has been on my mind for a while … there are different facets to this hypercube of a problem.
What’s the solution here? Well … there will be a humanity focused angle to Shall We Play A Game? If we can ensure the curation feeding a knowledge graph is free from AI slop, that would be a big start. Even if an LLM’s “knowledge” implodes, a well curated structure backing it should still permit quality results.
Conclusion:
It’s May. The shelves are going to be empty soon, as empty as my typically thin purse. There are a couple work things on the horizon, both have documents/knowledge graph angles to them.
So maybe I’m going to have a non-hobby reason to dig into this stuff … which would be nice.
hearing the Substack reader speak that regex is extremely interesting