Years ago ,when conflicts on Twitter got particularly fierce, there were people who would pull out JGAAP - the Java Graphical Authorship Attribution Program. The text of tweets would get fed to this system in an effort to identify who the actual operator of some highly provocative sock puppet was. The system never really worked correctly on 140 character chunks and it was little better if given many 140 character chunks. The best use I ever made of this was disqualifying a specific candidate as puppetmaster.
Fast forward to 2024 and once again there is a need to employ Natural Language Processing to sort out who’s who. There are good books out there for Python, Natural Language Toolkit is for experiments, and spaCy for high volume work. I’ve used both of these for my own coding, as well as spaCy appearing as a component of Open Semantic Search. This case is a little different, since I am not sure what I am doing, and I want a broader, more capable environment.
And as a Python guy that means I just installed Orange Data Mining.
Attention Conservation Notice:
This is going to be pretty technical, but what we’re doing applies to The Online Operation Kill Chain, although there is no one specific phase where it fits.
ODM Introduction:
Orange Data Mining provides a graphical programming environment akin to n8n, you drag various processes from a palette and connect them to form a workflow.
The palette on the left is open to Data, which lets you access CSV, SQL, and other sorts of inputs. Below that area there are five other classes of widgets - Transform, Visualize, Model, Evaluate, and Unsupervised. I got familiar with Orange a couple years ago using their excellent video tutorial series, but it never became a staple for me, so I’m totally in review mode.
The text mining examples are six years old and they reference a happier time, when anyone could access the Twitter API. My task in this case is some long form data coming from Substack. Even so, the examples are short, clear, and you can make out how things will work.
Stylometry Tradecraft:
If you open JGAAP and start poking around you will find MANY different methods for each stage of analysis. Much like NLTK, there have been an endless flood of academic papers on natural language processing, and the algorithms employed get integrated into various tools. This makes for a confusing environment with a steep learning curve.
What sorts of comparisons are available? Here are a few I know off the top of my head.
Character frequency.
Word frequency.
Character bigrams and trigrams.
Word bigrams and trigrams.
Character frequency literally counts each letter and word frequency counts words. The bigrams and trigrams are forms of n-gram analysis. The frequency count is a unigram, the bigrams/trigrams are a count of pairs or trios of words that appear together. This is just barely scratching the surface, if you want to know more there are a seemingly infinite supply of papers like this one, Whodunit? Learning to Contrast for Authorship Attribution.
As always, I have a LOT of reading to do before I can make sense of the task at hand.
Conclusion:
Have you followed the Claudine Gay plagiarism saga?
And then the plagiarism accusations against Neri Oxman, the wife of billionaire Bill Ackman, the guy who engineered Gay’s ouster?
If you feed many papers to a system that uses a large N for N-gram analysis, something on the order of a significant fraction of a sentence, you’re going to spot sources that were not properly cited.
My initial work in this instance has to do with Substack, but given the overall context I am sure there’s going to be a plagiarism wild hunt that follows. I am glad for tasking that’s finally going to make me focus on mastering ODM, but I get the feeling this is going to turn into an endless brawl with a lot of needless casualties.
But this is 2024, and brawling seems to be what we, as a society, are focusing on this year.