CuratorCore: Parabeagle’s New Foundation
Refactoring my growing tool set.
Having Parabeagled My Substack a few days ago, I’ve been chipping away at the Parabeagle codebase, and it’s become obvious …
I’ve got to refactor it, because it needs to be a proper Python package on PyPI(!)
CuratorCore is just an empty Github repo, but this abstraction of ChromaDB for the sake of paralegals and others involved in analytical exploration of large document caches IS what’s next.
Attention Conservation Notice:
Just vibe coding some Python stuff, if you don’t do this, probably best to move on to something you actually want to read.
Assessment:
Parabeagle started out specifically for court cases. I have one that I’m dealing with as part of my Brand Defense Strategist duties with Cicada3301, and there were a couple others I needed to review last fall. I didn’t want them mixed together, loading/unload them into a Chroma instance was a nonstarter, and that’s how Parabeagle was birthed.
Let’s Call it Poogle is a tongue in cheek read on what’s gone so badly wrong with Google’s search. I was looking for something on this Substack, but with 773,000 words across 890 articles at the time, I can only find the few articles I’ve bookmarked because I reference them frequently. Making it accessible took an afternoon of coding, and then an hour of runtime on an M1 Mac. Once properly captured, it took just a couple minutes to find the things I was seeking.
So that is two similar problems, and lurking in the background is my urge to reboot Disinfodrome. I’ve spent time with Claude digging into the Open Semantic Search, which I already know well enough to build by hand. I cobbled up Solr7-mcp, as a means to talk to OSS, but it’s pretty sad when applied to large groups of documents, exploding context windows even on well provisioned models.
Planning:
Regular readers will know that I am a daily follower of the work of Nate B. Jones. Below is a modification of a one of his “Messy Idea Management” prompts that’s been through a couple volleys of Claude and I editing it. I explicitly told Claude we were just planning. As I described in Harnessing Agent Chaos, using the system is like walking a brace of huskies, so I keep ‘em in their kennel while I provide instruction.
This is the first step towards extracting the common functions needed by Parabeagle and SubstackSavor into the CuratorCore package.
Conclusion:
This is a small but interesting (at least to me) software project, but there’s a much larger thing lurking.
I’ve been a reader/doer my whole career, now I have to become a delegator/evaluator.
This is THE problem for 2026, the need to master agentification. I can see what this means, but I haven’t accomplished a lot. I do have a lot of volunteer management in the context of social movements, but delegating to small groups of willing humans is very different from harnessing LLM(s). The technology stuff I’ve done on my own still leans heavily on my integrator sensibilities, rather than my ability to get them written in markdown I can feed to a system like Antigravity or Claude Code.
One of the things I am doing to facilitate this is taking on toolsmith duties for small groups - my startup, an NGO, and the Cicadians helping with the new puzzle, Shall We Play A Game? This is the right thing, it just feels … strange. I have to pay less attention than I normally would to a much broader range of activity than I would normally do.
When I compare my progress to others, I see at an intellectual level that I am moving quickly. I absolutely do NOT feel this, quite the opposite - every day is an overwhelming onslaught of change.
CuratorCore Prompt:
# Parabeagle Core Abstraction Plan v2
---
## Vision
Parabeagle is mutating. The name is evocative for paralegal/court case work, but there are broader applications. The goal is to:
1. **Extract core functionality** into a reusable Python package
2. **Preserve domain-specific interfaces** with their own branding
3. **Enable multiple specialized tools** built on the same foundation
4. Be mindful that a team environment with MindsDB is just over the horizon.
---
## Terminology Standard
ChromaDB’s use of “document” for text chunks confuses non-programmers who expect “document” to mean “file.” We establish clear, consistent terminology:
### The Filing Cabinet Metaphor
| Term | Definition | Physical Analog |
|------|------------|-----------------|
| **CABINET** | Top-level container for all user data | A filing cabinet |
| **DRAWER** | A project or case containing related work | A drawer in the cabinet |
| **FOLDER** | A logical grouping of files within a drawer | A folder in the drawer |
| **FILE** | An actual PDF or other document on disk | A piece of paper |
| **CHUNK** | A semantic unit of text extracted from a file | A highlighted passage |
| **COLLECTION** | A group of chunks with shared embeddings | An index (see scoping below) |
### Collection Scoping
A COLLECTION can exist at multiple levels:
| Scope | Contains | Auto-Created |
|-------|----------|-------------|
| **FOLDER** | Chunks from all FILES in one FOLDER | Yes |
| **DRAWER** | Chunks from all FILES across all FOLDERS in one DRAWER | Yes |
| **CABINET** | Chunks from all FILES in the entire CABINET | Yes |
| **Custom** | User-defined selection across FOLDER(S), DRAWER(S), or metadata filters | No |
**Key insight:** A FILE’s CHUNKS will appear in at least three auto-created COLLECTIONS (FOLDER, DRAWER, CABINET). Custom COLLECTIONS allow users to create cross-cutting indexes based on metadata, topics, or arbitrary selection.
### Terminology Rules
- ✅ **CHUNK** — what ChromaDB calls “document”
- ✅ **FILE** — a PDF or other file on disk (never “document”)
- ✅ **DRAWER** — what legal professionals call a “case”, avoids overloaded term and maintains the visual metaphor of filing.
- ✅ **COLLECTION** — a searchable index of chunks
- ❌ **DOCUMENT** — never use this word in our software or documentation, beyond initial explanation of why it’s not used.
- ❌ **CASE** — never use this word in our software or documentation, beyond initial explanation of why it’s not used.
### Operations
| Operation | Description |
|-----------|-------------|
| **EXPORT** | Bundle FILES and COLLECTIONS into a shareable archive (.zip) |
| **IMPORT** | Receive an archive and integrate into the user’s system |
| **DEFANG** | Display URLs safely as `example[.]com` (infosec convention) |
---
## Data Model Hierarchy
```
CABINET (root, respects $CHROMAFILES)
├── CABINET COLLECTION (all chunks in cabinet)
└── DRAWER (project)
├── DRAWER COLLECTION (all chunks in drawer)
└── FOLDER (logical grouping)
├── FOLDER COLLECTION (chunks from this folder only)
└── FILE (PDF on disk)
└── CHUNK (text passage, stored in ChromaDB)
CUSTOM COLLECTION (user-defined, can span any combination)
```
**Note:** The three auto-created COLLECTION types (FOLDER, DRAWER, CABINET) ensure every CHUNK is searchable at multiple granularities. Custom COLLECTIONS are user-created for specialized queries.
### Key Design Insights
1. **DRAWER/FOLDER/FILE hierarchy** — Court work requires multiple logically separate folders within a case (drawer). A flat folder structure is insufficient.
2. **Collections span drawers** — A COLLECTION should default to indexing all CHUNKS from all FILES across an entire DRAWER, though single-FOLDER collections remain possible.
3. **Portable via $CHROMAFILES** — Person A creates a DRAWER, exports it, shares with Person B who imports it. The system must correctly resolve paths relative to each user’s `$CHROMAFILES` environment variable.
4. **Source links in responses** — MCP server responses must include clickable links to the source FILE in its FOLDER. This requires proper `$CHROMAFILES` support.
> **Runtime path resolution:** When displaying file paths, the `$CHROMAFILES` environment variable is prepended at runtime — NOT during IMPORT. All stored paths are relative to the CHROMAFILES root, making archives portable between users.
5. **URL metadata preservation** — If a FILE was fetched from a URL (e.g., Substack), that URL must be preserved in metadata and shown to users.
- **Archive.org integration** — Option to check if URL exists in Wayback Machine
- **Auto-archive** — Option to submit URL to archive.org if not already preserved
> **Preservation must be optional:** In legal contexts, a paralegal may want to preserve evidence but explicitly NOT create a public archive that could assist opposing counsel. Ephemeral content is sometimes strategically valuable.
6. **DEFANG option** — Available at FOLDER or DRAWER level. Displays URLs as `example[.]com` for safe handling of potentially malicious links.
7. **Flexible EXPORT/IMPORT scope** — Must support:
- Single FOLDER
- Entire DRAWER
- Full CABINET
8. **UUID-based identity** — For MindsDB integration and multi-CABINET scenarios, all entities must use UUIDs (not filenames or auto-increment IDs) to avoid collisions.
---
## Use Cases After Abstraction
| Product | Domain | Key Capability |
|---------|--------|----------------|
| **Parabeagle** | Legal/Court | Case management with DRAWER hierarchy for paralegals |
| **SubstackSavor** | Research | Newsletter archive collection (extensible to other sites) |
| **Intel Cache** | Intelligence | Document-based analysis (requires OCR) |
| **Maltego Curator** | OSINT | Document curation for graph analysis workflows |
---
## Problem Statement
Analysts and curators spend hours manually organizing, chunking, and searching document collections. They cobble together:
- Expensive paid solutions
- Misapplied free tools
- Brittle custom scripts
The cognitive load of managing the toolchain prevents them from doing actual analysis.
### Who It’s For
| Attribute | Description |
|-----------|-------------|
| **Primary user** | Analyst/curator managing document collections |
| **Current workaround** | Mix of paid solutions, incorrect free tools, custom scripting |
| **Pain point** | Too much *doing* curation work, not enough *analyzing* content |
---
## Technical Approach
### The Simplest Version
Abstract the storage layer into a Python package that multiple MCP servers can import.
**Package name:** `CuratorCore`
- PyPI package: `curatorcore`
- Import: `from curatorcore import ...`
- Must NOT reference Parabeagle (which is its own branded product)
- Evokes the curation/filing metaphor
- Clean namespace for PyPI
### Success Criteria
- [x] Core functions identified and isolated
- [ ] `curatorcore` package builds and passes tests independently
- [ ] Parabeagle MCP imports `curatorcore` instead of monolithic code
- [ ] SubstackSavor imports `curatorcore` and works standalone
- [ ] Both installable via `uv` or `pip`
- [ ] EXPORT/IMPORT works between different users’ systems
### Must-Haves
- Separate `curatorcore` package with CABINET/DRAWER/FOLDER/FILE/CHUNK/COLLECTION model
- Parabeagle and SubstackSavor as distinct repos depending on `curatorcore`
- Publishable Python package (PyPI or private index)
- Consistent terminology throughout (no “document” anywhere)
- UUID-based entity identification
- $CHROMAFILES environment variable support
### Nice-to-Haves
- OCR capability for scanned PDFs
- Multi-COLLECTION search across a DRAWER
- OPML batch import for Substack feeds
- DEFANG toggle at FOLDER/DRAWER level
### Out of Scope
- Team/multi-user features
- Cloud hosting or SaaS deployment
- Authentication/authorization systems
- Mobile interfaces
---
## Reality Check
| Assessment | Status |
|------------|--------|
| Realistic to build | ✅ Single developer, existing codebase, clear boundaries |
| Requires careful API design | ⚠️ Core package must be stable before dependents |
| May uncover hidden dependencies | ⚠️ Extraction often reveals coupling |
---
## Build Plan
| Session | Duration | Goal |
|---------|----------|------|
| 1 | 2-3 hours | Map current codebase, identify core vs. interface code |
| 2 | 2-3 hours | Extract core into `curatorcore` package structure |
| 3 | 1-2 hours | Refactor Parabeagle MCP to import from package |
| 4 | 1-2 hours | Create SubstackSavor as separate project using package |
| 5 | 1-2 hours | Implement EXPORT/IMPORT with $CHROMAFILES support |
**Stop when:** Both applications work independently with shared core, and EXPORT/IMPORT enables file sharing between users.
---
## One Sentence
> This refactoring lets **analysts** use **purpose-built tools for their domain** instead of **one monolithic system trying to serve everyone**.


