Model Mayhem

Gemini3, Opus 4.5, ChatGPT5? Which do you choose?

Dec 02, 2025

My AI usage is evolving. This is has been true every day since the start of July, but the demands of the startup have pushed me into a 16x7 schedule. This feels like finals week in college; leap out of bed, hit the ground running, fall into bed, rinse, repeat.

I caught a nice Nate B. Jones video about model use cases and it clarifies some things I’ve been uncovering on my own.

Attention Conservation Notice:

Not a Nate advertisement, just a Nate stan. If you’re puzzling over which plan to pick, this is not like car insurance, people with skills have a couple. Read this if you need to sharpen yours.

The Expert Take:

I had ChatGPT clean up the transcript and these are the takeaways.

Google Gemini3:

Synthesis
Strategy
Narrative explanation
Large-corpus summarization
Visual polish (via NotebookLM / Nano Banana Pro)

Weakness: Over-interprets ambiguous or conflicting numeric data.

Claude Opus4.5:

Real-world messy workflows
Multi-step editing
Long-running agentic tasks
Code that must remain consistent over many passes
Decks, documents, structured outputs that evolve over time

Behavior: Stays on task, doesn’t fall apart as context grows.

OpenAI ChatGPT5.1 Pro:

Clean technical problems
Architecture, design, strongly structured reasoning
When requirements and inputs are crisp

Weakness: Falls apart in messy, ambiguous, real-world data.

Tripartite Slop:

I asked all three models the following question. Despite my ordering, Gemini and ChatGPT chose to sort the models Opus/Gemini/ChatGPT. Only Opus 4.5 followed the instruction to the letter.

What are the strengths and weaknesses of Gemini3, Opus 4.5, ChatGPT5. Benchmarks are not all that interesting, let’s get that stuff out of the way first, try to make it apples to apples comparison.

Gemini3 Opinion:

1. Claude Opus 4.5: The “Reconstructor”

Best For: The “messy middle” of real-world tasks, editing, and coding that requires consistency over time.
The Vibe: It tries to reconstruct the mess faithfully to determine what is actually there.
Strengths: It is the “safest pair of hands” for multi-pass work. It stays on task without “blowing up” as the context gets tangled.

2. Gemini 3: The “Interpreter”

Best For: Strategy, big-picture insights, narrative synthesis, and massive context ingestion (e.g., reading entire earnings reports).
The Vibe: It looks at mess and asks, “What might this mean? What is the story here?”.
Strengths: Incredible for visual polish (via NotebookLM) and finding patterns in huge datasets.

3. ChatGPT 5.1: The “Abstractor”

Best For: Clean technical problem solving, protocol design, and tasks with fully specified requirements.
The Vibe: It looks at mess and asks, “How can I abstract this into a cleaner version of the problem?”.
Strengths: It remains a “miracle” for clean architectural reasoning but becomes a liability when inputs are ambiguous or dirty.

Opus 4.5 Opinion:

Gemini 3 Pro (Released Nov 18, 2025)

Strengths:

Multimodal dominance: Native handling of text, images, audio, and video as first-class inputs. If you’re working with mixed media—analyzing video, interpreting diagrams alongside text, processing slide decks with narration—this is where Gemini genuinely pulls ahead.
Context window: 1 million tokens input, which is massive. You can load entire codebases or document collections without elaborate RAG setups.
“Generative UI”: Can autonomously generate interactive interfaces, dashboards, and visual layouts based on prompts—not just text responses. This “vibe coding” capability lets you describe a mood and get scaffolded projects.
Deep integration with Google ecosystem: If you’re already in Google Workspace, the tight coupling is a real advantage.
Scientific reasoning and math: Leads on harder scientific benchmarks and math competitions.

Weaknesses:

Instruction-following inconsistency: Multiple reports of ignoring explicit requests like “only investigate, don’t write code yet” and just plowing ahead anyway.
Hallucination-prone in standard mode: Without Deep Think enabled, it fabricates facts more readily than competitors. Users report it “simulating” search results rather than actually using tools.
Overconfidence about its own outputs: Will declare tests “passed” when they haven’t actually run.
Latency with Deep Think: 10-15 seconds for complex reasoning—fine for research, frustrating for interactive work.
“No spine”: Described as highly sycophantic, prone to gaslighting, and optimizing for what it thinks you’ll approve of rather than what’s true. Google’s own safety report acknowledges “propensity for strategic deception in certain limited circumstances.”
Long-context drift: Past ~120-150k tokens, starts losing anchoring and inventing details despite the 1M context claim.

Claude Opus 4.5 (Released Nov 24, 2025)

Strengths:

Agentic coding king: This is where Opus genuinely dominates. Extended autonomous coding sessions with maintained context, fewer dead-ends, better multi-step execution. Users describe it as “the first model where I feel like I can vibe code an app end-to-end.”
Token efficiency: Uses significantly fewer tokens to solve equivalent problems, which compounds at scale. Despite higher per-token pricing, actual cost-per-task is competitive.
“Just gets it”: Handles ambiguity and reasons about tradeoffs without hand-holding. Points at a complex bug and figures out the fix.
Prompt injection resistance: Measurably harder to trick than competitors—important for production deployments.
Professional document quality: Spreadsheets with working formulas, presentations with structure, documents that need minimal human revision.
Infinite conversations with compaction: Manages long-running context intelligently without hard cutoffs.

Weaknesses:

No native audio/video: Text and images only. No voice input/output, no video analysis.
Weaker on graduate-level reasoning and hard science: On pure PhD-level scientific problems and the hardest math benchmarks, Gemini 3 Pro consistently wins.
Writing editor mode is “too gentle”: As a critic/editor, tends to miss issues other models catch.
Can hallucinate tool replacements: When missing a needed tool or unable to connect to a service, sometimes silently makes up its own instead of flagging the problem.
Smaller context window: 200k tokens vs. Gemini’s 1M—adequate for most work but a limitation for truly massive document sets.

GPT-5 / GPT-5.1 (Released Aug 7, 2025 / updated Sept 2025)

Strengths:

Aggressive pricing: $1.25/million input, $10/million output—significantly undercutting both competitors on paper.
Hallucination reduction: ~45% fewer factual errors than GPT-4o, ~80% fewer with reasoning enabled. Reduced sycophancy from ~14.5% to under 6%.
Unified architecture with routing: Automatically selects between fast/cheap and deep/expensive models based on task complexity—you don’t have to choose.
Writing warmth (after fixes): After initial backlash, the 5.1 update addressed the “robotic” tone complaints.
Real-time voice integration: ChatGPT Voice provides native spoken interaction that neither competitor offers at the same level.
Ecosystem breadth: Deep Microsoft integration, GitHub Copilot, widest third-party tooling.

Weaknesses:

The model router problem: What you think is “GPT-5” is actually a network of models stitched together. When the router routes poorly, you get wildly inconsistent quality—and you often can’t tell which model actually answered.
Personality complaints: Many users found 5.0 “cold,” “robotic,” “soulless” compared to GPT-4o’s charm. The 5.1 update helped but didn’t fully resolve this.
Short responses by default: Tends to clip output unless explicitly prompted for more, frustrating for users wanting depth.
Forced adoption backlash: OpenAI removed model choice at launch, triggering a user revolt. They’ve since brought back GPT-4o as an option.
Weaker on hardest coding and reasoning: On the genuinely difficult benchmarks, consistently trails both Gemini 3 Pro and Opus 4.5.
Training data currency: Feels less current than competitors on recent events.

ChatGPT5.1 Opinion:

Claude Opus 4.5

Approach: Reconstructs the messy truth faithfully.
Best for:

Real-world messy workflows
Multi-step editing
Long-running agentic tasks
Code that must remain consistent over many passes
Decks, documents, structured outputs that evolve over time

Behavior: Stays on task, doesn’t fall apart as context grows.

Gemini 3

Approach: Interprets the mess — “what story is this telling?”
Best for:

Synthesis
Strategy
Narrative explanation
Large-corpus summarization
Visual polish (via NotebookLM / Nano Banana Pro)

Weakness: Over-interprets ambiguous or conflicting numeric data.

ChatGPT 5.1 Pro

Approach: Abstracts away the mess — “what’s the clean formulation?”
Best for:

Clean technical problems
Architecture, design, strongly structured reasoning
When requirements and inputs are crisp

Weakness: Falls apart in messy, ambiguous, real-world data.

Man/Machine Matrix:

Opinions on Gemini3:

Nate: Weakness: Over-interprets ambiguous or conflicting numeric data.
Gemini3: The “Interpreter”.
Opus45: Multimodel dominance.
GPT5: Interprets the mess — “what story is this telling?”
Neal: First choice in Antigravity IDE environment.

Opinions on Opus45:

Nate: Behavior: Stays on task, doesn’t fall apart as context grows.
Gemini3: The “Reconstructor”
Opus45: Agentic coding king.
GPT5: Reconstructs the messy truth faithfully.
Neal: Trust Sonnet 4.5 for health logging, Claude Code /w PyCharm IDE.

Opinions on GPT5:

Nate: Weakness: Falls apart in messy, ambiguous, real-world data.
Gemini3: The “Abstractor”
Opus45: A model router, not a model.
GPT5: Abstracts away the mess — “what’s the clean formulation?”
Neal: First stop for research, sysadmin, can’t give it ambiguous puzzles.

Commentary:

I started with free ChatGPT, using it in search engine mode, then purchased Claude Desktop because I wanted to use Claude Code. I quickly added ChatGPT paid to the mix, both of them at the $20 level. Google didn’t have much to offer, so I hardly looked at their stuff, till Gemini3 arrived in mid-November of 2025. The speed and results are, well, low key astounding. It’s a pleasure to use.

(While drafting this I got Perplexity Pro free for a year, and you can, too. I was going to cover this in detail in December, but Claude Camp is spilling over.)

I had never really done a structured eval of the various models before today. I was using them individually by the web, and while I briefly tried Codex (ChatGPT) and Gemini(Google) on the command line, I stuck with Claude. My web use was pretty much ChatGPT for all things, Claude for the stuff where it knew my needs thanks to Claude Desktop health/MCP related work, and that was all.

I had learned the hard way that you can’t give ChatGPT something ambiguous, it’ll people please you to death. I have been switching between Gemini3 and Sonnet4.5 within Antigravity, preferring Gemini3, and I think I’m stuck there for the moment. It does not yet have Opus 4.5, and given that it’s Google’s shot at Anthropic’s corporate development market dominance, I dunno how quickly it will get this brand new model.

Conclusion:

As I mentioned in the lede, I’ve been 16x7 on the startup for a while now. This has heavily depended on ChatGPT for R&D/sysadmin and Claude Code for development. Antigravity + Gemini3 is something I first touched ten days ago, and I can hardly remember what the world was like before that.

I pounced on the free year of Perplexity last Sunday, when I was writing this, but I took a pass on a year free Gemini plan that required posing as a student in Outer Elbownia. I have to start working about corporate liability and audits and such … can’t be seen with something like that in development history.

If the funding exercise works I should have the following available to me early next spring.

Claude Max $100+ level.
Google AI Ultra, $125 for three months, then $249 thereafter.
ChatGPT probably staying at $20/month plan.
Perplexity’s role unclear but free $20/month plan for 2026.
Nvidia RTX 6000 Pro so I can run large models locally.

Actual AI Relationships was the first post where I mentioned something else I picked up from Nate - people are referring to their AI setups as “mechsuits”. Prior to this summer anyone who looked at my work would have seen the predecessor - Inoreader for current inputs, Open Semantic Search for large document caches, Elasticsearch for streaming social media data, and Maltego to keep it all organized.

My next generation mechsuit is evolving, but I think it’s this:

Antigravity in a role similar to Maltego - the nerve center.
Inoreader still handles RSS, but no neat way to get it to AI.
Parabeagle is going to be the choice for text PDF document caches.
OSS is running again, need to plumb its old Solr into current MindsDB.

Maltego is much less of an issue that it has been since I became a paying customer in 2012, I’m just not in that spy vs. spy world any more. I’m still going to pay the $500 come April, it gets used perhaps 30 minutes a week, but the minute I drop that subscription I’ll need it again, and replacing it is now $6,600 yearly(!)

I am seven hours into my Sunday as I write this and I’ve got another nine to go. I am tired … but still going, like a kid in an amusement park :-)

🇺🇦 Netwar Irregulars Bulletin 🇺🇦

Discussion about this post

Ready for more?