AI On Your Hardware
You probably don’t need this.
When I became interested in AI nine months ago, I had some vague notion that it would assist me in analytical tasks and programming. Things have changed a number of times since then and a recent Reddit post from a new person has brought me back to the topic.
BLUF: If you’re a professional looking for career advice, get yourself Claude Pro so you can check out Code and Cowork, and start envisioning having a three digit monthly bill involving Claude Max. AI costs like turn of the century cell phone service, so get used to it. Unless you have job duties that involve buying AI inference at scale, there’s little point in trying to run it on your own.
Attention Conservation Notice:
There will be lots of numbers and hardware architecture stuff and it’s not going to be all that well organized. The above BLUF is all you need to know, unless you’re trying to build an AI based solution for your employer, or starting an AI focused company.
Services:
The frontier models - Google’s Gemini, Anthropic’s Opus/Sonnet, and the ChatGPT5 melange all run on VERY capable hardware. An Nvidia DGX H200 rack mount system has eight GPUs in it and they have 141GB of memory each. A million tokens - that’s 800,000 words - will take between 50GB and 300GB of memory, on top of whatever the model itself requires. That’s a bottom of the barrel box, there was a recent announcement about Anthropic agreeing to using Amazon Trainium chips on the input end and the dinner plate sized Cerebras processors on the output side.
You can not begin to compete with what a frontier model does by purchasing your own equipment.
Motivation:
I have reason to run models locally. The specifics include:
Bulk embedding of documents and chats.
Benchmarking and performance testing medical specific models.
Understanding hybrid in-house/provider inference routing.
Simulating mobile device inference performance.
Exploring self hosted model security issues.
Tuning models for my specific workloads.
As you can see, I need to understand how things I’ll be buying in bulk will behave, I need to touch models that don’t exist as services, and perhaps create “fine tunes” of my own.
Hardware:
Today I’m pecking away on an M1 Mac Pro with a system that has a single Nvidia RTX 5060Ti under my desk. Once we close on funding, I could see at least some of the following happening.
Old Mac gives way to 32GB M4 or M5 Mac Air.
RTX 5060Ti 16GB in my Proxmox workstation is joined by a twin.
AMD Max+ 395 128GB desktop joins the mix.
Nvidia DGX Spark 128GB replaces my Proxmox desktop.
Mac Studio /w 128GB or 256GB for exploration/training.
Nvidia RTX 6000 Pro 96GB GPU joins RTX 5060 in Proxmox workstation.
The only thing wrong with my current Mac is that there’s someone I know whose existing Mac is at death’s door. I am going to do something that lets me run larger models here at home. In order by increasing cost the Max+ 395 128GB ($2200), DGX Spark 128GB ($4500), Mac Studio ($4000 - $10000 depending), or the RTX 6000 Pro ($9200) are all possibilities. The 128GB unified memory systems are bringing the same amount of space as the dedicated 96GB GPU.
There is a distinct possibility that, even if I have plenty of money, only the first three get done. That would get me visibility into the three major AI ecosystems (AMD, Apple, Nvidia) for the price of just one of the three more expensive options. I cringe a bit to say this, I love owning my own gear, but the capital cost just doesn’t make sense. The $9200 acquisition cost of the RTX 6000 Pro must be compared against the roughly $3/hour cost of renting a dramatically more capable H200.
Conclusion:
Last fall I had a dying HP Z420 with an archaic 6GB Nvidia GTX 1060. The $500 HP Z4 that replaced it has been a good investment, steadily earning its keep. The $500 RTX 5060Ti? It’s satisfied my curiosity but it’s sat idle 99% of the time. With the benefit of hindsight, an RTX 3050 would have done that for half the cost. Buying a second 5060Ti would be very specifically for satisfying my curiosity about vLLM.
I suspect that when I have 96GB of GPU memory it’ll end up running Qwem3.5-122B-A10B. This is a 122 billion parameter model using just 10 billion mix of experts parameters at a time. I’d put it in an agent setup like IronClaw and its job would be standing watch over our production system.
But like I said at the start, unless you are an intense tinkerer, or your work duties involve understanding things at a volume purchaser level, let your AI spend be monthly services that get you access to frontier models. Services I see in my future would include:
Claude Max - perhaps $200/month level.
Perplexity Max at $200/month for at least a month.
Perplexity Pro at $20/month if Max doesn’t get used enough.
Exa - best business intel search I’ve used thus far.
Minimal Gemini & ChatGPT just so I stay familiar.
Maybe an API token spend for an agent, rather than local GPU.
Right now I’ve got what looks like a turn of the century cell phone bill. I expect it to increase to a turn of the century car payment - several hundred monthly.
If you are not running this hard, if you’re not “self-disrupting” in order to stay on top of things, 2026 is the year you’ll get hit from the shadows by someone like me who’s been eyeing your revenue streams.


