Self Hosting LLMs
Found a very good source @bycloudAI on YouTube.
Thereās an excellent source of AI related wisdom out there - @bycloudAI. Iām running on fumes and this video on the bigger picture of running LLMs locally was really helpful. Letās watch a video and then consider how to level up in this area, eliminating subscription costs.
Attention Conservation Notice:
Hopeless nerd mumbling about AI performance particulars and hardware minutia. Go no further, unless youāre planning on following this path yourself.
LLMs Of Your Own:
There you have it, everything you might want to know about keeping LLMs as pets. This is an enormous area and I guess I am starting to be literate.
Local LLMs:
I have limited gear available and until the startup Iām working on gets funded, this is all there is.
16GB M1 Pro MacBook - main desktop, can maybe get 5GB model going, if I turn off everything else.
8GB M1 Mac Air - provides 5GB models over the network using LMStudio.
i7-9700T - 32GB desktop that can provide 16GB models over the network with LMStudio, but very, VERY slowly.
Pi5 - 8GB machine, why am I installing Ollama here?
I am paying for ChatGPT Plus ($20/mo) just for general usage and Iāve got Claude Pro ($20/mo) for coding. I had an OpenAI API account but something wiped out my initial $5 investment, I know not what. This is the problem - API usage is a hydra.
Parabeagle was the first place I got aggressive about this problem, replacing OpenAI with a local embedding method. This does not require a lot of ram and runs well on the M1 Pro. It will process tens of thousands of documents at a rate where I didnāt bother to note the run time, it just got stuff done.
Since thereās a mobile component I am getting familiar with React Native. Iāve been directed to climb the Figma learning curve ($20/mo). Iām using it with VSCode, which makes me stabby when Iām compelled to touch it before being properly caffeinated. I started looking at Letta for various reasons, and thusly I have a Google Gemini API key. This led to the discovery of Gemini CLI, which does what Claude Code does, and no cost so far.
The React Native TypeScript environment is Figma for design, VSCode for development, and Gemini for the AI assist. This is wholly separate from the Python specific PyCharm and Claude Code Iāve been using for Parabeagle. This is brand new but Iām finding Gemini to be pleasing, itās different than Claude Code, but Iām moving through things with it that would have taken me months sans assistance.
The need for a local model was urgent with Parabeagle and I suspect it will become urgent with the two frameworks Iām using - Letta and MindsDB. I donāt care if prototypes are dead slow, as long as theyāre not draining my limited funds.
Fantasies:
My tired old HP Z420 with its Nvidia GTX 1060 is still sitting here under my test bench, but it hasnāt been powered on since the spring of 2024. The CPU lacks the AVX2 feature that modern AI apps demand and the 6GB GPU is almost ten years old. Thing āsteam powered carā - itās hot, itās noisy, and youāre going nowhere in a hurry.
I asked for a Z440, which has AVX2, and a 16GB Nvidia RTX 5060. Since then Iāve been reading and I think the generation after the Z440, the Z4 G4, would be much better. There is a fairly small premium, weāre talking $350 machines on Ebay, and they come with the AVX512 instructions. This is SIMD stuff - Single Instruction Multiple Data, which is used for matrix processing. Itās not as fast as the dedicated purpose GPUs, but it makes a tremendous difference. AVX512 are a confusing family of instruction sets, it would be better to get an AVX10 system, but that means adding another zero to the price tag. Picking the HP is me just crossing my fingers a top tier vendor got the AVX512 stuff right.
Apple systems have unified memory - the CPU and GPU share the machineās physical memory. This is not as fast as the dedicated GPU memory of Nvidia cards, but it costs less and itās more flexible. The Nvidia DGX Spark caused a stir when announced, but itās been eclipsed by systems based on AMDās Ryzen AI Max+ 395. Careful here, if youāre wanting a full featured desktop, you have to get a PRO model of that chip, assuming youāll want to do virtualization at some point. Even so, a 128GB unified memory system for $2,000 beats the heck out of the $3,500 M4 Max Mac Studio. If I were to throw $3,500 at Apple it would be for a 96GB M3 Ultra Studio - the memory bandwidth is double that of the M4s, and that shows bigtime in AI related work.
The absolute least cost āget something in here to do a local LLMā would be an Intel Arc Pro A40, but in addition to the $200 this would involve major surgery on the little Dell Optiplex that does Proxmox/file server duties. Iād be giving up both Proxmox and the associated 5TB of mirrored storage in exchange for a tiny 6GB GPU. And itās NOT Nvidia. Branching out from CUDA systems is inevitable, but I would not bet the farm on this. All I need is one *whoops* in this area and itāll cost months.
Conclusion:
Morning comes, I make my way to my desk, and I immediately plunge into integration work. I am again serving in the R&D CTO role advertised on my LinkedIn profile. I am occasionally ⦠startled ⦠by this. Being so sick for so very long, then being mostly put back together is ⦠itās just weird. I got a call today, theyāre trying to move up my appointment with the allergist, and once that happens I may well be 100% free of constraints.
There is a brewing MAGA meltdown, Iāve started to pay attention to the Six Republican Factions. This is just a weird hobby at this point, I put maybe 2,000 hours into the big MAGA graph I kept from 2020 - 2024, and itās sort of like the final season of Game of Thrones. Iām not THAT into it, but I did use a vast amount of time and energy, so I do kinda want to know how it ends.
I havenāt had any new hardware this year, other than swapping a small monitor for the 32ā fishbowl on my test bench, which was a nonevent. Iām gonna go out on a limb here and predict that the piecemeal HP workstation will precede any funding leading to some fabulous Apple gadget here on my desk.
But itās all good ⦠I have my life back, my career is rebooting, who could have predicted this?

