Champagne Taste, Beer Money: Building Sovereign AI on a Budget
“Sovereign AI” is having a real moment. It’s all over LinkedIn, it’s in the analyst decks, and there’s a genuine industry forming around a simple premise, that you should control the AI you depend on instead of renting all of it from someone else. I think that’s exactly right. It’s also an old idea wearing a new hat, which is to own the thing you depend on. I’ve been self-hosting what matters to me since at least 1996, so pointing that same instinct at AI was the obvious next step.
To be clear up front, because the internet is the internet, this is not a hit piece on the frontier labs. Claude, GPT, Gemini, they’re genuinely excellent and I use at least Claude every day. But this is about having a Plan B you actually own, so that when a price changes, or a model gets retired, or a terms-of-service paragraph quietly gets reworded, or some pesky government decides to get involved, none of it can switch off something my family or my 24/7 agents and interests have come to rely on.
So I built one. In the shop DC. On a budget.
Champagne Taste, Beer Money
If money were no object I’d have racked something with a couple of NVIDIA A100s, let CUDA do its thing, and spent my weekends on literally anything else. Money is an object. So I did what I always do around here (see roughly every other post on this blog). Buy the most capability per dollar I can find and pay the rest in elbow grease.
I did a lot of research and cost comparing like I normally do, scouring eBay window shopping and looking at all the pretties I can’t have. Then I went and scoured the LLM benchmarking sites and started comparing specs. Ultimately I landed on four AMD Instinct MI100s. Real datacenter cards, 32GB of HBM2 apiece, 128GB of VRAM in the box, picked up used for a fraction of what the green-team equivalent runs.
The catch, because with me there’s always a catch, is that these are CDNA1 / gfx908 parts on AMD’s ROCm stack. Roughly the entire LLM tooling ecosystem quietly assumes you’re on CUDA. I was not. Notice the pattern. I optimized for cheap and relatively fast, which meant good was only going to be for some definition of good, not easy good.
Bill of Materials
For the homelab crowd who always asks what’s in the box, here’s the parts list. Most of it came off eBay, some of it after a few “make offer” rounds that knocked the prices down a bit.
| Qty | Component | Price |
|---|---|---|
| 4 | AMD Instinct MI100 32GB | ~$4,000 |
| 1 | ASRock ROMED8-2T motherboard | $770-$1,100 |
| 1 | AMD EPYC 7402 24-core 2.8GHz | $87 |
| 1 | EVGA SuperNOVA 1600W PSU | ~$165 |
| 1 | Rosewill RSV-AI01 4U chassis | $450 |
| 4 | GPU cooling fan shrouds | ~$108 |
| 1 | Dynatron A41 low-profile CPU cooler | $59 |

I already had the 128GB of ECC DDR4-2400 and the SSDs for boot and model storage sitting around, so those aren’t in the totals. All in, after the haggling, the rig cost me somewhere between $5,400 and $5,600. The memory and storage I reused is honestly where the real money would be if I were buying it all fresh today.
You could absolutely put nicer cards in a box like this if money were no object. A single RTX A6000 runs $8,000 to $11,000 and gives you 96GB on one card, which is arguably the better tool. I did not have A6000 money. I’m also eyeing an Infinity Fabric bridge that ties all four MI100s together for fast card-to-card communication, which would help the models that have to span more than one GPU.
Holding the Pieces
Buying used datacenter gear is a patience exercise. The parts showed up a few at a time as auctions closed and “make offer” haggles landed, so the build came together in fits and starts over a couple of weeks. That part I expected. The fun started once everything was in hand and had to physically coexist in one box.
The first CPU cooler I bought was too tall to fit under the lid of a 4U chassis. Easy mistake, annoying fix, order a low-profile SP3 cooler and wait again.
The EPYC chip taught me something I didn’t know. The big SP3 socket has a metal retention frame held down by Torx screws, and they have to be tightened in the marked order to spec, a T20 bit at 11 to 13 inch-pounds. Skip the sequence or under-torque it and the socket doesn’t seat evenly across that huge grid of pins, which shows up as missing memory. I had DIMM slots that simply wouldn’t register until I got a torque driver and tightened everything in the right order. Now all the RAM is there.
Airflow was the next puzzle. I wanted the rear 80mm fans pulling hot air straight out the back behind the cards, but the chassis fan mount didn’t line up with where the leftmost PCIe slot sat. The fix was a PCIe slot relocator, one of those adapter-and-ribbon-cable kits that lets you move a card to a different physical position. It works, but it’s an unfortunate design decision on an otherwise decent chassis.
And the cards themselves are the real lesson. These are passive datacenter accelerators. They have no fans of their own, because they were built to live in a chassis with a wall of screaming high-pressure fans shoving air through them. Sit one in anything less and it will happily climb to 100°C and start throttling itself to survive. Most of the work here wasn’t the GPUs, it was building enough airflow to keep them honest. Shrouds on every card, high-static-pressure fans, and a chassis that actually moves air.
Then It Was the Software’s Turn
With the cards finally seated, cooled, and stable, the software had opinions of its own. To get a useful amount of context onto a 32GB card you have to be clever about the model’s working memory, what’s called the KV cache. That’s the running record of everything the model has read so far in a conversation, and it grows with the length of the context. Keep it at full precision and a long conversation eats all your VRAM in a hurry, so the move is to compress it. The catch is that the tool I picked to do that compression had only ever been tested on two newer AMD datacenter cards and one consumer NVIDIA card. Mine were an older generation, on none of those lists. Unsupported software, unsupported silicon, and me. What could go wrong.
Plenty, as it turns out. Here is roughly how the debugging went.
First it ran, but small. With the cache at full precision everything worked, coherent and stable. The problem was that there was no room left for the big context windows I actually wanted, so this was never the finish line.
Next I switched on the cache compression, and the output fell apart. The model would load, pass its health check, and then under real load start producing confident nonsense. Real words, real grammar, no meaning. The useful clue was that the full-precision path was still perfectly clean, which told me the model and the core math were fine. The bug was specifically in the compressed-cache code, which someone had written and tuned for the newer cards. On my older ones the numbers just didn’t come out the same.
Then came crashes that only happened on the hard problems. Short prompts were rock solid, long ones would abort partway through. Underneath, a language model is mostly an enormous stack of matrix multiplications, and the GPU’s math library ships pre-built routines tuned for specific shapes and sizes of those multiplications. The particular sizes that only come up with long prompts had no matching routine for my card, so instead of slowing down it simply died. The fix was forcing the system to load an older, more complete copy of that math library, the one that actually included the routines my cards needed.
The last one cost me a weekend. Two builds of the identical source code, one rock solid and one quietly producing corrupt results. The only difference was how they were assembled. The broken build borrowed a few low-level system libraries from the operating system at startup, and those happened to be a slightly different version than the ones the code had been built against. Everything loads, nothing throws an error, and then the answers come out subtly wrong because two pieces of software disagree about the exact memory layout of the data they hand each other. Rebuilding with those libraries baked directly into the program, so there was no second version to disagree with, made it vanish.
The thread running through all of it is that none of these showed up on a quick test. A “what’s 2+2” check passed every single time, which is exactly how it kept fooling me. The failures only appeared under the long, messy, varied prompts that real agent traffic throws at a model, and a build that breezed through a 6,000-token test would still fall over at 30,000. Every one of these came with a perfectly good off-ramp labeled “just give up and rent an API.” I wrote down what broke each time and kept driving.
Picking an Engine and a Model
There’s a choice underneath all of that I haven’t mentioned yet, what to actually run the models on. I started with vLLM, the serious production-grade inference server, the kind that does clever batching to serve a lot of requests at once. Getting it to build for cards this old was a multi-day project by itself, mostly fighting the ROCm toolchain into compiling at all. I did get it running, and in testing I pushed well over a million tokens of context cache through it on a small model, which was a fun number to hit. But the features I actually wanted kept running into walls on hardware and model types this far off the beaten path.
llama.cpp, through a fork focused on cache compression, turned out to be the more forgiving path on older cards, and it opened up the whole ecosystem of pre-quantized open models. So production runs on llama.cpp today, and vLLM stays on the bench for experiments. In fairness, since I made that call other people have gotten vLLM running well on these exact MI100 cards through AMD’s AITER kernels, so the wall I hit was more about my patience than the hardware. I went the llama.cpp route and haven’t had a reason to switch back yet, though I’m curious enough now to go back and benchmark it properly.
Then there are the models, and the two tiers are a deliberate split. For hard reasoning and long-horizon coding I want a dense model that puts all of its parameters to work on every token, so the heavy tier is Qwen3.6-27B. The cheap, high-volume tier is its faster sibling, the mixture-of-experts Qwen3.6-35B-A3B, which only activates 3B of its parameters per token, so it’s quick and perfectly good for the easy work but can’t reach the dense model on the hard stuff. I’ve run models from the Gemma, Qwen, GLM, and DeepSeek families on this box. The bigger ones, like GLM and DeepSeek, do run, but they only fit on 32GB cards if you drop to smaller quant sizes and shorter context windows, which hands back exactly the quality and headroom you were trying to buy in the first place. After many days of testing and tweaking, the Qwen models have consistently punched above their weight here, which matches where they land on the public leaderboards. Either way the rig doesn’t care which one is loaded, so trying a new one is a download and a config change, not a project.
Making It Survive 24/7
Getting a model to answer one prompt is a demo. I wanted infrastructure, including multiple usable agents running around the clock. Coding assistants, background automation loops, a voice assistant for the house. The kind of thing that has to keep working at 3am while I’m asleep, without me hovering over it.
That’s a different problem, and a less forgiving one. A few things mattered.
The first was running a pool of model instances instead of one. Right now that’s three copies of a 27B Qwen model, one pinned to each of three cards for the heavy interactive work, plus a faster 35B mixture-of-experts model on the fourth card for the cheap, high-volume stuff. Out front sits an LLM router gateway I built myself, which gives me ultimate flexibility over how requests get balanced across the pool, so one wedged request doesn’t drag its neighbors down with it, and a shared CPU-side cache lets them reuse each other’s warm context.
On a single stream the heavy 27B model decodes at about 24 tokens a second and the lighter mixture-of-experts model at about 48, both comfortably faster than I can read, and the pool runs several of those at once in parallel.
The second was watchdogs that notice when an instance has crashed or gotten stuck and bring it back on their own. “24/7” is a promise about 3am, not about business hours.
The third was a lot of fussy VRAM accounting. Each 27B card serves two parallel slots of 128K-token context by squeezing the KV cache down to 8-bit, which is enough for an agent to hold a real codebase in its head, and it all has to fit inside 32GB without tipping a card into out-of-memory the second two requests land at once.
Day to day, I drive the whole thing from my editor. I use Zed and opencode, both pointed at the rig through a preset I call Open Opus, a nod to the model it’s quietly standing in for. My coding assistant talks to four GPUs in my own building instead of an API in someone else’s.
Here’s the rig with all four cards working, first from the command line and then from the custom monitoring dashboards I built to keep an eye on it.
$ amd-smi
+------------------------------------------------------------------------------+
| AMD-SMI 26.3.0+2bd1678d3d |
| OS kernel Version: 6.17.13-2-pve |
| ROCm Version: 7.12.0 |
| VBIOS Version: 000.000.000.000.015466 |
| Platform: Linux Baremetal |
|-------------------------------------+----------------------------------------|
| BDF GPU-Name | Mem-Uti Temp UEC Power-Usage |
| GPU HIP-ID OAM-ID Partition-Mode | GFX-Uti Fan Mem-Usage |
|=====================================+========================================|
| 0000:03:00.0 AMD Instinct MI100 | 6 % 59 °C 0 55/290 W |
| 0 3 N/A N/A | 45 % N/A 24267/32752 MB |
|-------------------------------------+----------------------------------------|
| 0000:83:00.0 AMD Instinct MI100 | 40 % 73 °C 0 289/290 W |
| 1 1 N/A N/A | 99 % N/A 31919/32752 MB |
|-------------------------------------+----------------------------------------|
| 0000:86:00.0 AMD Instinct MI100 | 43 % 72 °C 0 294/290 W |
| 2 2 N/A N/A | 100 % N/A 31917/32752 MB |
|-------------------------------------+----------------------------------------|
| 0000:c3:00.0 AMD Instinct MI100 | 41 % 72 °C 0 297/290 W |
| 3 0 N/A N/A | 100 % N/A 31915/32752 MB |
+-------------------------------------+----------------------------------------+

The whole fleet at a glance. Junction temps, total power draw, and per-card utilization and VRAM across all four MI100s.

Per-card drill-down. Utilization, memory, temperature, and power for GPU 0 through 3.
The payoff is a rig where all of it, the coding, the agents, the house assistant, runs on hardware I own, in a building I control.
But What About X
Right about here someone says it. “You can get 128GB for three grand, why on earth did you spend $5,600?” It’s a fair shot, and in 2026 it’s a real option. NVIDIA’s DGX Spark and the AMD Strix Halo boxes (Ryzen AI Max+ 395) put 128GB of unified memory and a capable GPU in something the size of a hardback book, drawing a hundred-odd watts in silence on a desk. If all I wanted was to run one big model quietly in a home office, I’d probably own one.
But not all 128GB is the same 128GB. Those boxes share their memory over roughly 256 to 273 GB/s of bandwidth. A single MI100 moves about 1.2 TB/s, and I have four of them. Token generation is almost entirely bottlenecked by memory bandwidth, so on raw decode speed each of my cards is in a different league, and four of them run at once. That last part is the whole game. A unified-memory box is one pool, which means one model doing one thing at a time. My rig is four independent high-bandwidth cards, which is what turns it from a single-user appliance into a 24/7 server running a pool of agents in parallel.
So the sticker comparison is the wrong one. “128GB for $3,000” and “128GB for $5,600” are not the same purchase. One is a quiet, low-power box that holds a large model and serves it to one person at a time. The other is four datacenter GPUs with several times the bandwidth, feeding a whole household of agents around the clock. The little boxes genuinely win on size, noise, power, and, for the NVIDIA one, the painless CUDA software story I spent weeks fighting on the AMD side. I wanted the always-on multi-agent server, so I paid for it in watts and fan noise instead of dollars. Different job, different tool.
Why Bother
Because “just use the API” is the exact dependency I’m trying not to have, unless it’s my API.
Run your own inference and your costs don’t reprice on somebody else’s earnings call. Your data never leaves the building. And nobody three states away can deprecate, rewrite, or regulate a capability out from under you on a random Tuesday. In my setup the home rig is the root of trust, the boring trusted core everything else leans on, not the disposable edge.
That’s what sovereign AI looks like at a personal scale. The companies now building it for organizations that can’t or shouldn’t roll their own are working the same conviction from the other end, and the conviction is sound. It isn’t isolation. I still reach for a frontier model when it’s the right tool for the job. It’s the same reason I keep a generator, a second internet uplink, and a shelf of cold spares. When the thing you depend on belongs to someone else, owning your own stops being a hobby and starts being insurance.
Further Reading
A few of the projects and rabbit holes behind this build, in case you’re tempted to do the same thing.
- TheTom/llama-cpp-turboquant — the TurboQuant KV-cache fork I run, on the feature/turboquant-kv-cache branch
- TurboQuant on llama.cpp — the upstream discussion and the research it grew out of
- btbtyler09/mi100-llm-testing — someone else putting MI100s to work, on the vLLM and AITER path instead of llama.cpp
- vLLM and llama.cpp — the two inference engines
- AMD ROCm — the compute stack that made all of this both possible and occasionally painful