How Much Is Your AI DevOps Costing You?

It's been almost a year since I closed out the first run of this blog. I said it was a series, and then I let it sit. The honest reason is that the back half of that year was busy in the way that doesn't make for clean writing. But the lessons kept coming, and this one was expensive enough to drag me back to the keyboard. Call it Season 2. The bills got bigger, and so did my opinions.

Here's the one that started it.

The Question That Starts Every Cost Story

A founder asked me a simple question: why is our AWS bill so high?

That question is never simple. It's the smoke. You still have to go find the fire. So I went and looked.

There were nineteen GPU instances running. Twenty-four hours a day. They had been running like that for about ten days before anyone went looking, and they were doing almost nothing.

What the System Was Supposed to Do

The design itself was good. I want to be clear about that up front, because the failure here wasn't the architecture. It was what got bolted onto it.

The platform ran on Kubernetes. A job-manager would pull an image, spin up a GPU instance in AWS, run the job, store the output, and release the instance. Spin up, do the work, tear down. That's the whole loop. GPUs are expensive, so you only want them alive while there's work in front of them.

The instance shape — structure, memory, storage, size — was defined in YAML. And it was deliberately built to be flexible: it could provision through either Kubernetes node groups or Karpenter, depending on what the workload needed. That flexibility was the point. It meant the system could scale to zero when idle and grab exactly the right hardware when it didn't.

Scale to zero. That's the design. Nineteen idle GPUs is the opposite of the design. So what happened?

What the AI Built Instead

A developer needed to make a change. They used AI to help. The AI added a node group and an autoscaler with a default minimum size — and that change got deployed outside the infrastructure-as-code that managed everything else.

Read that again. The guardrails were in the IaC. The change went around them.

That autoscaler was set to a five-node minimum. Not zero. Five. Which means five GPU instances were contractually obligated to exist at all times, doing nothing, waiting. The reason given was that the developer didn't want to run out of GPU instances in the availability zone.

It's a reasonable-sounding fear. It is also how you light money on fire.

Then It Got Worse

Here's where it stops being one mistake and becomes a pattern.

The developer had run out of capacity before — more than once. But the reason they kept running out wasn't a hardware shortage. It was that they hadn't set it up properly in the first place. So instead of fixing the setup, they kept building around it.

They decided they needed more memory and storage on an instance. So they spun up a completely different node group with higher requirements — and set that minimum to ten instances.

Then they stood up a third thing: a completely separate static node group for a pod that was mostly idle, because they thought it needed GPU instances. It mostly didn't.

Do the arithmetic. Five, plus ten, plus four. Nineteen GPU instances, running 24/7, most of them solving a problem that didn't exist — created by trying to solve a problem that was caused by the previous fix.

None of it was malicious. It never is. It was a person reaching for AI to get unstuck, getting an answer that ran, and trusting "it ran" as proof of "it's right." Each layer was a patch on the last patch. The AI happily helped build every one of them, because the AI had no idea what the original design was trying to do.

The Part Nobody Wants to Hear

It sat like that for roughly ten days before I investigated, caught it, and shut it down — killed the idle instances, disabled the autoscaler, and unwound the node groups that should never have existed.

Ten days of nineteen idle GPUs is about $20,000. For nothing. No output, no jobs, no value. Just heat.

And here's the uncomfortable part: the expensive fix and the competent fix cost exactly the same amount of effort. The platform already supported what the developer needed. The flexible YAML, the node-group-or-Karpenter design — it was built to grab the right hardware on demand and scale back to zero. The capable move was to use it. Instead, three new things got built outside the system that was designed to handle all three.

The AI didn't fail. It did exactly what it was told. The failure was upstream of the AI.

What Competent Would Have Looked Like

Competent would have been realizing the AI could not infer the design without proper context.

That's the whole thing. That's the lesson that cost twenty grand to relearn.

The AI couldn't see the scale-to-zero intent. It couldn't see that the flexibility was already there. It couldn't see that "I keep running out of capacity" was a symptom of a bad setup, not a case for raising the floor to five, then ten. It saw a prompt, and it answered the prompt. Confidently. Outside the IaC.

You cannot prompt your way out of not understanding your own system. The AI will give you an answer at the resolution of the context you hand it. Hand it no design, and it will invent one — and the one it invents will run, and it will cost you, and it will look fine right up until the founder asks why the bill is so high.

The Receipt

Nine months ago I said AI-generated infrastructure was faster but dumber, and that speed without judgment is just a faster path to mistakes. I believed it then. I have the invoice now.

So when someone asks me how much their AI DevOps is costing them, I don't assume the answer is zero. The acceleration is real. The savings are real. But the failure mode is real too, and it doesn't show up in the demo. It shows up ten days later, running 24/7, in an availability zone, doing nothing.

Go look at your own minimums. Today. If something is set to a non-zero floor, somebody decided that on purpose — and you want to know whether that somebody understood the design, or just didn't want to run out.

👉 This is the kind of thing I find for a living. If you're scaling fast with AI in the loop and you're not sure what's quietly running underneath you, that's exactly the conversation I want to have. The bill is the symptom. The judgment gap is the disease.

If your team is shipping AI-generated infrastructure faster than anyone is reviewing it, you may already be paying for it and not know yet. Let's find out before the founder asks.