#2777: GPU Idle Waste and Serverless Green Computing

Why your dedicated GPU burns 130 watts doing nothing, and how serverless platforms cut energy waste by more than half.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2940
Published: May 12
Duration: 30:36
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: gpu-acceleration serverless-gpu sustainability

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

A dedicated H100 GPU draws roughly 100 to 150 watts at idle — just sitting there, powered on, doing zero computation. That's the cost of readiness: the chip is live, memory is refreshed, interconnects are active, waiting for a job that may not come for hours. Most people assume GPU power scales linearly with utilization, but the reality is a staircase with a very high first step. At 40% utilization, a typical H100 might average 350-400 watts of draw — roughly half of its 700-watt peak — while delivering less than half of its potential work. The gap between power consumed and useful work performed is pure waste.

Serverless GPU platforms flip this dynamic entirely. By aggregating demand across thousands of users and time zones, platforms like Modal keep GPUs busy near-continuously, targeting 80-90% utilization. When one function finishes, another starts within milliseconds — the GPU never idles. The economic incentive (maximizing revenue per chip) aligns perfectly with the environmental incentive (minimizing energy per unit of work), a rare harmony. With data center PUE multipliers adding 10-55% overhead on top of GPU draw, the waste from idle chips scales enormously: a mid-sized cluster of 100 H100s at 45% utilization can waste hundreds of megawatt-hours annually.

Manufacturing also carries significant embodied carbon — an estimated 150-200 kg CO2 per H100-class GPU. By serving the same workload with fewer physical chips, serverless avoids not just operational waste but manufacturing impact too. For a startup moving from 35% dedicated utilization to 85% serverless utilization, energy per useful teraflop can drop by more than half. The question becomes: if the case is this strong, why isn't serverless the default?

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2777: GPU Idle Waste and Serverless Green Computing

Daniel sent us this one — he's been thinking about serverless GPU platforms, and he's got a hunch that the shared-resource model might actually be a big deal environmentally, not just economically. His question is basically: what does GPU idle draw actually look like, what counts as optimal utilization, and is there a real case that serverless is the greener way to do AI compute? It's a good question. Everyone talks about cost with serverless, but the energy angle gets way less airtime.

It's one of those things where the intuition lines up with the physics, which almost never happens. You've got these enormously power-hungry chips — an H100 can pull seven hundred watts at full tilt — and in a dedicated setup, a lot of that capacity is just sitting there waiting for something to do. The question is how much waiting, and what it costs when it waits.

Seven hundred watts. So running one H100 at full load is like running seven incandescent light bulbs from the old days, except the light bulbs don't need liquid cooling.

Right, and that seven hundred watts is the thermal design power — the ceiling. But here's the thing that surprised me when I dug into this: the idle power on these chips is not trivial. An H100 at idle still draws somewhere around a hundred to a hundred and fifty watts. That's not nothing. That's a couple of old-school light bulbs just sitting there, doing zero computation, burning through electricity.

The GPU equivalent of a car idling in a parking lot is still burning a meaningful amount of fuel.

And the car analogy actually holds up better than most analogies do. With a car, you've got the engine running, burning fuel, going nowhere. With a GPU, the chip is powered, the memory is refreshed, the interconnects are live — it's ready to go, but it's not doing productive work. The hundred to a hundred and fifty watts is the cost of readiness.

Which immediately tells you that a GPU sitting at fifty percent utilization is not saving fifty percent of the power. The curve is not linear.

Not even close. This is what most coverage gets wrong. People think, oh, if I'm only using my GPU at thirty percent, I'm only burning thirty percent of the power. But you've got that base draw — that idle floor — plus whatever the actual compute workload adds on top. The relationship between utilization and power consumption is more like a staircase with a very high first step.

Walk me through the math. If someone's renting a dedicated GPU instance — say on RunPod or one of the specialist clouds — and they're hitting, I don't know, forty percent utilization over a month, what's their actual energy waste look like compared to the useful work?

Let's use a concrete example. Say you've got an H100 pulling about a hundred and thirty watts at idle, and it scales up to roughly seven hundred watts at full utilization. If you're averaging forty percent utilization — and I should say, that's actually not a terrible number in this space — then a lot of the time you're either at idle or in some intermediate state. The actual power draw might average out to maybe three hundred and fifty, four hundred watts. So you're burning maybe half the full-load power, but only getting forty percent of the full-load work. The gap between those two numbers is your waste.

That's at forty percent, which you're saying is respectable.

The numbers I've seen from data center surveys suggest that typical GPU utilization in dedicated deployments hovers between thirty and sixty percent. And that's aggregated across a whole cluster — some nodes are hot, some are cold. But the cold ones are still drawing that idle power.

The dedicated GPU user is essentially paying for a lot of readiness. The chip is standing by in case they need it.

Readiness is expensive, both in dollars and in kilowatt-hours. This is where the serverless model gets interesting. When you're on Modal or a similar platform, you're not renting a GPU — you're renting a slice of GPU time. The platform is aggregating demand across thousands of users, and they can pack workloads onto those chips much more densely. If one user's function finishes, another user's function starts within milliseconds. The GPU doesn't get to idle.

The GPU equivalent of hot-desking.

Hot-desking with a very aggressive office manager. The platform's whole business model depends on keeping those chips busy. Every idle second is money they don't recoup. So they're incentivized to pack workloads as tightly as possible.

Which means the environmental incentive and the economic incentive are pointing in the same direction. You mentioned that almost never happens.

It really doesn't. Usually you've got this tension — doing the environmentally responsible thing costs more. Here, maximizing utilization is both the profit-maximizing move and the energy-minimizing move. The platform wants to squeeze every possible cycle out of every GPU, and that squeezing means fewer total GPUs need to be manufactured, shipped, powered, and cooled to serve the same total workload.

Let's talk about that cooling for a second. Because a GPU doesn't just draw power for itself — the data center has to remove the heat it generates. What's the overhead on that?

The standard metric is power usage effectiveness, or PUE. A PUE of one point zero means every watt going into the data center goes into compute — zero overhead. Nobody achieves that. The industry average, according to the Uptime Institute's most recent survey, hovers around one point five five. That means for every watt your GPU draws, you need another point five five watts for cooling, power distribution losses, lighting, all the infrastructure overhead. The hyperscalers — Google, Microsoft, the big ones — they get down to around one point one, one point one two. But your average colocation or smaller cloud provider?

The real energy cost of that idling H100 is not a hundred and thirty watts. It's a hundred and thirty watts times whatever the PUE multiplier is. Closer to two hundred watts of actual grid draw for a chip doing absolutely nothing.

And when you scale that up across a fleet of thousands of GPUs, the waste gets staggering. Let's say a mid-sized AI company is running a hundred H100s around the clock, with an average utilization of forty-five percent and a PUE of one point four. The idle waste — just the portion of power draw that represents no productive work — could be in the range of tens of thousands of watts, continuously. Over a year, we're talking about hundreds of megawatt-hours of electricity that accomplished nothing except heating the outside air.

Hundreds of megawatt-hours to do nothing. The sloth in me appreciates the commitment to inactivity, but even I have limits.

That's a hundred GPUs. The big training clusters are tens of thousands of GPUs. The waste at scale is genuinely enormous.

Okay, so let me push on something. You mentioned optimal utilization. Daniel's prompt asked whether a hundred percent utilization is even the benchmark, or whether you need to leave headroom for maintenance and error checking. What does complete optimal look like?

This is a great question because it breaks the naive assumption that a hundred percent is the target. In practice, you never want to run at a literal hundred percent utilization sustained. There are a few reasons. One is that GPU memory needs to be managed — you need some headroom for memory allocation and deallocation, and if you're packed to the gills, you start getting out-of-memory errors that kill jobs. Another is thermal — running at absolute maximum for extended periods can accelerate hardware degradation. And then there's scheduling overhead — the orchestrator needs a tiny bit of slack to place new jobs without queueing delays.

What's the number?

The sweet spot most operators target is somewhere between eighty and ninety percent utilization. At that level, you're getting excellent efficiency, but you've got enough slack to handle spikes, failures, and maintenance without everything falling over. The serverless platforms can push toward the higher end of that range because they've got sophisticated schedulers that can pack workloads very tightly and shift them around in near-real-time.

A dedicated instance held by a single tenant — they're probably nowhere near that.

Almost certainly not. Think about the usage pattern of a typical small AI shop. They're doing development during business hours — maybe a few inference jobs, some fine-tuning runs. The GPU is sitting there, powered on, drawing that idle wattage, cooling system humming away, for maybe sixty to seventy percent of the hours in a week.

The GPU is working a nine-to-five and the rest of the time it's just keeping the chair warm.

Except the chair costs real money and has a carbon footprint. And this is where the serverless model shines. Because the platform aggregates demand globally, across time zones, the workload curve gets smoothed out. When the North American developers go to sleep, the European ones are waking up, and then the Asian ones. The GPU never really gets a break.

Time-zone arbitrage as environmental policy.

Unintentional, but yes. The platform doesn't set out to be green — it sets out to make money by maximizing utilization. But the byproduct is that you're squeezing far more useful work out of each physical GPU. And that means fewer GPUs need to exist.

Which brings us to the embodied carbon question. Manufacturing these things isn't free, environmentally speaking.

Not at all. Chip fabrication is enormously resource-intensive. The semiconductor manufacturing process involves ultra-pure water, hazardous chemicals, clean rooms that have to maintain extremely tight environmental controls — the energy footprint of producing a single advanced GPU is substantial. I've seen estimates that the embodied carbon of manufacturing an H100-class GPU could be in the range of a hundred and fifty to two hundred kilograms of CO2 equivalent. That's before the chip ever draws a single watt of operational power.

If serverless means you can serve the same total workload with, say, thirty percent fewer GPUs in circulation, you're avoiding not just the operational energy waste but all that manufacturing impact too.

And we haven't even talked about the downstream effects — the rare earth mining, the water usage in fabs, the transportation. The environmental case for maximizing utilization goes well beyond the electricity bill.

Let's get concrete for a second. If I'm a small startup and I move from a dedicated GPU instance to a serverless platform, what's the actual difference in energy per unit of useful work?

It's hard to give a single number because it depends so much on the baseline. But let's construct a scenario. Take a startup that's been renting a dedicated H100 on a cloud provider, running at about thirty-five percent utilization on average — which is not uncommon for a small team. Their effective energy per teraflop of useful computation is terrible because they're paying the idle overhead for all those unused hours. If they move to a serverless platform that's operating at eighty-five percent aggregate utilization on the same hardware, the energy per useful teraflop drops dramatically — potentially by more than half.

More than half. That's not a marginal improvement.

It's a step change. And the economics track the energy. The startup's bill drops too, because they're only paying for the compute they actually use. The alignment between cost and carbon is almost perfect in this scenario.

Which makes you wonder — if the case is this strong, why isn't serverless the default? Why are people still renting dedicated instances?

A few reasons. One is latency sensitivity. If you're running a real-time inference service where milliseconds matter, the cold-start time on a serverless function can be a problem. The platform has to spin up a container, load your model into GPU memory — that can take seconds, sometimes tens of seconds. For batch processing or asynchronous workloads, that's fine. For a chatbot that needs to respond instantly, it's a harder sell.

The cold start is the serverless equivalent of waiting for the car to warm up.

Right, and nobody wants that in a production user-facing application. The platforms are getting better at this — they can keep models warm in memory, they can predict demand spikes — but it's still a real constraint.

Another reason is predictability. When you rent a dedicated GPU, you know exactly what your bill is going to be. It's a fixed monthly cost. With serverless, your bill scales with usage, which is great when usage is low, but can be terrifying if you get an unexpected spike. There's a budgeting comfort in the dedicated model, even if it's less efficient.

The financial equivalent of wanting a car in the garage even if you only drive it on weekends.

And then there's the control factor. Some teams want to optimize at the hardware level — they want to manage CUDA versions, driver configurations, they want to squeeze every last drop of performance out of the silicon. Serverless abstracts all that away, which is great for productivity but can feel constraining if you're a hardware-level optimization nerd.

The dedicated model persists partly for technical reasons, partly for psychological ones.

Partly because the environmental cost isn't priced in. If there were a carbon tax that reflected the true cost of that idle waste, the economics would shift dramatically. But right now, the dedicated GPU renter isn't paying for the externality. The grid bears it, the climate bears it, but the monthly invoice doesn't reflect it.

Which brings us to the policy question buried in the prompt. Should we be doing more to encourage serverless models, especially for GPU compute?

I think the case is surprisingly strong, and it doesn't require heavy-handed regulation. Even just transparency would help. If cloud providers were required to disclose the average utilization of their GPU fleets, or if there were an energy-efficiency rating for different compute models, that would push buyers toward the more efficient option. Right now, most people renting GPUs have no idea what their actual utilization is, let alone what the carbon impact looks like.

Information asymmetry keeping the inefficient model alive.

And the serverless platforms themselves could lean into this more. Modal and their competitors don't really market the environmental angle. It's all about developer experience and cost savings. But the environmental case is compelling, and it's not greenwashing — the math holds up.

Let's talk about what happens at the other end of the spectrum. The hyperscale training runs — the companies that are training frontier models on clusters of tens of thousands of GPUs. Is serverless even relevant there?

Not for the training itself. When you're doing a massive distributed training run, you need tightly coupled GPUs with high-bandwidth interconnects, all working in lockstep. That's the polar opposite of the serverless use case. But the inference side — serving the model to millions of users — that's where serverless shines. And inference is actually the bigger energy consumer over the lifetime of a model. Training is a one-time cost; inference is continuous.

Even for the big players, the environmental argument for serverless inference holds.

And some of the big players are effectively running their own internal serverless platforms. They've built orchestration layers that pool inference requests across multiple models and route them to available GPU capacity. It's serverless in all but name.

The hyperscalers have reinvented serverless for themselves, while the rest of the market is still debating whether to adopt it.

That's a very good way to put it. The efficiency gains are real enough that the companies with the most at stake — the ones running the biggest fleets — have already internalized the model. They just don't call it serverless.

Let's circle back to something Daniel asked about maintenance overhead. Do GPUs need scheduled downtime for error checking, or can you really push them continuously?

GPUs don't need scheduled maintenance the way a car needs an oil change. But they do experience transient errors — cosmic rays flipping bits in memory, that kind of thing. The error correction on the memory — ECC — handles most of that transparently. What does happen over time is that thermal cycling — heating up and cooling down repeatedly — can stress the solder joints and interconnects. So a GPU that's kept at a relatively stable temperature, even a high one, may actually last longer than one that's constantly cycling between idle and full load.

Which is another point for serverless. The chips stay warm.

A serverless GPU that's running at a steady eighty-five percent utilization has a much more stable thermal profile than a dedicated GPU that's going from zero to a hundred and back to zero every day. The thermal stability may actually extend hardware life.

We've got lower idle waste, higher utilization, smoother thermal profiles, reduced manufacturing demand. Is there any part of this where serverless is worse environmentally?

The one counterargument I can think of is the overhead of the orchestration layer itself. The serverless platform has to run schedulers, load balancers, monitoring systems — all of which consume compute resources of their own. And the multi-tenancy adds some overhead — container isolation, network virtualization, that kind of thing. But that overhead is tiny compared to the idle waste it eliminates. We're talking maybe a two to five percent overhead versus potentially fifty percent or more waste in dedicated deployments.

The net is still strongly positive.

The orchestration overhead is the environmental equivalent of the electricity used by the serverless platform's office lights. It's real, but it's a rounding error compared to the main event.

I want to pull on one thread you mentioned earlier — the cold start problem. Is that a fundamental limitation, or is it solvable?

It's mostly solvable. The platforms are already doing clever things — keeping frequently used models pre-loaded in GPU memory, using predictive scaling to anticipate demand, snapshotting container state so that startup time goes from tens of seconds to single-digit seconds. I wouldn't be surprised if, within a couple of years, the cold start latency is low enough that it's not a meaningful barrier for most use cases.

The technical objections are eroding.

And the economic case was already strong. The environmental case adds another dimension that I think is going to become harder to ignore, especially as AI energy consumption gets more scrutiny in the press and from regulators.

The International Energy Agency has been flagging data center energy growth as a real concern. If serverless can meaningfully bend that curve, it's not just a nice-to-have.

The IEA's latest numbers show data centers consuming somewhere around two hundred and forty to three hundred and forty terawatt-hours annually, and AI workloads are the fastest-growing component of that. Even a ten or fifteen percent reduction through better utilization would be massive in absolute terms. We're talking tens of terawatt-hours — the output of multiple power plants.

That's just the operational side. Add in the embodied carbon savings from needing fewer GPUs manufactured, and the total impact is even larger.

The manufacturing angle is underappreciated. Every GPU not made is a GPU's worth of fabrication energy, water, and materials that never gets consumed. Serverless effectively reduces the total addressable market for GPU hardware — not because people are doing less compute, but because they're using the existing hardware more efficiently.

The Jevons paradox lurks in the background here, though. If serverless makes GPU compute cheaper and more accessible, total demand might go up enough to offset the per-unit efficiency gains.

This is the classic rebound effect. And you're right to raise it — it's the strongest counterargument to the environmental case. If serverless enables a wave of new AI applications that wouldn't have been viable under the dedicated model, total GPU energy consumption could still rise even if per-workload efficiency improves dramatically.

The carpool lane that gets so popular it ends up just as congested as the regular lanes.

But I think there's an important distinction. Even if total consumption rises, we're still better off than if that same total workload were being served by dedicated instances. The efficiency gain is real regardless of whether absolute consumption goes up. It's the difference between a world where AI compute grows efficiently and one where it grows wastefully.

The environmental case for serverless holds even under pessimistic assumptions about demand growth.

The counterfactual matters. The question isn't whether AI energy consumption will grow — it will. The question is whether it grows on a foundation of thirty percent utilization or eighty-five percent utilization. The difference between those two curves, over a decade, is enormous.

We haven't even talked about the geographic dimension. Serverless platforms can locate their GPU fleets in regions with clean grids.

A serverless platform can choose to put its data centers in places like Quebec, where the grid is almost entirely hydroelectric, or in the Nordics, where there's abundant wind. A small company renting a dedicated GPU might not have that flexibility — they take whatever region their cloud provider offers. The serverless model centralizes the siting decision, and that centralization can be leveraged for environmental benefit.

You get a double dividend: higher utilization and cleaner power.

It depends on the platform's choices, but the structure enables it in a way that the fragmented dedicated model doesn't.

Let's land the plane a bit. If someone listening is running AI workloads on dedicated GPUs and they're wondering whether to switch — what's the threshold where it makes sense environmentally and economically?

If you're running at less than about sixty percent sustained utilization, you're almost certainly better off on a serverless platform — both in cost and in carbon. Between sixty and eighty percent, it depends on your workload patterns and how spikey your demand is. Above eighty percent sustained, dedicated might actually be more efficient because you're avoiding the orchestration overhead without much idle waste.

Most small-to-medium users are well below sixty percent.

The utilization numbers I've seen suggest that sub-sixty percent is the norm, not the exception, for dedicated GPU instances outside of hyperscale environments.

The environmental case for serverless isn't just intuition — it holds up under scrutiny. The idle draw is real, the utilization gap is wide, the thermal and manufacturing benefits are substantial, and the counterarguments are mostly about edge cases and rebound effects that don't negate the core efficiency gain.

The beautiful thing is, you don't need to be an environmentalist to make the switch. The economic case alone justifies it for most users. The carbon savings are basically a free byproduct of saving money.

Like adopting a feral cat that turns out to be house-trained.

actually a pretty good analogy. You're getting a benefit you didn't sign up for, and it costs you nothing extra.

Daniel's hunch was right. Serverless isn't just elegant economics — it's a greener way to do GPU compute.

The more the industry moves in that direction, the bigger the aggregate impact. This is one of those rare cases where individual cost optimization and collective environmental benefit are pointing in exactly the same direction. You don't have to choose between your budget and the planet.

The question now is whether the market moves fast enough on its own, or whether some gentle nudging — disclosure requirements, efficiency ratings — would accelerate the shift.

My bet is that economics will do most of the work. GPU compute is expensive, and waste is waste. But transparency would help. If every cloud console showed you your average GPU utilization and estimated idle waste, a lot of people would be surprised — and a lot of them would switch.

Now: Hilbert's daily fun fact.

Hilbert: In the 1840s, naturalists studying jellyfish in the Caspian Sea near present-day Tajikistan discovered that the jellyfish bell is composed of roughly ninety-five percent water and only about zero point two percent structural protein — the rest being dissolved salts and trace minerals — making it one of the most chemically minimal animal structures ever documented.

...right.

That explains why they never get invited to structural engineering conferences.

The open question — and I think this is where the conversation goes next — is whether serverless GPU platforms start marketing the environmental angle explicitly, and whether that changes adoption patterns. The numbers are there. Somebody just has to put them on the bill.

If they don't, maybe the regulators eventually will. Either way, the direction of travel seems clear. Thanks to our producer Hilbert Flumingtop. This has been My Weird Prompts. Find us at myweirdprompts dot com, and if you've got a question like this one — something where your intuition might actually be right and you want us to dig into the numbers — send it in.

We'll be here.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2777: GPU Idle Waste and Serverless Green Computing

Downloads

You Might Also Like

#2777: GPU Idle Waste and Serverless Green Computing