Daniel sent us this one — he wants to get into ASICs, Application Specific Integrated Circuits. Chips built for exactly one job. But before we go there, he wants us to pin down what we even mean by a chip. Is it a CPU, part of a CPU, part of a GPU? What's the common piece that shows up across all of them? And then the bigger question — when you customize silicon at the deepest level, what does that actually mean in practice, and what does it displace in terms of cost? There's a lot to unpack here.
Oh, this is a great one. And by the way, quick note — today's episode is being powered by DeepSeek V four Pro, which feels appropriate for a chip architecture discussion. But let's get into it. The common piece Daniel's asking about is the die. The silicon die. That's the actual slab of semiconductor — almost always silicon — cut from a wafer, where the entire integrated circuit lives. Every chip, whether it's a five hundred dollar CPU or a fifteen thousand dollar custom ASIC, starts as the same thing. The difference is entirely in what gets etched onto it.
A CPU is a die, a GPU is a die, an ASIC is a die — they're all just rectangles of silicon with different personalities etched into them.
And the personality is everything. A CPU is a general-purpose sequential processor. It's built to handle anything you throw at it — spreadsheets, web browsing, operating system tasks. But that flexibility comes with enormous overhead. Instruction decoders, branch predictors, multiple levels of cache hierarchy, general-purpose arithmetic logic units. It's carrying around a lot of machinery that isn't actually doing the computation you care about. It's orchestrating.
A GPU strips some of that away?
A GPU strips away the sequential optimization and goes massively parallel. Instead of a few very sophisticated cores, you get thousands of simpler cores. Each one isn't as smart individually, but together they can chew through matrix multiplication like nothing else. That's why they dominate AI training. But they're still general-purpose in their own way — they can run any shader program, any CUDA kernel. They've still got flexibility baked in.
An ASIC takes that logic to its extreme. You're saying, I know exactly what this chip will ever need to do, and I'm burning that one program into the hardware itself. No flexibility, no overhead. Just the computation.
The efficiency numbers are staggering. I pulled some data on this — in cryptocurrency mining, an ASIC achieves about fifteen to seventeen joules per terahash. A GPU doing the same work consumes around fifty thousand joules per terahash. That's not a percentage difference. That's three thousand times more energy for the same output.
Wait, fifty thousand versus seventeen? That's not an efficiency gap, that's a different universe. How is the GPU even in the same conversation at that point?
Because the GPU can mine Bitcoin today and render a video tomorrow. The ASIC can mine Bitcoin and only mine Bitcoin. If the mining algorithm changes, or if the coin collapses, your ASIC is e-waste. The GPU still has resale value. So the inefficiency is the price you pay for not betting your entire investment on one algorithm remaining relevant.
That's the hidden cost of flexibility. You're paying for optionality in wasted electricity every single second the chip runs. And for a hyperscaler running millions of chips twenty-four seven, that optionality premium becomes astronomical.
That's exactly why Google built the TPU — the Tensor Processing Unit. It's the most prominent ASIC example in AI right now. The latest generation, Ironwood, version seven, hits four thousand six hundred and fourteen teraflops per chip. They trained Gemini three entirely on TPU pods. No general-purpose GPU overhead. Every transistor on that die is doing matrix math.
What does that actually displace in terms of cost when you go custom? Daniel asked about the non-customized cost comparison.
Let me break this down because the economics are genuinely fascinating. Off-the-shelf CPUs run a hundred to five hundred dollars per chip. GPUs go from three hundred to over two thousand. ASICs can cost fifteen hundred to fifteen thousand per chip at low volume. But that's the wrong way to look at it.
Because the chip unit cost isn't the whole story.
The big number is the NRE — non-recurring engineering. That's the design cost. For a structured ASIC, you're looking at two hundred thousand to seven hundred fifty thousand dollars. For a standard-cell ASIC, eight hundred thousand to two and a half million plus. At advanced nodes like seven nanometers, you can blow past ten million just in mask costs before you've fabricated a single working chip.
There's a breakeven volume where those upfront costs get amortized enough that the ASIC becomes cheaper than buying off-the-shelf parts.
The industry rule of thumb is between fifty thousand and two hundred thousand units per year. Below ten thousand units a year, don't even think about an ASIC — just use CPUs or GPUs or maybe an FPGA. Above two hundred thousand units a year, the ASIC almost always wins. And the savings compound because an ASIC consolidates multiple functions onto one die, so your printed circuit board shrinks, your power supply requirements drop, cooling gets simpler, packaging gets cheaper.
You're not just saving on the chip itself. You're shrinking the entire bill of materials around it.
That's the part most coverage misses. Everyone focuses on the silicon cost, but the system-level savings are often larger. Fewer components, smaller board, less heat to manage, simpler power delivery. At scale, that's millions.
Alright, let's get to the deepest part of Daniel's question. What does customizing silicon at the deepest level actually mean in practice? Not the business case, not the economics — what are engineers literally doing?
The design flow runs from something called RTL to GDSII. RTL is Register Transfer Level — engineers write code in Verilog or VHDL that describes the logic of the chip. This is where you define what the chip actually does at a functional level. Then you synthesize that into a gate-level netlist — essentially a map of logic gates and how they connect. Then you do physical floorplanning, placement, and routing. And finally you produce a GDSII file, which is the geometric layout data sent to the foundry. That's the file that says, here is exactly where every transistor goes, every metal layer, every connection.
It's not just writing a program that runs on existing hardware. You're describing the hardware itself in code, and that code gets turned into physical geometry.
At the deepest level — what's called full-custom design — every single transistor, every logic cell, every metal layer is designed from scratch. You're not using pre-built building blocks. You're placing individual transistors to optimize for your exact computation. This yields the highest possible density and efficiency, but it's enormously expensive and time-consuming. You'd only do this for the most performance-critical chips, or for analog components where standard cells don't cut it.
That sounds almost absurdly labor-intensive. Who's actually doing full-custom anymore?
It's rare. The dominant approach today is semi-custom, or standard-cell design. You use pre-designed, pre-characterized logic cells from a library — AND gates, flip-flops, multiplexers, all the basic building blocks — and you do custom placement and routing of those cells. You're still getting a custom chip, but you're not reinventing the NAND gate. This dramatically reduces risk and design time.
It's like the difference between milling your own lumber and building with pre-cut framing. The house is still custom, but you're not starting by felling trees.
That's the one analogy I'll allow myself today. And there's an even lighter version called gate-array ASIC, where the transistors are pre-fabricated on the wafer and only the metal interconnect layers are customized. That cuts turnaround time and NRE costs significantly, but you lose some density and performance.
We've got this spectrum. On one end, buying an off-the-shelf CPU or GPU — maximum flexibility, zero NRE, terrible efficiency per watt for any specific task. In the middle, FPGAs — reconfigurable after manufacturing, better efficiency than general-purpose, but still three to four times more power-hungry than an equivalent ASIC. And on the far end, full-custom silicon where every transistor is placed by hand for one workload.
FPGAs are interesting because they blur the line. You can reconfigure them after manufacturing. The hardware adapts to the software rather than the other way around. But that reconfigurability comes at a power penalty because all those programmable interconnects and lookup tables consume energy even when they're not switching.
Where does the tape-out fit in? I've heard that term and it always sounds vaguely terrifying.
Tape-out is the point of no return. It's when you send the GDSII file to the foundry and say, fab this. The design is frozen. A single mask set for a modern seven nanometer chip can cost millions of dollars. If you find a bug after tape-out, a respin can cost tens of millions and add months of delay. It's terrifying. That's why verification is something like sixty to seventy percent of the total design effort.
You spend more time checking your work than actually doing the work. That feels like a life lesson disguised as an engineering practice.
It's a brutal discipline. A bug in software, you push an update. A bug in silicon, you've got a very expensive paperweight. I remember reading about the Intel FDIV bug in the original Pentium — a tiny floating-point division error that cost them four hundred seventy five million dollars to replace. And that was in the nineties. Today's chips are orders of magnitude more complex.
Let's talk about what's happening right now in AI hardware, because this is where the ASIC versus GPU debate is most alive. Nvidia dominates with general-purpose GPUs. But Google, Amazon, Microsoft, Meta — they're all building custom silicon.
Every hyperscaler is asking the same question. We know our workloads. We know exactly what operations our models run billions of times. Why are we paying Nvidia's margins for general-purpose hardware when we could build chips that do only what we need, at a fraction of the power?
Nvidia's answer is basically, because by the time you design and fab your custom chip, we'll have released two new generations that are faster than your ASIC anyway.
That's the real tension. ASICs lock you into a specific design at a specific process node. If the algorithms change — and in AI, they change fast — your custom silicon is obsolete. Google's TPU strategy works because they control the entire stack. They design the chips, they design the models, they design the training frameworks. They can co-optimize in a way that a company buying off-the-shelf hardware can't.
Apple's doing something similar with the Neural Engine in their M-series and A-series chips. I saw they're now processing over sixty trillion operations per second for on-device AI. That's an ASIC embedded inside a general-purpose system-on-chip.
It's optimized for a very specific thing — low-latency inference with minimal power draw. Privacy-sensitive, on-device processing. They're not trying to train models with it. They're running face recognition, voice processing, predictive text. Fixed workloads, known patterns, massive volume. Perfect ASIC territory.
The fork in the road is basically, are your AI workloads stable enough to justify burning them into silicon? If yes, ASIC wins on efficiency by orders of magnitude. If no, you stay on GPUs or FPGAs and pay the flexibility tax.
There's a fascinating wildcard here. Open-source chip design is becoming a thing. Google has something called OpenMPW — open-source chip design shuttle runs. There's a project called Tiny Tapeout where you can get a custom chip fabricated for a few hundred dollars on a shared reticle. You literally submit your design, it gets placed on a die alongside other people's designs, and you get your own little square of silicon back.
Wait, a few hundred dollars for a custom chip? That's not a typo?
Not a typo. The shared reticle model means multiple designs share the mask costs. Your design might only be a few hundred microns square, but it's real. It's fabricated on a real process node. Now, this is for tiny designs — we're not talking about a competitor to an Nvidia H one hundred. But the barrier to entry is collapsing.
We might be approaching a world where startups, not just hyperscalers, can afford custom silicon. That changes the economics entirely. If the NRE drops from millions to thousands, the breakeven volume plummets.
We're not there yet for production-grade chips. But the trajectory is real. The same democratization that happened with software — where anyone with a laptop can build and deploy an app — might be coming for silicon. Open-source EDA tools, shared fabrication runs, standard cell libraries that are freely available. It's early, but it's not science fiction.
Let me pull us back to Daniel's original question about the common component. We've established it's the die. But I want to make sure we haven't glossed over something. When you look at a CPU, part of what you're seeing is the package — the black rectangle with pins or contacts on the bottom. Inside that package is the die. But is the die the whole CPU?
The die is the silicon itself. But a complete chip also includes the package, the substrate that connects the die to the outside world, the thermal interface material, the heat spreader. In advanced packaging, you might have multiple dies in one package — chiplets connected by silicon interposers. So the die is the fundamental computational unit, but a modern processor is often a system of dies.
That's where the ASIC advantage gets even more interesting. If you're building a custom chip, you can integrate things that would normally be separate components. Memory controllers, I/O interfaces, specialized accelerators — all on one die.
Or on multiple dies in one package, using advanced packaging. AMD's been doing this with chiplets for years. But the principle is the same — integration reduces latency, reduces power, reduces board complexity. Every time data has to leave the chip and travel across a PCB trace to another component, you're burning energy and adding nanoseconds of delay.
The ASIC philosophy isn't just about stripping out unused functionality. It's also about pulling in functionality that would otherwise live elsewhere in the system.
Integration and specialization, working together. And this connects to something Daniel's been thinking about — inference versus training. Training is still dominated by GPUs because the workloads are varied and the models are changing fast. But inference — running a trained model in production — is becoming the bigger cost. And inference workloads are much more stable. You know exactly what operations your model needs. That's prime ASIC territory.
Which is why we're seeing so many inference-specific chips hitting the market. Groq, Cerebras, SambaNova — they're all building chips that are essentially ASICs for transformer inference.
Nvidia's response has been interesting. Their acquisition strategy, their move toward more specialized tensor cores within their GPUs — they're essentially embedding ASIC-like blocks inside a general-purpose architecture. Best of both worlds, in theory. The flexibility of a GPU with ASIC-level efficiency for the operations that matter most.
There's still overhead. You're still paying for the general-purpose infrastructure around those tensor cores.
And that's the bet. Nvidia is betting that the overhead is worth paying for the flexibility. The hyperscalers are betting that it isn't. We'll see who's right.
Let me ask you something. If you were advising a company that's deciding between going ASIC or sticking with GPUs for their AI workload, what's the one question you'd tell them to answer first?
How stable is your algorithm? Not your model architecture — your fundamental operations. If you're doing standard transformer inference with a known set of matrix sizes and attention patterns, and you're doing it at volume, you should be talking to ASIC vendors. If you're experimenting with new architectures every quarter, if you're doing research rather than production serving, stay flexible. The worst outcome is spending two years and ten million dollars on a custom chip for a model that's obsolete before the chip tapes out.
That's the nightmare scenario. You finally get your silicon back from the foundry and the world has moved on to a completely different architecture.
It's happened. There were companies that built ASICs for specific cryptocurrency mining algorithms, and then the coin forked to a new proof-of-work function, and overnight those chips became doorstops. That's the risk distilled to its purest form.
Alright, let's shift gears slightly. Daniel asked about what gets displaced in terms of cost. We've covered the chip-level economics. But what about the human side? When a company goes ASIC, what roles and what processes get displaced?
That's a great angle. You need a completely different team. Instead of software engineers writing CUDA kernels, you need hardware engineers writing Verilog. You need verification engineers, physical designers, DFT — design for testability — specialists. You need people who understand clock domain crossing and timing closure and signal integrity. It's a different discipline entirely.
You're not just buying different chips. You're building a different organization.
The timeline is completely different. Software you can iterate on in hours or days. An ASIC design cycle is typically eighteen to twenty-four months from architecture to silicon. If you're used to the software cadence of ship, measure, fix, ship again — hardware will feel glacial.
That's why FPGAs are so popular as a stepping stone. You can prototype your design, test it in real workloads, iterate quickly, and then — once it's stable — commit it to an ASIC.
FPGA for prototyping, ASIC for production. That's the classic playbook. And it works well because the RTL code you write for the FPGA can largely be reused for the ASIC. The synthesis target changes, but the logic description doesn't.
The RTL is the portable part. The actual gates and wires change depending on whether you're targeting a reconfigurable fabric or a fixed silicon process, but the functional description carries over.
And that's why RTL skills are so valuable. If you can write good Verilog or VHDL, you can target an FPGA today and an ASIC tomorrow. Your code describes what the hardware does, and the tools figure out how to map it to the target technology.
Let's talk about where this is all heading. We've got hyperscalers designing their own chips. We've got open-source toolchains lowering the barrier. We've got chiplets and advanced packaging making it easier to mix and match specialized dies. Is the future of computing a world where general-purpose CPUs and GPUs become niche products, and everything else is some flavor of custom silicon?
I don't think general-purpose goes away. There's always going to be a need for flexible compute. Your phone, your laptop — those run too many different workloads to justify a pure ASIC approach. But what I think we'll see is general-purpose chips becoming host processors surrounded by specialized accelerators. Your CPU handles the operating system, the browser, the unpredictable stuff. But the heavy lifting — AI inference, video encoding, cryptography, sensor processing — gets offloaded to ASIC blocks on the same die or in the same package.
Which is already happening. Apple's doing it. Intel's doing it with their NPU. AMD's doing it with their AI engine. The general-purpose core is becoming the conductor, not the orchestra.
That's a profound shift. For decades, the industry was about making the general-purpose core faster. Higher clock speeds, deeper pipelines, better branch prediction. Now we're hitting the limits of that approach, and the gains are coming from specialization. It's a complete inversion of the philosophy that drove Moore's Law.
Moore's Law was about making transistors cheaper and faster. The specialization era is about using those transistors more intelligently. Same transistor budget, but instead of building a bigger general-purpose core, you build a collection of specialized engines that each do one thing extremely well.
This is why the ASIC conversation matters so much right now. We're at an inflection point where the economics of general-purpose computing are breaking down for the most valuable workloads. AI training and inference consume enormous amounts of energy and silicon. The incentive to specialize is overwhelming.
The thing that strikes me is how much of this is invisible to the end user. You're using ASICs constantly and you don't know it. The cellular modem in your phone is an ASIC. The Wi-Fi chip is an ASIC. The Bluetooth controller is an ASIC. The touchscreen controller is an ASIC. All these fixed-function chips doing one job perfectly, hidden inside a device that feels like a general-purpose computer.
That's the triumph of the ASIC approach. When the function is stable and the volume is high enough, you don't even think about it. It just becomes a component. The entire modern world runs on custom silicon that nobody sees.
Alright, let's land this. Daniel wanted to understand what a chip is at its core — the die — and what it means to customize at the deepest level. We've covered that spectrum from off-the-shelf to full-custom. But if there's one thing I want listeners to take away, it's that the die is the great equalizer. Everything starts as the same slab of silicon. The difference between a five-dollar microcontroller and a fifteen-thousand-dollar AI accelerator is entirely in the design — the RTL code, the physical layout, the engineering hours poured into optimization.
The economics are counterintuitive. Custom silicon is expensive to develop but cheap to operate at scale. General-purpose silicon is cheap to buy but expensive to operate at scale. The breakeven depends entirely on your volume and how stable your workload is.
Now: Hilbert's daily fun fact.
The collective noun for a group of porcupines is a prickle.
What can listeners actually do with any of this? First, if you're building a product that needs compute, don't default to a CPU or GPU. Ask whether your workload is stable enough to justify an FPGA prototype, and whether your volume justifies an eventual ASIC. The breakeven math is surprisingly accessible — fifty thousand units a year is not an astronomical number for a successful product.
Second, if you're a software engineer curious about hardware, the barrier to entry has never been lower. You can learn Verilog online, simulate your designs with open-source tools, and even get a tiny chip fabricated through Tiny Tapeout for a few hundred dollars. The path from software to silicon is real and it's getting shorter every year.
Third, pay attention to where your compute actually runs. The ASICs in your phone are doing sixty trillion operations a second while sipping power. The GPU in your cloud instance is burning hundreds of watts. Understanding the hardware your code runs on is not optional anymore — it's core engineering literacy.
The big open question for me is whether the open-source silicon movement can do for hardware what open source did for software. If chip design becomes accessible to startups and individual engineers the way software development did, we could see an explosion of specialized silicon that makes the current AI hardware landscape look primitive. Or it might turn out that the physics and economics of fabrication keep custom silicon in the hands of hyperscalers forever. I don't know which way it goes.
I suspect it'll be somewhere in the middle. The tools will democratize, but the fabrication will always require enormous capital. What changes is who gets to play with the design tools, not who owns the fabs. And that's still a massive shift.
Thanks to Hilbert Flumingtop for producing. This has been My Weird Prompts. You can find every episode at myweirdprompts dot com or wherever you get your podcasts. If you enjoyed this, leave us a review — it helps.
We'll be back with another one soon.