#2207: Specs First, Code Second: Inside Agentic AI's New Era

As AI coding agents evolve from autocomplete to autonomous cloud workers, the bottleneck has shifted—now it's about how clearly you specify what ne...

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2365
Published: Apr 13
Duration: 25:29
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: claude-sonnet-4-6
Topics: ai-agents prompt-engineering software-development

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Specs First, Code Second: Inside Agentic AI's New Era

The way developers interact with AI coding tools is undergoing a fundamental shift. What began as line-by-line autocomplete has evolved into autonomous cloud agents capable of tackling large tasks independently over hours, returning logs, video recordings, and live previews rather than just code diffs. And with that evolution comes a new bottleneck: clarity of intent.

The Numbers Behind the Shift

Cursor's growth tells the story. In March 2024, the company had 2.5x more tab autocomplete users than agent users. By February 2025, that ratio had completely inverted—twice as many agent users as tab users. Agent usage grew 15x in a single year. More striking: 35% of pull requests merged internally at Cursor are now created by autonomous cloud agents. This isn't a beta feature. It's their actual development workflow. Their recurring revenue doubled in three months to $2 billion ARR.

These numbers matter because they expose a practical urgency: when an agent runs for hours on a cloud VM, a vague prompt doesn't just produce mediocre code. It produces hours of wasted compute and a debugging nightmare at the end.

The Three Levels of Specification

Deepak Babu Piskala's January arXiv paper formalizes the spectrum of specification rigor:

Spec-first involves writing the specification before coding and potentially discarding it afterward. This works well for prototypes and initial AI-assisted development.

Spec-anchored maintains the spec alongside code throughout the entire lifecycle, with tests enforcing alignment. This is the pattern for long-lived production systems.

Spec-as-source is the most radical: the spec is the only artifact humans ever edit, and code is entirely generated. Think automotive workflows where Simulink models generate C code directly. It's a significant inversion of how most developers think about their job.

What Makes a Good Spec

The research identifies four essential qualities:

Behavior-focused: describes what happens, not how
Testable: every requirement is verifiable
Unambiguous: different readers reach the same interpretation
Complete but not over-specified: covers essential cases without devolving into pseudo-code

A practical example: "add photo sharing to my app" hands an agent a dozen implicit decisions—format, permissions, size limits, storage, compression. A proper spec eliminates that guessing. Formats like Gherkin (Given-When-Then) and EARS notation (Easy Approach to Requirements Syntax) aren't stylistic preferences. They force every assumption explicit before the agent begins work.

The Tool Ecosystem

The ecosystem has exploded. GitHub Spec Kit leads with 87,600 stars (version 0.6.2 released as this episode aired). It's MIT-licensed, CLI-based, agent-agnostic, and supports 25+ AI agents. The workflow is deliberately sequential: constitution → specify → plan → break down → implement. Each phase is a gate.

The constitution concept is particularly interesting—immutable project principles that govern all development decisions. Unlike Cursor's .cursorrules files (which are essentially persistent system prompts), a Spec Kit constitution has automated validation and lifecycle management. IBM has even published a fork for infrastructure-as-code workflows.

BMAD-METHOD (Breakthrough Method for Agile AI-Driven Development) takes a different approach with 12 specialized agent personas and scale-adaptive workflows. Quick Flow handles bug fixes; Enterprise Flow handles full platform development. This addresses a real criticism: spec-driven tooling can be overkill for simple tasks.

Augment Code's Intent platform represents the living spec model—one that updates bidirectionally as agents implement changes. A coordinator agent spawns specialists (Investigate, Implement, Verify, Critique, Debug, Code Review) in parallel, maintaining semantic understanding across hundreds of thousands of files. It's a premium offering ($60–$200/month), but for complex multi-agent workflows on large codebases, it may save engineering time.

OpenSpec is explicitly brownfield-first, using delta markers (added, modified, removed) to track changes against existing functionality. It enforces a three-phase state machine: proposal, apply, archive. Critically, you can't generate code until a spec is explicitly approved—a deliberate friction point that keeps humans in the loop before implementation, not after.

The Productivity Question

The honest answer is complicated. METR's early 2024 study found AI tools made experienced open-source developers 19% slower—a number widely cited and widely debated. But METR's February 2025 follow-up is more telling: developers were about 18% faster overall, but 30–50% of developers now avoid tasks without AI because they don't want to do them manually. One developer said: "I avoid issues where AI can finish things in two hours but I'd have to spend twenty hours."

This reveals enormous selection effects. The tasks people are willing to do without AI are a self-selected subset of easier or more interesting work. The true productivity delta is likely much larger than measured.

The spec-driven case is that early 2025 slowdowns stemmed largely from unstructured prompts creating debugging loops that consumed the time saved on generation. You save two hours writing code but spend four hours debugging because the agent was guessing at your intent. Eliminate the ambiguity upfront, and you eliminate the debugging loops.

The Skeptic's Case

Birgitta Böckeler, writing on Martin Fowler's site in October 2024, raised a real tension. She examined Kiro and found a bug fix that generated four user stories with 16 acceptance criteria. More broadly, she noted that Spec Kit created so many intermediate markdown files it became tedious to review. She coined a German word for it: Verschlimmbesserung—making something worse in the attempt to make it better.

The living spec versus static spec distinction becomes crucial here. Tools like Kiro and GitHub Spec Kit are static; you write the spec upfront and it doesn't update as implementation proceeds. For complex multi-service projects, that spec can drift from reality within hours.

Is This New?

The final question cuts deepest: is spec-driven development genuinely new, or just formal methods and behavioral specifications getting a fresh coat of paint? The answer is probably both. The underlying concepts—executable specifications, behavior-driven development, requirements as executable artifacts—have existed for years. What's new is the urgency. When your agents are autonomous cloud workers, clarity of intent stops being a nice-to-have and becomes the primary cost driver. The tools, the workflows, and the community investment reflect that shift.

The real test will come in the next 12 months: whether teams adopting spec-driven development actually see the promised productivity gains, or whether they discover that writing good specs is just as hard as writing good code.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2207: Specs First, Code Second: Inside Agentic AI's New Era

Here's what Daniel sent us this week. He's asking about spec-driven development — the idea that it's quickly entrenching itself as the dominant paradigm for agentic AI code generation. He wants us to dig into the tools and best practices for defining clear specs when you're working with AI coding agents, why the quality of the spec is now the real bottleneck, and whether this whole movement is genuinely new or just old ideas getting a fresh coat of paint. There's a lot to unpack here.

There really is. And I think the "is it new" question is actually the most interesting place to start, because it cuts to whether developers should be paying attention to this or just ignoring it as hype.

So let's set the stage. The argument is basically that as AI coding tools have evolved from autocomplete into these fully autonomous cloud agents, the thing that determines output quality has shifted. It's no longer about how good the model is at writing a line of code. It's about how clearly you told it what to build.

Cursor's CEO Michael Truell laid this out really well back in February. He described three eras of AI coding. Era one is tab autocomplete, one keystroke at a time. Era two is synchronous agents, the prompt-and-response loop most people are still in. Era three, where we are now, is autonomous cloud agents that tackle large tasks independently over hours and return logs, video recordings, and live previews rather than just diffs.

And the numbers behind that shift are kind of staggering. In March of last year, Cursor had two and a half times as many tab users as agent users. By February this year, that had completely flipped. Twice as many agent users as tab users. Agent usage grew fifteen times in a single year.

And thirty-five percent of the PRs merged internally at Cursor are now created by autonomous cloud agents. That's not a beta feature, that's their actual development workflow. Their recurring revenue doubled in three months to two billion ARR. So when Truell talks about era three being real, he has the numbers to back it up.

Which makes the spec question urgent in a very practical way. If an agent is running for hours on a cloud VM, a vague prompt doesn't just produce mediocre code. It produces hours of wasted compute and a debugging nightmare at the end.

This is the core insight of spec-driven development. The spec is the primary artifact. Code becomes the implementation detail of the spec, not the other way around. And there's a January arXiv paper by Deepak Babu Piskala that formalizes this really nicely. He describes a spectrum of three levels of specification rigor.

Walk me through those.

The first is spec-first, where you write the spec before coding and potentially discard it afterward. That's good for prototypes and initial AI-assisted development. The second is spec-anchored, where the spec is maintained alongside code throughout the entire lifecycle, and tests enforce alignment. That's your long-lived production system pattern. The third is spec-as-source, the most radical form, where the spec is the only artifact humans ever edit and code is entirely generated. Think automotive workflows where Simulink models generate C code directly.

The spec-as-source thing is wild to think about. Code files with comments saying "generated from spec, do not edit." That's a pretty significant inversion of how most developers think about their job.

It is. And Tessl, which started as a context management tool and has evolved into basically a package manager for agent skills, is pushing exactly this direction. They have a framework in private beta where that's the actual model. But I want to come back to whether that's realistic, because the history here is not entirely encouraging.

Before we get to the skeptic's case, let's talk about what a good spec actually looks like in practice, because I think a lot of developers hear "write specs first" and think that means writing documentation nobody reads.

The paper's framing is useful here. A good spec is behavior-focused, meaning it describes what happens, not how. It's testable, so every requirement is verifiable. It's unambiguous, meaning different readers reach the same interpretation. And it's complete enough to cover essential cases without over-specifying. That last one is tricky, because there's a real failure mode of writing specs that are so detailed they're essentially pseudo-code, which defeats the entire purpose.

The example in the research is a good one. If you prompt an agent with "add photo sharing to my app," you've just handed it a dozen implicit decisions. Format, permissions, size limits, storage, compression. The agent has to guess all of those. A spec eliminates that guessing.

And the format matters. The paper recommends Given-When-Then, the Gherkin format, for acceptance criteria. Kiro, Amazon's agentic IDE, uses EARS notation, which stands for Easy Approach to Requirements Syntax. The structure is: "when trigger, the system shall response." These formats aren't just stylistic preferences. They force you to make every assumption explicit before you hand the task to an agent.

So let's talk about the tools, because the ecosystem here has exploded. By the way, Claude Sonnet 4.6 is writing our script today, which is a fun little detail given we're talking about AI-generated artifacts. But back to the tools. GitHub Spec Kit is the headline number. Eighty-seven thousand six hundred stars as of today, with version zero point six point two dropping literally this morning.

And it's MIT licensed, CLI-based, agent-agnostic, supports twenty-five or more AI agents including Claude Code, Cursor, Windsurf, Gemini CLI, the full list. The workflow is a pipeline of commands. You start with a constitution, then specify, then plan, then break down tasks, then implement. That sequencing is deliberate. Each phase is a gate.

The constitution concept is the one I find most interesting. It's described as the project's immutable principles that govern all development decisions. Not a per-feature spec, not a per-session rules file, but persistent project memory that every agent interaction is anchored to.

That's actually a meaningful distinction. Cursor's dot cursorrules files are often described as pseudo-specs, and they do serve a similar function at first glance. But they have no automated validation, no spec lifecycle management. They're basically persistent system prompts. The constitution in Spec Kit is something more structural. IBM has even published a fork for infrastructure-as-code workflows, which tells you the concept is landing in enterprise contexts.

The extension ecosystem for Spec Kit alone is fifty-plus extensions. Jira integration, Azure DevOps sync, CI/CD gates, security review. For a tool that's only been around a few months, that's a remarkable amount of community investment.

BMAD-METHOD is the other big one. Forty-four thousand five hundred stars, five thousand three hundred forks, version six point three released three days ago. The name stands for Breakthrough Method for Agile AI-Driven Development, which I'll admit sounds like a motivational poster, but the actual system is genuinely sophisticated.

Twelve specialized agent personas. Mary the Business Analyst, Preston the Product Manager, Winston the Architect. There's something both impressive and slightly unsettling about that level of role decomposition.

What's smart about it is the scale-adaptive approach. Quick Flow for bug fixes and small tasks, Enterprise Flow for full platform development. That addresses one of the real criticisms of spec-driven tooling, which is that it's overkill for most day-to-day coding. You don't need sixteen acceptance criteria to fix a null pointer exception.

Which is exactly the criticism Birgitta Böckeler raised on Martin Fowler's site last October. She looked at Kiro specifically and found a bug fix that generated four user stories with sixteen acceptance criteria. And her broader point was that Spec Kit created so many intermediate markdown files it became tedious to review. Quote: "I'd rather review code than all these markdown files."

That's a real tension. The value of the spec is clarity, but if the tooling generates so much scaffolding that the signal is buried in the noise, you've traded one kind of overhead for another.

She also coined a German word for it. Verschlimmbesserung. Making something worse in the attempt to make it better.

Which is a word that should probably be on a t-shirt at every developer conference.

Alongside "move fast and break specs."

But here's where I think the living spec versus static spec distinction becomes really important. Kiro and GitHub Spec Kit are static. You write the spec upfront, and it doesn't update as implementation proceeds. For a complex multi-service project, that spec can start drifting from reality within hours of the agent starting work.

Augment Code's Intent platform is the counter-approach. They describe it as a living spec, one that updates bidirectionally as agents implement changes. They have a coordinator agent spawning specialist agents in parallel, including an Investigate agent, an Implement agent, a Verify agent, a Critique agent, a Debug agent, and a Code Review agent. And their context engine maintains semantic understanding across four hundred thousand or more files.

The pricing is sixty dollars a month for Standard, two hundred for Max. Which is a significant commitment, but if you're running complex multi-agent workflows on large codebases, the alternative is probably more expensive in engineering time.

OpenSpec takes a different angle. Twenty-eight thousand four hundred stars, and it's explicitly brownfield-first. It uses delta markers, added, modified, removed, to track changes against existing functionality. And it enforces a strict three-phase state machine: proposal, apply, archive. You can't generate code until a spec has been explicitly approved.

That approval gate is interesting because it's a deliberate friction point. The whole premise is that you want humans in the loop before implementation begins, not after. Which is the opposite of the "just ship it and see what happens" culture a lot of teams have drifted into with AI tools.

Let's talk about the productivity question, because this is where the honest answer gets complicated. The METR study from early last year found that AI tools made experienced open-source developers nineteen percent slower. That number was widely cited and widely argued about.

The February update from METR is more nuanced and honestly more interesting. Their late twenty twenty-five follow-up estimated developers were about eighteen percent faster. But the really telling finding is that thirty to fifty percent of developers are now choosing not to submit tasks to the study because they don't want to do them without AI. One developer said, "I avoid issues where AI can finish things in two hours but I'd have to spend twenty hours."

Which means the selection effects in the study are enormous. The tasks people are willing to do without AI are a self-selected subset of easier or more interesting tasks. The true productivity delta is probably much larger than the measured one.

Augment Code's argument is that the early twenty twenty-five slowdown was largely because unstructured prompts created debugging loops that consumed the time saved on generation. You save two hours writing code but spend four hours debugging the output because the agent was guessing at your intent the whole time.

And the spec-driven case is that if you eliminate the ambiguity upfront, you eliminate the debugging loops. The arXiv paper cites a controlled study showing human-refined specs improve LLM-generated code quality with error reductions of up to fifty percent. There's also a financial services case study showing a seventy-five percent reduction in integration cycle time after adopting an API-first spec-driven approach.

Four percent of all GitHub commits are now authored by Claude Code, by the way. That's the current figure. That's not a niche workflow anymore.

So now let's get to the "is this just BDD with branding" question, because that's the most pointed version of the skeptic's argument. Bryan Finster, a DevOps practitioner, said exactly that. And the arXiv paper actually addresses it head-on.

The paper's position is that SDD is an evolution, not a revolution. The core insight, write specs first and let code derive from them, has been agile wisdom for decades. TDD to BDD to SDD is a natural lineage. What's genuinely new is three things. Better tooling that makes executable specs practical. CI/CD maturity that enables automated enforcement. And AI as a consumer where spec quality directly determines output quality. That third point is the one that changes the stakes.

When your consumer is a human developer, a slightly ambiguous spec just leads to a clarifying conversation. When your consumer is an autonomous agent running for six hours on a cloud VM, that ambiguity materializes as a concrete artifact you now have to debug.

The Thoughtworks Technology Radar from November twenty twenty-five placed SDD in the Assess category, which is their language for "worth exploring to understand how it affects your enterprise." But they also raised the most sobering parallel. Model-Driven Development from the early two thousands. Same promise: define the model, generate the code, humans only edit the high-level artifact.

And MDD largely failed for business applications.

It did. Thoughtworks' note was: "We may be relearning a bitter lesson that handcrafting detailed rules for AI ultimately doesn't scale." The Böckeler analysis makes the same parallel explicitly. MDD sat at an awkward abstraction level and created too much overhead and constraints.

So what's different this time? Because there has to be an answer to that or the whole movement is doomed to repeat the same failure.

I think there are two meaningful differences. First, natural language specs plus LLMs eliminate the need for a custom parseable language. MDD required you to learn a domain-specific language, which was its own kind of overhead and created lock-in. Natural language specs don't have that barrier. Second, the non-determinism of LLMs is actually a feature here, not a bug. MDD failed partly because the generated code was too rigid. LLMs can navigate ambiguity in implementation while still being constrained by a clear behavioral spec.

Although non-determinism cuts both ways. If the spec is the only source of truth and the code is entirely generated, you need very high confidence that the generation is correct. And right now, that confidence isn't there for most production contexts.

Which is why spec-anchored, the middle tier, is probably where most teams should be operating. Not spec-first as a throwaway, not spec-as-source as the radical endpoint, but spec-anchored where the spec is maintained alongside code and tests enforce alignment.

Let's talk about Kiro's agent hooks, because I think this is an underappreciated part of the picture. It's Amazon's agentic IDE built on VS Code, launched July last year, and the hooks concept is different from the upfront planning workflow.

Hooks are event-driven automations triggered on file save or create. You save a React component and the agent automatically updates the test file. You modify API endpoints and the agent refreshes the README. It's spec-driven development at the micro level rather than the project level. Continuous, ambient enforcement rather than upfront planning.

That feels like the more practical near-term adoption path for teams that aren't ready to restructure their entire development workflow around a constitution and a five-phase pipeline.

The limitation with Kiro right now is that it only supports Claude models and the specs are static, they don't update during implementation. For small to medium projects that's probably fine. For complex multi-service architectures it starts to break down.

The "self-spec" pattern is interesting too. The idea that the LLM authors its own specification before generating code. You give it a high-level prompt, it produces a spec, humans review and refine the spec, then the same or another agent implements against it. That creates an explicit separation between planning and execution that most current workflows collapse together.

And it surfaces the assumptions the model is making before they get baked into code. If the agent's spec for "add photo sharing" says it'll use S3 for storage and you actually need on-premise storage, you catch that before implementation, not after six hours of cloud agent runtime.

So if you're a developer or a team trying to figure out where to actually start with this, what does the practical guidance look like?

I'd start with the constitution or memory bank concept regardless of which tooling you adopt. Getting the immutable project principles written down, the tech stack, the security model, the architectural constraints, is valuable independent of anything else. That document alone eliminates a huge class of agent guessing.

And it's reusable across tools. If you write a solid constitution today and switch from Spec Kit to BMAD-Method next month, the constitution travels with you.

Second, use Given-When-Then or EARS notation for acceptance criteria on any non-trivial feature. Not for bug fixes, not for small refactors. But for anything that's going to run as an autonomous agent task for more than a few minutes, having explicit behavioral requirements is the difference between a useful output and a debugging session.

Third, match the rigor level to the task. This is where a lot of teams get it wrong. BMAD-Method's scale-adaptive approach is smart precisely because it doesn't force enterprise flow on a bug fix. Sixteen acceptance criteria for a null pointer exception is bureaucracy, not engineering.

Fourth, think seriously about the static versus living spec question before you commit to tooling. If you're working on a project with many services, many contributors, and fast-moving requirements, a static spec will drift within hours. Augment Code's living spec approach or something like OpenSpec's explicit approval gates may be worth the additional infrastructure cost.

And fifth, the Tessl point about context quality is worth sitting with. Their data shows that well-structured context can drive a three point three times improvement in agent use of libraries. The Cisco security skill example went from a forty-seven percent baseline to eighty-four percent agent success rate. HashiCorp's Terraform stacks skill went from forty-seven to ninety-six percent. That's not a marginal improvement, that's the difference between a tool that works and one that doesn't.

The broader implication there is that spec quality and context quality are converging. The spec tells the agent what to build. The context tells the agent what it has to work with. Getting both right is the actual discipline.

There's a version of this conversation where we end up concluding that spec-driven development is just good engineering practice that's always been true and is now being rediscovered under a new name. And honestly, I think that's partially right. But the scale argument is real.

When your agent is running autonomously for hours and the cost of a wrong direction is hours of wasted compute, the stakes of getting the spec right are qualitatively different from the stakes when you're working synchronously with a human developer. The discipline was always valuable. Now it's necessary.

The eighty-seven thousand stars on Spec Kit in a matter of months tells you something about developer appetite for structure. That's not vendor marketing driving those numbers. That's developers who've experienced the debugging loop that comes from loose prompts and are actively looking for a better workflow.

And BMAD-Method's forty-four thousand stars with five thousand forks and a hundred and thirty-five contributors. OpenSpec at twenty-eight thousand four hundred. This is a genuine grassroots movement. The community is building this infrastructure because they need it, not because a product team told them to.

The question I keep coming back to is the target user question. Because Böckeler raised it sharply. Spec-driven tools include product-level concepts. User stories, product requirements documents, acceptance criteria. Are these tools for developers? For product managers? For some developer-PM hybrid that doesn't really exist yet at most companies?

That's an honest tension. If SDD requires developers to do requirements analysis, you're asking them to develop a skill set that's traditionally sat with a different function. BMAD-Method's persona system, with separate agents for Business Analyst, Product Manager, and Architect roles, is partly an attempt to address this. The developer writes the high-level intent and the agent personas do the requirements decomposition. But someone still has to review the output.

Someone still has to know if the acceptance criteria are actually correct.

Which brings you back to the false confidence pitfall. A passing spec test only guarantees code matches the spec. If the spec is wrong, the code faithfully implements the wrong thing. The discipline of writing good specs is not something tooling can fully automate. It requires domain knowledge and judgment that still lives with humans.

That's probably the most important thing to say about all of this. The tools are genuinely useful and the ecosystem is maturing fast. But the underlying skill of translating intent into unambiguous, testable, behavior-focused requirements is a human skill that becomes more valuable as agents become more powerful, not less.

The agents are getting better at implementation. The bottleneck is shifting to specification. Which means the highest-leverage investment a developer can make right now is getting good at writing specs, not at writing code.

Alright. Future implications. The spec-as-source world, if it arrives, probably looks less like developers writing code and more like developers writing and reviewing specifications, with agents generating and verifying the implementation layer continuously. The Tessl vision of code marked "generated from spec, do not edit" is still a few reliability jumps away from being the default, but the direction is clear.

And the living spec versus static spec debate is going to intensify as projects get more complex. Right now most teams don't have the infrastructure for living specs. But as the tooling matures, the cost of that infrastructure will drop. The teams that figure out bidirectional spec-code synchronization early will have a meaningful advantage on large codebases.

The open question I'd leave listeners with is this: if the spec is the primary artifact and code is the implementation detail, what does version control look like in five years? Do you version the spec and let the code be regenerated from it? That's a genuinely unsettled question with interesting implications for how we think about code ownership, audit trails, and debugging.

That's a good one to sit with.

Thanks as always to our producer Hilbert Flumingtop for keeping this whole operation running. Big thanks to Modal for providing the GPU credits that power this show. If you're enjoying My Weird Prompts, a quick review on your podcast app helps us reach new listeners more than you'd think. This has been My Weird Prompts. We'll see you on the next one.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2207: Specs First, Code Second: Inside Agentic AI's New Era

Specs First, Code Second: Inside Agentic AI's New Era

The Numbers Behind the Shift

The Three Levels of Specification

What Makes a Good Spec

The Tool Ecosystem

The Productivity Question

The Skeptic's Case

Is This New?

Downloads

You Might Also Like

#2207: Specs First, Code Second: Inside Agentic AI's New Era