Here's what Daniel sent us this week. He's asking about spec-driven development — the idea that it's quickly entrenching itself as the dominant paradigm for agentic AI code generation. He wants us to dig into the tools and best practices for defining clear specs when you're working with AI coding agents, why the quality of the spec is now the real bottleneck, and whether this whole movement is genuinely new or just old ideas getting a fresh coat of paint. There's a lot to unpack here.
There really is. And I think the "is it new" question is actually the most interesting place to start, because it cuts to whether developers should be paying attention to this or just ignoring it as hype.
So let's set the stage. The argument is basically that as AI coding tools have evolved from autocomplete into these fully autonomous cloud agents, the thing that determines output quality has shifted. It's no longer about how good the model is at writing a line of code. It's about how clearly you told it what to build.
Cursor's CEO Michael Truell laid this out really well back in February. He described three eras of AI coding. Era one is tab autocomplete, one keystroke at a time. Era two is synchronous agents, the prompt-and-response loop most people are still in. Era three, where we are now, is autonomous cloud agents that tackle large tasks independently over hours and return logs, video recordings, and live previews rather than just diffs.
And the numbers behind that shift are kind of staggering. In March of last year, Cursor had two and a half times as many tab users as agent users. By February this year, that had completely flipped. Twice as many agent users as tab users. Agent usage grew fifteen times in a single year.
And thirty-five percent of the PRs merged internally at Cursor are now created by autonomous cloud agents. That's not a beta feature, that's their actual development workflow. Their recurring revenue doubled in three months to two billion ARR. So when Truell talks about era three being real, he has the numbers to back it up.
Which makes the spec question urgent in a very practical way. If an agent is running for hours on a cloud VM, a vague prompt doesn't just produce mediocre code. It produces hours of wasted compute and a debugging nightmare at the end.
This is the core insight of spec-driven development. The spec is the primary artifact. Code becomes the implementation detail of the spec, not the other way around. And there's a January arXiv paper by Deepak Babu Piskala that formalizes this really nicely. He describes a spectrum of three levels of specification rigor.
Walk me through those.
The first is spec-first, where you write the spec before coding and potentially discard it afterward. That's good for prototypes and initial AI-assisted development. The second is spec-anchored, where the spec is maintained alongside code throughout the entire lifecycle, and tests enforce alignment. That's your long-lived production system pattern. The third is spec-as-source, the most radical form, where the spec is the only artifact humans ever edit and code is entirely generated. Think automotive workflows where Simulink models generate C code directly.
The spec-as-source thing is wild to think about. Code files with comments saying "generated from spec, do not edit." That's a pretty significant inversion of how most developers think about their job.
It is. And Tessl, which started as a context management tool and has evolved into basically a package manager for agent skills, is pushing exactly this direction. They have a framework in private beta where that's the actual model. But I want to come back to whether that's realistic, because the history here is not entirely encouraging.
Before we get to the skeptic's case, let's talk about what a good spec actually looks like in practice, because I think a lot of developers hear "write specs first" and think that means writing documentation nobody reads.
The paper's framing is useful here. A good spec is behavior-focused, meaning it describes what happens, not how. It's testable, so every requirement is verifiable. It's unambiguous, meaning different readers reach the same interpretation. And it's complete enough to cover essential cases without over-specifying. That last one is tricky, because there's a real failure mode of writing specs that are so detailed they're essentially pseudo-code, which defeats the entire purpose.
The example in the research is a good one. If you prompt an agent with "add photo sharing to my app," you've just handed it a dozen implicit decisions. Format, permissions, size limits, storage, compression. The agent has to guess all of those. A spec eliminates that guessing.
And the format matters. The paper recommends Given-When-Then, the Gherkin format, for acceptance criteria. Kiro, Amazon's agentic IDE, uses EARS notation, which stands for Easy Approach to Requirements Syntax. The structure is: "when trigger, the system shall response." These formats aren't just stylistic preferences. They force you to make every assumption explicit before you hand the task to an agent.
So let's talk about the tools, because the ecosystem here has exploded. By the way, Claude Sonnet 4.6 is writing our script today, which is a fun little detail given we're talking about AI-generated artifacts. But back to the tools. GitHub Spec Kit is the headline number. Eighty-seven thousand six hundred stars as of today, with version zero point six point two dropping literally this morning.
And it's MIT licensed, CLI-based, agent-agnostic, supports twenty-five or more AI agents including Claude Code, Cursor, Windsurf, Gemini CLI, the full list. The workflow is a pipeline of commands. You start with a constitution, then specify, then plan, then break down tasks, then implement. That sequencing is deliberate. Each phase is a gate.
The constitution concept is the one I find most interesting. It's described as the project's immutable principles that govern all development decisions. Not a per-feature spec, not a per-session rules file, but persistent project memory that every agent interaction is anchored to.
That's actually a meaningful distinction. Cursor's dot cursorrules files are often described as pseudo-specs, and they do serve a similar function at first glance. But they have no automated validation, no spec lifecycle management. They're basically persistent system prompts. The constitution in Spec Kit is something more structural. IBM has even published a fork for infrastructure-as-code workflows, which tells you the concept is landing in enterprise contexts.
The extension ecosystem for Spec Kit alone is fifty-plus extensions. Jira integration, Azure DevOps sync, CI/CD gates, security review. For a tool that's only been around a few months, that's a remarkable amount of community investment.
BMAD-METHOD is the other big one. Forty-four thousand five hundred stars, five thousand three hundred forks, version six point three released three days ago. The name stands for Breakthrough Method for Agile AI-Driven Development, which I'll admit sounds like a motivational poster, but the actual system is genuinely sophisticated.
Twelve specialized agent personas. Mary the Business Analyst, Preston the Product Manager, Winston the Architect. There's something both impressive and slightly unsettling about that level of role decomposition.
What's smart about it is the scale-adaptive approach. Quick Flow for bug fixes and small tasks, Enterprise Flow for full platform development. That addresses one of the real criticisms of spec-driven tooling, which is that it's overkill for most day-to-day coding. You don't need sixteen acceptance criteria to fix a null pointer exception.
Which is exactly the criticism Birgitta Böckeler raised on Martin Fowler's site last October. She looked at Kiro specifically and found a bug fix that generated four user stories with sixteen acceptance criteria. And her broader point was that Spec Kit created so many intermediate markdown files it became tedious to review. Quote: "I'd rather review code than all these markdown files."
That's a real tension. The value of the spec is clarity, but if the tooling generates so much scaffolding that the signal is buried in the noise, you've traded one kind of overhead for another.
She also coined a German word for it. Verschlimmbesserung. Making something worse in the attempt to make it better.
Which is a word that should probably be on a t-shirt at every developer conference.
Alongside "move fast and break specs."
But here's where I think the living spec versus static spec distinction becomes really important. Kiro and GitHub Spec Kit are static. You write the spec upfront, and it doesn't update as implementation proceeds. For a complex multi-service project, that spec can start drifting from reality within hours of the agent starting work.
Augment Code's Intent platform is the counter-approach. They describe it as a living spec, one that updates bidirectionally as agents implement changes. They have a coordinator agent spawning specialist agents in parallel, including an Investigate agent, an Implement agent, a Verify agent, a Critique agent, a Debug agent, and a Code Review agent. And their context engine maintains semantic understanding across four hundred thousand or more files.
The pricing is sixty dollars a month for Standard, two hundred for Max. Which is a significant commitment, but if you're running complex multi-agent workflows on large codebases, the alternative is probably more expensive in engineering time.
OpenSpec takes a different angle. Twenty-eight thousand four hundred stars, and it's explicitly brownfield-first. It uses delta markers, added, modified, removed, to track changes against existing functionality. And it enforces a strict three-phase state machine: proposal, apply, archive. You can't generate code until a spec has been explicitly approved.
That approval gate is interesting because it's a deliberate friction point. The whole premise is that you want humans in the loop before implementation begins, not after. Which is the opposite of the "just ship it and see what happens" culture a lot of teams have drifted into with AI tools.
Let's talk about the productivity question, because this is where the honest answer gets complicated. The METR study from early last year found that AI tools made experienced open-source developers nineteen percent slower. That number was widely cited and widely argued about.
The February update from METR is more nuanced and honestly more interesting. Their late twenty twenty-five follow-up estimated developers were about eighteen percent faster. But the really telling finding is that thirty to fifty percent of developers are now choosing not to submit tasks to the study because they don't want to do them without AI. One developer said, "I avoid issues where AI can finish things in two hours but I'd have to spend twenty hours."
Which means the selection effects in the study are enormous. The tasks people are willing to do without AI are a self-selected subset of easier or more interesting tasks. The true productivity delta is probably much larger than the measured one.
Augment Code's argument is that the early twenty twenty-five slowdown was largely because unstructured prompts created debugging loops that consumed the time saved on generation. You save two hours writing code but spend four hours debugging the output because the agent was guessing at your intent the whole time.
And the spec-driven case is that if you eliminate the ambiguity upfront, you eliminate the debugging loops. The arXiv paper cites a controlled study showing human-refined specs improve LLM-generated code quality with error reductions of up to fifty percent. There's also a financial services case study showing a seventy-five percent reduction in integration cycle time after adopting an API-first spec-driven approach.
Four percent of all GitHub commits are now authored by Claude Code, by the way. That's the current figure. That's not a niche workflow anymore.
So now let's get to the "is this just BDD with branding" question, because that's the most pointed version of the skeptic's argument. Bryan Finster, a DevOps practitioner, said exactly that. And the arXiv paper actually addresses it head-on.
The paper's position is that SDD is an evolution, not a revolution. The core insight, write specs first and let code derive from them, has been agile wisdom for decades. TDD to BDD to SDD is a natural lineage. What's genuinely new is three things. Better tooling that makes executable specs practical. CI/CD maturity that enables automated enforcement. And AI as a consumer where spec quality directly determines output quality. That third point is the one that changes the stakes.
When your consumer is a human developer, a slightly ambiguous spec just leads to a clarifying conversation. When your consumer is an autonomous agent running for six hours on a cloud VM, that ambiguity materializes as a concrete artifact you now have to debug.
The Thoughtworks Technology Radar from November twenty twenty-five placed SDD in the Assess category, which is their language for "worth exploring to understand how it affects your enterprise." But they also raised the most sobering parallel. Model-Driven Development from the early two thousands. Same promise: define the model, generate the code, humans only edit the high-level artifact.
And MDD largely failed for business applications.
It did. Thoughtworks' note was: "We may be relearning a bitter lesson that handcrafting detailed rules for AI ultimately doesn't scale." The Böckeler analysis makes the same parallel explicitly. MDD sat at an awkward abstraction level and created too much overhead and constraints.
So what's different this time? Because there has to be an answer to that or the whole movement is doomed to repeat the same failure.
I think there are two meaningful differences. First, natural language specs plus LLMs eliminate the need for a custom parseable language. MDD required you to learn a domain-specific language, which was its own kind of overhead and created lock-in. Natural language specs don't have that barrier. Second, the non-determinism of LLMs is actually a feature here, not a bug. MDD failed partly because the generated code was too rigid. LLMs can navigate ambiguity in implementation while still being constrained by a clear behavioral spec.
Although non-determinism cuts both ways. If the spec is the only source of truth and the code is entirely generated, you need very high confidence that the generation is correct. And right now, that confidence isn't there for most production contexts.
Which is why spec-anchored, the middle tier, is probably where most teams should be operating. Not spec-first as a throwaway, not spec-as-source as the radical endpoint, but spec-anchored where the spec is maintained alongside code and tests enforce alignment.
Let's talk about Kiro's agent hooks, because I think this is an underappreciated part of the picture. It's Amazon's agentic IDE built on VS Code, launched July last year, and the hooks concept is different from the upfront planning workflow.
Hooks are event-driven automations triggered on file save or create. You save a React component and the agent automatically updates the test file. You modify API endpoints and the agent refreshes the README. It's spec-driven development at the micro level rather than the project level. Continuous, ambient enforcement rather than upfront planning.
That feels like the more practical near-term adoption path for teams that aren't ready to restructure their entire development workflow around a constitution and a five-phase pipeline.
The limitation with Kiro right now is that it only supports Claude models and the specs are static, they don't update during implementation. For small to medium projects that's probably fine. For complex multi-service architectures it starts to break down.
The "self-spec" pattern is interesting too. The idea that the LLM authors its own specification before generating code. You give it a high-level prompt, it produces a spec, humans review and refine the spec, then the same or another agent implements against it. That creates an explicit separation between planning and execution that most current workflows collapse together.
And it surfaces the assumptions the model is making before they get baked into code. If the agent's spec for "add photo sharing" says it'll use S3 for storage and you actually need on-premise storage, you catch that before implementation, not after six hours of cloud agent runtime.
So if you're a developer or a team trying to figure out where to actually start with this, what does the practical guidance look like?
I'd start with the constitution or memory bank concept regardless of which tooling you adopt. Getting the immutable project principles written down, the tech stack, the security model, the architectural constraints, is valuable independent of anything else. That document alone eliminates a huge class of agent guessing.
And it's reusable across tools. If you write a solid constitution today and switch from Spec Kit to BMAD-Method next month, the constitution travels with you.
Second, use Given-When-Then or EARS notation for acceptance criteria on any non-trivial feature. Not for bug fixes, not for small refactors. But for anything that's going to run as an autonomous agent task for more than a few minutes, having explicit behavioral requirements is the difference between a useful output and a debugging session.
Third, match the rigor level to the task. This is where a lot of teams get it wrong. BMAD-Method's scale-adaptive approach is smart precisely because it doesn't force enterprise flow on a bug fix. Sixteen acceptance criteria for a null pointer exception is bureaucracy, not engineering.
Fourth, think seriously about the static versus living spec question before you commit to tooling. If you're working on a project with many services, many contributors, and fast-moving requirements, a static spec will drift within hours. Augment Code's living spec approach or something like OpenSpec's explicit approval gates may be worth the additional infrastructure cost.
And fifth, the Tessl point about context quality is worth sitting with. Their data shows that well-structured context can drive a three point three times improvement in agent use of libraries. The Cisco security skill example went from a forty-seven percent baseline to eighty-four percent agent success rate. HashiCorp's Terraform stacks skill went from forty-seven to ninety-six percent. That's not a marginal improvement, that's the difference between a tool that works and one that doesn't.
The broader implication there is that spec quality and context quality are converging. The spec tells the agent what to build. The context tells the agent what it has to work with. Getting both right is the actual discipline.
There's a version of this conversation where we end up concluding that spec-driven development is just good engineering practice that's always been true and is now being rediscovered under a new name. And honestly, I think that's partially right. But the scale argument is real.
When your agent is running autonomously for hours and the cost of a wrong direction is hours of wasted compute, the stakes of getting the spec right are qualitatively different from the stakes when you're working synchronously with a human developer. The discipline was always valuable. Now it's necessary.
The eighty-seven thousand stars on Spec Kit in a matter of months tells you something about developer appetite for structure. That's not vendor marketing driving those numbers. That's developers who've experienced the debugging loop that comes from loose prompts and are actively looking for a better workflow.
And BMAD-Method's forty-four thousand stars with five thousand forks and a hundred and thirty-five contributors. OpenSpec at twenty-eight thousand four hundred. This is a genuine grassroots movement. The community is building this infrastructure because they need it, not because a product team told them to.
The question I keep coming back to is the target user question. Because Böckeler raised it sharply. Spec-driven tools include product-level concepts. User stories, product requirements documents, acceptance criteria. Are these tools for developers? For product managers? For some developer-PM hybrid that doesn't really exist yet at most companies?
That's an honest tension. If SDD requires developers to do requirements analysis, you're asking them to develop a skill set that's traditionally sat with a different function. BMAD-Method's persona system, with separate agents for Business Analyst, Product Manager, and Architect roles, is partly an attempt to address this. The developer writes the high-level intent and the agent personas do the requirements decomposition. But someone still has to review the output.
Someone still has to know if the acceptance criteria are actually correct.
Which brings you back to the false confidence pitfall. A passing spec test only guarantees code matches the spec. If the spec is wrong, the code faithfully implements the wrong thing. The discipline of writing good specs is not something tooling can fully automate. It requires domain knowledge and judgment that still lives with humans.
That's probably the most important thing to say about all of this. The tools are genuinely useful and the ecosystem is maturing fast. But the underlying skill of translating intent into unambiguous, testable, behavior-focused requirements is a human skill that becomes more valuable as agents become more powerful, not less.
The agents are getting better at implementation. The bottleneck is shifting to specification. Which means the highest-leverage investment a developer can make right now is getting good at writing specs, not at writing code.
Alright. Future implications. The spec-as-source world, if it arrives, probably looks less like developers writing code and more like developers writing and reviewing specifications, with agents generating and verifying the implementation layer continuously. The Tessl vision of code marked "generated from spec, do not edit" is still a few reliability jumps away from being the default, but the direction is clear.
And the living spec versus static spec debate is going to intensify as projects get more complex. Right now most teams don't have the infrastructure for living specs. But as the tooling matures, the cost of that infrastructure will drop. The teams that figure out bidirectional spec-code synchronization early will have a meaningful advantage on large codebases.
The open question I'd leave listeners with is this: if the spec is the primary artifact and code is the implementation detail, what does version control look like in five years? Do you version the spec and let the code be regenerated from it? That's a genuinely unsettled question with interesting implications for how we think about code ownership, audit trails, and debugging.
That's a good one to sit with.
Thanks as always to our producer Hilbert Flumingtop for keeping this whole operation running. Big thanks to Modal for providing the GPU credits that power this show. If you're enjoying My Weird Prompts, a quick review on your podcast app helps us reach new listeners more than you'd think. This has been My Weird Prompts. We'll see you on the next one.