Here’s what Daniel sent us — he’s been thinking about how stack definition and selection change in the age of Claude Code and agentic AI. His core idea is that GitHub is this incredible discovery layer now. You have a specific need, like background noise removal in an audio pipeline, and you can find dozens of implementations. Before agentic code tools, a lot of these projects just sat there — great code, but never quite mature enough for mainstream adoption because they lacked polished docs or tutorials. Now, he curates five or ten promising repos, feeds them to Claude, and asks whether they fit his architecture. The agent reads the source directly and gives a real evaluation. He’s also wrestling with how to document those decisions — what you chose, why, what you rejected — so that a year later, when the picture changes, you’re not starting from scratch. His question for us is how to make that documentation airtight for both human developers and code generation agents.
Oh, this is exactly the conversation I’ve been wanting to have. And by the way, quick note — today’s script is being generated by DeepSeek V four Pro.
So where do we even start with this? Because what Daniel’s describing isn’t just a neat productivity hack. It’s a fundamental shift in how we assemble software stacks.
It really is. Think about what stack selection looked like even three years ago. You’d have a quarterly architecture review, someone would spend two weeks evaluating libraries, reading documentation, maybe building a small prototype. The decision was high-stakes because you were locking in for months or years. Now, with what Claude Code can do — indexing entire repositories, parsing dependency trees, checking license compatibility, assessing whether an API surface aligns with your existing architecture — that evaluation cycle collapses from weeks to seconds.
The stakes change with it, right? If evaluation is fast and cheap, stack selection stops being this quarterly high-anxiety ritual and becomes something you can revisit whenever the landscape shifts.
That’s the part I find genuinely exciting. GitHub hosts something like two hundred million repositories now, with roughly forty million public repos containing usable code. Before agentic tooling, the vast majority of those were effectively invisible. You’d only find them if they had strong SEO, good READMEs, an active community. But Claude Code doesn’t need a polished tutorial. It reads the source code directly. It can look at a repo with twelve stars and a sparse README and still assess whether the implementation is solid.
The discovery surface expands massively. But there’s a flip side to that, and I think this is where Daniel’s question about documentation gets interesting. If an agent can evaluate a library by reading its source, what’s stopping it from hallucinating compatibility? Or missing something critical that a human would catch by actually running the code?
That’s the right question to ask. There’s a paper from earlier this year on architectural design decisions in AI agent harnesses that identifies exactly this failure mode. Agents are excellent at pattern matching across codebases, but they can be overconfident about integration fit. They might see a function signature that looks compatible and assume everything works, without catching subtle runtime behavior differences.
The agent’s evaluation is a starting point, not a final verdict.
And this is where Daniel’s instinct to curate five or ten candidates first, rather than just asking Claude to find libraries from scratch, is actually really smart. He’s bringing human judgment to the initial filtering — he knows his domain, he knows what looks promising — and then using the agent for the deep evaluation that would take him hours per repository.
I want to pull on a thread here. Daniel mentioned that before agentic code development, a lot of great projects just sat there, never mature enough for mainstream adoption. What’s the actual gap that these tools bridge?
I’d call it the maturity gap. A typical open source project needs a certain threshold of polish to be adoptable by humans — clear installation instructions, usage examples, API documentation, test coverage you can verify, some evidence of maintenance. That’s a high bar. Most repos never clear it. But an AI agent doesn’t need the tutorial. It can scan the source, trace the dependency graph, evaluate the test suite even if it’s sparse, and form a judgment about whether the core implementation is sound.
The agent is essentially reading the code the way a very fast, very thorough senior engineer would — skipping the marketing and going straight to the implementation.
And that changes which projects are viable candidates for your stack. You’re no longer limited to the top five results on a GitHub search sorted by stars. You can find that library with forty-seven stars that happens to solve your exact problem elegantly, because the agent can verify that it does.
Which brings us to the trust problem. If I’m going to build a production system around a library that an agent recommended, I need more than “the source code looks good.” What’s the verification step?
This is where what some people are calling stack probes comes in. The idea is you don’t just ask the agent to evaluate a library — you ask it to generate a proof-of-concept integration with your existing codebase using that library. A small, disposable experiment. Maybe fifty lines of code. Then you run it. If it compiles, if it passes basic smoke tests, if the output looks correct, you’ve got real evidence. And if it fails, you’ve only burned a few minutes. You can do this for all five or ten candidates in parallel — something completely impractical before. Claude Code can generate all ten in minutes, and you evaluate the results.
I want to make sure we’re grounding this. Can you walk through a concrete example? Say I’m building a React Native audio pipeline and I need background noise removal. What does this actually look like?
Okay, so you start by searching GitHub for noise suppression libraries. You might find RNNoise, which is Mozilla’s recurrent neural network approach. You might find components from Krisp’s open-source releases. You might find SpeexDSP, which has been around forever. You might find a handful of newer deep learning approaches. You curate maybe six candidates. Then you give Claude your architectural context — “I’m building a React Native app, the audio pipeline uses WebRTC, I need real-time noise suppression that runs on-device, latency budget is twenty milliseconds, the existing stack uses TypeScript with native modules for performance-critical paths.” Then you point it at each repository and ask specific questions. Does this library’s API fit our pipeline? What’s the dependency footprint? Are there license conflicts? Does it support the platforms we need? Can it meet our latency budget based on what the source reveals about its processing approach?
Claude can answer those from reading the source?
For the most part, yes. It can parse the public API surface, check the package manifest for dependencies and flag conflicts with your existing stack, read the license file, analyze the processing pipeline to estimate whether the approach is likely to meet a twenty-millisecond budget. Some of this is inference, not certainty — but it’s informed inference based on actually reading the implementation.
What about the things it gets wrong?
The most common failure mode is hallucinated compatibility — the agent assumes two things work together because the interfaces look compatible, but there’s a runtime behavior mismatch it can’t see from static analysis. Another is outdated analysis — if Claude’s training cutoff means it’s seeing a version of the repository from six months ago, it might miss a recent breaking change. And there’s a subtler risk: agents can develop what I’d call aesthetic preferences. They might favor a library with cleaner code over one with messier code but better runtime characteristics, because they’re pattern-matching on code quality rather than operational behavior.
That last one is interesting. The agent is optimizing for something, but it might not be what you’d optimize for.
And that’s why the stack probe — the actual integration experiment — matters so much. It catches the cases where static analysis isn’t enough. If the agent says “this library is a perfect fit” but the proof-of-concept integration fails at runtime, you’ve learned something important about the limits of the evaluation.
The workflow Daniel’s describing — curate, evaluate, probe, decide — has a built-in verification loop. It’s not blind trust in the agent’s judgment.
The agent is an accelerator, not an oracle. It does in seconds what would take you hours, but you still apply human judgment to the output.
Okay, so that’s the discovery and selection side. But Daniel’s prompt also raised the documentation question, and I think this is where things get really interesting. Once you’ve made a decision — we’re using RNNoise for noise suppression, here’s why, here’s what we rejected — how do you capture that in a way that’s useful six months later when you’re not the person maintaining it?
Or when the agent that’s generating code for you six months later needs to understand why you made that choice, so it doesn’t try to swap in a different library that looks better on the surface but fails for the same reasons you already discovered and rejected.
This is the part where I think a lot of teams fall down. They make the decision, they have the Slack thread about it, maybe there’s a design doc somewhere in Google Drive, and then everyone moves on. A year later, nobody remembers why they chose PostgreSQL over something else, or why they rejected a particular library, and someone either repeats the research or makes a change that breaks something subtle.
This is exactly the problem that Architecture Decision Records were designed to solve. Michael Nygard popularized this back in twenty eleven. The idea is beautifully simple: for every significant architectural decision, you write a short markdown file that captures the context, the decision itself, the alternatives you considered, and the consequences. You store these files in version control alongside the code. The standard template is five sections. Status — is this proposed, accepted, deprecated, superseded? Context — what problem were you solving? Decision — what did you choose and why? Alternatives considered — what else did you evaluate and why did you reject it? Consequences — what are the trade-offs, the risks, the things you’re accepting by making this choice?
The key discipline is that you never edit an accepted ADR. If a decision changes, you create a new ADR that supersedes the old one. So you get this clean audit trail of how your thinking evolved over time.
And this is where it gets really powerful in the agentic era. That audit trail is exactly what an AI agent needs to understand your architectural constraints without rediscovering them. You can feed your ADRs into Claude Code as system context, and suddenly the agent knows not just what you chose, but what you considered and rejected and why.
Daniel's prompt actually raised two distinct things, and I think it's worth separating them. Stack definition is the question of what goes in your stack. Stack selection is how you choose between options. And in the agentic era, those two things are getting coupled in ways they weren't before.
In twenty twenty-three, stack definition was mostly a human exercise. You'd research libraries, read documentation, maybe do a spike, make a decision, and document it somewhere. The documentation was for future humans. Stack selection was the research phase, definition was the write-up phase. They were sequential. Now they're iterative and continuous, because the agent is a consumer of both. The stack definition — the document that says "we use RNNoise for noise suppression because X, Y, Z" — becomes active context for future code generation. And the selection process itself is accelerated by agents doing the evaluation. So you're not doing selection once a quarter and then documenting the outcome. You're constantly re-evaluating, because the cost of evaluation has dropped to near zero.
That's a shift from stack as architecture to stack as living system.
And it connects to something Daniel mentioned that I think is underappreciated. GitHub has become a component marketplace in a way that wasn't really true before. The long tail of those forty million public repos is enormous. But before agentic tools, that long tail was effectively invisible — because humans don't have time to evaluate the forty-seventh most-starred noise suppression library. An AI agent doesn't care about any of that. It reads the source. So the addressable market of usable libraries just expanded by orders of magnitude.
Which brings us to the documentation tension Daniel raised. Documentation is only valuable if someone reads it. If you write beautiful architectural decision records and nobody ever looks at them again, you've created overhead without leverage.
This is the part that actually excites me. Agentic AI creates a guaranteed consumer for those records. An agent never gets bored. It never skims. It reads every word of every ADR you feed it. That flips the return on investment calculation entirely. The documentation you write today becomes the context that prevents an agent from making a bad architectural decision six months from now.
The pitfall Daniel identified — no point documenting if no one reads it — gets addressed not by making humans better at reading docs, but by creating a new audience that reads everything by default.
That audience is increasingly the primary audience for certain types of documentation. When Claude Code is generating the implementation, the ADR isn't background reading — it's a constraint. It's saying "do not use library X, we already rejected it for reason Y." That's not passive. That's executable.
Let's put some teeth on that. If I'm actually running this evaluation workflow, what does Claude Code look at in each repository that I'd miss if I were skimming READMEs?
The first thing it does is map the dependency tree against your existing stack. Say you're on React Native with a specific version of the WebRTC library. Claude will check whether RNNoise pulls in anything that conflicts — a different version of a shared dependency, a native module that expects a different build toolchain. That's the kind of thing that would take a human an hour of digging through package files and build scripts. And it does that across six candidates simultaneously, in seconds. Then it looks at the API surface area. For a noise suppression library, the question is: does it expose a clean interface that matches our pipeline? Are we dealing with raw PCM buffers or encoded audio? Does it process in chunks or streams? Claude can answer that by reading the header files, the public method signatures, the type definitions. It's not guessing — it's parsing actual source.
Here's what I keep coming back to. How do I trust that evaluation? The agent might correctly identify the API shape, but it could miss that the library has a memory leak under sustained load, or that the maintainer abandoned it six weeks ago.
This is where the maturity gap thing really matters. A lot of these repos have solid core implementations — the algorithm works, the code is sound — but the documentation is thin. A human evaluator looks at that and walks away because the onboarding cost seems too high. Claude doesn't need onboarding. It reads the implementation directly. But reading the implementation doesn't tell you about the memory leak. And that's exactly why the stack probe is non-negotiable. You don't stop at the evaluation. You ask Claude to generate a proof-of-concept integration — a small, self-contained module that wires the candidate library into your actual pipeline. You run it, you profile it, you see if it breaks.
The agent's evaluation is the filter. The probe is the verification.
The evaluation narrows six candidates to two. The probe tells you which one actually works. And even a failed probe is valuable, because it tells you something specific about why a library doesn't fit. That's information you capture in the ADR.
Which brings us to the minimum bar for a repository to even make it into the candidate list. What does a repo need to have for this workflow to work?
It needs a license file, because that's a hard gate. It needs a build system that's parseable — Claude can read CMake files, package.It needs source code that's not minified or obfuscated. And ideally it has some indication of recent activity — commits within the last year, open issues that are actually getting responses. That last one is more for you than for Claude. The agent can't reliably judge community health.
The aesthetic preference problem you mentioned earlier — the agent favoring clean code over better runtime behavior. How do you guard against that?
You make the evaluation criteria explicit. Don't just say "evaluate this library." Say "evaluate this library against these five criteria: latency, dependency footprint, platform support, license compatibility, and integration complexity with our existing pipeline." When the criteria are spelled out, the agent is less likely to substitute its own. And this connects directly to the documentation side. When you write the ADR, those criteria become part of the record. Future you — or a future agent — can see not just what you chose, but what you were optimizing for. That's the difference between "we picked RNNoise" and "we picked RNNoise because on-device latency was the binding constraint and it was the only candidate that reliably stayed under twenty milliseconds.
Once you've made the choice, you need it to stick. The decision has to live somewhere that both humans and agents can find it. And this is where ADRs really shine as the documentation framework.
There's a standardized template now. The MADR project — Markdown Architecture Decision Records — has over five thousand stars on GitHub. It gives you a consistent structure: context, decision, alternatives considered, consequences. The discipline is that you never edit an accepted ADR. If things change, you write a new one that supersedes the old.
Which means you get an audit trail. But the question is whether that audit trail is actually useful, or whether it's just paperwork.
It becomes useful the moment an agent consumes it. Here's the concrete workflow I'd propose. Before you write any code for a new feature, you define the stack in a living document. It includes your ADRs, a dependency graph, and your integration patterns. You feed that entire document to Claude Code as system context at the start of every session. So the agent walks in already knowing the architectural constraints — and the reasoning behind them. That's the piece that's usually missing. Code shows what you chose. It doesn't show what you considered and rejected. Without that, an agent might suggest a library you already ruled out six months ago, and you're paying tokens to re-litigate a settled question.
The documentation ROI problem Daniel raised. How do you make sure these ADRs don't go stale?
You can set up CI checks that flag when an ADR references a dependency version that's been superseded. Or when a rejected alternative has gained significant new features. Imagine a check that runs quarterly, scans your ADRs, and says: "ADR number twelve rejected SQLite because it lacked JSONB support. SQLite added JSONB in version three point forty five. This decision may be stale.
That's the knock-on effect. The agent isn't just consuming the documentation — it's actively monitoring whether the documentation still holds.
There's a real case study here. A team maintained forty-seven ADRs over eighteen months. They fed the entire set to Claude Code and asked it to identify stale decisions. It found three. One of them was a database choice where the rejected alternative had since added the exact features that were the original deal-breakers. Catching that early saved them a migration that would have taken weeks.
That's the shift. Documentation stops being a passive record and becomes an active participant in architecture.
The format matters. A traditional wiki page is prose-heavy, unstructured, hard for an agent to parse reliably. An ADR in a docs slash decisions folder follows a consistent template. The agent knows exactly where to find the context, the decision, the alternatives. Structure is what makes it machine-consumable.
How detailed does an ADR need to be? What's the minimum viable documentation that's useful for both humans and agents?
Five bullet points per section is plenty. You don't need essays. For the alternatives section, the key is capturing why you rejected each option — not just listing names. "Rejected SpeexDSP because it doesn't support the sampling rate we need" is actionable. "Considered SpeexDSP" is not.
The minimum process?
Start with a single ADR template. Use it for one decision. See how it feels. Then add a stack manifest — one file that lists your current stack, key decisions, and integration patterns. Update it when decisions change. Set a quarterly calendar reminder to review ADRs with an agent. That's it. Three artifacts, one recurring review. You're not building a bureaucracy.
The lightweight part matters. The moment documentation feels like overhead, people stop doing it.
That's the thing — in the agentic era, it's not overhead. It's leverage. Every ADR you write is context you never have to explain again. To a human or an agent. And that leverage compounds. Every time you onboard a new developer or spin up a new agent session, that ADR corpus is doing work. It's not sitting in a wiki nobody visits.
Let's make this concrete. Someone's listening, they've got a stack they've never formally documented. What's the Monday morning version of this?
Step one: define the need. Be specific — not "audio processing," but "real-time background noise removal for a React Native WebRTC pipeline targeting mobile devices." Step two: curate five to ten candidates from GitHub. Step three: feed them to Claude Code with explicit evaluation criteria and your stack constraints, and let it generate a disposable prototype against the top candidate. Step four: document the decision as an ADR. Even if it's half a page. Date, context, decision, consequences, alternatives considered. Five bullet points per section is plenty. Don't over-engineer it. The template is not the product — the reasoning is. And step five: add that ADR to your agent's system context for future sessions. So the next time Claude Code wakes up in your repo, it already knows why you're on RNNoise and not SpeexDSP.
There's one artifact that makes step five actually work, and that's a stack manifest. A single file — call it STACK dot md, call it whatever — that lists your current stack, your key architectural decisions, and your integration patterns. Your agent reads it at the start of every session. A living architectural index. And you update it when decisions change. Which brings us to the recurring review. Set a quarterly calendar reminder. Feed your ADR corpus to an agent and ask: "Which of these decisions are stale? What should we reconsider?" The agent can compare your documented constraints against the current state of the ecosystem — new library versions, new alternatives, deprecations you missed.
That's the piece that makes documentation self-correcting. Without the review cycle, you're just building a graveyard of past decisions.
With it, you've got a documentation system that actively maintains your architecture. It's not overhead. It's force multiplication.
Which brings me to the question I keep coming back to. All of this assumes the ADR format is what agents should be reading. But ADRs were designed for humans — they're prose documents that happen to be structured. As agents become the primary consumers, do we need something new? Something between natural language and a formal specification?
A machine-readable architectural spec. That's a open question. Right now we're using markdown files with a template, and it works because Claude can parse natural language well enough. But there's a ceiling. An ADR says "we chose PostgreSQL for JSONB support." An agent has to infer what that implies about your data model. A formal spec could declare constraints directly — "this system requires a relational database with native JSON querying and geospatial indexing" — and the agent could match against it programmatically. The tension is that formal specs are harder to write. Humans won't do it. So the sweet spot might be something we haven't quite figured out yet. A format that's still writeable by a tired developer on a Friday afternoon, but structured enough that an agent can validate decisions against it without hallucinating implications.
Which leads to the bigger implication. You talked about quarterly reviews where an agent checks your ADRs for staleness. But what if that becomes continuous? Stack selection as a service — an agent that's always watching the GitHub ecosystem, comparing new releases against your documented decisions, and flagging opportunities.
"Library X you rejected in Q two twenty twenty-five just shipped the feature that was your deal-breaker. Here's a pull request with the integration." That's where this is heading. And it only works if your decisions are documented in a format the agent can reliably interpret.
The documentation stops being a record and starts being a sensor.
That's the shift. And I think that's the thing worth sitting with. We've talked about documentation as leverage — but it might actually be infrastructure.
Now, Hilbert's daily fun fact.
The average cumulus cloud weighs about one point one million pounds.
We'll leave you with that open question — what does the format look like when machines are the primary audience for architectural decisions? If you've got thoughts, we'd love to hear them. Thanks to Hilbert Flumingtop for producing. This has been My Weird Prompts. Find us at myweirdprompts dot com.