Daniel sent us one this week about something that's genuinely transformed how this show gets made. He's been iterating on the production pipeline — he does this every few weeks, it's a passion project, he fits it in when he can. The big change recently was swapping out Google Search and Tavily for Exa AI as the retrieval backend. Same branding code, same prompts, same language model, but the accuracy jumped dramatically. He's been spot-checking the results and they're consistently accurate in ways the previous tools just weren't. The question is: what makes agentic search tools like Exa work so well, and when they fail, why do they fail?
This is one of those cases where a single component swap reveals something fundamental about the whole system. Everyone obsesses over which language model they're using — is it Claude, is it GPT-four-o, is it Gemini — but the retrieval layer is doing more heavy lifting than most people realize. I've seen entire conference panels devoted to debating model selection, and nobody asks what search backend is feeding the thing. It's like arguing about which chef is better while ignoring where the ingredients came from.
Garbage in, garbage out, but with a neural network on top. Which, by the way, is worse than regular garbage in, garbage out. A neural network doesn't just pass along bad data — it amplifies it, smooths out the rough edges, makes it sound authoritative. You get garbage that went to finishing school.
Exactly the problem. If your retrieval returns a hallucinated fact, the best language model in the world will confidently elaborate on that hallucination. It'll add context, nuance, historical background — all of it built on a foundation of sand. So let's unpack what's actually happening under the hood. What makes Exa different from the search tools we were using before?
Before we get into the architecture, give me the landscape. What were we actually comparing here? Because I think a lot of people hear "search API" and imagine they're all basically the same thing with different branding.
Three very different approaches to the same problem. Google Search — and I mean the programmatic search API, not you typing into a browser — returns a ranked list of web pages based primarily on PageRank and dozens of other signals. It's optimized for popularity and authority, not factual accuracy. The core assumption is that if lots of people link to something, it's probably useful. That's a great assumption for "where's the nearest pizza place" but a terrible one for "what did the EU actually legislate last month." Tavily, which we used before Exa, takes a different approach: it searches, retrieves the top results, and then uses a language model to summarize them into a single coherent paragraph. It's a single-shot summarization layer on top of a search index. Exa does something fundamentally different. It uses what they call neural search — embedding queries and documents into a shared vector space, then retrieving based on semantic similarity rather than keyword matching. And critically, it doesn't just fire off one query and return results. It issues multiple sub-queries, evaluates what comes back, and refines its search based on what it finds.
It's not just a better search engine. It's a different category of thing. Like the difference between a library catalog and a research assistant who actually reads the books.
That's the key insight. To understand why Exa works so well, we need to look at three specific architectural decisions that set it apart from Google and Tavily. First, the embedding approach. Second, the agentic loop. Third, the structured metadata output.
Let's start with embeddings. What does neural search actually mean in practice? Because "vector space" sounds like something from a physics lecture.
Traditional search engines like Google use an inverted index. Imagine the index at the back of a book — you look up a word, it tells you which pages contain that word. Google does this at massive scale, with sophisticated ranking algorithms on top, but fundamentally it's matching keywords to documents. You search for "EU AI regulation May twenty twenty-six" and it looks for pages containing those terms. Neural search works differently. Exa takes your query and converts it into a vector — essentially a list of numbers that represents the semantic meaning of your question. It does the same thing with every document in its index. Then it finds documents whose meaning-vectors are closest to your query-vector. The magic is that this captures conceptual similarity, not just word overlap.
If I search for "latest EU rules on artificial intelligence," it might find a document titled "European Commission Updates AI Liability Framework" even though none of my exact words appear in that title.
And here's a concrete example that makes this tangible. Imagine you're searching for information about a specific car recall. A keyword search for "Toyota brake recall twenty twenty-five" will only return documents that contain some combination of those words. It'll miss a document titled "Automaker Announces Safety Campaign for Prius Braking System" because the word "recall" doesn't appear — even though "safety campaign" is the industry term for the exact same thing. Exa's embedding model understands that "safety campaign" and "recall" are semantically equivalent in this context. It maps them to nearby points in vector space and retrieves both.
It's not just matching strings, it's matching meaning. Which feels like the promise of AI search that we've been hearing about for years.
Exa's embedding model is trained on a corpus of over a billion documents with a specific focus on news articles from the last five years. That temporal training bias is important — it means the model understands that "latest" and "recent" are temporal signals, not just generic adjectives. A general-purpose embedding model might treat "latest developments in quantum computing" the same as "developments in quantum computing." Exa's model knows to weight recency. It's been trained to recognize that "latest" isn't just flavor text — it's a constraint on the search.
Which explains why it's particularly good for the kind of prompts Daniel sends in — current events, regulatory changes, things where the date actually matters. If you're asking about something that happened last week, recency isn't optional, it's the whole point.
And this connects to the second architectural decision: the agentic loop. This is where Exa really diverges from both Google and Tavily. Google returns a list of links. Tavily returns a summarized paragraph. Exa acts more like a researcher who's been given an assignment and figures out how to break it down. It doesn't just execute a search — it plans a search strategy.
Walk me through a concrete example. Daniel sends in a prompt asking about the latest regulatory changes for AI in the European Union as of May twenty twenty-six. What does Exa actually do? Paint me the step-by-step.
It decomposes the query. Instead of searching for that entire sentence as one block, it might issue three or four separate sub-queries. One for "EU AI Act amendments twenty twenty-six." Another for "European Commission AI regulation May twenty twenty-six." A third for "AI liability directive updates twenty twenty-six." Each of those sub-queries gets its own neural search, returns its own set of results, and then Exa evaluates the relevance of what came back. If one sub-query returned mostly outdated content, it might re-query with a stronger recency filter. If another returned content from low-authority domains, it might re-query with a domain authority threshold. This iterative refinement is what makes it "agentic" — it's not a single lookup, it's a process.
It's essentially doing what a good human researcher does. You don't type one search into Google and call it done. You try a few different phrasings, you see what comes back, you adjust based on what you find, you drill down into the promising leads and abandon the dead ends.
Except it's doing this in milliseconds, and the "adjustment" is algorithmic rather than intuitive. But the structure is similar. And this is where the contrast with Tavily becomes really stark. Tavily would have just grabbed the top article and summarized it.
Which might be from April twenty twenty-six, or might conflate proposed changes with enacted ones, or might be an opinion piece rather than a regulatory filing. And you'd never know because the summarization layer would smooth all that over into one neat paragraph.
And that's the double failure mode. Tavily's summarization layer introduces a second point of failure — the language model doing the summarizing can hallucinate even if the source article is accurate. So you've got the retrieval risk and the summarization risk stacked on top of each other. Exa returns raw content with structured metadata, leaving the summarization to Daniel's own pipeline, where he can control the prompts and the temperature settings and the grounding instructions.
Which brings us to the third architectural decision — the metadata. And this is the one that I think gets overlooked in most discussions of search tools, but it's actually what makes the whole system auditable.
This is the part that makes spot-checking possible. Exa doesn't just return text. It returns publication date, author, domain authority score, and direct quotes with source URLs. So Daniel's pipeline can include not just the retrieved information but also the provenance of that information. When the language model generates the final output, it's not just saying "the EU passed new regulations" — it can say "according to the European Commission's May eighteenth press release" and link to it.
That's verifiable. You can click through and confirm. It transforms the output from a claim into a claim with a receipt. Which, for a show that's about accurate information, is kind of the whole ballgame.
Which is impossible with Google's snippet-based results. Google might show you a snippet that says "the EU passed new AI regulations" but it's extracted from a page that also says "...are expected to be proposed next year" and Google's snippet algorithm cut off the crucial context. I've seen this happen dozens of times. The snippet is technically a quote from the page, but it's been surgically extracted to mean the opposite of what the page actually says. Exa's structured approach preserves that context because it returns the full content, not a decontextualized fragment.
We've got three things working together. Semantic embeddings that understand meaning, not just keywords. An agentic loop that decomposes and refines queries. Structured metadata that preserves provenance. It's a system designed for grounding, not for general web search. It's not trying to be everything to everyone.
That's the point most coverage misses. Google Search processes over eight point five billion queries a day, but its index is optimized for general web search — finding the most popular, most linked-to page about a topic. Exa is optimized for a different use case entirely: retrieving accurate, verifiable information that an AI system can ground itself in. Those are fundamentally different objectives. It's like comparing a Swiss Army knife to a scalpel. They're both cutting tools, but you don't want your surgeon using the Swiss Army knife.
Of course they are. One is "show me what everyone's clicking on," the other is "show me what's actually true." And those only overlap by accident, not by design. Popularity and accuracy are correlated in some domains and completely uncorrelated in others.
Think about medical information. The most popular page about a specific symptom might be a WebMD article that's been SEO-optimized to death and hasn't been updated in three years. The most accurate page might be a recent NIH publication with almost no backlinks. Google's algorithm will favor the former. Exa's architecture — with its recency bias and authority weighting — is designed to surface the latter.
Exa isn't perfect. In fact, some of the same mechanisms that make it so effective also create specific failure modes. And I think this is where the conversation gets really interesting, because understanding the failure modes is how you learn to use the tool well.
Let's go through the major ones. Failure mode one: temporal ambiguity. Exa's embedding model prioritizes recency, which is great for current events queries. But if you ask a question that has a stable, timeless answer, the recency bias can work against you. Ask "what is the capital of France" and Exa might prioritize a twenty twenty-six news article that mentions Paris in passing over a static knowledge source that definitively states the answer. The query doesn't signal recency, but the model's training bias toward recent content means it might still weight newer documents higher.
It's solving for a problem you didn't have. You wanted the most authoritative answer, not the most recent one. It's like asking a librarian for a book on French geography and having them hand you yesterday's newspaper because it happens to mention Lyon.
This gets worse with queries that are ambiguous about their temporal scope. "What are the leading causes of climate change" — is that asking for the established scientific consensus, which hasn't changed much in decades, or the latest research on attribution science, which is evolving rapidly? Exa has to guess, and it'll default to recency. If what you actually wanted was the long-established consensus, you might get a very skewed picture built from the last six months of research papers, which will emphasize the cutting-edge debates rather than the settled science.
What's failure mode two?
Domain authority bias. Exa's structured metadata includes domain authority scores, which help filter out low-quality sources. But domain authority can be gamed, and it can also be misleading. A well-established news outlet with high domain authority might publish an inaccurate or biased article about a politically charged topic. A niche academic blog with low domain authority might have the most rigorous analysis. Exa's authority weighting would favor the news outlet.
Daniel mentioned he caught this once with a politically charged topic during his spot-checking. He didn't give me the specifics, but he said the top result was from a major outlet and it was just... But the domain authority score was sky-high, so Exa treated it as gospel.
It's a hard problem because domain authority is a useful heuristic most of the time. The BBC is generally more reliable than some random WordPress blog. You'd be foolish not to weight authority at all. But the edge cases matter, especially for the kind of factual grounding this podcast depends on. And the edge cases tend to cluster around exactly the topics where accuracy matters most — politically contested issues, emerging science, regulatory changes where different stakeholders are spinning the story.
Failure mode three?
The echo chamber problem, and this is the most interesting one from an architectural perspective. Because Exa iterates on its own results, it can reinforce its own initial errors. The first sub-query returns a result that's slightly off-topic but semantically adjacent to what you actually want. The second sub-query is informed by that first result, so it drifts further. The third sub-query builds on the second. By the end of the agentic loop, you've wandered into a completely different topic area and you don't even realize it because each step seemed reasonable in isolation.
Like a game of semantic telephone. The first person whispers "EU AI regulations" and by the fifth person it's "European technology policy framework" and nobody noticed the drift because each step was only a small change.
That's exactly what it is. And this is where the "agentic" label can be misleading. People hear "agentic search" and imagine the AI is thinking like a human researcher, exercising judgment, catching its own mistakes. It's not. It's a deterministic process of iterative query refinement. It can amplify errors as easily as it can correct them. There's no genuine reasoning happening — it's pattern matching at each step. The system doesn't know it's drifting. It has no meta-cognition, no ability to step back and say "wait, am I still answering the original question?
Which is an important corrective to the hype. Agentic doesn't mean intelligent. It means it takes multiple steps. That's it. The intelligence is in the design of the loop, not in some emergent reasoning capability. And I think that distinction gets lost in a lot of the marketing around these tools.
It means it takes multiple steps. That's it. The intelligence is in the design of the loop, not in some emergent reasoning capability. And if the loop is designed with a slight bias toward recency, or authority, or certain query decompositions, that bias gets amplified with each iteration. It's not self-correcting — it's self-reinforcing.
What's the fourth failure mode?
Exa's agentic loop works best with well-formed, specific queries. If you send in something vague like "tell me about AI," the system doesn't know which sub-queries to prioritize. Should it search for AI history, AI applications, AI ethics, AI regulation, AI research breakthroughs? It has to guess, and the guesses compound. Daniel's prompts are typically very specific — "latest regulatory changes for AI in the EU as of May twenty twenty-six" — which is why he's seeing such a high success rate. The system has clear signals to work with. It knows this is about regulation, geography, and recency. It can decompose confidently.
The quality of the output is downstream of the quality of the query. Garbage in, garbage out, but with a neural network on top — we're back to that. But it's worse than that, right? Because with a vague query, you're not just getting bad results, you're getting confidently bad results that have been through multiple rounds of refinement.
I'd add: the quality of the query is partly about specificity, but it's also about knowing what the tool is optimized for. Exa is optimized for temporal relevance and factual grounding. If you're asking a question that benefits from those optimizations, it'll shine. If you're asking a question that requires different optimizations — broad conceptual exploration, historical context, serendipitous discovery — you might be better off with a different tool. It's not that Exa is bad at those things, it's that it wasn't designed for them.
Let's do a direct comparison. Same query across the three tools. Something like "recent breakthroughs in quantum computing." I want to see how each one handles it and where they each stumble.
Google Search would return a mix. Probably a Wikipedia page on quantum computing — high authority, but possibly outdated. A few news articles from major outlets about recent developments. Some press releases from companies announcing their latest achievements. The ranking would be driven by links and engagement metrics, not factual accuracy. The Wikipedia page might be the top result even though it doesn't mention the most recent breakthrough, simply because it has the most inbound links and the highest overall authority score. You'd get breadth but not necessarily relevance.
Tavily would take the top results from its own search, feed them to a language model, and produce a summary paragraph. The problem is that the summary might blend information from multiple sources in ways that lose nuance. A twenty twenty-three article about one lab's achievement and a twenty twenty-six press release about a different breakthrough might get merged into a single narrative that implies they're connected when they're not. Or the summarization model might hallucinate a detail that wasn't in any of the source documents. I've seen summarization models insert phrases like "researchers announced" when the source material was a speculative preprint, or "the breakthrough confirms" when the source said "suggests." Those are small word changes with massive implications for accuracy.
Two layers of potential error. The retrieval layer might return the wrong sources, and the summarization layer might misrepresent them. And Daniel's experience reflects this — when the pipeline was using Tavily, the same prompts were producing less accurate results. It wasn't that Tavily is a bad product, it's that the architecture introduces failure modes that are hard to control for.
What would Exa do with the quantum computing query?
Exa would decompose it into sub-queries — probably something like "quantum computing breakthrough twenty twenty-six," "quantum supremacy milestone recent," "quantum error correction advance." It would retrieve results based on semantic similarity, not keyword matching, so it might find a peer-reviewed paper that doesn't use the word "breakthrough" but describes a significant advance. It would return structured metadata with each result, including publication dates and domain authority scores. And it would prioritize recency because the query includes "recent.
Here's the catch — what if the most significant recent breakthrough was published on a lower-authority domain? A university lab's press release, or a preprint server like arXiv? The actual breakthrough might be sitting on a dot-edu domain with a modest authority score while Nature dot com has a less significant but more polished article from the same time period.
That's exactly the case study Daniel's experience hints at. Exa might return a twenty twenty-three Nature paper about a specific lab's achievement because Nature has extremely high domain authority, while missing a more significant twenty twenty-six breakthrough published on a university's own website. The recency signal and the authority signal are in tension, and Exa has to resolve that tension algorithmically. Sometimes it gets the balance wrong. And the user might never know they missed the bigger story.
What have we learned about when this works and when it fails? Let's synthesize.
It works best when three conditions are met. One, the query has clear temporal signals — "latest," "recent," "as of May twenty twenty-six." Two, the topic is well-covered by high-authority sources, so domain authority is a reliable quality signal. Three, the query is specific enough to decompose into well-scoped sub-queries. It fails when any of those conditions break down — when the query is temporally ambiguous, when the best sources are low-authority, or when the query is too vague to decompose effectively.
Which means there's a skill to using it well. It's not a magic box. And I think that's actually good news — it means the tool rewards expertise. The better you understand how it works, the better your results will be.
None of these tools are magic boxes. And I think that's the deeper lesson here. The AI community has a tendency to treat each new tool as a step toward general intelligence, when really what we're seeing is better engineering for specific use cases. Exa is brilliantly engineered for grounding AI-generated content in verifiable facts. That's a specific problem with specific constraints, and they've designed an architecture that matches those constraints. It's not AGI. It's excellent product design.
What does this mean for someone building a RAG pipeline, or even just trying to get better results from AI tools? Let's get practical. If someone's listening and they want to improve their own retrieval setup tomorrow, what should they do?
Four things you can do right now. First, craft your prompts with temporal qualifiers and domain preferences baked in. Don't just ask "what are the latest AI regulations" — ask "what AI regulations were enacted or proposed in the last thirty days, prioritizing government publications and peer-reviewed sources." Exa's agentic loop works best with constrained queries. Give it constraints. The more you tell it about what kind of sources you trust and what time window you care about, the better it can decompose the query.
You're not just describing what you want to know, you're giving the retrieval system the parameters it needs to do its job well. You're speaking its language. It's like the difference between telling a research assistant "find me stuff about AI" versus "find me every EU parliamentary document from the last quarter that mentions the AI Act, and prioritize primary sources over news coverage.
Second actionable insight: always spot-check. Exa's structured metadata makes this easy — it returns source URLs and direct quotes. Verify that the quote appears in the source and that the source actually supports the claim. This is the single most important quality control step in any RAG pipeline, and it's shockingly underutilized. Most people treat search results as ground truth. Daniel doesn't, and it's why he can say with confidence that Exa is returning accurate results. He's not trusting — he's verifying.
Trust but verify, applied to search APIs. And I'd add: spot-checking isn't just about catching errors. It's also about building intuition for when the tool is likely to fail. The more you spot-check, the better you get at predicting which queries will produce problems before you even run them.
Third: if you get inconsistent results for the same prompt, rephrase. The agentic loop is sensitive to initial query quality — a small change in wording can trigger different sub-queries and different results. "EU AI regulations May twenty twenty-six" might decompose differently than "latest European Union artificial intelligence rules." Experiment with phrasing until you find what works consistently. Treat query formulation as a skill you develop, not a one-shot guess.
Don't rely on a single tool. Exa is excellent for temporal, fact-dense queries. But for topics where authority matters more than recency — historical research, established scientific knowledge, legal precedents — you might want to supplement with a direct Wikipedia API call or a curated database. No single search tool is perfect for all use cases. The best RAG pipelines use multiple retrieval sources and reconcile the results. It's more engineering work upfront, but the accuracy gain is real and measurable.
Which is more work, but the accuracy gain is real. And this gets back to something you said earlier — people obsess over model selection but ignore the retrieval layer. Multiple retrieval sources with reconciliation is probably a bigger accuracy win than switching from GPT-four-o to Claude or vice versa.
That's really the -lesson here. Daniel saw a dramatic improvement in accuracy by swapping one component of his pipeline. Same branding code, same prompts, same language model. The only variable was the search backend. That tells you how much the retrieval layer matters. It's not the flashy part of the stack, but it might be the most important part.
It's the unsung hero. Everyone's focused on the language model, but the language model can only work with what it's given. It's like having a brilliant chef working with whatever shows up at the back door. If the delivery truck brings spoiled produce, it doesn't matter how good the chef is.
A great language model with bad retrieval produces bad output. A mediocre language model with great retrieval produces good output. Invest in your search backend. It's not glamorous, but it's where the leverage is.
Which is a sentence that would have sounded completely insane five years ago. "Invest in your search backend." In twenty twenty-one, search was a solved problem. You typed into Google and you got results. The idea that search quality was a strategic differentiator for AI applications would have sounded like saying "invest in your calculator.
Welcome to twenty twenty-six, where your search API is a strategic asset. The world changed under us.
Let's wrap up with a look at where this is heading. What's the next frontier for agentic search? Because text retrieval is clearly maturing, but the world isn't just text.
Multi-modal retrieval. Exa is already experimenting with image and video search. The challenge is that grounding gets much harder when your source material isn't text. How do you verify that a claim derived from a video transcript is accurate? How do you attribute information from an infographic? The structured metadata approach that works so well for text doesn't translate cleanly to visual media. A video doesn't have a single publication date in the same way an article does. An infographic might combine data from multiple sources with different levels of reliability.
Because the "source" isn't a single URL with a publication date and an author. It's a frame in a video, or a data point in a chart, or a claim made in a podcast. And each of those has its own provenance chain that might be completely opaque. Who made the chart? Where did they get the data? Was the data updated since the video was published?
That's before we even get to the problem of deepfakes and synthetic media. If your retrieval system pulls in a video that looks like a news report but was generated by an AI, the grounding verification problem becomes exponentially harder. You can't just check the domain authority of the hosting platform — the content itself might be fabricated. We're entering a world where the verification problem isn't just "is this source reliable" but "is this source even real.
The accuracy gains we're seeing now with text might not translate to multi-modal retrieval. Or at least, not without a whole new set of architectural innovations. The structured metadata approach that works for text relies on assumptions that don't hold for other media types.
Which brings me to the open question that keeps me up at night. As agentic search tools become more sophisticated, will they replace traditional search engines for AI-powered applications, or will they remain a specialized tool for specific use cases like content grounding? I don't know the answer. Google's scale is staggering — eight point five billion queries a day, an index that covers essentially the entire web. Exa is optimized for a narrow but critical use case. The question is whether the narrow use case eventually expands to cover most of what people actually need from search, or whether the breadth of general web search remains indispensable.
My bet is on specialization. The history of technology suggests that general-purpose tools get unbundled into specialized tools that do specific things extremely well. Google Search is the department store. Exa is the boutique that only sells one thing, but it's the best version of that thing. And I think we'll see more boutiques — search tools optimized for legal research, for medical literature, for financial filings. Each one doing one thing better than any general tool could.
For Daniel's use case — generating accurate, grounded content about current events — the boutique wins. The department store has everything, but the boutique has exactly what you need and nothing you don't.
Which is why the podcast sounds better than it did two months ago. The retrieval layer changed, and everything downstream improved. Same hosts, same format, same language model — better information going in, better show coming out.
Now: Hilbert's daily fun fact.
Now: Hilbert's daily fun fact.
Hilbert: In sixteen ninety-four, a Mongolian saddle-maker named Tserendorj designed a specialized curved awl for leather-stitching that could pierce three layers of hardened yak hide in a single motion. He demonstrated it before the local khan, who was so impressed he commissioned five hundred saddles. Tserendorj died of pneumonia three days later, and the awl design was lost for over two centuries. It was rediscovered in nineteen-oh-three by a Russian ethnographer who found a single surviving sketch in a monastery archive, but by then industrialized saddle production had made the technique obsolete.
...right. So the moral of the story is document your work, or your revolutionary yak-hide-piercing awl dies with you.
The khan never got his saddles. Five hundred saddles, just... Because one guy caught a cold. That's the most brutal supply chain disruption I've ever heard of.
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoyed this episode, find us at myweirdprompts dot com. We'll be back next week.
Probably with something even weirder. Maybe Hilbert will tell us about the seventeen-hundreds next time. What happened in seventeen-oh-four, Hilbert? Actually, don't answer that.