I was looking at some of our old infrastructure logs the other day and it struck me how much the conversation has shifted. A couple of years ago, everyone was just scrambling to get any kind of vector search working. It was the great vector gold rush. But today, here in late March twenty twenty-six, it feels like we have entered the era of vector optimization. We are no longer just asking if we can turn text into numbers, we are asking how we can do it across five different modalities without going bankrupt in the process.
It is a completely different world, Corn. Especially with the explosion of releases we have seen just in the last few weeks. Today's prompt from Daniel is about the modern embedding landscape, and it really forces us to look at how the fundamental unit of A-I memory has evolved. We have moved past simple text-to-vector conversion into these universal, multimodal spaces. It is no longer just about finding a document that looks like your query; it is about finding a frame in a video that answers a question posed in an audio clip, all while managing the massive overhead that comes with high-dimensional data.
Daniel is asking us to compare the major players like OpenAI, Cohere, Voyage, and Jina, but he also wants us to dig into the technical debt side of this. We have talked before in episode twelve fourteen about how choosing an embedding model is essentially a permanent marriage to a specific coordinate system. If you want to switch models later, you have to re-index every single document in your database. That is a massive undertaking if you are talking about millions or billions of vectors. It is the ultimate architectural lock-in.
And that is why the stakes are so high right now. If you pick the wrong horse today, you are stuck with a massive migration bill down the road. What is fascinating is that the definition of a good horse has changed. We used to just look at how many dimensions a model had, thinking that more was always better. We thought a fifteen hundred dimensional vector was inherently smarter than a seven hundred dimensional one. But now, we have these Matryoshka architectures that let you scale the dimensions up or down depending on your budget and your latency needs. It is about flexibility, not just raw size.
Before we get into the heavy math of Matryoshka models, let's look at the landscape as it stands here in late March. It feels like Jina A-I really threw a wrench in things with their version four release just a few days ago, on March twentieth.
Jina Embeddings version four is a massive shift. They built it on the Qwen two point five V-L three billion backbone. For listeners who aren't tracking every model release, that means they are using a vision-language model as the foundation for embeddings. It is not just a text model that has been bolted onto an image model. It is natively multimodal. One of the coolest things they introduced is what they call Scenario-Switch. They are using Low-Rank Adaptation, or LoRA adapters, to let the model optimize itself for different tasks on the fly.
So, instead of having one model for retrieval and another for text matching, you just swap the adapter? How does that actually work in a production pipeline?
Precisely. Think of it like a Swiss Army knife where the blade changes shape based on what you are cutting. You can optimize for a specific query type, like a short search string, or a specific retrieval task, like finding a needle in a haystack of technical manuals, without needing a whole new model. It makes the embedding process much more surgical. But while Jina is pushing the edge on adapters, Google just dropped the Gemini Embedding two Preview on March tenth, which is trying to be the ultimate generalist. They are calling it a five-modality model.
Five modalities. Let me guess: text, image, video, audio, and... what is the fifth?
P-D-F. They are treating the document structure itself as a modality. It supports three thousand seventy-two dimensions, and it maps everything into a single unified space. Imagine being able to search through a video clip using a snippet of audio or finding a specific chart in a P-D-F by describing it in text, all within the same vector index. It is the realization of the "universal embedding" dream we have been hearing about for years.
That sounds like a dream for enterprise search, but I have to wonder about the cost. If we are talking about three thousand seventy-two dimensions for every single chunk of data, the storage math starts to look pretty scary. This reminds me of our discussion in episode twelve fifteen about the vector database hangover. People build these massive indexes and then realize they are spending thousands of dollars a month just on R-A-M to keep those vectors searchable.
That is exactly where the math gets brutal. Let's break it down for the listeners. A standard vector with one thousand twenty-four dimensions using thirty-two-bit floating point numbers requires four kilobytes of memory. If you have a million documents, that is four gigabytes. But once you add in indexing overhead, like H-N-S-W graphs, and replication for high availability, you are looking at significant costs. One million documents can easily cost you over one hundred dollars a month just in R-A-M. Now, scale that to a billion documents, which many enterprises have, and you are looking at a hundred thousand dollars a month just to keep the lights on for your search engine.
Which is why everyone is talking about OpenAI's text-embedding-three-large and small models. They were the ones who really popularized Matryoshka Representation Learning, or M-R-L. I love that name, by the way. It refers to those Russian nesting dolls where you open one and there is a smaller one inside.
It is a perfect name for the tech. In a traditional embedding model, the information is spread out across the entire vector. If you want fewer dimensions, you usually have to train a whole new, smaller model. But with M-R-L, the model is trained with a specific loss function so that the most important information is packed into the earlier dimensions of the vector. The first sixty-four dimensions are a summary of the first one hundred twenty-eight, which are a summary of the first two hundred fifty-six, and so on.
So, if you use OpenAI's large model at its full three thousand seventy-two dimensions, you get maximum accuracy. But if you only need, say, ninety-eight percent of that accuracy, you can just lop off the end of the vector and only store the first two hundred fifty-six dimensions.
And that reduces your storage costs by twelve times. That seems like a no-brainer for most applications. Why would anyone use the full three thousand seventy-two dimensions if they can get almost the same performance at a fraction of the cost?
Well, that two percent difference in accuracy can be the difference between finding the right legal precedent or the right medical diagnosis and missing it entirely. In high-precision domains, you want every bit of signal you can get. But for general purpose search, M-R-L is a lifesaver. OpenAI's small model is incredibly cheap at two cents per million tokens, while the large model is thirteen cents. It is a safe default for a reason. It is the I-B-M of embeddings—nobody ever got fired for choosing OpenAI.
But OpenAI isn't the only one in the game. If we look at the enterprise side, Cohere Embed version four is still the gold standard for many people, especially if they are dealing with more than just English. Cohere is fascinating because they focus so heavily on what they call compression-aware training. They know that developers are going to quantize their vectors to save money, so they train the model to be resilient to that compression. Their multilingual performance across more than one hundred languages is still the benchmark to beat. They charge ten cents per million tokens, which sits right in the middle of the price range.
Speaking of benchmarks, we should talk about the M-T-E-B, the Massive Text Embedding Benchmark. It is the scoreboard everyone looks at, but I am starting to hear a lot of grumbling that the scoreboard might be rigged, or at least, it is not measuring what we think it is.
There is a massive controversy right now regarding benchmark contamination. Because these benchmarks are open-source, it is very tempting for a model provider to include the test data in their training set. If a model has seen the questions and the answers during training, it is going to look like a genius on the leaderboard, but it might fall apart when you give it real-world data it has never seen before. We are seeing models that rank in the top five on M-T-E-B but fail to find basic information in a proprietary company wiki.
And it is not just contamination. The M-T-E-B tracks fifty-six datasets across eight categories, things like classification, clustering, and semantic textual similarity, or S-T-S. But a lot of engineers are realizing that a high S-T-S score doesn't necessarily mean the model is good at R-A-G, or Retrieval-Augmented Generation.
That is a crucial distinction. S-T-S measures how similar two sentences are. For example, "The cat sat on the mat" and "A feline rested on the rug." They are semantically similar. But in a R-A-G system, you aren't looking for similarity; you are looking for the answer to a question. If I ask, "What is the capital of France?" I don't want a sentence that is semantically similar to my question, like "What is the largest city in France?" I want a sentence that contains the answer: "Paris is the capital of France." Those two sentences aren't actually that similar in a vector space unless the model is specifically trained for asymmetric retrieval.
This is why we are seeing a shift toward what people are calling Agentic Benchmarks. Instead of just looking at a single retrieval step, they measure how well the embeddings support multi-step reasoning. If an A-I agent has to find three different pieces of information across ten documents to answer a complex prompt, how often does it succeed? That is a much better proxy for real-world performance. We are moving from "does this look like that" to "can this help me solve a problem."
And if you look at those kinds of technical, code-heavy, and multi-step tasks, Voyage A-I has been making a lot of noise. They were acquired by MongoDB last year, and on February fifteenth, they released Voyage Multimodal three point five. It is currently leading the benchmarks for technical retrieval. Voyage is interesting because they are founded by Tengyu Ma, a professor from Stanford. They seem to be taking a very academic, high-precision approach. Their Voyage-three-large model is optimized for long contexts, up to thirty-two thousand tokens, which is massive for an embedding model. Most models top out at eight thousand.
If you are trying to embed a whole technical manual or a massive codebase, that extra context window is a game changer. You don't have to chunk the data into tiny pieces and lose the global context. But Herman, you pay for it. Voyage-three-large is eighteen cents per million tokens. That is significantly more expensive than OpenAI or Cohere. Is it worth it?
If you are building a tool for engineers or doctors, that cost is usually worth the precision. If a developer is searching a codebase and the embedding model misses a crucial function definition because it was outside the context window, that is a failure. Voyage is betting that precision in high-value domains is worth the premium price.
Let's talk about the open-source side of things. B-G-E from the Beijing Academy of Artificial Intelligence and Nomic A-I seem to be the two big names there. B-G-E M-three is still the heavyweight for open-source multilingual tasks, right?
It is. The M-three stands for Multi-task, Multi-granularity, and Multi-lingual. It is a beast of a model. But Nomic is doing something really cool with their Embed version two. It is a very lightweight model, only one hundred thirty-seven million parameters. Because it is so small, you can run it locally on your own hardware or even on edge devices. And despite its size, it supports a context window of eight thousand one hundred ninety-two tokens.
I love the idea of local embeddings. If you are worried about privacy or if you are working in a secure environment where you can't send your data to an A-P-I, having a model like Nomic that you can run in-house is essential. Plus, Nomic is big on reproducibility. They release their training data and their entire pipeline, which is a breath of fresh air compared to the "black box" models from the big providers. It gives you a level of auditability that you just can't get with OpenAI.
It really does. Now, I want to circle back to the storage math because there is another piece of technology that pairs with Matryoshka models to solve the cost problem, and that is Binary Quantization.
I was waiting for you to bring this up. This is where we turn those complex floating point numbers into just ones and zeros, right?
In binary quantization, you take each dimension of the vector and you just record whether it is positive or negative. You turn a thirty-two-bit number into a single bit. That is a thirty-two-times reduction in size. When you combine that with Matryoshka models, where you have already reduced the number of dimensions from three thousand to, say, two hundred fifty-six, you can see a ninety-five percent reduction in your total R-A-M footprint.
But what does that do to the accuracy? I imagine turning a nuanced number into a simple one or zero has to lose a lot of information. It feels like trying to describe a painting using only black and white pixels.
You would think so, but it is surprisingly effective for the first stage of retrieval. Think of it like a funnel. You use the super-compressed, binary vectors to quickly scan millions of documents and find the top one hundred candidates. This is incredibly fast because computers are very good at comparing bitstrings using something called Hamming distance. Then, you pull the full-precision vectors for just those one hundred documents to do a final, high-accuracy reranking. It gives you the speed and cost savings of compression without sacrificing the final quality of the search results.
That feels like the right architectural pattern for twenty twenty-six. Use the cheap, fast stuff to narrow the field, then use the expensive, high-precision stuff to cross the finish line. It is about being smart with your compute budget.
It is. And that leads us to the decision-making criteria Daniel asked about. How do you actually choose between these models?
If I am building a high-throughput app where latency is everything, like a real-time autocomplete or a simple recommendation engine, I am looking at Nomic for local deployment or OpenAI-small for their A-P-I. They are fast, they are cheap, and the performance is more than enough for those use cases. You don't need a three-billion parameter multimodal model to suggest the next word in a search bar.
And if you are in a high-stakes domain, like legal tech or medical research, you go with Voyage-three or the new Gemini two model. You want that high-precision, long-context capability. If you are doing global search across dozens of languages, Cohere is your best bet because they have put so much work into the multilingual nuances that English-only models just miss.
There is also that "performance tax" we should mention. When you use a multilingual model for an English-only task, you usually see a five to ten percent drop in performance compared to a model that was trained specifically for English. It is a jack-of-all-trades, master-of-none situation. If your users are only ever going to search in English, don't use a multilingual model just because it sounds more advanced. You are paying a tax you don't need to pay in both accuracy and often in token cost.
That is a great point. And the final piece of advice for anyone building these systems is: do not trust the M-T-E-B leaderboard blindly. You have to test these models on your own domain-specific data. A model that ranks number one for general web search might be terrible at searching through your company's proprietary Python codebase or your internal H-R documents. We call this the "ground truth" problem.
It is the "ground truth" problem. You need to build your own small evaluation set of queries and expected results. It takes a bit of time upfront, but it is the only way to know if the model you are marrying is actually going to be a good partner. If you don't do this, you are just guessing based on someone else's marketing.
I think we should also look at the future of these universal embeddings. With Gemini two and Jina version four, we are seeing the walls between modalities come down. In the past, if you wanted to search images, you used a C-L-I-P model. If you wanted to search text, you used a B-E-R-T-based model. But now, everything is being mapped into the same neighborhood.
Does that mean specialized models are going to become obsolete? Will we eventually just have one "everything" embedding model?
I think we will always have a place for specialists. A model trained specifically on genomic data or chemical structures is always going to outperform a general-purpose multimodal model in those fields. But for the ninety percent of applications that deal with standard documents, images, and videos, the universal models are becoming too convenient to ignore.
It is about the friction. If I can use one A-P-I and one vector index to handle every type of data my company generates, that is a huge win for the engineering team. It reduces the "vector sprawl" we talked about in episode twelve twelve. You don't want five different databases for five different types of vectors. You want a unified memory layer.
Precisely. The goal is a unified memory layer for your A-I applications. Whether that is a R-A-G pipeline or an autonomous agent, having a consistent, high-quality, and cost-effective embedding strategy is the foundation for everything else. If your memory is fragmented, your agent is going to be confused.
So, to wrap this up for Daniel and everyone else listening, the "vibe check" for embeddings in March twenty twenty-six is pretty clear. Optimization is the name of the game. If you aren't using Matryoshka models and some form of quantization, you are overpaying for your storage. If you aren't looking at multimodal capabilities, you are going to be left behind when the next wave of "universal" apps hits.
And most importantly, don't get married to a model without a prenuptial agreement in the form of a solid evaluation pipeline. Know exactly what you are getting into before you commit to that coordinate system. Future-proof your storage by using Matryoshka-aware indexing so you can scale up or down as your needs change.
Well said. I think that covers the landscape for now, though I am sure another five models will drop by the time we finish recording this. The pace is just relentless.
That is the nature of the beast. But the fundamentals of the math and the storage economics are going to stay relevant for a long time. The physics of R-A-M costs don't change as fast as the model architectures.
Definitely. We will have to keep an eye on how those Agentic Benchmarks evolve. I suspect we are going to see a lot of these "leaderboard legends" get knocked down a peg when they have to do real-world reasoning across multiple documents.
I am looking forward to it. It is about time we had a more honest way to measure these things. Real-world performance is the only metric that matters in the end.
Alright, let's leave it there. This has been a deep dive into the state of the vector. Thanks to our producer Hilbert Flumingtop for keeping the gears turning behind the scenes.
And a big thanks to Modal for providing the G-P-U credits that power our research and this show.
If you found this technical deep dive useful, we would love it if you could leave us a review on your favorite podcast app. It really helps other curious nerds find the show. You can also find our full archive and show notes at myweirdprompts dot com.
This has been My Weird Prompts. We will catch you in the next one.
Later.