#1233: Why "Just Use Postgres" Isn't Always Enough

Can one database do it all? Explore why hardware constraints and data geometry keep specialized databases like Snowflake and ClickHouse alive.

0:000:00

Episode Details

Published: Mar 15
Duration: 25:06
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM
Topics: data-storage architecture distributed-systems

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The modern software engineering landscape is currently obsessed with simplification. This trend is best captured by the "Just Use Postgres" movement, which suggests that because PostgreSQL has added support for vectors, time-series data, and search, the need for specialized databases has vanished. However, while simplicity is a virtue, the laws of physics and hardware architecture eventually demand a more nuanced approach.

The Geometry of Data

The divide between Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) is rooted in how data is physically stored on a disk. Traditional databases like Postgres use row-based storage. When a computer saves a record, it places every attribute of that record—name, age, address—side-by-side in a single block of memory. This is perfect for transactions; if a user logs in, the database can grab their entire profile in a single read.

However, this model fails during large-scale analytics. If an analyst wants to find the average age of ten million users, a row-based database must still read the entire row for every user. It scans names and addresses it doesn't need just to find the age. This results in a massive waste of disk I/O and memory bandwidth.

The Columnar Revolution

To solve this, specialized analytical databases like Snowflake, ClickHouse, and DuckDB use columnar storage. Instead of storing data row-by-row, they store it column-by-column. All ages are stored together, and all names are stored together. When calculating an average age, the system only touches the "age" column, ignoring everything else.

This shift in geometry unlocks two massive advantages: compression and vectorized execution. Because a column contains homogeneous data (like a long list of integers), computers can compress it at ratios of 10:1 or even 50:1. Furthermore, modern CPUs can use SIMD (Single Instruction, Multiple Data) instructions to process these columns in batches. Rather than counting records one-by-one, the CPU can perform calculations on entire chunks of data simultaneously, providing orders-of-magnitude faster results.

Moving Data at Scale

The challenge for modern architects is getting data from the transactional "source of truth" to the analytical "engine" without crashing the system. This is where Change Data Capture (CDC) and streaming platforms like Apache Kafka become essential.

Instead of running heavy queries against a production database, CDC tools monitor the database’s write-ahead log (WAL). Every change is recorded and turned into a message that is sent to a buffer like Kafka. From there, the data can be fanned out to various specialized destinations—Elasticsearch for search, S3 for long-term storage, or Snowflake for deep analytics.

Ultimately, while "Just Use Postgres" is a great starting point for many startups, the physical constraints of hardware ensure that specialized tools will always have a place in high-scale environments. The goal isn't to make one tool do everything, but to build a pipeline that moves data to the tool best suited for the job.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1233: Why "Just Use Postgres" Isn't Always Enough

Daniel's Prompt

Custom topic: We talked about databases recently so today let's do an episode about specifically data warehouses or data lakes. We see some major names in the field like Clickhouse and I guess the question is when

Hey everyone, welcome back to My Weird Prompts. I am Corn Poppleberry, and I am joined as always by my brother, Herman. We are coming to you from our home in Jerusalem on this fine March afternoon. It is March fifteenth, two thousand twenty-six, and the tech world is currently obsessed with one thing: simplification.

Herman Poppleberry here. It is good to be back in the chairs, Corn. We have had a busy week, but I have been looking forward to this specific discussion since our housemate Daniel sent over the prompt a few days ago. He has been watching the tech landscape shift, and he basically asked us why we are still making things so complicated. He sees the "Just Use Postgres" movement taking over Twitter and LinkedIn, and he wants to know if the era of the specialized database is finally dead.

It is a provocative question, and honestly, it is a fair one. The prompt today is really a challenge to the modern data architecture. For the last couple of years, there has been this massive movement in the software engineering world called "Just Use Postgres." The idea is that Postgres has become so capable, with extensions for vectors, time-series data, and even basic search, that we should stop over-engineering our stacks with ten different specialized databases. We are seeing people try to run their entire company off a single RDS instance.

And I love the spirit of that movement. I really do. Simplicity is a virtue in engineering. But Daniel’s prompt pushes on the ceiling of that philosophy. He wants to know why, if Postgres is so great, we still have these massive, multi-million dollar deployments of Snowflake, ClickHouse, and BigQuery. If the relational database won the war, why does the specialized analytical database still exist? Is it just corporate inertia, or is there something deeper?

It is the classic architectural divergence between what we call Online Transactional Processing, or O L T P, and Online Analytical Processing, or O L A P. We touched on the database explosion back in episode one thousand one hundred twenty-four, and we even debated the future of Postgres in episode one thousand one hundred twenty-three. But today, we are going to look at the physical constraints. We are going to look at why, no matter how much we love Postgres, the laws of physics and hardware eventually force us to build a separate house for our data.

It is like trying to use a Ferrari as a dump truck, Corn. You can put a trailer hitch on a Ferrari, and it might pull a few loads of gravel, but eventually, the transmission is going to melt. A Ferrari is built for speed and precision in a single direction. A dump truck is built for volume and throughput. They are both vehicles, but their internal geometry is fundamentally different. If you try to move a mountain with a fleet of Ferraris, you are going to go bankrupt and break all your cars.

That is a perfect place to start. Let us talk about that internal geometry. When we talk about a traditional database like Postgres or MySQL, we are talking about row-based storage. Herman, explain why that is the gold standard for transactions but a nightmare for analytics.

So, imagine a spreadsheet. In a row-based database, when the computer saves that data to the disk, it writes it row by row. It puts the name, then the age, then the address, then the credit card number, all right next to each other in a single block of memory. This is what we call the N-ary Storage Model. This is incredible for what we call transactional workloads. If a customer logs into your website and needs to see their profile, the database can go to one spot on the disk, grab that entire row, and serve it up instantly. It is efficient because the disk head does not have to jump around. It is a single seek and a single read.

Right, because in a transaction, you usually want the whole object. You want the whole user or the whole order. But what happens when a data analyst comes along and says, "I want to know the average age of every user in our database of ten million people"?

That is where the Ferrari starts to smoke. In a row-based system, to get the average age, the computer still has to read the entire row for every single user. It has to read the name, the address, the credit card number, and the bio, just to get to that one tiny piece of data called age. If your row is one kilobyte and the age is only four bytes, you are wasting over ninety-nine percent of your disk I O and your memory bandwidth reading data you do not need. If you have a one terabyte table, you are literally reading one terabyte of data just to extract a few megabytes of ages. It is a massive waste of resources.

And that is the birth of the columnar database. Systems like ClickHouse, or Snowflake, or even DuckDB, which has been getting a lot of hype lately. Instead of storing the data row by row, they store it column by column. All the names are together in one block. All the ages are together in another. This is the Decomposition Storage Model.

It is a simple flip of the orientation, but the implications are massive. If I want the average age of ten million people in a columnar database, the system only touches the age column. It scans a single contiguous block of numbers. It is orders of magnitude faster because it is not touching the names or the addresses at all. It is the difference between reading one chapter of a book and reading every single page just to find one specific word on each page. But there is a second, even more important benefit to columnar storage that people often overlook, and that is compression.

Oh, this is huge. If you have a column of data where every entry is a country name, you might have the word "United States" appearing ten thousand times in a row.

Computers are incredibly good at compressing repetitive data. We use things like Run-Length Encoding or Dictionary Encoding. In a row-based database, your data is a mix of strings, integers, and booleans, so it is hard to compress effectively because the data is heterogeneous. But in a columnar store, because the data is homogeneous, you can get compression ratios of ten to one or even fifty to one. This means you can store a petabyte of data in a fraction of the physical space, which also makes it faster to read because you are pulling fewer bits off the disk. You are effectively increasing your disk bandwidth through the power of math.

So, that explains the storage side. But there is also the compute side. I know you have been looking into vectorized execution recently. How does that play into this divide? Because it is not just about how the data sits on the disk; it is about how the CPU actually chews through it.

This is where we get into the actual CPU architecture. Modern processors have these instructions called S I M D, which stands for Single Instruction, Multiple Data. Essentially, a modern CPU can perform a single operation, like an addition, on a whole batch of numbers at the same time. Row-based databases usually process data one row at a time. They use what is called the Volcano-style iterator model. They loop through the rows: "Give me a row, process it, give me the next row." It is very flexible but very slow for big scans.

Because of the overhead of the function calls for every single row, right?

There is a lot of branching and a lot of wasted CPU cycles. Columnar engines, on the other hand, are designed for vectorized execution. They grab a batch of, say, one thousand ages, shove them into the CPU registers, and calculate the sum in a few clock cycles using those S I M D instructions. It is like the difference between a person counting coins one by one and a machine that weighs a whole bag of coins to tell you the value instantly.

So even if Postgres added a columnar extension, which it has through things like Hydra or ParadeDB or Citus, it is often fighting against its own core architecture which was built for that row-by-row, transactional consistency. It is hard to bolt on vectorized execution after the fact when the whole system is built around the idea of a single row being the atomic unit of work. Postgres is optimized for ACID compliance—Atomicity, Consistency, Isolation, and Durability. It wants to make sure that if you move five dollars from account A to account B, that data is safe forever.

Precisely. And let us not forget the B-Tree index. This is the bread and butter of Postgres. B-Trees are amazing for finding a single record in a sea of millions. If you want to find user ID five million four hundred thirty-two, a B-Tree will find it in a few hops. But they have a massive overhead for writing. Every time you insert a row, the database has to update several indexes, which involves locking and shifting data around. In an analytical system, where you might be dumping a hundred million rows of log data every hour, the overhead of a B-Tree would bring the system to its knees. Analytical databases usually use something called L S M trees or just simple partition pruning and min-max indexes, which are much more efficient for high-volume writes and massive scans.

Okay, so we have established that the hardware and the data structures themselves demand specialization once you hit a certain scale. But that brings us to the messiest part of the whole thing. If I have my production data in Postgres because I need those fast transactions, but I need my analytics in ClickHouse or Snowflake, how do I get the data from point A to point B? This is the data pipeline problem, and it is where most engineers lose their minds.

This is where the engineering gets really expensive. You cannot just run a "SELECT STAR" from Postgres and insert it into Snowflake every five minutes. You will crash your production database. The overhead of that query will lock tables, eat up memory, and your actual users will start seeing five hundred errors. You need a way to move data without impacting the live users.

Daniel specifically asked about how a company like Netflix handles this. And Netflix is actually a great case study because they were pioneers in this space. They do not just have one database; they have thousands of microservices, each with its own database. They have a massive "Database of Databases" problem.

Right. So, if you are Netflix, and you want to know which shows are trending in Jerusalem right now, you have to pull data from a service that handles user clicks, a service that handles geographic metadata, and a service that handles the content library. These are all different databases, likely managed by different teams. The modern way to do this is through something called Change Data Capture, or C D C.

I love the elegance of C D C. Instead of querying the database tables directly, you have a tool like Debezium that sits there and watches the database’s write-ahead log, or the WAL. Every database has a log where it records every single change—every insert, update, and delete—before it actually commits it to the table. It is the database's "black box" recorder. Debezium reads that log in real-time and turns every change into a message.

And then those messages get tossed into a streaming platform, usually Apache Kafka. This is the heart of the pipeline. Kafka acts as a giant, high-speed buffer. It is a distributed commit log that can handle millions of events per second. Once the data is in Kafka, it can be sent to multiple places at once. One stream might go to an Elasticsearch cluster for real-time search, another might go to an S3 bucket which acts as your data lake, and another might go directly into an analytical warehouse like Snowflake.

Let us pause there for a second. We just used two terms that people often confuse: Data Lake and Data Warehouse. In the Netflix model, what is the difference? Because they use both extensively.

Think of the Data Lake as the attic of your house. When the data comes out of Kafka, it is often messy. It is raw J S O N, it might have missing fields, or it might be in a format that is hard to query. You dump all of that into a Data Lake, like Amazon S3, using a format like Apache Iceberg or Parquet. It is incredibly cheap storage. You keep everything there because you might need it later, even if you do not know why yet. Netflix actually uses Apache Iceberg to manage their data lake because it allows them to treat S3 like a giant table with schema evolution and ACID transactions.

But you cannot really run a business off an attic. You will never find anything, and the performance is terrible if you are trying to do complex joins.

That is where the Data Warehouse comes in. You have these transformation jobs, often using a tool called D B T, which stands for Data Build Tool. These jobs take the raw, messy data from the Data Lake, clean it up, apply business logic, and then load it into the Data Warehouse in a structured format, like a star schema. The Data Warehouse is the clean, organized showroom where the analysts can actually do their work. It is where you define things like "What is a subscription?" or "What counts as a view?"

It sounds like a lot of moving parts. You have the production database, the C D C connector, the Kafka cluster, the Data Lake, the transformation layer, and finally the Data Warehouse. This is what we call the Lambda or Kappa architecture, depending on how you handle batching. But Herman, is this not exactly what Daniel was worried about? The complexity here is staggering. Why would anyone do this if they could just run a query on Postgres?

It is. And for a small startup, it is absolutely overkill. If you have ten gigabytes of data, please, for the love of all that is holy, just use Postgres. You can run analytics on Postgres just fine at that scale. But when you are Netflix, and you are dealing with petabytes of data and millions of events per second, you simply do not have a choice. The operational overhead of the pipeline is the price you pay for being able to ask complex questions of your data without crashing the site. Imagine if every time an analyst wanted to see the quarterly revenue, the "Play" button stopped working for a million users. That is the trade-off.

What about the scale itself? When does a company actually need to move from the "just use Postgres" phase to the "I need a Data Warehouse" phase? Is there a specific number? Because I think people often jump the gun because they want to play with the cool tools.

There is no hard and fast rule, but a good rule of thumb is around one hundred gigabytes of analytical data. Once your aggregations, like your sums and averages over large tables, start taking more than a few seconds, you are entering the danger zone. Another trigger is when you need to join data from multiple different sources. If you have data in Postgres, data in Salesforce, and data in Google Analytics, trying to join those inside Postgres is a nightmare. You need a central warehouse to act as the single source of truth.

That brings up a great point about the semantic layer. We talked about this a bit in episode one thousand two hundred twelve, when we were looking at the sprawl of specialized databases. The problem with having data in all these different places is that nobody knows which version of the data is correct. The Data Warehouse, when done right, provides that semantic layer where everyone agrees that "this is how we define a daily active user." If you do not have that, you end up with "Data Silos."

And that is a conservative principle in engineering, really. Order and governance. If you just have a bunch of data lakes without a warehouse, you end up with a data swamp. You have ten different analysts giving ten different numbers for revenue because they are all querying the raw logs differently. One analyst might be counting canceled subscriptions, and another might not. The warehouse is where you enforce the rules of the road. It is where the business logic lives.

Let us talk about the market for a minute. If someone is listening to this and they realize they have outgrown their single Postgres instance, who are the big players they are looking at in two thousand twenty-six?

It is a fascinating market right now. You have the cloud giants, of course. Google BigQuery, Amazon Redshift, and Microsoft Fabric. These are great because they are integrated into the clouds people are already using. BigQuery, in particular, is incredible because it is serverless. You do not even manage a database; you just write SQL and Google handles the rest. It scales from zero to petabytes without you ever touching a configuration file.

But then you have the independent giants like Snowflake and Databricks. They have been in a bit of a cold war for the last few years, right? It feels like they are both trying to become the same thing.

Oh, it is a total war. Snowflake started as a pure Data Warehouse. They were the ones who perfected the separation of storage and compute. You pay for the storage on S3, and you only pay for the compute when you are actually running a query. It was a game changer for cost. Databricks, on the other hand, came from the Data Lake world. They were the Spark people. They said, "Keep your data in a lake, and we will give you the tools to query it."

And now they are meeting in the middle. Snowflake is adding more support for unstructured data and machine learning, and Databricks created the "Lakehouse" concept to add warehouse-like structure to the data lake using Delta Lake.

They are converging. And then you have the challengers like ClickHouse. ClickHouse is the speed king. It is an open-source columnar database that is terrifyingly fast for real-time analytics. If you need to show a dashboard to a million users that updates every second, you probably want ClickHouse. It is what powers the analytics for things like Cloudflare and Uber. And then there is DuckDB, which is the darling of the data world right now. It is an analytical database that runs inside your application, just like SQLite. It is perfect for when you want to do heavy analytics on a local machine or in a small web app without setting up a whole server. We are even seeing "MotherDuck," which is a serverless version of DuckDB that bridges the gap between local and cloud.

It is interesting to see the pendulum swing. We went from everything in one database to massive fragmentation, and now we see tools like DuckDB trying to bring some of that power back to the local level. But what about the future? Daniel’s prompt touched on whether this divide will ever go away. Have you heard of H T A P?

Hybrid Transactional and Analytical Processing. This is the holy grail. The idea is a single database that can handle both row-based transactions and columnar analytics simultaneously. There are companies like TiDB, SingleStore, and even some new Postgres forks that are trying to do this. They actually store the data twice internally—once in rows for the transactions and once in columns for the analytics—and the query optimizer decides which one to use based on the query.

That sounds like it solves the pipeline problem, but it must be incredibly complex to build. You are basically running two databases in a trench coat.

It is. And it is expensive. For most companies, the laws of physics still favor the split. It is just more efficient to have a specialized engine for each task. If you try to do everything in one engine, you often end up with a "jack of all trades, master of none" situation. I think for the next decade, we are still going to be building pipelines. The tools will just get better at hiding the complexity. We are already seeing that with things like PeerDB or other managed C D C services that make the move from Postgres to Snowflake almost a one-click process.

So, to summarize the answer to Daniel’s prompt: we need specialized analytical databases because of the fundamental physical differences in how we read and write data. Row-based for finding a needle in a haystack, columnar for weighing the whole haystack. And while Postgres is getting better every day, it still cannot cheat the cache lines of a CPU or the compression benefits of homogeneous data columns. You can only fight the hardware for so long before it fights back.

Do not over-engineer too early, but do not ignore the laws of physics. If your analytics are slow, and your production database is struggling under the weight of your BI tools, it is time to look at a warehouse. And remember, the "Just Use Postgres" movement is a reaction to the complexity of the last decade, but it is not a magic wand that makes the scale problem disappear.

That feels like a solid place to wrap up the core discussion. But before we go, Herman, what are the practical takeaways for someone who is feeling the pain of this growth right now? Let us give them a decision matrix.

First, check your indexing. Before you move to a whole new database, make sure you are not just missing a few key indexes in Postgres. Use the "EXPLAIN ANALYZE" command to see where the bottlenecks are. If you are doing a full table scan on a ten million row table, that is your problem right there. Second, if you do need to scale, look at Postgres extensions first. TimescaleDB is amazing for time-series data, and Citus can help you shard your database across multiple nodes if you are hitting CPU limits.

And third, if you do decide to build a pipeline, do not build it yourself from scratch. Do not write a custom Python script that runs on a cron job. Use established tools like Debezium for your C D C and Kafka or a managed alternative like Confluent or Redpanda for your transport. The last thing you want to be doing is writing custom code to move data. It will break, you will have schema drift, and you will lose data. Once your analysts stop trusting the data, your job becomes ten times harder.

Trust is the most important currency in data. Once the numbers in the warehouse do not match the numbers in production, the whole system is useless. You will spend all your time in meetings explaining why the revenue numbers are different instead of actually building things.

Very true. Well, this has been a deep dive into the plumbing of the internet. I hope this gave Daniel and all of you a better understanding of why the world of databases is so fragmented. It is not just because engineers like new toys, although we do. It is because the hardware demands it. We are slaves to the speed of light and the layout of silicon.

It usually comes down to the hardware in the end. If you enjoyed this episode, we have a whole archive of similar deep dives at myweirdprompts dot com. You can find our RSS feed there, or you can search for My Weird Prompts on Telegram to get a notification every time we drop a new one. We have been getting some great feedback on the Telegram channel lately, so keep those comments coming.

And if you have a second, please leave us a review on your podcast app or on Spotify. It genuinely helps the show grow and helps other people find these discussions. We really appreciate the support from our listeners. It is what keeps us going.

We really do. We have been doing this for over one thousand two hundred episodes now, and it is the community that keeps us coming back to these microphones. We love the weird prompts you all send in.

If you want to dive deeper into some of the topics we mentioned today, check out episode one thousand two hundred thirteen on the NoSQL paradox, or episode one thousand two hundred twenty-one where we talked about using AI to break the SQL migration bottleneck. We also have a great episode, one thousand two hundred twelve, about the vector database sprawl which is a great cautionary tale for anyone thinking about adding yet another database to their stack.

Plenty of good stuff in the archives. Thanks again to Daniel for the prompt. It was a great excuse to geek out on storage engines. We will see you all in the next one.

This has been My Weird Prompts. I am Corn Poppleberry.

And I am Herman Poppleberry. Peace.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.