#1505: BigQuery & GDELT: Mining Global News with AI

Discover how Google BigQuery and the GDELT project allow researchers to analyze billions of global news records using real-time AI and SQL.

0:000:00

Episode Details

Published: Mar 24
Duration: 16:23
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Scale of Global Sentiment

Analyzing the pulse of the planet requires more than a standard database. The Global Database of Events, Language, and Tone (GDELT) project monitors news in over 100 languages, updating every fifteen minutes. With a repository exceeding 2.5 billion georeferenced records dating back to 1979, it represents a "real-time mirror" of human society. To process this mountain of information, researchers increasingly turn to Google BigQuery, a platform designed specifically for the massive parallel processing required by such vast datasets.

Columnar Storage and the Dremel Engine

The primary challenge of analyzing GDELT-scale data lies in the architecture of traditional databases. Most production environments use row-based storage (OLTP), which is ideal for looking up specific user records or processing transactions. However, when an analyst needs to calculate sentiment trends across an entire continent over a decade, a row-based system becomes a bottleneck because it must read every piece of irrelevant data in a row to find the specific column needed.

BigQuery solves this through columnar storage and the Dremel execution engine. By storing data in columns rather than rows, the system only reads the specific attributes required for a query. This massively parallel architecture allows it to scan petabytes of data in seconds, effectively decoupling storage from compute. This serverless approach means users only pay for the processing power—measured in "slots"—used during the query.

The Shift to AI-Native Analytics

The role of the data warehouse is evolving from a "cold storage" repository for old reports into an active engine for artificial intelligence. Recent updates have introduced functions like AI.EMBED and AI.SIMILARITY, allowing users to generate vector embeddings directly within SQL queries. This means a developer can perform semantic similarity searches—finding news stories that are conceptually related rather than just sharing keywords—without moving data to a separate machine learning environment.

Furthermore, integrations with models like Gemini 3.1 enable natural language data cleaning and museum-scale cataloging. These tools allow for the analysis of non-textual data as well, such as real-time visual analysis of international news broadcasts.

Why Offloading Matters

A recurring theme in high-level data management is the "separation of concerns." Keeping decades of historical data in a production database can degrade application performance, leading to lag or crashes when complex analytical queries are run. By offloading this "historical baggage" to BigQuery, organizations keep their production environments lean and responsive.

This separation also allows for more complex data joining. By moving internal sales or logistics data into the same warehouse as external datasets like GDELT, companies can cross-reference their performance against global events, such as geopolitical shifts or supply chain disruptions. While this creates some data duplication, the trade-off is a vastly superior ability to derive insights without risking the stability of live applications.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1505: BigQuery & GDELT: Mining Global News with AI

Daniel's Prompt

Custom topic: let's do a deep dive into Google big query. Daniel was using it the other day to query gdelt - which we should also cover in this episode. Why do we need specific databases for querying "big" data and

I was looking at some of the stuff Daniel sent over this morning, and it reminded me of that time you tried to organize our entire childhood photo collection into a relational database. Do you remember that? You had tables for locations, dates, people, and even the type of film used.

I still maintain that was a very logical way to handle the metadata, Corn. Though I will admit, querying for every photo where you were wearing that ridiculous striped sweater took longer than I expected. I am Herman Poppleberry, by the way, for anyone joining us for the first time.

Today's prompt from Daniel is about Google BigQuery, and it seems he has been getting his hands dirty with the Global Database of Events, Language, and Tone project, or GDELT. He is asking us to dive into why we need these massive, specialized databases and what it looks like when you actually offload your production data into a backend like this for the heavy lifting.

This is right up my alley. Daniel is playing with one of the most fascinating datasets on the planet. GDELT is essentially a real-time mirror of human society. It was created by Kalev Leetaru—pronounced KAH-lev leh-TAR-oo—and it monitors news in over one hundred languages from almost every corner of the globe. It updates every fifteen minutes. When you realize it contains over two and a half billion georeferenced records going back to nineteen seventy-nine, you start to understand why Daniel is using BigQuery. You simply cannot run that on a standard setup.

Two and a half billion records. That sounds like the kind of scale where a normal database just throws up its hands and quits. I know Daniel works in technology communications and automation, so he is probably looking for patterns or sentiment shifts in global news. But walk me through the BigQuery side of this. What makes it the right tool for something like GDELT?

It comes down to the architecture. Most people are used to what we call Online Transactional Processing databases, or OLTP. Think of PostgreSQL or MySQL. Those are row-based. If you want to find one specific user or update a single order, they are incredibly fast. But GDELT is a multi-terabyte beast. If you ask a row-based database to calculate the average sentiment of news reports in South America over the last ten years, it has to read every single row, including every column you do not need, like the author's name or the source URL.

And that is where the bottleneck happens. You are essentially asking the database to read the entire library just to find out the average length of the books.

That is a perfect way to put it. BigQuery is an Online Analytical Processing system, or OLAP, and it uses columnar storage. If you have a table with one hundred columns, but your query only asks for two of them, BigQuery literally only reads those two columns from the disk. This is powered by the Dremel execution engine—that is D-R-E-M-E-L. It is a massively parallel processing engine that can scan petabytes of data in seconds. It splits the query into thousands of small pieces and distributes them across a massive cluster of machines.

I imagine that is why Daniel can get results in seconds rather than hours. I saw that Kalev Leetaru recently demonstrated using BigQuery with the new Gemini three point one model to do museum-scale cataloging of manuscripts that are five hundred years old. He even used it for real-time visual analysis of international news via the GDELT TV Visual Explorer. That is a wild jump from real-time news text to historical preservation and video analysis.

It shows the flexibility of the platform under Thomas Kurian’s leadership at Google Cloud. Just yesterday, on March twenty-third, Google announced that the AI dot EMBED and AI dot SIMILARITY functions are now generally available in BigQuery. This means Daniel can now take those billions of news records in GDELT and generate vector embeddings directly within his SQL queries. He does not have to move the data to a separate machine learning environment. He can do semantic similarity searches right there. If he wants to find news stories that are conceptually similar to a specific event, even if they do not share the same keywords, he can do that with a standard SQL statement.

That feels like a massive shift. We used to talk about the data warehouse being this cold storage place where data goes to die or maybe get turned into a dusty chart once a month. Now, it sounds like the warehouse is becoming the brain of the operation. If Daniel can run generative models directly on the data, what does that do to the traditional workflow of a developer?

It simplifies it immensely. In the past, if you wanted to do what Daniel is doing, you would have to extract the data, transform it, load it into a specialized vector database, and then run your models. Now, BigQuery ML supports the automatic deployment of open-source models to Vertex AI endpoints. This was just updated on March seventeenth. It streamlines the entire pipeline. You stay within the Google Cloud ecosystem, and the data lives in Colossus, which is Google’s distributed file system. Because storage and compute are decoupled, you only pay for the processing power you use, which they measure in units called slots.

I have always found the concept of slots a bit abstract. Is it just a fancy word for a virtual CPU?

Essentially, yes. A slot is a unit of computational capacity. You can use on-demand pricing, which is about six dollars and twenty-five cents per terabyte scanned, or you can buy dedicated capacity if you are running massive workloads constantly. For Daniel, the on-demand model is probably perfect because he gets the first terabyte every month for free. But the real magic is how BigQuery handles the scale. It can process over one terabyte of data per second if you have enough slots allocated.

Let's talk about the offloading aspect Daniel mentioned. Why wouldn't a company just keep all their data in their production database? If I am running a successful e-commerce site, why not just keep twenty or more years of transaction history in my main Postgres instance?

Because you will eventually kill your application. Production databases are optimized for those point lookups I mentioned. Every time you run a massive analytical query, like "give me the year-over-year growth for every product category since two thousand ten," you are consuming Input Output Operations Per Second. You might even lock tables. If your database is busy scanning ten years of history for a report, it might not be able to process a new customer's password change or a checkout request. Your app lags, or worse, it crashes.

So, it is about separation of concerns. You keep the production database lean and mean for the live users, and you shove the historical baggage into the warehouse for the analysts.

And it is not just about performance. It is about the type of questions you can ask. When you offload production data into BigQuery, you can join it with external sources. This is exactly what Daniel might be doing with GDELT. Imagine joining your internal sales data with global news trends. You might find that your sales in a specific region are tied to certain geopolitical events or even weather patterns that GDELT is tracking. You cannot do that easily if your data is siloed in a standard SQL database.

I can see how that would be powerful for something like conflict forecasting or supply chain management. If you are a global logistics company like UPS, which we know uses BigQuery, you need to know if a protest in a specific port city is going to impact your deliveries three days from now. GDELT provides that real-time signal, and BigQuery provides the engine to cross-reference it with your internal fleet data.

That is a perfect example. And with the governance updates they rolled out on March sixteenth, it is getting easier for global organizations to manage this. They introduced a new Global Default Location setting. This prevents people from accidentally creating datasets in the wrong region, which is a huge deal for data residency laws. If you are working with sensitive data in Europe or Israel, you need to be sure it stays where it is supposed to.

You mentioned Gemini three point one earlier. How is that actually changing the way people interact with BigQuery? Is it just a chatbot that writes SQL for you, or is it deeper than that?

It is much deeper. The integration now allows for natural language data cleaning and transformation. If Daniel has a bunch of messy files in Cloud Storage, he can use Gemini directly from the BigQuery console to say, "Clean up these dates and normalize the currency columns," and it will handle the transformation logic. It makes the data prep phase, which usually takes up eighty percent of a data scientist's time, much more manageable.

It sounds like the barrier to entry is dropping, but the ceiling for what you can do is getting much higher. I am curious about the AI-native label Google is pushing. Is that just marketing, or is there a fundamental change in how the database operates?

I think it is fundamental. When you integrate embeddings and similarity searches into the SQL layer, you are changing the definition of what a database is. It is no longer just a place to store numbers and strings. It is a place to store and query meaning. If Daniel can write a SQL query that says "SELECT news story FROM gdelt table WHERE semantic similarity of the story text to peace negotiations is greater than zero point eight," he is querying the essence of the news, not just looking for the word peace.

That is a big deal for researchers. I remember we touched on specialized databases in episode eleven twenty-four when we talked about why one size no longer fits all. It seems BigQuery is trying to be the one size fits all for the analytical side, even if it is not the right choice for the live application side.

That is the trade-off. You would never use BigQuery to back a mobile app where every millisecond of latency matters for a single row fetch. But for anything involving a scan of a large dataset, it is hard to beat. The Dremel engine is just too efficient at what it does. And because it is serverless, you do not have to worry about managing the underlying infrastructure. You do not have to spin up clusters or worry about disk space. It just scales.

What about the risks? If Daniel is offloading all this data, he is creating a lot of duplication. You have the data in the production database, and then you have it again in the warehouse. Does that lead to source of truth problems?

It can if you are not careful with your pipelines. Usually, you use a process called Change Data Capture to stream updates from your production database to BigQuery in near real-time. This keeps them in sync. But you are right, the warehouse becomes a different kind of truth. It is the historical truth. The production database is the right now truth. The real risk is usually cost. If you write a poorly optimized query that scans a petabyte of data on-demand, you could be looking at a multi-thousand-dollar bill for a single click.

That is a scary thought. "Oops, I just spent six thousand dollars on a typo."

Google has added a lot of guardrails for that, like the ability to set custom quotas and limits. And with the new AI tools, the system is getting better at predicting how much a query will cost before you run it. But for someone like Daniel, who is technically literate and understands the underlying architecture, it is an incredibly powerful sandbox. He can play with the entire history of global news sentiment for the price of a couple of coffees a month, as long as he stays within that free tier and optimizes his scans.

I like the idea of Daniel sitting there in Jerusalem, looking at global trends through this massive Google lens. It feels very future of work. You don't need a room full of servers anymore; you just need a good prompt and a solid understanding of how to structure your analysis.

It really is. And GDELT is such a unique project because it is so open. It is one of the primary public datasets in BigQuery. Anyone can go in and start querying it. You can look at the tone of news reports during the early stages of a conflict or track how certain topics gain traction across different languages. The fact that it is updated every fifteen minutes means you can almost see the world's collective consciousness shifting in real-time.

It makes me wonder about the edge cases. What happens when the AI starts reading the news that was generated by other AIs, and then Daniel queries that sentiment in BigQuery? We are getting into some weird feedback loops there.

That is a real challenge for the GDELT project. Distinguishing between organic human reporting and automated content is becoming a massive part of the work. But that is also why having these AI-native features in the database is so important. You can use models to help filter and categorize the data at scale.

So, for the listeners who are currently struggling with a slow production database, what is the first step? Is it as simple as just dumping it into BigQuery?

Not quite. You want to identify your most expensive analytical queries first. Look at the reports that take the longest to run or the ones that make your developers nervous. Those are your prime candidates for offloading. You can use tools like Google's own Data Transfer Service or third-party pipelines to move that data. Once it is in BigQuery, you stop running those reports against your production database entirely. You will see an immediate improvement in application performance.

And then you can start doing the fun stuff, like joining it with weather data or news trends from GDELT.

I mean, that is where the real insights are. You start seeing the why behind the numbers. If your sales dropped in a certain region, was it because of a local holiday you didn't account for, or was there a specific news event that GDELT picked up on that shifted consumer sentiment? That is the kind of second-order analysis that builds great companies.

I think we have covered a lot of ground here. From the Dremel engine to the museum-scale manuscripts of Kalev Leetaru. It is clear that BigQuery isn't just a database; it is a massive analytical platform that is increasingly being defined by its AI capabilities.

It is a fascinating time to be working with data. The transition to AI-native features like AI dot EMBED is really just the beginning. By this time next year, I wouldn't be surprised if the SQL language itself has evolved even further to make these complex semantic queries even more natural.

Before we wrap up, I want to make sure we give a shout-out to the people who make this show happen.

Of course. A huge thanks to our producer, Hilbert Flumingtop, for keeping us on track and making sure the audio doesn't sound like it was recorded inside a tin can.

And a big thanks to Modal for providing the GPU credits that power this show. We couldn't do these deep dives without that support.

This has been My Weird Prompts. If you are finding these discussions helpful, we would love for you to follow us on Telegram. Just search for My Weird Prompts to get notified whenever we drop a new episode.

We will be back next time with whatever weird prompt Daniel decides to throw at us. Hopefully, it involves more columnar storage and fewer striped sweaters.

I make no promises on the sweaters, Corn.

Goodbye, everyone.

See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.