#741: Preserving the Web: The Internet Archive and Arweave

Explore how the Internet Archive saves the web, the legal battles threatening its future, and the rise of decentralized storage like Arweave.

0:000:00

Episode Details

Published: Feb 21
Duration: 32:20
Audio: Direct link
Pipeline: V4
TTS Engine
LLM
Topics: data-integrity networking decentralized-storage

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The internet is often perceived as a permanent record, yet in its early days, the average lifespan of a webpage was a mere 44 days. Without active intervention, the digital history of the late 20th century would have vanished into "bit rot." This challenge led to the creation of the Internet Archive in 1996, an ambitious project aimed at providing universal access to all knowledge by capturing the ephemeral snapshots of the World Wide Web.

The Mechanics of Digital Time Travel

At the heart of the Internet Archive is the Wayback Machine, which utilizes sophisticated web crawlers like Heritrix. These digital "spiders" navigate links across the globe, downloading content to be stored in the Web ARChive (WARC) format. Unlike a simple screenshot, a WARC file records the entire transaction between a browser and a server, allowing the Archive to "replay" a website as it existed at a specific moment in time.

As the web has evolved from static text to dynamic, database-driven applications, the technical burden has increased. Modern tools now use full browser rendering to capture the complex data streams of social media and interactive apps. Today, the Archive manages over 100 petabytes of data, housed in independent, high-density storage nodes known as "Petaboxes" to ensure the library remains independent of commercial cloud providers.

The Legal and Financial Frontier

Operating as a nonprofit, the Internet Archive faces constant financial and legal pressure. While it functions as a digital library, it does not enjoy the same established legal protections as physical institutions. A major legal battle regarding "Controlled Digital Lending" has recently threatened its operations. By allowing users to borrow digital scans of physical books, the Archive drew the ire of major publishing houses, leading to court rulings that have forced the removal of hundreds of thousands of titles. This highlights a critical vulnerability: because the Archive is a centralized institution, it is subject to the jurisdiction and "digital rot" of court orders.

Decentralization and the Perma-web

In response to the risks of centralization, new protocols like Arweave have emerged. Arweave proposes a "perma-web" using a decentralized blockchain structure. Unlike the Internet Archive’s voluntary donation model, Arweave uses an upfront endowment fee to fund storage for centuries. Because the data is distributed across a global network of miners, there is no central authority to sue or compel to delete information.

This creates a stark philosophical divide. The Internet Archive acts as a curated, human-led institution that respects "the right to be forgotten" and follows legal takedown requests. Arweave, conversely, functions more like a law of physics—once data is committed to the network, it is virtually impossible to erase.

Conclusion: A Choice of Futures

The struggle to preserve the internet reveals a fundamental tension between the need for a responsible, curated history and the desire for a censorship-resistant, permanent record. Whether through the heroic efforts of centralized libraries or the immutable code of decentralized protocols, the goal remains the same: ensuring that the digital age does not become a dark age for future generations.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #741: Preserving the Web: The Internet Archive and Arweave

Daniel's Prompt

Let's discuss the history and mechanics of the Internet Archive. How does it manage the enormous complexity and storage requirements of indexing the growing internet? What are the funding challenges it faces as a voluntary project? Additionally, how do its centralized nature and terms of service—including the right to remove content—compare to distributed alternatives like Arweave, and what does this mean for the future of digital preservation and censorship resistance?

Welcome back to My Weird Prompts, everyone. This is episode seven hundred twenty-nine, and I am Corn, joined as always by my brother.

Herman Poppleberry, at your service. It is good to be here, Corn. I have been looking at some of the new updates on our website this week, and it is pretty wild to see our entire show history mapped out like that.

It really is. Daniel was telling me he spent a good chunk of time making sure our pipeline was solid and even added a three-dimensional topic graph. It is like you can literally swim through the connections between all seven hundred-plus episodes we have done. But the real kicker is that while he was doing that, he also uploaded the entire archive of this show to the Internet Archive.

Which is such a perfect segue into his prompt for today. Daniel is curious about the history and the mechanics of the Internet Archive itself. He wants to know how they manage the sheer complexity of indexing a growing internet, the funding hurdles they face as a voluntary project, and how their centralized model compares to distributed alternatives like Arweave.

It is a massive topic. I think most people use the Wayback Machine occasionally when they hit a four hundred four error, but they do not realize the Internet Archive is essentially the Library of Alexandria for the digital age, except it is also fighting for its life in court half the time.

That is exactly right. And it is not just websites. They have got books, movies, software, and now, apparently, seven hundred episodes of us talking about weird prompts. So, where do you want to start, Corn? Should we look at the origin story?

Let us do the origins. I think it is important to understand that this was not started by a government or a university. It was Brewster Kahle. He founded it in nineteen ninety-six, which, in internet years, is practically the Stone Age.

Brewster Kahle is a fascinating figure. Before he started the Archive, he was one of the minds behind Alexa Internet, which was later sold to Amazon. The initial idea was that the internet was this incredibly ephemeral thing. In the mid-nineties, the average lifespan of a webpage was something like forty-four days. If you did not save it, it was gone forever. Kahle saw this and realized we were losing our cultural history in real-time. He famously said that the job of a library is to provide universal access to all knowledge, and he realized that if the "knowledge" was moving to the web, the library had to move there too.

It is crazy to think about. If you look at a newspaper from nineteen twenty, it is probably sitting in a library somewhere. But a blog post from nineteen ninety-eight? Without the Archive, that is just vapor. So, how did they actually start "saving" the internet? I mean, you cannot just right-click and save the whole world wide web.

Well, technically, they kind of do. They use what are called web crawlers. The most famous one is called Heritrix, which is an open-source, extensible, web-scale crawler that they developed. Think of it like a digital spider that just crawls from link to link, downloading the content it finds. But the complexity is staggering because the internet is not just flat text files anymore. You have got JavaScript, dynamic content, and database-driven sites that are very hard to capture in a static snapshot.

Right, because if you just save the HTML code but the site relies on a server-side database to display anything, you just end up with a broken page. How do they handle that?

They use a format called WARC, which stands for Web ARChive. It is an international standard. A WARC file does not just save the page you see; it saves the entire transaction. It records the request the crawler sent, the response from the server, the metadata, everything. When you use the Wayback Machine, it is basically replaying those recorded transactions to rebuild the page in your browser. It is a time machine for data packets. But even with WARC, the modern web is a nightmare. Think about a site like Instagram or a complex web app. Those are not just pages; they are streams of data. The Archive has had to develop tools like "Wayback-ui" and "Brozzler," which uses a real browser to render the page and capture the results, rather than just grabbing the raw code.

That explains why sometimes the Wayback Machine feels a bit laggy or why some images do not load. It is trying to reconstruct a moment in time from these discrete packets. But Herman, the scale here is what blows my mind. We are talking about billions of pages. How do they store all of that?

It is astronomical. As of early twenty twenty-six, they are managing over one hundred petabytes of data. For perspective, a petabyte is a thousand terabytes. And they are adding millions of new items every week. They do not just use the cloud, either. For a long time, they were famous for building their own storage nodes. They called them Petaboxes. These were custom-designed server racks that were optimized for high density and low power consumption. They actually have their own data centers in San Francisco and elsewhere.

Wait, so they are a nonprofit, but they are running their own physical data centers? That sounds incredibly expensive. I mean, if you are a voluntary project, how do you pay the electricity bill for one hundred petabytes of spinning disks?

That is the big struggle. Their funding comes primarily from grants, individual donations, and some fees they charge for specialized crawling services for libraries or governments. But they operate on a shoestring budget compared to the tech giants. Brewster Kahle has always been very vocal about the fact that a library should be independent. If you host all your data on Amazon Web Services or Google Cloud, you are at the mercy of their pricing and their terms of service. To be a true library, you have to own the "stacks," even if those stacks are made of silicon and cooling fans.

I love that philosophy, but it feels very precarious. If they have a bad fundraising year, does the history of the internet just start getting deleted?

They have redundancy. They try to keep multiple copies of the data in different geographic locations, including a partial mirror in Egypt at the Bibliotheca Alexandrina. But you are right, the financial pressure is constant. And it is not just the cost of the hardware. It is the legal costs. This leads into what Daniel mentioned about the funding challenges and the legal battles.

You are talking about the "Controlled Digital Lending" thing, right? The lawsuit with the book publishers?

Exactly. This is the biggest existential threat they have faced. During the pandemic, the Internet Archive launched the National Emergency Library. Normally, they follow a rule called Controlled Digital Lending, where if they have one physical copy of a book in their warehouse, they can lend out one digital copy at a time. It mimics how a physical library works. But during the lockdowns, they removed the waitlists because students couldn't get to physical libraries.

And the publishers lost it.

They sued. Hachette, HarperCollins, Wiley, and Penguin Random House. They argued that the Internet Archive was essentially running a pirate site. The Archive argued they were a library performing a library function. Unfortunately for the Archive, the courts have been quite harsh. By twenty twenty-four, they had lost major rulings, and the appeals have been a long, uphill battle. It has forced them to remove hundreds of thousands of books from their digital lending library. It is a massive blow to the concept of a "digital library."

That is the part that hurts. When we talk about "digital preservation," we usually think about technical failures or bit rot, but here, the "rot" is coming from a court order. It makes you realize that the Internet Archive, for all its greatness, is a centralized institution subject to the laws of the United States.

That is the perfect pivot to the second half of Daniel's prompt. He asked about the centralized nature of the Archive and how it compares to things like Arweave. If the Internet Archive gets a court order to delete something, they delete it. They have to. Their terms of service even say they reserve the right to remove content for any reason. If you are looking for "censorship resistance," the Internet Archive might not be your final destination.

So let us talk about Arweave. I know we have mentioned it before, but for those who missed it, how does it actually solve this problem? Is it just a "decentralized" Wayback Machine?

In a way, yes. Arweave is what they call a "perma-web." It uses a blockchain-like structure, but instead of just recording financial transactions, it stores data. The big innovation there is the economic model. When you upload something to Arweave, you pay a one-time fee upfront. That fee goes into an endowment. The interest from that endowment is used to pay miners to store the data forever.

Forever is a very long time, Herman. How do they guarantee that?

They use a mathematical model that assumes the cost of storage will continue to decrease over time, which it has for decades. By charging a fee that covers hundreds of years of storage at current prices, they can theoretically fund the storage indefinitely as the tech gets cheaper. And because it is decentralized, there is no "Brewster Kahle" to sue. There is no central office. The data is spread across thousands of nodes globally.

So if a publisher wants a book removed from Arweave, who do they send the cease and desist letter to?

That is the point. There is no one to send it to. Once it is on the network and confirmed, it is technically impossible for any single person to delete it. It is the ultimate censorship-resistant archive. But that brings us to the "mechanics" Daniel asked about. Arweave uses something called "Succinct Proofs of Random Access" or S-P-o-R-A. It basically forces miners to prove they have access to old, rare data in order to earn new rewards. This ensures that the data doesn't just sit on one server; it is replicated across the whole network.

But wait, there has to be a catch. If I can put anything on Arweave and it can never be deleted, that sounds like it could get very dark, very fast. I mean, what about illegal content or harmful data?

That is the massive ethical dilemma of the perma-web. Arweave does have some mechanisms where individual node operators can choose what they host. So, a node in a certain country might filter out content that is illegal there. But the data itself remains on the network as a whole. It is a very different philosophy than the Internet Archive. The Internet Archive is a curated, human-led institution. They try to be responsible actors. Arweave is a protocol. It is neutral, for better or worse.

It is the difference between a librarian and a law of physics. A librarian can decide that a certain book is harmful or that a court order must be followed. A law of physics just exists. But I am curious, Herman, about the "right to remove" that Daniel mentioned. On the Internet Archive, if I am a private citizen and they have archived a version of my website from twenty years ago that I am embarrassed by, can I ask them to take it down?

Yes, you can. They generally honor "robots dot t-x-t" files, which are the instructions websites give to crawlers. If you add a "do not crawl" tag to your site today, the Wayback Machine will often hide the historical snapshots, too. They try to be good neighbors. They also have a takedown request system for copyright or privacy concerns.

See, that feels like a feature to me, not a bug. The "right to be forgotten" is a pretty big deal in modern ethics. If I put something on Arweave, I am giving up that right forever. I am literally tattooing that data onto the internet.

It is a trade-off. Do you want a library that can be pressured by governments and corporations but can also correct mistakes and protect privacy? Or do you want a permanent record that is immune to pressure but can never be erased, even if it is wrong or harmful?

I think we need both. The Internet Archive is our collective memory, but Arweave is our insurance policy. If the Internet Archive ever gets sued out of existence—which is a real fear given these recent court cases—having that data mirrored on a decentralized network might be the only way we do not enter a new "Digital Dark Age."

That is actually why Daniel's work this week was so interesting. He uploaded our show to the Internet Archive, but he is also looking into mirroring it on Arweave. He mentioned that it is quite expensive to do that right now because we have so much audio data, but it is an aspiration. It is about creating layers of permanence.

Let us go back to the storage mechanics for a second. You mentioned petabytes. How does the Internet Archive deal with the "complexity" Daniel asked about? I imagine that as the web becomes more complex—more video, more heavy assets—the storage requirements do not just grow linearly; they explode.

They use a lot of deduplication. If ten thousand websites all link to the same logo of a major corporation, the Archive does not need to save ten thousand copies of that image. They can save it once and point to it. They also use heavy compression. But the real complexity is in the "link rot" Daniel mentioned. When a site goes down, all the links on it break. The Archive has to manage this massive internal map to make sure that when you are browsing a site from two thousand five, the links on that page lead you to other archived pages from two thousand five, not the current live web.

That sounds like a nightmare to maintain.

It is. It is called "Wayback-ing" the links. They have to rewrite the U-R-Ls on the fly as you browse. And then there is the software side. They have the "Emularity" project where they actually run emulators in your browser. You can go to the Internet Archive and play an old M-S-D-O-S game or use a version of Windows ninety-five. They are not just saving the files; they are saving the environment needed to run them.

That is the part I find most impressive. Saving a document is one thing, but saving the ability to read that document fifty years from now is much harder. If you have a file from nineteen eighty-four but no computer can open it, you haven't really preserved anything.

Exactly. Digital preservation is a two-part problem: bit preservation—making sure the ones and zeros don't change—and functional preservation—making sure we can still interpret those ones and zeros. The Internet Archive is one of the few places doing both at scale. They even have a physical archive! Did you know that, Corn? In Richmond, California, they have shipping containers full of physical copies of the books they have digitized. They keep the physical "source" as a backup for the digital record.

That is incredible. It is like they are hedging their bets against every possible type of failure. But let's talk about the "growing internet" part of Daniel's prompt. We are seeing more and more content behind paywalls and login screens. How does the Archive handle the "Dark Web" or just the "Private Web"?

That is a huge hurdle. The Archive generally does not crawl behind logins or paywalls. This means a huge portion of our modern discourse—things happening inside private Discord servers, Facebook groups, or behind the New York Times paywall—is not being preserved in the same way the open web was in the nineties. We are actually entering a period where we might have less historical record of the twenty-twenties than we do of the nineteen-nineties because so much of the web is now "closed."

That is a terrifying thought. We think we are the most documented generation in history, but if the servers at Meta or X go dark, or if they just decide to delete old posts to save on storage costs, that history is gone.

Precisely. And that is why the "Save Page Now" feature is so important. It allows individuals to act as volunteer librarians. If you see something important on the open web, you can manually tell the Archive to go grab it. But for the closed web, we are still in a very precarious position.

So, what about the funding? Daniel mentioned it is a voluntary project. If someone is listening to this and they realize, "Wow, I use the Wayback Machine once a week and I do not want it to die," what is the best way for them to support it?

Well, they take donations, obviously. But they also look for volunteers to help with things like the "Open Library" project or the "Archive-It" service. And honestly, just using it and citing it helps. The more people who see it as an essential public utility, the harder it is for publishers to argue that it is just a "piracy site" without facing public backlash.

It is funny, we call it a "weird prompt," but this is actually one of the most consequential topics we have covered. The way we store our history defines how future generations will see us. If we leave it all in the hands of private companies like Facebook or X, formerly Twitter, that history could be deleted the second it becomes unprofitable to host it.

Or the second a new C-E-O decides they do not like the "vibe" of the old data. We have seen that happen. Platforms die, and they take everything with them. GeoCities is the classic example. When Yahoo shut it down in two thousand nine, they were going to delete millions of personal websites. The Internet Archive and a group called the Archive Team stepped in and did a "panic download" to save as much as they could. That is why you can still see those old, neon-colored, "under construction" pages today.

"Panic download." That should be the name of a band. But it also highlights the "single point of failure" problem. If the Archive Team had not been there, that culture would be gone.

This is where I get really excited about the Arweave comparison. In a decentralized world, you do not need a "panic download." The system is designed so that the data is always being mirrored and verified by multiple parties. But the cost is the big barrier. Storing one hundred petabytes on Arweave would cost a fortune right now. The Internet Archive's centralized model is much, much cheaper per gigabyte.

So we are in this middle ground. We have the centralized, affordable, but legally vulnerable Archive, and we have the decentralized, expensive, but invincible Perma-web.

And the future is probably a hybrid. You use the Internet Archive for your day-to-day research and for the massive, broad-scale crawling of the web. But for the truly "critical" data—the things that must never be lost or censored—you pay the premium to put it on a distributed network.

Like our podcast.

Exactly. Daniel is making sure we are in the "day-to-day" archive now, and maybe one day, we will be on the "invincible" one.

I like the idea of some archaeologist in the year three thousand finding an Arweave node and listening to us talk about Brewster Kahle. They will be like, "Wow, these guys really loved their metadata."

"And they really liked their brotherly banter."

Guilty as charged. But honestly, Herman, does it worry you that we are relying so much on a nonprofit? I mean, think about Wikipedia. We take it for granted, but if the Wikimedia Foundation disappeared tomorrow, the world would be significantly dumber.

It worries me every day. We have privatized the "town square" of the internet, but we have left the "library" to be run by volunteers and donations. It should be a public good. In a perfect world, the Internet Archive would have a multi-billion dollar endowment protected by international treaty. But we do not live in that world. We live in a world where Penguin Random House can sue a library for lending books.

It feels like a fundamental misunderstanding of what the internet is. The internet was built to share information, but our legal systems are still built around the idea of controlling it. The Internet Archive is essentially trying to force the internet to behave like a library, and the legal system is trying to force it to behave like a bookstore.

That is a great way to put it. A library is a place where you can access information regardless of your ability to pay. A bookstore is a place where access is a commodity. The Internet Archive is the only thing standing between us and the total commodification of our digital history.

So, what are the practical takeaways for our listeners? I mean, besides donating. How can a regular person use the Archive better?

One thing people do not realize is that you can "save a page" yourself. If you are on a website and you think, "This is important, and I bet it will be gone in a month," you can go to the Wayback Machine and there is a "Save Page Now" feature. You just paste the U-R-L, and it triggers a crawl right then. You are basically acting as a volunteer librarian.

I do that all the time with news articles that I think might get "stealth-edited" later. It is a great way to keep people accountable.

It is! And for researchers, the Internet Archive has an amazing collection of "dead software." If you need to run a specific version of a program from nineteen ninety-two for a project, you can probably find it there, and you might even be able to run it in your browser. It is an incredible resource for overcoming the "functional preservation" hurdle we talked about.

I think it is also worth mentioning the "Wayback Machine" browser extension. I have it on my laptop. If I hit a broken link, it automatically checks the Archive to see if there is a saved version. It makes the "link rot" problem almost invisible.

It turns the entire internet into a version-controlled system. It is brilliant. But again, it all depends on that central server staying up.

Which brings us back to Arweave. I am curious, Herman, if you were to pick one thing—one piece of human culture—that you would pay to put on Arweave to ensure it survives for a thousand years, what would it be?

Oh, man. That is a heavy question. Honestly? Probably the Wikipedia database. It is the most concentrated version of human knowledge we have ever created. If we lost everything else but kept Wikipedia, we could probably rebuild a lot of it.

That is a solid choice. I think I would pick the source code for the original World Wide Web. Just so people in the future can see how simple it was at the start, before we added all the tracking and the ads and the complexity.

A little bit of digital nostalgia for the year three thousand. I love it.

Well, I think we have covered the bases on Daniel's prompt. We looked at the history, the Heritrix crawlers, the WARC files, the Petaboxes, the legal drama, and the decentralized future. It is a lot to take in, but it makes me feel better knowing that people like Brewster Kahle and the folks at Arweave are thinking about this.

It is a battle for the soul of the internet, Corn. And I am glad we are on the side of the librarians.

Me too. And hey, if you are listening to this and you have found some value in our seven hundred-plus episodes, maybe head over to the Internet Archive and see what else they have. It is a rabbit hole you will never want to come out of.

And if you are enjoying the show, we would really appreciate a quick review on your podcast app or Spotify. It genuinely helps other people find us, and it helps keep this whole "My Weird Prompts" project going.

Yeah, we are not on Arweave yet, so we still need your help to stay visible in the current algorithms. You can find us on Spotify, Apple Podcasts, and pretty much everywhere else. Our website is myweirdprompts dot com, where you can find the R-S-S feed and a contact form if you want to send us your own weird prompt.

Or just email us at show at myweirdprompts dot com. We love hearing from you.

Thanks to Daniel for the prompt today and for getting our archive sorted. It feels good to know we are "official" now.

It really does. Until next time, I am Herman Poppleberry.

And I am Corn. This has been My Weird Prompts. Thanks for listening.

Goodbye, everyone!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.