#3748: Your Backup Is Probably Corrupted Right Now

How to catch ZFS pool degradation before your backup faithfully preserves garbage for weeks.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3927
Published: Jun 20
Duration: 28:53
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: data-integrity backup-strategies hardware-reliability

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

This episode tackles a nightmare scenario: discovering that your backups have been silently filling with corrupted data for weeks. The core problem is that ZFS, despite its robust data integrity features, doesn't scream when things go wrong — it quietly increments error counters in zpool status and waits for you to notice. For anyone running backups, this means a pre-backup health check is non-negotiable. The gold standard is running a scrub to verify every block's checksum, but since scrubs can take hours or days on large pools, the practical workflow involves scheduling regular scrubs and checking the most recent results plus current zpool status before each backup. The critical distinction between ZFS checksums and SMART monitoring is that SMART only reports the drive's physical health metrics — it can't detect corrupted data from firmware bugs, bad cables, or controller faults. ZFS checksums verify the actual data, making them the final word on integrity. For notification systems, the escalation ladder matters: email is too easily buried, push notifications via Pushover or Gotify are effective for home labs, and PagerDuty makes sense for production systems where downtime costs money. The uncomfortable question about pulling the plug on a degraded pool without RAID comes down to risk tolerance — every second of operation is a gamble, but the instinct to go cold immediately is not wrong.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#3748: Your Backup Is Probably Corrupted Right Now

Daniel sent us this one, and it's basically a war story wrapped in a technical question. He learned the hard way that if your ZFS pool starts degrading silently, every full disk backup you make from that point forward is writing corruption. The question is, how do you build a workflow that catches pool degradation before the backup runs, alerts you in a way you'll actually notice, and then, once you know something's wrong, what's the smart disaster response? And he's asking specifically: are there tools that bake this in already, how does ZFS integrity checking compare to SMART, what notification system makes sense, and if you're not running RAID, do you make the system cold immediately? That last one is the one that makes my fur stand up a little.

It should make your fur stand up. The "not running RAID" part is the buried lede here. Because if you have redundancy, degradation is a yellow alert. You have time. If you don't have redundancy, degradation is a five-alarm fire and every second you're running that pool you're gambling. But let's back up and do this in order, because the question is really about designing a defensive perimeter around the backup itself.

And I think the thing that probably stung for him, and would sting for anyone, is the realization that the backup you thought was your safety net had been quietly filling itself with garbage for weeks. That's the kind of thing that makes you stare at a ceiling at two in the morning.

It's the worst kind of failure because it's silent. ZFS will tell you about checksum errors if you ask it, but it doesn't scream. It doesn't tap you on the shoulder. It just increments a counter in zpool status and waits for you to notice. And most people don't run zpool status every morning with their coffee.

Which is the first thing we should probably nail down. How do you actually check pool health before a backup? Because the answer is straightforward but the implementation is where people get lazy.

The command is zpool status, and the thing you're looking for is any non-zero value in the READ, WRITE, or CKSUM columns. Those three columns are the canary. If any of them are non-zero, your pool has detected corruption. ZFS checksums every block, so when it reads a block and the checksum doesn't match, it increments that counter. It's not probabilistic. It's not a heuristic. It's a cryptographic guarantee that the data coming off the disk is not the data that was written.

That's worth underlining because it's easy to conflate this with SMART, and the prompt specifically asks about the difference. So let's talk about that. ZFS checksum verification versus SMART. They're doing fundamentally different things, right?

SMART is the drive's own self-reported health metrics. It's looking at physical parameters: reallocated sectors, spin-up time, temperature, uncorrectable read errors, things like that. It's the drive's internal diagnostics. And it's useful. A drive with a sky-high reallocated sector count is probably going to fail soon. But SMART does not validate your data. A drive can pass SMART with flying colors and still return corrupted data because of a firmware bug, a cable issue, a RAM problem, a controller fault. That's the category of failure that SMART is blind to.

The musical equivalent of a smoke detector that only checks if it has batteries but doesn't actually detect smoke.

ZFS checksums, on the other hand, validate the data itself. They don't care why the data is wrong. They just know it's wrong. And this is the crucial distinction for backup workflows. If you're doing a full disk backup with something like dd or a block-level tool, you are copying bits verbatim. If those bits are corrupted, you are faithfully preserving the corruption. You're not backing up your data, you're backing up the noise.

The pre-backup health check is non-negotiable. And it's not just zpool status. You'd want to run a scrub first, or at least have a recent scrub result to look at.

Scrub is the gold standard. A scrub reads every single block in the pool, verifies every checksum, and repairs anything it can if you have redundancy. If you don't have redundancy, it will at least tell you which files are damaged. The problem is scrub takes time. On a large pool it can run for hours or even days. So you can't realistically run a full scrub before every backup. What you can do is schedule regular scrubs, say monthly or biweekly, and then before each backup, check the results of the most recent scrub and run zpool status for any new errors since then.

This is where the tooling question comes in. Daniel asked whether there are tools that already bake this workflow in. And I think the honest answer is, sort of, but not in the polished turnkey way most people want.

There are a few layers to this. At the most basic level, ZFS has the zed, the ZFS Event Daemon. It's been part of OpenZFS for years. It can execute scripts when certain events happen, including checksum errors or pool degradation. Out of the box it mostly just logs things, but you can configure it to send emails, fire off webhooks, trigger arbitrary commands. It's not a backup tool, it's an event responder, but it's the foundation that most custom solutions are built on.

Zed is the notification plumbing but it doesn't do the backup orchestration itself.

Then there's something like Sanoid and Syncoid, which are the most popular third-party tools in the ZFS backup space. Sanoid handles snapshot management and Syncoid handles replication. Sanoid can be configured to run pre-snapshot and post-snapshot scripts, so you could absolutely put a health check in the pre-snapshot hook. If the check fails, the snapshot doesn't happen and Syncoid doesn't replicate. But it's not built in as a default behavior. You have to wire it up yourself.

Which brings us to the uncomfortable truth of ZFS tooling, which is that it's CLI-dominated and expects you to know what you're doing. The ecosystem assumes competence.

It's very much a power-user filesystem. The trade-off is that you get incredible data integrity guarantees and features like snapshots, compression, and deduplication, but the management surface is mostly command-line and the tooling expects you to compose your own workflows. There's no ZFS Backup Wizard that holds your hand through all of this. TrueNAS has a web interface that wraps a lot of this, and it does have built-in health notifications and scrub scheduling, but for custom backup pipelines you're still writing scripts.

If someone's sitting there thinking, I want what Daniel described, a pre-backup health check that gates the backup and alerts me if something's wrong, they're probably writing a shell script that does zpool status, greps for non-zero error counts, and either proceeds with the backup or fires an alert.

That script is maybe thirty lines. It's not complicated. The hard part isn't the check itself, it's the alerting. And the prompt specifically asks about notification systems, so let's talk about that. The question is, how do you make sure you actually notice when the alert fires?

This is the part where I feel like the answer depends on how much pain you've already experienced. Someone who's never lost data might be fine with email. Someone who's stared into the void at two in the morning after losing a pool wants something that will physically shake them awake.

The escalation ladder matters. Email is the lowest tier. It's asynchronous, it can get buried, it can go to spam, and if your mail server is on the same machine that's failing, you might not even get it. SMS is a step up, but it costs money and has delivery lag. Push notifications through something like Pushover or Gotify are fast and free or cheap, and they bypass the email problem entirely. Then you have the pager-duty tier, PagerDuty or Opsgenie, which will actually escalate, call your phone, wake you up. That's overkill for a home lab but absolutely appropriate for any system where downtime costs real money.

Daniel mentioned PagerDuty by name in the prompt, which tells me he's thinking about this in terms of something that actually demands attention, not something that politely suggests you might want to look at a log file someday.

PagerDuty is interesting because it's designed for on-call engineering teams. It has escalation policies, scheduling, acknowledgement requirements. If you don't acknowledge the alert within a certain window, it escalates to the next person. For a home server, you could configure it to escalate from a push notification to an SMS to a phone call over the course of, say, ten minutes. The thing is, it costs money and it's a whole service to manage. For most people, a well-configured Pushover alert that makes an obnoxious sound and overrides do-not-disturb is probably sufficient.

The other thing about notification design is that the alert needs to be specific. You don't want "something is wrong with the server." You want "ZFS pool tank has three checksum errors on disk ada2. Pool status attached." The more specific the alert, the faster the response.

That's where scripting comes in again. Zed can capture all of that detail because ZFS events carry the pool name, the vdev, the error type. A good zed configuration pipes that into your notification system with all the context. There's a zed.zedlet file you can customize. The default one that ships with most distributions is pretty barebones but the hooks are all there.

Let's talk about the other part of the prompt, which is SMART versus ZFS checking. You already made the key distinction, but I want to push on something. Is there a case for running both? Does SMART catch things that ZFS checksums miss, even if the checksums are the final word on data integrity?

Absolutely run both. They're complementary. SMART tells you the drive is physically deteriorating before it starts returning bad data. Reallocated sectors are the classic early warning. The drive reads a sector, realizes it's failing, and remaps it to a spare sector. That's SMART attribute five, Reallocated Sector Count. If that number is climbing, the drive is on its way out. You want to know that before the checksum errors start because by the time ZFS sees checksum errors, some data may already be unrecoverable if you don't have redundancy.

SMART is the early warning radar and ZFS checksums are the impact confirmation.

There's another category: ECC memory errors. ZFS trusts the system RAM. If you have a bad RAM stick that's silently corrupting data in flight, ZFS will checksum the corrupted data and write it to disk, and it will think everything is fine because the checksum matches the corrupted data. This is why serious ZFS deployments use ECC RAM. Without ECC, your checksums are only as trustworthy as your memory.

That's a whole other rabbit hole and probably beyond the scope of this prompt, but it's worth flagging. The chain of trust goes deeper than just the disks.

It does, and it's one of those things where the more you learn, the more paranoid you get. But for the specific workflow Daniel is describing, the pre-backup check is mostly about catching disk-level corruption. And the most reliable way to do that is zpool status plus a recent scrub, supplemented by SMART monitoring for early warning.

Let's get to the part that made my fur stand up. Once degradation is detected and you're not running RAID, should the system be made immediately cold?

This is where the answer gets nuanced and a little uncomfortable. The short answer is: it depends on what's degraded and what you're trying to protect. But the instinct to go cold immediately is not wrong.

Walk me through the scenarios.

Scenario one: you have a single-disk pool, no redundancy, and zpool status shows checksum errors. That means some data is already corrupted. You don't know how much. Every additional read or write risks making it worse because the drive is failing. In that scenario, yes, you want to stop all activity on that pool immediately. Ideally you'd export the pool, physically disconnect the drive, and then figure out your recovery strategy on a copy of the disk image. The clock is ticking on that drive and every spin is borrowed time.

You make it cold, you image the drive with something like ddrescue, and you work from the image.

Ddrescue is the tool for this. It reads the disk block by block, skips unreadable sectors, and can retry failed areas later. You get as much data as the drive will physically give you, and then you try to salvage what you can from the image. It's slow and painful and you will lose something, but you'll lose less than if you kept the pool online and the drive ate itself completely.

Scenario two: you have a mirror or a RAIDZ, and only one disk is showing errors. In that case, you don't go cold. ZFS will heal the data from the healthy copy automatically during a scrub or even during normal reads. You identify the failing disk, you replace it, and the pool resilvers. The system stays online the whole time. This is the entire point of redundancy. Degradation with redundancy is an inconvenience. Degradation without redundancy is a disaster.

Which is why the prompt specifying "if you're not running RAID" changes the answer from "relax, replace the disk" to "shut it down now.

I think it's worth saying explicitly: if you care enough about your data to be running ZFS and doing full disk backups, you should probably be running at least a mirror. The cost of a second disk is trivial compared to the cost of data loss. A mirror turns a catastrophic failure into a maintenance task.

There's a philosophical question here about the backup strategy itself. The prompt describes full disk backups as an "ultimate fail-safe" layered on top of incremental backups. And I think that's correct, but it's also worth asking what exactly the full disk backup is protecting against that the incrementals don't cover.

Incremental backups protect against file-level issues: accidental deletion, ransomware, application-level corruption, user error. They're granular and space-efficient. A full disk backup, a block-level image, protects against filesystem-level corruption, bootloader issues, partition table disasters, operating system rot. If your ZFS pool itself becomes unmountable, your incremental backups of files inside it are useless. You need a block-level image to recover the pool or to extract data with forensic tools.

The full disk backup is the parachute, and the incrementals are the safety net below the parachute. Different failure modes, different tools.

The nightmare scenario is the one Daniel hit: the parachute has a hole in it and you don't find out until you pull the cord. That's why the pre-backup health check is so critical for full disk backups specifically. With incremental file backups, if a file is corrupted, you might lose that file. With a corrupted full disk backup, you might lose everything.

Let's talk about what a complete, production-grade pre-backup health check script actually looks like. I think people hearing this want something concrete they can adapt.

At minimum, you want three checks. First, check the pool status. If zpool status shows any pool as anything other than ONLINE with zero errors, abort. Second, check when the last scrub completed. If the last scrub is older than your threshold, say thirty days, abort or at least warn. A stale scrub is almost as bad as no scrub. Third, check SMART on all disks. If any disk has a high reallocated sector count or pending sectors, that's a warning. You might still run the backup, but you definitely want to be notified.

The script should have a clear exit code. Zero means go, anything else means stop. That way the backup tool can just check the exit code.

And the backup tool itself should be configured to not run if the pre-check script returns non-zero. That's the gating function. Most backup schedulers, whether it's cron or systemd timers or something more sophisticated, can be configured this way. The pre-check script runs first, and if it fails, the backup job never starts.

What about the notification side? You mentioned zed earlier, but for a pre-backup check script, you'd probably want explicit notifications within the script itself.

Right, because zed is reactive. It fires when ZFS detects an event. But your pre-backup script is proactive. It's checking before the backup, and if it finds something, it needs to tell you immediately. The simplest approach is a shell script that uses curl to hit a webhook. Pushover has a simple API. You can send a notification with a single curl command. Here's where I'd say: if you're going to the trouble of building this, invest the fifteen minutes to set up a dedicated notification channel. Don't just send an email to your overflowing inbox. Make it obnoxious.

The notification needs to have the same urgency as the problem. A degraded pool without redundancy is a "wake you up at three in the morning" problem. The alert should match that.

This connects to a broader principle in system administration that I think is underappreciated: alert fatigue is real, but so is alert invisibility. If all your alerts go to the same place with the same priority, you'll either ignore all of them or be overwhelmed by all of them. You need tiers. Pool degradation without redundancy is the highest tier. It should bypass your normal notification filters.

Like the nuclear football of server alerts.

It should feel different from "your disk is eighty percent full" or "a new package update is available." It should be the equivalent of a fire alarm, not a calendar reminder.

Let's pivot slightly. The prompt is really about smart disaster mitigation, and we've talked about detection and alerting. What about the mitigation itself? Once you know the pool is degraded and you've made the decision to go cold or not, what's the actual recovery playbook?

If you have redundancy, the playbook is well-documented. Identify the failing disk with zpool status, which will show you which device has errors. Physically replace it. Run zpool replace. Wait for resilver. Verify the pool is healthy. That's it. The whole process is designed to be boring.

If you don't have redundancy?

Then the playbook depends on what you're trying to save. If the pool is still mountable, your first priority is to copy everything you can to a known-good destination. Not a backup, just a raw copy. rsync, zfs send to an external drive, whatever you have. Get the data off the degraded pool onto something healthy. Then you destroy the pool, replace the failing hardware, and recreate it from your backups or from the copy you just made.

If the pool isn't mountable?

That's when you reach for ddrescue and forensics tools. You image the disk, you try to import the pool from the image, and you salvage what you can. This is a terrible place to be and the recovery rate is never one hundred percent. Which is why the entire point of this conversation is to never get to this point. The pre-backup health check, the notifications, the SMART monitoring, it's all designed to catch the problem before the pool becomes unmountable.

The ounce of prevention that's worth several terabytes of cure.

I want to circle back to something the prompt mentioned that we haven't fully addressed. Daniel said he'd design a custom backup solution from scratch with this workflow. The question is whether any existing tools bake it in. I think the honest answer is no, not as a fully integrated solution. But I also think that's not a huge problem because the pieces are all there and they're not that hard to assemble.

It's a Lego situation. The bricks exist, but nobody's selling the pre-assembled castle.

For the kind of person who's running ZFS and doing block-level backups, assembling those bricks is probably within their skill set. The bigger risk isn't the technical difficulty, it's the failure of imagination. It's not realizing that the pre-backup health check is necessary until you've already lost data.

Which is exactly what happened here. The prompt is basically a post-mortem of a near-miss or a direct hit, and the question is, how do I make sure this never happens again.

The answer, to summarize the actionable part, is: one, schedule regular scrubs and check the results. Two, write a pre-backup script that gates on zpool status and recent scrub completion. Three, configure notifications that you will actually notice, with appropriate urgency. Four, monitor SMART for early warning. Five, if you're not running redundancy, treat degradation as an emergency and go cold while you still can. Six, seriously consider whether the cost of a second disk for a mirror is less than the cost of losing the data. The answer is almost always yes.

That sixth point feels like the real takeaway. A lot of people run single-disk ZFS pools because they think ZFS's checksumming alone protects them. It doesn't. It tells you the data is corrupted, but without redundancy it can't fix it. It's a detection system, not a repair system.

That's the single biggest misconception about ZFS. Checksums without redundancy are a canary in a coal mine. They tell you something is wrong, but they don't save you. You need a mirror or RAIDZ for that. And if you can't afford redundancy, you need to be even more vigilant about backups and health checks, because your margin for error is zero.

The canary is very helpful, but it's not going to fly down and rescue you.

It's just going to stop singing, and then you're on your own.

Let's talk about one more angle that the prompt touches on implicitly. The notification channel question. Daniel mentioned PagerDuty, but I think there's a middle ground between "email that goes to spam" and "enterprise incident management platform." What about something like Healthchecks.

io is a great suggestion. It's a dead man's switch for cron jobs and scheduled tasks. You configure your backup script to ping Healthchecks.io when it starts and when it succeeds. io doesn't receive the expected ping within the configured window, it alerts you. It supports email, Slack, Discord, Telegram, Pushover, a whole bunch of channels. The free tier covers a handful of checks, which is plenty for a home server.

You could have the pre-backup health check ping a Healthchecks.io endpoint, and if the check fails and the backup never runs, Healthchecks.io screams at you because the expected success ping never arrived.

And it's dead simple to set up. A single curl command in your script. The nice thing about this approach is it covers two failure modes: the health check found a problem, and the backup itself failed or never ran. Either way, you get alerted.

That's elegant. The silence is the alarm.

It's the kind of thing that takes five minutes to set up and might save you from discovering a problem three months after it started. Which, reading between the lines of the prompt, sounds like roughly what happened.

The prompt has the energy of someone who just had a very bad weekend and is now redesigning their entire approach to data integrity at three in the morning.

Those are usually the best system designs, honestly. Adversity-driven architecture.

The school of hard knocks has an excellent job placement rate.

It really does. And I think one thing that's implicit in all of this is that backup integrity is a process, not a product. You can buy backup software, you can buy redundant hardware, but if you don't have the monitoring and alerting and regular verification, you're just hoping. And hope is not a strategy for data integrity.

That's almost a T-shirt slogan. Hope is not a backup strategy.

It's not. And the specific insight from this prompt, the thing I think is worth really underlining, is that full disk backups have a unique vulnerability. With file-level incrementals, if one file is corrupted in the backup, you lose that file. With a block-level image, corruption at the source means the entire image is suspect. The blast radius is total.

Which is why the pre-backup health check isn't a nice-to-have. It's the gate that prevents you from confidently backing up garbage for months and only discovering it when you need to restore.

The restore is the worst possible time to discover your backup is bad. That's the moment when you have no other options. The whole point of a backup is that it's there when everything else has failed. If the backup itself is broken, you've failed at the one job backups have.

The workflow Daniel described is basically: scrub regularly, check pool health before backup, alert aggressively if something's wrong, and if you're running without redundancy, treat degradation as an all-hands emergency. That's the blueprint.

The tools exist to do all of this. zpool status, zpool scrub, zed, SMART monitoring via smartctl, Pushover or Healthchecks.io for notifications, and a small shell script to tie it together. It's not a polished commercial product but it's a weekend project for someone comfortable with the command line.

Which I think describes the target audience for this pretty well. If you're running ZFS and doing block-level backups, you are already comfortable with the command line. The missing piece isn't technical skill, it's awareness of the failure mode.

Hopefully this conversation fills that gap for some people. The failure mode is real, it's silent, and it will ruin your day if you don't catch it early.

One last thing before we wrap. The prompt mentions making the system cold if degradation is detected without RAID. I want to emphasize that "cold" means different things in different contexts. For a home server, it might mean exporting the pool and shutting down the machine. For a production system, it might mean failing over to a standby and isolating the degraded node. The principle is the same: stop the bleeding, then figure out the treatment.

The treatment, in almost every case, starts with getting a copy of whatever data you can onto known-good media. Whether that's a zfs send to an external drive, an rsync to another machine, or a ddrescue image of the raw disk. The failing hardware is a ticking clock. You copy first, analyze later.

That's the kind of clear, actionable advice that comes from having been through it or from knowing someone who has. The prompt is basically a cautionary tale with a technical question attached, and the answer is: yes, you can build this, and yes, you absolutely should.

If you're listening to this and realizing your backup pipeline doesn't have a pre-flight health check, this is your sign to add one. Before the next backup runs.

Before the next scrub completes and you forget to look at the results.

Before the canary stops singing.

Now: Hilbert's daily fun fact.

Hilbert: In nineteen forty-three, a geometry textbook printed in Sapporo contained a misprinted diagram of a pentagonal tiling that, by shifting one vertex by three millimeters on the page, accidentally described a previously unknown aperiodic tiling pattern. The error was not discovered until a Hokkaido University librarian noticed the anomaly in nineteen eighty-eight while digitizing the collection.

Three millimeters on a printed page in nineteen forty-three. The entire difference between a known tiling and a discovery, just sitting in a textbook for forty-five years.

I love that the librarian noticed. Forty-five years of students probably looked at that diagram and thought, huh, that looks a bit off, and moved on.

This has been My Weird Prompts, with me, Herman Poppleberry.

Produced by Hilbert Flumingtop. If you've got a data integrity war story or a question you want us to dig into, find us at myweirdprompts.

If this episode saved you from a bad weekend, leave us a review wherever you listen. It helps more than you'd think.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#3748: Your Backup Is Probably Corrupted Right Now

Downloads

You Might Also Like

#3748: Your Backup Is Probably Corrupted Right Now