Why We Check Our Own Backups (And Why You Should Doubt Yours)

We had a freshness-monitor bug that made backups look healthy when they weren't running. We caught it. Here's the story, and why automated 'backups OK' green ticks deserve healthy scepticism.

The single most dangerous monitoring state isn't red. It's green-when-it-should-be-red. A red light gets fixed. A wrongly-green light gets ignored — until you need the thing it was supposed to be monitoring.

We hit one of those last week. The backups themselves were fine. The thing that checked whether the backups were running had a subtle bug. Here's what happened.

The check

Our backup loop on each fleet node touches a freshness marker file at the end of every cycle. Our diagnostic tool, flame doctor, reads the marker's modification time and warns if it's older than 36 hours, fails if it's older than 7 days. Standard pattern. Works on every Linux box on the planet.

We rolled out an updated version of the backup loop a few days ago. The update added the freshness marker (it didn't exist before). We deployed the new script, started the backup cycle, walked away.

A few days later, our Discord started complaining: "no fresh backup marker on spark or litespeed in 5 days."

The bug, before we explain it

Try to guess the bug. The script was deployed. The cron schedule was correct. The backups were running on schedule. The marker function was in the script. Nothing was misconfigured.

The bug was: busybox sh doesn't reload function bodies when the script file changes.

The backup loop is a long-running shell process. It started up before our update with the old script in memory. We then edited the script on disk to add the marker function. The next time the running loop iterated, it executed its own cached version of the loop body — which still didn't have the marker function in it. So no marker got written. So flame-doctor saw stale timestamps. So it correctly complained.

The fix was a one-liner: kill the running process. Our supervisor restarted it from the current script. Markers started appearing within minutes.

The lesson

The lesson isn't "busybox is annoying" (although it is). The lesson is that monitoring code can decay independently of the thing it monitors. A green tick from a check you wrote yesterday doesn't mean the system is healthy — it means the check thinks the system is healthy. Those aren't the same.

Our actual backups were fine. They were running. The data was uploaded to Backblaze on schedule. The check that proved this was wrong was wrong by silence — it stopped reporting "yes" because the writer had a bug, not because the writer caught a real failure.

We caught it because Discord noticed the silence. If our alerting had been "ping me when something breaks" instead of "ping me when something stops reporting", we would have missed it indefinitely.

What this changes about how we do checks

Service supervisors should watch script mtimes. If /progs/flame/bin/some-loop.sh changes on disk, the supervisor should restart the long-running process so it picks up the new code. We had been treating shell scripts the same way as compiled binaries, which only matter at restart. They aren't. Interpretation behaviour differs, and it bit us.

Freshness checks beat success checks. "I saw a 'completed' line yesterday" can be wrong if the writer is broken. "The marker file is fresh" requires both that the writer ran and that it ran the current code. The check has to depend on a side-effect that proves the writer is alive now, not just that it once was.

What you should ask your hosting company

A hard question for any hosting company: when did you last actually restore a customer backup, end to end, without the customer asking? "We back up daily" is table stakes. "We periodically restore, by ourselves, to verify the chain works" is a different posture.

Our restore path is exercised. We've used it in anger more than once this year — on customer-requested restores, on a recovery-scheme verification round-trip, and on at least one self-test where we wanted to be sure the encryption-decryption was lossless. If a hosting provider can't tell you when they last restored their own data, treat that as data.

Backups are easy to set up and easy to assume work. The expensive question is whether the check that says they work is itself working. We caught ours. The general lesson is wider: green ticks lie if no one watches the watcher.