Every hosting company has a worst-day scenario. Ours is: the primary server is gone, the backups are encrypted, and the key that decrypts those backups was on the primary server. That's the pit you can't climb out of.
We've engineered ours away. Here's how.
The vault
There's a single master vault on our primary server — a small encrypted file holding the secrets that ultimately decrypt customer backups, sign manifests, and authenticate the fleet. Lose this vault and we lose everything; expose it and a single attacker steals everything. Both endings are unacceptable.
So we never store the unlock key in one place.
Shamir secret sharing
The master key is split into three shares using Shamir's secret sharing over GF(256). Any two of the three shares reconstruct the master; one share alone is mathematically useless. Not "hard to brute-force" — actually useless, the way a single point doesn't define a line.
The three shares are distributed deliberately:
- Share 1 lives on our secondary fleet node, sealed with that node's public key
- Share 2 lives on our tertiary fleet node, sealed with that node's public key
- Share 3 is printed once during setup and goes offline — paper, password manager, or both
The "sealed" part matters. We use NaCl sealed-box encryption so each fleet share can only be opened by the destination node's private key. Even if an attacker grabs the bytes off a fleet node, they get a 2-of-3 share they can't decrypt.
What this protects against
Primary server destroyed: rebuild a new primary, pull two shares (one from a fleet node + the offline Share 3), reconstruct the master, decrypt the vault, restore.
One fleet node compromised: the attacker gets a sealed share they can't open and which alone isn't enough anyway. We rotate.
Operator's offline copy lost: the two fleet shares still combine to reconstruct. We re-issue Share 3 and rotate.
All three locations compromised at once: that's the threat we accept can't be engineered around. Three independent compromises within the same window is a different category of problem from any one of them.
What it doesn't protect against
Operator error during reconstruction. If we paste the wrong base64 string or use the wrong fleet share, the reconstruction simply fails — but it fails noisily, not silently, so we don't decrypt the wrong thing and corrupt the vault.
It also doesn't protect against future cryptographic breaks against XChaCha20-Poly1305 or Curve25519. We pick conservative algorithms and rotate the master periodically; if the world changes, we re-key.
Why we built it ourselves
We use NaCl primitives (well-reviewed, conservative, no surprises) but the Shamir layer and vault format are our own code, ~720 lines of Go. Two reasons we didn't import a library:
- Auditability. A small in-tree implementation is something we can fully read. We did.
- No runtime dependencies. The recovery binary is statically linked Go. If our distro is broken, the fleet network is down, and we're rebuilding from a fresh Alpine image at 3am, the recovery binary still runs.
What it looks like in practice
The whole thing is one CLI tool: flame recovery init to create the scheme, flame recovery verify to sanity-check it, flame recovery reconstruct to actually rebuild the master from shares. The init command prints the offline material exactly once and refuses to print it again — there's no "show me Share 3" command, by design.
We've been carrying this design for a while. As of this week it's deployed on the live fleet.
The honest version of "we keep your data safe" is that we engineer for our own worst day, not just normal operations. A backup is only as good as the key that decrypts it. We built the recovery layer for the keys themselves.