Disaster Recovery and Geographic Redundancy for Payment Nodes

You have backup nodes. You think you have disaster recovery. You do not know unless you tested it last week.

Disaster recovery is not a configuration file. It is a plan you execute under pressure, when your primary infrastructure is dead, and customers are watching money disappear.

Blockchain nodes fail in specific ways. A Geth node runs out of disk space and stops syncing. Your fallback routes to a secondary node in the same data center. That node is also out of disk space because you did not notice the disk usage metric. Both fail together and payment infrastructure goes dark.

A Solana validator loses network connectivity to the cluster. It cannot validate blocks and cannot submit transactions. You have a backup validator in a different region. Failover routes traffic there. The backup validator is 5 slots behind and needs 10 seconds to catch up. During those 10 seconds, transactions submitted to the backup are stuck in a local mempool and never make it onchain.

A bridge node crashes and corrupts the attestation database. Your backup bridge is running hot-standby mode and replicated the corruption immediately. Both bridges are now suspect. Settlement stalls while you decide which copy of truth is correct.

Planning for disaster means understanding failure modes specific to your setup. Different chains fail differently. Your first step is to catalog failure modes for each chain you run. What breaks a validator on this chain? What breaks a bridge? How fast can you detect it?

Geographic Redundancy and RPO/RTO

RPO is how much data you are willing to lose. RTO is how long you are willing to be down.

A payment system has harsh RPO requirements. You cannot lose transaction data. If a customer paid you $100,000 and you lost the record, that is fraud, even if the blockchain confirms the payment. So your RPO is measured in seconds, not hours.

RTO is how fast you switch to the backup. For consumer payments, RTO might be 30 seconds. For B2B settlement, RTO might be 5 seconds. For high-frequency trading, RTO might be sub-100ms.

Set your RPO and RTO targets before you design the system. Do not design first and hope they happen.

For primary infrastructure, your RPO should be as close to zero as possible. Use synchronous replication if you can afford the latency cost. Most teams use async replication with 1-5 second lag and accept that much potential loss if the primary crashes before data ships to the secondary.

For RTO, measure against the metric that matters to customers. If customers care about settlement time, RTO means "time from primary failure to transaction settling on secondary." If they care about connectivity, RTO means "time from primary failure to first successful request to secondary." These are different. The latter might be 1 second, the former might be 30 seconds.

Running nodes in a single region is gambling. A datacenter outage, a regional network partition, or a software update gone wrong can take you offline completely. Multi-region is the minimum viable setup.

Run your primary node in one region. Run your secondary in another region at least 100km away (to avoid shared infrastructure, power grids, and fiber cuts). Configure your routing layer to health-check both nodes. If the primary fails, route to the secondary.

This is not free. A Geth node synced to current state consumes 1-2TB of storage and 20-50GB of bandwidth per day. Running it in two regions is expensive. Running it in three regions is very expensive. So be selective. Prioritize by value.

For bridges and routers, you can afford more redundancy because they are lighter weight. Run bridges in three regions if possible.

Keep your secondary node's state current. Use snapshot sync to reduce catchup time. For high-value infrastructure, run the secondary in hot-standby mode, where it processes blocks in real-time but does not produce its own blocks. When failover happens, the secondary is already current.

Testing Recovery

Disaster recovery only works if you have tested it under failure conditions. Not simulated failure. Actual failure.

Schedule a chaos test. On a Tuesday morning, shut down your primary node. Measure how long until transactions complete on the secondary. Measure how many requests fail during transition. Measure how the system behaves.

Run chaos tests regularly. Monthly is reasonable for critical infrastructure. Quarterly minimum.

Chaos tests surface issues that static reviews miss. Your secondary node has a bug in a rarely-taken code path that only manifests under high load during failover. Your monitoring alert has a typo and never fires. Your runbook assumes someone is awake at 3am, but the team is asleep. The chaos test finds these things before a real disaster.

Document what you learn from each chaos test. After three chaos tests, you will have a list of failure modes and fixes. After six, you will understand your system's true failure profile.

Do not run chaos on production traffic during tests. Spin up a staging environment that mirrors production, then break it. Or run chaos on a subset of production traffic (5-10%) so real customers are not impacted.

You want to test failover without actually failing over production. Blue-green deployment keeps the primary running production traffic. Deploy the secondary in a green environment. Run test traffic through green. Verify that green works, then switch. Switch back to primary if green fails.

Another approach is canary failover. Route 1% of production traffic to the secondary for one week. Measure success rate, latency, error rates. If everything looks good, increase to 10%, then 50%, then 100%. This is safer than a hard cutover because you catch issues at small scale.

Both approaches require redundancy at the router layer. You need a router that can distribute traffic intelligently and switch targets on demand.

Shut down the primary when you are confident. Measure what happens to the system. Measure customer impact. If you cannot articulate a specific number for customer impact and recovery time, you do not have disaster recovery. You have a backup. Backups are nice. Recovery is different. Recovery is knowing exactly how fast your system restores under what conditions and what trade-offs are made in the process.

Disaster Recovery and Geographic Redundancy for Payment Nodes.

Geographic Redundancy and RPO/RTO

Testing Recovery

Tell us what you are building.