Disaster Recovery in System Design

What is Disaster Recovery?

Disaster Recovery (DR) refers to the set of processes, policies, and technologies used to recover and restore a system after a failure or catastrophic event. The goal is to resume normal operations as quickly and smoothly as possible, minimizing downtime and data loss.

Why is Disaster Recovery Important?

Imagine if a company like Netflix or Amazon lost all its data or went offline for hours. It would impact not only revenue but also customer trust. Disaster Recovery ensures your system is resilient and can bounce back from failures such as:

Natural disasters (floods, earthquakes)
Data center outages
Cyber attacks (ransomware, DDoS)
Human errors (accidental deletion of production data)

Key Metrics in Disaster Recovery

Before designing a DR strategy, two key terms must be understood:

1. Recovery Time Objective (RTO)

RTO is the maximum acceptable time your system can be offline after a failure.

2. Recovery Point Objective (RPO)

RPO defines the maximum acceptable amount of data loss measured in time. For example, if your RPO is 5 minutes, your system should not lose more than 5 minutes of data during a disaster.

Question: Why can't we always have RTO and RPO of zero?

Answer: Because it would require real-time backups, active-active data centers, instant failovers, and high costs. For most companies, a balance between cost and availability is necessary.

Disaster Recovery Strategies

Let’s explore the most common DR strategies with detailed examples:

1. Backup and Restore

This is the simplest and most cost-effective DR strategy. Systems regularly back up data to a secure location. In case of a disaster, the data is restored from the backup.

Example:

A startup backs up its database every night to AWS S3. If the database server crashes, the team restores the last backup and resumes services, accepting some data loss (up to 24 hours).

Question: Is this approach suitable for critical financial systems?

Answer: No. Critical systems require more frequent backups or real-time replication. A full-day data loss is unacceptable in banking systems.

2. Pilot Light

In this strategy, a minimal version of your system runs in a secondary region. It keeps essential components like the database ready but doesn't handle traffic until a disaster occurs. In an event, more infrastructure is spun up quickly.

Example:

An e-commerce site maintains a read-only replica of its production database in another AWS region. If the primary region fails, they quickly launch app servers and switch DNS to the backup region.

3. Warm Standby

Here, a scaled-down version of the system is always running in the backup location. It can handle some traffic and be scaled up quickly to full capacity.

Example:

A social media platform keeps a warm standby in GCP while operating in AWS. The GCP setup runs 30% of the infrastructure. If AWS fails, traffic shifts gradually to GCP.

Question: What is the trade-off between pilot light and warm standby?

Answer: Warm standby is more expensive but allows faster failover with less setup time. Pilot light is cheaper but takes longer to become fully operational during a disaster.

4. Active-Active

In this strategy, multiple data centers are running at full capacity simultaneously and serve real-time traffic. If one fails, others continue serving users without noticeable disruption.

Example:

Google Search operates in multiple data centers worldwide. Users are routed to the nearest one via load balancing. If one region goes offline, traffic reroutes automatically to the others.

Question: Is active-active always the best approach?

Answer: It offers the best availability and fastest recovery, but it’s also the most complex and expensive to build and maintain. Not every system needs such high resilience.

Testing Disaster Recovery

Designing DR is not enough—you must test it periodically. Testing ensures that backup processes work, failovers function, and your team knows the recovery steps.

Example:

A company runs a quarterly DR drill. They simulate a data center outage by taking one region offline and observing how the system recovers. Metrics like RTO and RPO are tracked and documented.

Common Mistakes in Disaster Recovery Planning

Not testing the recovery process
Assuming backups are restorable without verifying
Using the same region or availability zone for backups
Overlooking RTO/RPO definitions for each component

Real-World Scenario: Disaster at a Payment Company

In 2020, a payment service provider faced a massive outage due to a misconfigured firewall rule. Their payment API went offline for hours. Since they didn’t have warm standby or region-level redundancy, thousands of merchants were affected.

Question: How could this have been avoided?

Answer: With a warm standby in a different region and automated failover, they could have rerouted traffic and resumed operations quickly.

Conclusion

Disaster Recovery is an essential non-functional requirement for any system that values uptime and reliability. As a beginner, focus on understanding RTO and RPO, and start with simple backups. As your system grows, evaluate the trade-offs between cost and recovery speed using strategies like warm standby and active-active setups.

⬅ Previous TopicSecurity and Authentication in System Design

Next Topic ⮕Cost Optimization in System Design

Disaster Recovery in System Design

What is Disaster Recovery?

Why is Disaster Recovery Important?

Key Metrics in Disaster Recovery

1. Recovery Time Objective (RTO)

2. Recovery Point Objective (RPO)

Question: Why can't we always have RTO and RPO of zero?

Disaster Recovery Strategies

1. Backup and Restore

Example:

Question: Is this approach suitable for critical financial systems?

2. Pilot Light

Example:

3. Warm Standby

Example:

Question: What is the trade-off between pilot light and warm standby?

4. Active-Active

Example:

Question: Is active-active always the best approach?

Testing Disaster Recovery

Example:

Common Mistakes in Disaster Recovery Planning

Real-World Scenario: Disaster at a Payment Company

Question: How could this have been avoided?

Conclusion

Module 9: Non-Functional Topics❯

Welcome to ProgramGuru

Player Settings