System Design CourseSystem Design Course1

Disaster Recovery in System Design



What is Disaster Recovery?

Disaster Recovery (DR) refers to the set of processes, policies, and technologies used to recover and restore a system after a failure or catastrophic event. The goal is to resume normal operations as quickly and smoothly as possible, minimizing downtime and data loss.

Why is Disaster Recovery Important?

Imagine if a company like Netflix or Amazon lost all its data or went offline for hours. It would impact not only revenue but also customer trust. Disaster Recovery ensures your system is resilient and can bounce back from failures such as:

Key Metrics in Disaster Recovery

Before designing a DR strategy, two key terms must be understood:

1. Recovery Time Objective (RTO)

RTO is the maximum acceptable time your system can be offline after a failure.

2. Recovery Point Objective (RPO)

RPO defines the maximum acceptable amount of data loss measured in time. For example, if your RPO is 5 minutes, your system should not lose more than 5 minutes of data during a disaster.

Question: Why can't we always have RTO and RPO of zero?

Answer: Because it would require real-time backups, active-active data centers, instant failovers, and high costs. For most companies, a balance between cost and availability is necessary.

Disaster Recovery Strategies

Let’s explore the most common DR strategies with detailed examples:

1. Backup and Restore

This is the simplest and most cost-effective DR strategy. Systems regularly back up data to a secure location. In case of a disaster, the data is restored from the backup.

Example:

A startup backs up its database every night to AWS S3. If the database server crashes, the team restores the last backup and resumes services, accepting some data loss (up to 24 hours).

Question: Is this approach suitable for critical financial systems?

Answer: No. Critical systems require more frequent backups or real-time replication. A full-day data loss is unacceptable in banking systems.

2. Pilot Light

In this strategy, a minimal version of your system runs in a secondary region. It keeps essential components like the database ready but doesn't handle traffic until a disaster occurs. In an event, more infrastructure is spun up quickly.

Example:

An e-commerce site maintains a read-only replica of its production database in another AWS region. If the primary region fails, they quickly launch app servers and switch DNS to the backup region.

3. Warm Standby

Here, a scaled-down version of the system is always running in the backup location. It can handle some traffic and be scaled up quickly to full capacity.

Example:

A social media platform keeps a warm standby in GCP while operating in AWS. The GCP setup runs 30% of the infrastructure. If AWS fails, traffic shifts gradually to GCP.

Question: What is the trade-off between pilot light and warm standby?

Answer: Warm standby is more expensive but allows faster failover with less setup time. Pilot light is cheaper but takes longer to become fully operational during a disaster.

4. Active-Active

In this strategy, multiple data centers are running at full capacity simultaneously and serve real-time traffic. If one fails, others continue serving users without noticeable disruption.

Example:

Google Search operates in multiple data centers worldwide. Users are routed to the nearest one via load balancing. If one region goes offline, traffic reroutes automatically to the others.

Question: Is active-active always the best approach?

Answer: It offers the best availability and fastest recovery, but it’s also the most complex and expensive to build and maintain. Not every system needs such high resilience.

Testing Disaster Recovery

Designing DR is not enough—you must test it periodically. Testing ensures that backup processes work, failovers function, and your team knows the recovery steps.

Example:

A company runs a quarterly DR drill. They simulate a data center outage by taking one region offline and observing how the system recovers. Metrics like RTO and RPO are tracked and documented.

Common Mistakes in Disaster Recovery Planning

Real-World Scenario: Disaster at a Payment Company

In 2020, a payment service provider faced a massive outage due to a misconfigured firewall rule. Their payment API went offline for hours. Since they didn’t have warm standby or region-level redundancy, thousands of merchants were affected.

Question: How could this have been avoided?

Answer: With a warm standby in a different region and automated failover, they could have rerouted traffic and resumed operations quickly.

Conclusion

Disaster Recovery is an essential non-functional requirement for any system that values uptime and reliability. As a beginner, focus on understanding RTO and RPO, and start with simple backups. As your system grows, evaluate the trade-offs between cost and recovery speed using strategies like warm standby and active-active setups.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M