⬅ Previous Topic
Security and Authentication in System DesignNext Topic ⮕
Cost Optimization in System Design⬅ Previous Topic
Security and Authentication in System DesignNext Topic ⮕
Cost Optimization in System DesignDisaster Recovery (DR) refers to the set of processes, policies, and technologies used to recover and restore a system after a failure or catastrophic event. The goal is to resume normal operations as quickly and smoothly as possible, minimizing downtime and data loss.
Imagine if a company like Netflix or Amazon lost all its data or went offline for hours. It would impact not only revenue but also customer trust. Disaster Recovery ensures your system is resilient and can bounce back from failures such as:
Before designing a DR strategy, two key terms must be understood:
RTO is the maximum acceptable time your system can be offline after a failure.
RPO defines the maximum acceptable amount of data loss measured in time. For example, if your RPO is 5 minutes, your system should not lose more than 5 minutes of data during a disaster.
Answer: Because it would require real-time backups, active-active data centers, instant failovers, and high costs. For most companies, a balance between cost and availability is necessary.
Let’s explore the most common DR strategies with detailed examples:
This is the simplest and most cost-effective DR strategy. Systems regularly back up data to a secure location. In case of a disaster, the data is restored from the backup.
A startup backs up its database every night to AWS S3. If the database server crashes, the team restores the last backup and resumes services, accepting some data loss (up to 24 hours).
Answer: No. Critical systems require more frequent backups or real-time replication. A full-day data loss is unacceptable in banking systems.
In this strategy, a minimal version of your system runs in a secondary region. It keeps essential components like the database ready but doesn't handle traffic until a disaster occurs. In an event, more infrastructure is spun up quickly.
An e-commerce site maintains a read-only replica of its production database in another AWS region. If the primary region fails, they quickly launch app servers and switch DNS to the backup region.
Here, a scaled-down version of the system is always running in the backup location. It can handle some traffic and be scaled up quickly to full capacity.
A social media platform keeps a warm standby in GCP while operating in AWS. The GCP setup runs 30% of the infrastructure. If AWS fails, traffic shifts gradually to GCP.
Answer: Warm standby is more expensive but allows faster failover with less setup time. Pilot light is cheaper but takes longer to become fully operational during a disaster.
In this strategy, multiple data centers are running at full capacity simultaneously and serve real-time traffic. If one fails, others continue serving users without noticeable disruption.
Google Search operates in multiple data centers worldwide. Users are routed to the nearest one via load balancing. If one region goes offline, traffic reroutes automatically to the others.
Answer: It offers the best availability and fastest recovery, but it’s also the most complex and expensive to build and maintain. Not every system needs such high resilience.
Designing DR is not enough—you must test it periodically. Testing ensures that backup processes work, failovers function, and your team knows the recovery steps.
A company runs a quarterly DR drill. They simulate a data center outage by taking one region offline and observing how the system recovers. Metrics like RTO and RPO are tracked and documented.
In 2020, a payment service provider faced a massive outage due to a misconfigured firewall rule. Their payment API went offline for hours. Since they didn’t have warm standby or region-level redundancy, thousands of merchants were affected.
Answer: With a warm standby in a different region and automated failover, they could have rerouted traffic and resumed operations quickly.
Disaster Recovery is an essential non-functional requirement for any system that values uptime and reliability. As a beginner, focus on understanding RTO and RPO, and start with simple backups. As your system grows, evaluate the trade-offs between cost and recovery speed using strategies like warm standby and active-active setups.
⬅ Previous Topic
Security and Authentication in System DesignNext Topic ⮕
Cost Optimization in System DesignYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.