What is Availability?
Availability refers to the ability of a system to be operational and accessible when it is needed. It is usually represented as a percentage of uptime over a specific period. A highly available system ensures that users can access it without significant downtime.
How is Availability Measured?
Availability is commonly expressed in terms of "nines". For example:
- 99% availability means ~3.65 days of downtime in a year
- 99.9% availability means ~8.76 hours of downtime in a year
- 99.99% availability means ~52.56 minutes of downtime in a year
Example: Availability in a Web Application
Imagine a basic e-commerce website. If this website is hosted on a single server and that server crashes or restarts, the site becomes unavailable to users. If we add a load balancer and deploy multiple server instances across data centers, the application becomes more available—even if one server fails, traffic is routed to another.
What is Reliability?
Reliability is the probability that a system will run without failure over a specific period. A reliable system performs its intended function correctly and consistently under expected conditions. While availability focuses on being accessible, reliability focuses on being correct.
Example: Reliability in a Messaging System
Consider a chat application like WhatsApp. You may be able to open the app and send a message (availability), but if that message never reaches the recipient or arrives corrupted, the system is not reliable. A reliable messaging system ensures that once you send a message, it will reach the intended user exactly once, in the correct order, and without modification.
Availability vs Reliability: Key Differences
Aspect | Availability | Reliability |
---|---|---|
Definition | System is up and reachable | System works correctly without failure |
Focus | Uptime | Correctness |
Example | App is accessible 24/7 | Data is consistently processed and delivered without loss |
Measurement | Uptime percentage (e.g. 99.99%) | Mean Time Between Failures (MTBF) |
Question: Can a System be Available but Not Reliable?
Yes. A system may be accessible (available) but deliver incorrect or inconsistent results (not reliable).
Example: A payment gateway is online and lets you make payments (available), but due to a bug, it charges the customer twice (not reliable).
Question: Can a System be Reliable but Not Available?
Yes. A system may deliver correct results when it runs (reliable), but it may not be accessible all the time (not available).
Example: A data processing system that gives accurate results when run, but is only available 4 hours a day due to maintenance or resource constraints.
Real-World Analogy: ATM Machines
Think of an ATM:
- If the ATM screen works and accepts your card, but then shows a system error after PIN entry—it is available but not reliable.
- If the ATM processes transactions correctly when it's running, but is often out of service—it is reliable but not available.
- A good ATM is both highly available (works 24/7) and highly reliable (does not make errors).
Strategies to Improve Availability
- Use load balancers to distribute traffic
- Deploy across multiple availability zones or regions
- Use failover mechanisms and health checks
- Implement auto-scaling and redundancy
Strategies to Improve Reliability
- Implement strong error handling and retries
- Use idempotent operations (repeatable without side effects)
- Maintain data consistency with ACID or eventual consistency principles
- Monitor logs, metrics, and anomalies
Interview Insight
In system design interviews, you may be asked to build a highly available and reliable system. Clarify what the interviewer prioritizes:
"Is it more important that the system is always accessible, or that it never fails when processing data?"
Conclusion
Availability and reliability are two foundational pillars of system design. Both are crucial, but they solve different problems. Availability ensures users can access the system; reliability ensures the system works correctly. Depending on the system (e.g., banking vs social media), the balance between these two must be tailored carefully.
Quick Recap
- Availability = System is accessible
- Reliability = System behaves correctly
- Both are measured differently and need distinct strategies
- Real-world systems aim to maximize both