Introduction
In this module, we will learn how to design WhatsApp — a real-time, scalable, fault-tolerant messaging platform. WhatsApp is used by billions to send and receive instant messages, images, and voice notes. Designing such a system requires careful attention to latency, consistency, and scalability.
Functional Requirements
- One-on-one messaging
- Group messaging
- Message status updates (sent, delivered, read)
- Media sharing (images, videos, documents)
- End-to-end encryption
Non-Functional Requirements
- High availability and fault tolerance
- Low latency
- Scalability (supporting millions of concurrent users)
- Security and privacy
Step-by-Step Architecture
Client-Server Communication
WhatsApp uses a mobile-first approach where the client (mobile app) maintains a persistent connection with the backend using a protocol like XMPP or custom WebSockets over TCP. This enables instant message delivery.
Example:
When User A sends a message to User B:
- User A writes a message and hits "send".
- The message is encrypted and sent to the WhatsApp server.
- The server routes it to User B (if online), or stores it in a message queue (if offline).
Question:
How can we ensure User B receives the message even if their phone is off?
Answer:
We use a Message Queue to store undelivered messages. When User B reconnects, the server pushes the queued messages to the client.
Message Queue and Storage
Messages are stored temporarily in a distributed message queue (e.g., Kafka or RabbitMQ). Once a message is delivered and acknowledged, it can be deleted or archived.
Database Design
- Users: user_id, phone_number, name, profile_picture
- Messages: message_id, sender_id, receiver_id/group_id, timestamp, status, content
- Groups: group_id, name, members
Scalability Strategy
Horizontal Scaling
Different microservices handle different tasks — message service, media service, notification service, etc. We scale each service independently based on traffic.
Sharding
To handle billions of messages, we shard the message storage by user ID or region to distribute load across databases.
Media Sharing Design
WhatsApp supports image, video, document sharing. Media files are not sent directly through the messaging queue. Instead, they are uploaded to cloud storage (e.g., Amazon S3), and a download link is sent through the message.
Example:
User A sends a video:
- Client uploads video to cloud storage and gets a secure URL.
- Client sends the video URL as a message to User B.
- User B downloads the media using the URL.
Message Status Updates
WhatsApp shows three status levels for messages: Sent ✓, Delivered ✓✓, and Read ✓✓ in blue. Each transition is acknowledged back to the sender:
- ✓: Message received by server
- ✓✓: Message delivered to recipient
- ✓✓ (blue): Message read by recipient
Question:
What if the recipient is offline?
Answer:
The server will hold the message and status updates until the recipient comes online and acknowledges it.
End-to-End Encryption
WhatsApp uses the Signal Protocol for end-to-end encryption. Each message is encrypted on the sender’s device and decrypted only on the receiver’s device.
Key Concepts:
- Public-private key cryptography
- Session keys and key rotation
- Forward secrecy to protect past messages even if current keys are compromised
Group Messaging Design
Group chats are trickier as messages must be delivered to multiple recipients.
Broadcast Model:
- Sender sends one message to server.
- Server fans out that message to all group members via their respective message queues.
Optimization:
Store message once and reference it in each user's inbox to reduce duplication.
Notifications and Offline Handling
When users are offline, messages are queued, and the server sends push notifications via FCM/APNs. Upon reconnect, the app syncs new messages.
Monitoring and Reliability
Monitor systems using tools like Prometheus, Grafana, and Sentry for real-time alerts and performance tracking. Deploy failover systems and replication to ensure high availability.
Conclusion
Designing WhatsApp involves balancing real-time delivery, storage efficiency, reliability, and encryption. By splitting the architecture into microservices and separating message content from media, we can design a scalable and robust system.
Key Takeaways
- Use persistent TCP/WebSocket connections for real-time delivery.
- Store messages temporarily using queues for offline users.
- Separate storage of messages and media for efficiency.
- Ensure security with end-to-end encryption using protocols like Signal.
- Design with scalability and fault-tolerance in mind using microservices and sharding.