Rate Limiting in System Design

What is Rate Limiting?

Rate limiting is a technique used in system design to control the number of requests a user or client can make to a server within a specified time window. It protects your services from abuse, prevents system overload, and ensures fair usage among all clients.

Why Do We Need Rate Limiting?

Imagine a website that provides a public API. If one user starts making thousands of requests every second, it can slow down or even crash the server for everyone else. Rate limiting solves this by putting a cap on how many requests a user can make, such as 100 requests per minute.

Real-World Examples of Rate Limiting

Example 1: Login Attempts

Suppose a user is trying to log in to an application. If there is no limit on login attempts, a malicious actor could perform a brute-force attack by trying all possible passwords.

To avoid this, we can apply rate limiting: for example, only allow 5 login attempts per 10 minutes. After 5 failed attempts, block the user for 10 minutes.

Example 2: Public API Usage

Let’s say you’re offering a free weather API. You want to allow fair use but prevent abuse. You could define:

Free plan: 60 requests per hour
Paid plan: 1000 requests per hour

This encourages users to upgrade while protecting your backend resources from being overwhelmed by free users.

Intuition-Building Question

Question: If an attacker sends 1000 requests in 1 second, and you limit to 10 requests per second, will the attacker still be able to slow your system down?

Answer: It depends on how the rate limiter is implemented. If requests beyond the limit are dropped early (e.g., at the API gateway), then your system will stay safe. But if all requests reach the application before being rejected, the system can still become overwhelmed.

How Rate Limiting Works

Rate limiting works by tracking the number of requests from a user (or IP address, API key, etc.) and comparing it to a pre-defined threshold. If the threshold is exceeded, the server blocks further requests until the time window resets.

Common Algorithms Used for Rate Limiting

1. Fixed Window

In this method, time is divided into fixed windows (e.g., 1 minute). If the limit is 100 requests per minute, the system allows 100 requests from a user in that minute. After the minute resets, the counter resets.

Problem: A user can send 100 requests at the end of one window and 100 at the start of the next—effectively sending 200 requests in a short burst.

2. Sliding Window Log

This approach logs timestamps of each request and removes old ones outside the current time window. It gives a more accurate tracking of request rates but uses more memory and processing.

3. Sliding Window Counter

A more efficient version of the sliding window log. It approximates the number of requests using two time buckets (e.g., current minute and previous minute), combining them based on how much of each overlaps with the actual time range.

4. Token Bucket

This is one of the most commonly used algorithms. Here's how it works:

You have a bucket that fills with tokens at a fixed rate (e.g., 1 token per second).
Each request needs a token. If a token is available, the request is allowed, and the token is removed.
If no token is available, the request is denied.

Advantage: This allows for some burst traffic, as tokens can be accumulated over time.

5. Leaky Bucket

Imagine a bucket with a hole in the bottom. Water (requests) enters the bucket at any rate, but leaks at a steady rate.

This algorithm smooths out bursty traffic by ensuring a constant output rate.

Question to Deepen Understanding

Question: Which is better for APIs that can tolerate occasional bursts — Token Bucket or Leaky Bucket?

Answer: The Token Bucket is better for bursty traffic because it allows tokens to accumulate and supports short spikes. Leaky Bucket enforces a more consistent rate and discards bursts.

Where to Apply Rate Limiting?

API Gateways (e.g., Kong, AWS API Gateway)
Web Application Firewalls (WAF)
Backend services
Login endpoints
Message queues or task runners

HTTP Headers Used in Rate Limiting

Most modern APIs respond with rate limit headers:

X-RateLimit-Limit: Maximum number of requests allowed
X-RateLimit-Remaining: Remaining requests before hitting the limit
X-RateLimit-Reset: Time at which the limit resets

What Happens When the Limit is Exceeded?

When a client exceeds the rate limit, the server usually returns an HTTP status code 429 Too Many Requests. This is a signal to the client to wait and retry after some time.

Best Practices

Apply rate limiting closest to the source of traffic (like edge servers or API gateways)
Use descriptive headers so clients know how to behave
Log rate limit events for monitoring and alerting
Have a clear policy for rate limits in your API documentation

Conclusion

Rate limiting is a important component in system design, especially for scalable and resilient systems. It ensures that resources are used fairly and efficiently, prevents abuse, and keeps your services healthy under load.

As a beginner, you should start recognizing where and how to apply rate limiting in real-world systems such as login APIs, third-party services, or any shared resource.

⬅ Previous TopicCDN (Content Delivery Network)

Next Topic ⮕Relational vs NoSQL Databases: A Beginner’s Guide