What is Rate Limiting?
Rate limiting is a technique used in system design to control the number of requests a user or client can make to a server within a specified time window. It protects your services from abuse, prevents system overload, and ensures fair usage among all clients.
Why Do We Need Rate Limiting?
Imagine a website that provides a public API. If one user starts making thousands of requests every second, it can slow down or even crash the server for everyone else. Rate limiting solves this by putting a cap on how many requests a user can make, such as 100 requests per minute.
Real-World Examples of Rate Limiting
Example 1: Login Attempts
Suppose a user is trying to log in to an application. If there is no limit on login attempts, a malicious actor could perform a brute-force attack by trying all possible passwords.
To avoid this, we can apply rate limiting: for example, only allow 5 login attempts per 10 minutes. After 5 failed attempts, block the user for 10 minutes.
Example 2: Public API Usage
Let’s say you’re offering a free weather API. You want to allow fair use but prevent abuse. You could define:
- Free plan: 60 requests per hour
- Paid plan: 1000 requests per hour
This encourages users to upgrade while protecting your backend resources from being overwhelmed by free users.
Intuition-Building Question
Question: If an attacker sends 1000 requests in 1 second, and you limit to 10 requests per second, will the attacker still be able to slow your system down?
Answer: It depends on how the rate limiter is implemented. If requests beyond the limit are dropped early (e.g., at the API gateway), then your system will stay safe. But if all requests reach the application before being rejected, the system can still become overwhelmed.
How Rate Limiting Works
Rate limiting works by tracking the number of requests from a user (or IP address, API key, etc.) and comparing it to a pre-defined threshold. If the threshold is exceeded, the server blocks further requests until the time window resets.
Common Algorithms Used for Rate Limiting
1. Fixed Window
In this method, time is divided into fixed windows (e.g., 1 minute). If the limit is 100 requests per minute, the system allows 100 requests from a user in that minute. After the minute resets, the counter resets.
Problem: A user can send 100 requests at the end of one window and 100 at the start of the next—effectively sending 200 requests in a short burst.
2. Sliding Window Log
This approach logs timestamps of each request and removes old ones outside the current time window. It gives a more accurate tracking of request rates but uses more memory and processing.
3. Sliding Window Counter
A more efficient version of the sliding window log. It approximates the number of requests using two time buckets (e.g., current minute and previous minute), combining them based on how much of each overlaps with the actual time range.
4. Token Bucket
This is one of the most commonly used algorithms. Here's how it works:
- You have a bucket that fills with tokens at a fixed rate (e.g., 1 token per second).
- Each request needs a token. If a token is available, the request is allowed, and the token is removed.
- If no token is available, the request is denied.
Advantage: This allows for some burst traffic, as tokens can be accumulated over time.
5. Leaky Bucket
Imagine a bucket with a hole in the bottom. Water (requests) enters the bucket at any rate, but leaks at a steady rate.
This algorithm smooths out bursty traffic by ensuring a constant output rate.
Question to Deepen Understanding
Question: Which is better for APIs that can tolerate occasional bursts — Token Bucket or Leaky Bucket?
Answer: The Token Bucket is better for bursty traffic because it allows tokens to accumulate and supports short spikes. Leaky Bucket enforces a more consistent rate and discards bursts.
Where to Apply Rate Limiting?
- API Gateways (e.g., Kong, AWS API Gateway)
- Web Application Firewalls (WAF)
- Backend services
- Login endpoints
- Message queues or task runners
HTTP Headers Used in Rate Limiting
Most modern APIs respond with rate limit headers:
X-RateLimit-Limit
: Maximum number of requests allowedX-RateLimit-Remaining
: Remaining requests before hitting the limitX-RateLimit-Reset
: Time at which the limit resets
What Happens When the Limit is Exceeded?
When a client exceeds the rate limit, the server usually returns an HTTP status code 429 Too Many Requests. This is a signal to the client to wait and retry after some time.
Best Practices
- Apply rate limiting closest to the source of traffic (like edge servers or API gateways)
- Use descriptive headers so clients know how to behave
- Log rate limit events for monitoring and alerting
- Have a clear policy for rate limits in your API documentation
Conclusion
Rate limiting is a crucial component in system design, especially for scalable and resilient systems. It ensures that resources are used fairly and efficiently, prevents abuse, and keeps your services healthy under load.
As a beginner, you should start recognizing where and how to apply rate limiting in real-world systems such as login APIs, third-party services, or any shared resource.