Designing a rate limiter
From per-process counters to regional token buckets and adaptive abuse controls - five revisions that expose the traps in a deceptively small interview prompt.
Why this design question bites
On paper a rate limiter is one piece of middleware: count requests, return 429 Too Many Requests when a caller is over the line, otherwise forward. That's why interviewers reach for it. The naive version works on a demo. Real traffic asks harder questions almost immediately. Which identity are you actually limiting? What happens once the API runs across ten servers behind a load balancer? Can a smart caller time bursts around your window boundaries? And what does the limiter do when its own dependency starts getting slow?
The whole point is to be much faster than the service you're protecting, while staying accurate enough to actually catch abuse. The hard part isn't building one globally consistent counter. It's deciding where you really need exactness, where approximation is fine, and how to roll out policy changes without melting your customers in the process.
Requirements we'll design against
- Functional: enforce per-API-key, per-user, per-IP, and per-endpoint limits; return
429withRetry-After; support plan tiers; update policies without redeploying application code. - Non-functional: add less than 5ms p99 inside a region; protect a backend serving around 100k requests per second; tolerate regional Redis failures with explicit route-level behavior; produce enough telemetry to debug false positives and abuse spikes.
Kata setup
Scored characteristics
Archie reviews every rate limiter iteration against the same three quality attributes, then turns the lowest-scoring gaps into the next concrete design move.
Performance
The system's responsiveness and throughput under various loads. This includes latency, stress testing, peak analysis, and capacity planning.
Scalability
The ability of the system to handle an increasing number of users, transactions, or data volume, typically by adding or removing resources (elasticity).
Abuse resistance
How well the design resists evasive clients, false positives, policy mistakes, and attacks that try to exploit the limiter itself.
Five versions, five failure modes
Each version below fixes the failure mode that broke the version before it. The point isn't to leap to a maximal architecture in one step. It's to make the next bottleneck visible, then handle it on purpose.
v1 - In-memory fixed windows
Start with the naive middleware. Every API process keeps a map of counters keyed by identity:endpoint:minute. Above the threshold you return 429, otherwise you forward.
It's fast because there are no network calls. It's also only correct for one process. Run ten API workers behind a load balancer and a client gets ten quotas instead of one. Restarts and autoscaling reset the counters. And the fixed-window algorithm lets a caller double their effective rate by sending the back half of one window plus the front half of the next.
Design a rate limiter that enforces quotas across API servers and regions while keeping request latency low and abuse controls safe.
v2 - Shared Redis counters
Move the counters into Redis. Every API worker increments the same key for identity:endpoint:window, and the per-key TTL handles cleanup. That kills the worst v1 problem: load balancing stops multiplying a client's effective quota.
Design a rate limiter that enforces quotas across API servers and regions while keeping request latency low and abuse controls safe.
Shared state is not the whole design
Redis fixes fleet-wide visibility, but the fixed-window algorithm still leaks bursts. Be honest about what v2 actually solved. Every worker now sees the same state. Admission decisions still aren't smooth and aren't race-free.
v3 - Atomic token buckets
Swap fixed windows for token buckets, implemented as a single Redis Lua script. Each quota key holds two values: the current token count and the last refill timestamp. The script refills based on elapsed time, consumes a token if one's available, refreshes the TTL, and returns the allow/deny decision along with the metadata clients actually need (remaining, retry-after). This is also a good moment to lift the limiter out of the API workers and run it as its own service behind the API Gateway. Admission gets a stable contract, and the algorithm can evolve without redeploying every backend.
Design a rate limiter that enforces quotas across API servers and regions while keeping request latency low and abuse controls safe.
One script, one decision
Don't build the quota decision out of several client-side Redis commands. One Lua script keeps refill, consume, TTL update, and retry-after calculation atomic, and two workers can't both observe and spend the same token. Two details that bite if you skip them. Read the bucket clock from redis.call('TIME') inside the script, not the client's wall clock; that kills skew across workers and stops a client from forging a future timestamp to refill early. And on Redis Cluster, multi-key rules (per-user × per-endpoint enforced together, for example) need all keys in the same hash slot, so name them with a shared hash tag like rl:{acme}:user:123 and rl:{acme}:ep:search. Without it, Cluster rejects the EVAL cross-slot.
Like where this is going?
Try ArchieGuru on your own design — Archie will review the result.
v4 - Regional edge enforcement
The v3 limiter is accurate inside one region, but a user in Sydney shouldn't have to cross an ocean before your API can answer them. Deploy the limiter service and Redis cluster regionally. Add an edge worker for the coarse stuff: obvious bot bursts, abusive IP ranges, anything that doesn't need precise quota state to reject. Replicate policy outward as versioned snapshots, so each region enforces locally while operators still manage rules from one place.
Design a rate limiter that enforces quotas across API servers and regions while keeping request latency low and abuse controls safe.
Regional accuracy is usually enough
For API protection, it's usually fine that each region enforces a local share of the quota. Exact worldwide counters are slower and a lot more painful to operate. If the limit is contractual or billing-grade, design that as a separate global reservation or reconciliation problem instead of stretching the limiter to do it.
Does the budget actually fit? Rough math.
100k rps regional with a 5 ms p99 budget — does that actually fit? A token-bucket EVALSHA on a warm Redis primary lands in roughly 0.2–0.5 ms server-side, so the dominant cost is the round trip and the gateway's own work, not the script. One Redis primary handles ~80–150k EVAL ops/sec before tail latency starts to widen, so plan on 2–4 shards in the regional cluster (sized for headroom, not steady state) and pipeline batches per gateway connection. The 5 ms budget breaks down to roughly 1 ms gateway → limiter, 1 ms limiter → Redis round trip, <1 ms inside Lua, 1 ms back; the rest is slack for GC, retries, and the occasional Cluster reroute. Pool the limiter → Redis connections at (cores × 2) per shard so you don't queue under burst.
Plan for hot keys before you need to
A celebrity tenant or a single hot endpoint maps to one quota key, which maps to one Cluster slot, which maps to one Redis shard. That shard becomes the regional bottleneck. Two real mitigations. Shard the quota key: split it into, say, 16 sub-counters (rl:{acme}:user:123:s0..s15), each with rate/16, plus a soft sum-of-shards reconciliation for cross-shard fairness. You accept a small overshoot in exchange for removing the hot spot. Or local pre-decrement in the limiter process: each instance leases N tokens, decrements them locally, and reconciles periodically with Redis. Fastest path, but only safe for non-billing-grade limits. Catch hot keys with redis-cli --hotkeys or the slowlog before they turn into an incident.
v5 - Production controls and abuse feedback
The final version keeps the hot path local and moves the learning loop off it. The limiter emits a compact event for every allow and deny. Dashboards plot deny rate, false-positive rate, retry-after distribution, Redis latency, and which routes are currently in a failover mode. Abuse scoring reads that same stream and pushes adaptive limits back through the policy store as new versioned rules. New rules run in shadow mode first; nothing gets enforced until the shadow data looks sane. And every route has an explicit answer for what to do when Redis goes wobbly — fail open, fail closed, or fall back to a local emergency bucket.
Design a rate limiter that enforces quotas across API servers and regions while keeping request latency low and abuse controls safe.
Known limitations
v5 still isn't an exact global quota system. A client willing to spray traffic across all your regions can collect a regional allowance in each one until the async abuse controls notice. For protecting an API from abuse, that's a fair tradeoff for the latency you save. For billing-grade quotas, contractual limits, or anything carving up a genuinely scarce resource, you'd want reservations, reconciliation, or a separate global quota service.
It also assumes whoever calls the limiter has already worked out who is calling. Authentication, bot detection, IP reputation, and client fingerprinting all feed the limiter, but each is a separate system with its own false positives and privacy tradeoffs.
What Archie unlocked, step by step
Archie's reviews walk the design through the same order a strong interview answer follows. Make the state shared. Make the decisions atomic and fair. Move enforcement closer to users. Then add the controls that stop policy changes from hurting real traffic.
Score lift under Archie's reviews
Each step is one Archie review — the gap it flagged became the next version's design move. Rows are ordered from the area Archie still nudges hardest to the one it took furthest, with the overall score on the bottom row. Deltas show the total lift from v1 to v5.
Scalability
24 → 84+60
v1v2v3v4v5Abuse resistance
26 → 86+60
v1v2v3v4v5Performance
37 → 88+51
v1v2v3v4v5Overall
29 → 86+57
v1v2v3v4v5