Rate limiter for external APIs.
Build a production-grade rate limiter that throttles outbound calls to a
third-party API (an OTA, an LLM provider, a payment gateway), preserves
per-tenant FIFO order on retries, has zero coordination overhead between
multiple consumers, and scales linearly. The pattern uses the
.gate() primitive shipped in the
queen-mq
package, with the per-partition lease as natural backpressure — no
deferred queue, no timers, no Redis.
(8 runners, 100 tenants)
53s · 0 errors
vs. configured target
minimum drain time
Numbers are from a stress run on a single 10-core M-series Mac with one
PostgreSQL 18 container. Reproduce with
streams/examples/08-rate-limiter-stress.js.
1. The problem
Many APIs we depend on impose strict rate limits, and exceeding them costs money or service:
- OTAs (Booking, Airbnb, VRBO/Expedia, Hotel-Nerds, Google Hotel) — typically X req/min or Y operations/min per partner key, with suspension as the penalty.
- WhatsApp Cloud / SendGrid / Twilio — per-phone-number or per-IP tiers; exceeding triggers throttling or temporary bans.
- LLM providers (OpenAI, Anthropic) — TPM (tokens / min) and RPM (requests / min) per API key; exceeding returns 429 and a Retry-After.
A naïve approach (fire requests as fast as they arrive, retry on 429) is brittle: it floods the upstream, burns retries, and cannot enforce fairness across tenants. A correct rate limiter has to:
- Hold the surplus durably (no in-memory queue that dies on restart).
- Preserve per-key FIFO order across retries (otherwise a stale "set room available = false" can overwrite a fresh "set true").
- Coordinate across multiple worker processes without race conditions on the budget counter.
- Expose state for ops debugging.
Queen + streams gives all four properties out of the box.
2. The pattern
A producer pushes outbound API calls onto a Queen queue partitioned by
the rate-limit key (typically tenant, or
tenant + endpoint). A streaming pipeline reads that queue
through a .gate() operator that runs a token bucket per
partition. Allowed messages are forwarded to a sink queue from which the
actual API caller drains.
When the gate denies a message, the runner skips the cycle entirely (no state mutation, no ack, no sink push) and lets the source-queue lease expire naturally. The broker then redelivers the un-acked tail of the batch in its original order. This is what makes ordering FIFO on retries — the same partition lease sits with the runner across the deny → expiry → redeliver cycle, so denied messages can never overtake later messages.
Pushing denied messages onto a "retry later" queue (the typical Kafka-shaped pattern) breaks per-key FIFO order: by the time the deferred message rejoins the main queue, newer messages may have already passed through. Queen's per-partition lease lets us avoid this entirely — the broker itself is the backpressure mechanism.
3. Quickstart
The smallest working example. One tenant, 30 messages pushed in burst, bucket of 10 with 5/sec refill. The first 10 should arrive almost instantly, the rest at 5/sec.
Every snippet on this page was run against a live Queen server at
http://localhost:6632 before being included. The
quickstart drains in ~6 seconds with all 30 messages arriving in
seq order.
The broker client and the streaming SDK ship in the same package on every supported runtime:
# Node.js (npm)
npm install queen-mq
# Python (pip)
pip install queen-mq
# Go
go get github.com/smartpricing/queen/clients/client-go
// JavaScript
import { Queen, Stream, tokenBucketGate, slidingWindowGate } from 'queen-mq'
# Python
from queen import Queen, Stream, token_bucket_gate, sliding_window_gate
// Go
import (
queen "github.com/smartpricing/queen/clients/client-go"
"github.com/smartpricing/queen/clients/client-go/streams"
"github.com/smartpricing/queen/clients/client-go/streams/helpers"
)
The same .gate() primitive, helper factories, and
config_hash wire format work identically across all three
languages. A query registered by the JS client can be resumed by a
worker written in Go (or vice versa) — the SHA-256 of the operator
chain matches across runtimes.
The bucket takes ~6s to drain 30 messages: 10 burst (instant) + 20 more at 5/sec ≈ 4s. The remaining ~2s is broker round-trip overhead from the lease cycle.
Same wire protocol, same config_hash, same
per-partition lease back-pressure semantics. Pick whichever runtime
fits your service; the bucket state is shared via PostgreSQL so workers
written in different languages can co-process the same query.
4. Four canonical models
Real APIs measure rate limits in different units. The same
.gate() primitive expresses all four common shapes —
what changes is only the costFn and the queue contents:
| Model | Real-world example | Queue contains | costFn |
|---|---|---|---|
A — req/sec"100 calls/sec, batch arbitrary inside" |
Airbnb StandardCalendars; Google Hotel uploads | Pre-batched HTTP requests each carrying an items[] |
() => 1 |
B — msg/sec"100 ops/sec, regardless of batching" |
WhatsApp Cloud per-phone tier; Twilio; SendGrid | Individual messages (one logical operation per item) |
() => 1 |
C — 1:1"each call carries exactly one op" |
Legacy VRBO/Expedia "one operation per call" endpoints | Individual operations | () => 1 |
D — cost-based"weight per request varies" |
Expedia weight-per-endpoint; OpenAI TPM | Requests with a weight field |
msg => msg.weight |
Models A, B, and C share the same costFn — they differ
only in what you put on the queue. Model D shows the only place where
the gate code itself changes: the cost function reads the message
payload to compute how many tokens to charge.
Model D — cost-based example
100 requests with weights 1..5 (avg 3, total 300). Bucket of 100, refill 50/sec. Drains in ~6 seconds.
5. Two limits at once (model A + B cascade)
Some APIs impose both a request-per-second limit AND an
operations-per-second limit at the same time. Booking.com is the
canonical example: "≤ 100 calls/sec AND ≤ 1000 ops/sec per
partner". Both must be satisfied simultaneously.
Solution: chain two .gate() stages via an intermediate
queue. The MIN of the two binding rates wins. Per-partition order is
preserved across the chain because each stage maintains its own
lease on the same partition key.
In the verified run, gate 2 is the binding limit: the request rate stabilises at 6/sec (well below gate 1's limit of 10) precisely because each request carries 5 items and gate 2 caps the total at 30 msg/sec ⇒ 30/5 = 6 req/sec.
6. Sliding-window quotas (model E)
Some APIs impose quota-style limits — "max N events in any
rolling window of W seconds" — typically used for daily caps
("max 1000 emails/day per IP") or hourly bursts. The
slidingWindowGate helper implements an efficient 2-bucket
approximation that is accurate within ±1× the rate at boundaries
(precise enough for rate-limiting; for billing-grade precision keep
per-event timestamps in state).
For sliding windows, WINDOW_SEC ≥ 5 × LEASE_SEC.
If they're comparable (e.g. lease=2s ≈ window=2s) every lease expiry
lands at a window boundary and the previous-bucket weight blocks the
entire next window, collapsing throughput to ~half of nominal. Real
quota use cases (windows of minutes, hours, days) are far above
typical lease times, so this is mostly an issue for synthetic tests.
The 29 messages observed in the worst 5-second window (vs. the configured limit of 20) is the documented ±1× boundary inaccuracy of the 2-bucket approximation. For real-world quotas (e.g. 1000/day) the same approximation gives 1000-1500 across any 24h window, which is acceptable in nearly all use cases.
7. Sizing rule
For the token bucket, the maximum sustainable rate per partition is:
The reason: each lease cycle, the runner can drain at most
capacity tokens before being denied; after the deny the
partition is parked for leaseSec seconds before the next
cycle. So if capacity is too small, the deny cadence
becomes the bottleneck rather than the refill rate.
Recommendation: always set
capacity ≥ refillPerSec × leaseSec. This guarantees the
cycle quantum doesn't artificially clamp the bucket. The default
config in streams/examples/08-rate-limiter-stress.js uses
capacity = refillPerSec × leaseSec exactly.
For a target of R ops/sec/tenant on a 2-second source-queue
lease, set refillPerSec: R and
capacity: 2 × R. Then set
batchSize on the streaming runner to at least
capacity so a full bucket can be drained in a single
cycle.
8. Scaling with multiple consumers
A single Node process running Stream.run() sustains
roughly 1500–2500 msg/sec end-to-end on a modest setup, limited by
HTTP round-trips and PG cycle latency. To scale higher, run multiple
consumers in parallel — each calls .run() with the same
queryId. Queen's per-partition lease guarantees that a
given partition is processed by AT MOST ONE runner at any moment, so
per-key state writes never collide and ordering stays intact.
In production each runner is typically its own Pod. The state coordination
is fully handled by PostgreSQL: every gate decision reads the bucket
state from queen_streams.state within the same transaction
that mutates it, and the partition lease enforces single-writer
semantics across all runners.
9. Stress test results
The pattern was stress-tested at three scales. Source:
streams/examples/08-rate-limiter-stress.js.
| Config | Total msgs | Time | Aggregate throughput | Per-tenant rate (avg) | Errors | Efficiency |
|---|---|---|---|---|---|---|
| 10 tenants × 500 msg, 1 runner | 5,000 | 6.4s | 785 msg/s | 101.4 (target 100) | 0 | 78% |
| 50 tenants × 2,000 msg, 4 runners | 100,000 | 22.6s | 4,426 msg/s | 96.9 (target 100) | 0 | 89% |
| 100 tenants × 5,000 msg, 8 runners | 500,000 | 53.0s | 9,431 msg/s | 100.5 (target 100) | 0 | 94% |
Throughput scales nearly linearly with the number of runners. The same
test was repeated for all four canonical models (A, B, C, D) plus the
cascade and sliding window — all 5 verifiable against
streams/examples/09-rate-limiter-all-models.js:
| Model | Description | Drained | Per-tenant rate (target) | Aggregate | Time | Status |
|---|---|---|---|---|---|---|
| A | req/s | 100,000 | 104.0 (100) | 4,702/s | 21.3s | ✅ |
| B | msg/s | 100,000 | 106.0 (100) | 4,713/s | 21.2s | ✅ |
| D | cost-based | 100,000 reqs (550k weight) | 1,051 weight/s (1,000) | 42,953 w/s | 12.8s | ✅ |
| A+B | cascading | 50,000 | 119.9 req/s (100) + 511 msg/s (500) | 4,671/s | 10.7s | ✅ |
| E | sliding-window (20 / 10s) | 10,000 | 11.2 (10) | 518/s | 19.3s | ✅ |
10. Inspecting state
All gate state lives in the queen_streams.state table — a
plain PostgreSQL table you can query directly for ops debugging.
Because the state is plain SQL, you can hook it directly into Grafana, Metabase, or your existing alerting pipeline — no extra metrics exporter required.
11. Common pitfalls
.gate() with windowing in the same stream
The gate's partial-ack-on-deny semantics are incompatible with the
full-batch atomic-cycle model that windows + reducers assume. The
Stream compiler enforces this with a clear error. If you
need both rate limiting AND windowed aggregation, run them as two
sequential streams with an intermediate queue.
The bucket state is sharded by partition_id in
queen_streams.state. If you partition the source queue
by tenant but rate-limit by tenant + endpoint
(via .keyBy()), two different runners can hold different
partition leases yet write the same logical key, causing cross-worker
contention. Always partition by exactly the rate-limit key. The
runtime warns you if you use .keyBy() to override.
retryLimit to a large number
A denied message goes back into the queue via lease expiry — Queen
reads that as a "retry". On a heavily throttled tenant a single
message can be redelivered hundreds of times before its bucket
refills enough. Set retryLimit to ≥ 1000 on the source
queue or messages will end up in the DLQ unfairly. (We use 100,000
in the stress tests.)
Repeating from the sizing callout: for
slidingWindowGate, ensure
windowSec ≥ 5 × leaseSec on the source queue. Otherwise
deny→retry round-trips collide with window boundaries and effective
throughput collapses.
12. References
All helpers used in this recipe are tiny pure-JavaScript factories defined in a single file (~140 lines). You can read the full source in the time it takes to read this page:
| What | Where | Re-exported from |
|---|---|---|
tokenBucketGate({ capacity, refillPerSec, costFn }) |
client-v2/streams/helpers/rateLimiter.js (function declaration) |
queen-mq top-level(see client-v2/index.js) |
slidingWindowGate({ limit, windowSec, costFn }) |
client-v2/streams/helpers/rateLimiter.js (function declaration) |
queen-mq top-level(same) |
| Minimal example (used in §3 quickstart) | streams/examples/07-rate-limiter.js |
|
| Stress test (used in §9 numbers) | streams/examples/08-rate-limiter-stress.js |
|
| All-models test (A / B / C / D / cascade / sliding-window) | streams/examples/09-rate-limiter-all-models.js |
|
The .gate() operator (compiler integration) |
client-v2/streams/operators/GateOperator.js |
|
| Stored procedure (SP) implementing the cycle commit | lib/schema/procedures/021_streams_cycle_v1.sql |
