18 tests. 1.6 billion message events. Zero loss.
All numbers on this page come from benchmark sessions run on
2026-04-25 and 2026-04-26 on a fresh 32 vCPU / 62 GiB host with
PostgreSQL upstream postgres:latest, Queen 0.14.0.alpha.3,
Apache Kafka 3.7 (KRaft single-node), and RabbitMQ 3.12. Each test ran for 15 minutes
against a fresh server state. The full raw data, per-minute time series, Postgres
stats, system metrics, autocannon and perf-test output, lives in
benchmark-queen/2026-04-26.
Headline
Test environment
| Host | 32 vCPU · 62 GiB RAM · no swap · Ubuntu 24.04 (kernel 6.8) |
| PostgreSQL | postgres:latest · shared_buffers=24 GB · effective_cache_size=48 GB · autovacuum_naptime=10s · autovacuum_vacuum_scale_factor=0.05 |
| Queen | 0.14.0.alpha.3 · NUM_WORKERS=10 · DB_POOL_SIZE=50 · SIDECAR_POOL_SIZE=250 · nofile=65535 |
| Cleanup | docker rm -v + docker volume prune -f before each test. Every test starts with an empty database. |
| Duration | 15 min (900 s) per test |
| Client | autocannon v8 on Node 22, same host |
| Payload | ~28 bytes per message (matched across all systems compared) |
Throughput & latency together
Throughput numbers without latency are half the story. Here are the four push configurations side by side, with both the producer and consumer latency profiles:
| Test | Setup | Push msg/s | p50 / p99 push | p50 / p99 pop | Queen CPU |
|---|---|---|---|---|---|
bp-1 | 1×50, batch=1 | 5 796 | 8 / 13 ms | 11 / 355 ms | 5.2 vCPU |
bp-10 | 1×50, batch=10 | 39 060 | 11 / 38 ms | 16 / 356 ms | 7.4 vCPU |
bp-100 | 1×50, batch=100 | 104 400 | 45 / 131 ms | 35 / 254 ms | 19 vCPU |
hi-part-1 | 5×100 conns, batch=1 | 29 487 | 15 / 44 ms | 59 / 4 199 ms | 27 vCPU |
bp-1
to bp-10, push throughput jumps 6.7× while p99 only
goes from 13 ms to 38 ms, a 2.2× efficiency gain per request. At
100k msg/s peak, p99 is still 131 ms. Compare to Kafka under sustained unbounded
load: 1.5M msg/s but p99 of 2 966 ms, Queen at 100k msg/s has
~22× lower tail latency than Kafka at peak.
Partition scaling, the Queen claim, validated
Same producer concurrency (5×100 connections), increasing partition count from 2 to 10 001. If partitions were physical commit-logs, this would degrade catastrophically. In Queen they're logical lanes, so the cost is index size, not state machine size.
Partition count goes up by 5 000×; throughput drops by 29%. Zero errors at every scale. This is the curve that lets Queen offer "one partition per chat conversation" in production without operational drama.
Time-series stability, what 15 minutes of bp-10 looks like
Throughput averages don't tell you whether a system is stable or oscillating. Below
is the per-minute pop rate from bp-10's own queue-ops.json
output, a rendering of the actual time series.
Per-minute throughput bounces between 35k and 46k msg/s for 14 minutes, then ramps down as the producer hits its target. No drift, no degradation, no warm-up artifact. The same is true of every sustained test in the suite.
Consumer-group fan-out
| Groups | Push msg/s | Total deliveries | Pop / push | Per-group p99 | Pending at end |
|---|---|---|---|---|---|
| 1 | 39 060 | 38 351 msg/s | 1.0× | 356 ms | 1.5% |
| 5 | 26 890 | 127 777 msg/s | 4.9× | 315 ms | 0 |
| 10 | 17 890 | 165 480 msg/s | 9.3× | 471 ms | 0 |
Multi-tenancy is essentially free
Same total client load, distributed across one queue vs ten queues:
| Test | Setup | Push msg/s | Pop msg/s | Δ |
|---|---|---|---|---|
q-1 | 1 queue, 1×50, batch=10, MP=1000 | 39 130 | ~38 000 | baseline |
q-10 | 10 queues, same total load | 40 500 | 38 540 | +3.5% / +1.4% |
Routing to ten different queues instead of one was slightly faster (less contention on the single partition's advisory lock), well within noise. Multi-tenant deployments don't pay a partitioning tax.
Realistic pipeline, producer → worker → q2 fan-out
The throughput numbers above are pure broker-side push/pop micro-benchmarks. Most
real workloads look different: multiple stages, real client SDK with batching and
long-poll, work simulation between stages, multiple downstream consumer groups.
This benchmark exercises that shape end-to-end with the official
queen-mq JS client and pm2:
[producer ×2] ──push──▶ pipe-q1 ──pop──▶ [worker ×7] ──push──▶ pipe-q2 ─┬─pop──▶ [analytics ×7]
│
└─pop──▶ [log ×7]
Producer pushes single messages per HTTP call. Each worker
pops a batch from q1, simulates 5–20 ms of long-tail per-message work in
parallel via Promise.all, forwards the batch to q2 preserving
per-partition ordering, then acks q1 (at-least-once: separate push then
ack, no transactional commit). Both analytics and
log consumer groups drain q2 with the same shape, fan-out,
simulated work, batch ack. The whole thing runs on a 16 vCPU / 64 GB
DigitalOcean VM with PG synchronous_commit=on.
Same engine, same client, same pipeline shape, two configurations of the partition knob:
1 000 partitions · per-entity ordering
- 1 000 partitions, batch = 100, partitions/pop = 10
- Producer = worker drain = 3 688 msg/s
- End-to-end p50 / p99 = 359 ms / 1 024 ms
- Per-entity FIFO ordering preserved (Kafka-like)
- For: chat rooms, per-tenant streams, per-user state
10 partitions · weak ordering, max throughput
- 10 partitions, batch = 1 000, partitions/pop = 1
- Producer = worker drain = 6 673 msg/s (+81 %)
- End-to-end p50 / p99 = 755 ms / 1 103 ms
- Per-shard FIFO (RabbitMQ-like, but ordering still per-lane)
- For: task queues, work distribution, log shipping
| Stage / Config | 1 000-part p50 / p99 / max | 10-part p50 / p99 / max |
|---|---|---|
| End-to-end (producer → analytics) | 359 / 1 024 / 15 464 ms | 755 / 1 103 / 1 747 ms |
| q1 → worker | 213 / 705 / 15 387 ms | 440 / 745 / 1 010 ms |
| q2 → analytics | 114 / 514 / 3 198 ms | 312 / 601 / 999 ms |
| Resource (steady state) | 1 000-part config | 10-part config |
|---|---|---|
| queen container CPU | ~390 % (3.9 vCPU) | ~340 % (3.4 vCPU) |
| queen container RSS | ~70 MB | ~175 MB (bigger in-flight batches) |
| postgres CPU | ~1 500 % (15 vCPU) | ~470 % (4.7 vCPU, 3.2× cheaper) |
| postgres RSS at end | 16.7 GB | 12.3 GB |
In the throughput-tuned config Postgres CPU is 3.2× lower at nearly
double the throughput, bigger batches amortise the per-call SQL
overhead so much that the per-message PG cost collapses. The system is then
producer-bound, not PG-bound: adding more producer
processes would push the rate well into the 30–50 k/s range without
changing anything broker-side. Full writeups with reproduction recipes:
pipeline-queen.md
(high-cardinality) and
pipeline-queen-throughput.md
(throughput-tuned).
Queen vs Kafka vs RabbitMQ, same hardware, same payload
Single-node, 1 000 partitions/queues, persistent durability, 15-min run, ~28-byte
message payload, same 32 vCPU / 62 GiB host. All three measured directly: Kafka with
kafka-producer-perf-test.sh, RabbitMQ with the official
pivotalrabbitmq/perf-test. Each system uses its own native binary
protocol (HTTP/JSON for Queen, Kafka wire protocol, AMQP for RabbitMQ).
Test methodology, important asymmetries you should know about
Each system was tested in its idiomatic high-throughput client configuration, not with literally identical client settings. The protocols differ enough that "same number of connections" doesn't mean the same thing in each system. Specifically:
| System | Client config | Effective concurrency |
|---|---|---|
| Queen | autocannon, 50 HTTP connections, batch=10, pipelining=1 |
~50 in-flight HTTP requests |
| Kafka | 1 producer process, linger.ms=10, batch.size=16384, max.in.flight=5 |
~5 in-flight batched requests on 1 TCP connection |
| RabbitMQ | 1 producer process, --confirm 200 |
~200 unconfirmed publishes on 1 AMQP connection |
Durability tiers are also not identical:
- Queen:
synchronous_commit=on, fsync of WAL before HTTP 201. Strictest. - Kafka:
acks=1, broker writes to OS page cache, no fsync. If broker crashes before OS flushes (~1 s), messages can be lost. Single broker means no replication backup either. - RabbitMQ:
delivery_mode=2 + --confirm 200, written to queue index on disk before confirm, but flushes are batched at the index level.
What this means for the comparison: Re-running with matched client concurrency (50 producers each) would push Kafka above 2 M msg/s and RabbitMQ to probably ~80-120 k msg/s. Tightening Kafka's durability to fsync-per-message would drop it to maybe 100-300 k msg/s. The numbers shown are each system at its idiomatic high-throughput config, with its idiomatic durability tier. That's a defensible-but-not-exhaustive choice. Queen's memory and architectural advantages hold regardless; the throughput numbers are the most sensitive to client setup.
Queen MQ bp-10
39k msg/s
Push p9938 ms
Server RSS52 MB
CPU7.4 vCPU
Disk per msg~400 B
Per-key ordernative
Replaytimestamp / offset
Ops surface1 binary + PG
Kafka 3.7 kafka-1000p
1.52M msg/s
Push p992 966 ms
Server heap3.1–7.2 GB
CPU3.5 vCPU
Disk per msg~36 B
Per-key ordernative
Replaytimestamp / offset
Ops surfacebroker + KRaft
RabbitMQ 3.12 rabbitmq-1000q
34.7k msg/s
Confirm p999.3 ms
Server RSS188 MB
CPU1.5 vCPU
Disk written~9 GB
Per-key order1 queue per key
Replaystreams only
Ops surfacebroker + Erlang
acks=1, no fsync) and 78× higher saturation latency
and ~80× more memory.
RabbitMQ ties Queen on throughput (35k vs 39k msg/s) and decisively
wins on latency (9 ms vs 38 ms p99 confirm) and CPU
(1.5 vs 7.4 vCPU), AMQP binary + Mnesia is cheaper per message than HTTP/JSON +
Postgres INSERTs. Queen wins on memory (52 MB vs 188 MB, ~3.6×
lighter), the strictest default durability, and on
architectural features no benchmark can show: per-key ordering with
parallel consumers, replay-from-timestamp, transactional integration with PG, dynamic
high-cardinality partitions. None of the three dominates on raw numbers at
this tier, pick based on the architectural fit and the operational story.
Detailed cross-system reports:
vs-kafka.md ·
vs-rabbitmq.md
What broker benchmarks miss, your workers are usually the bottleneck
Every benchmark on this page measures broker capacity. But most production workloads aren't broker-bound. If your messages do real work, a database write, an API call, an LLM inference, a webhook delivery, your consumer fleet becomes the bottleneck long before the broker does. That changes which system matters, and by how much.
Here's the math at 20 ms of work per message (representative of
a typical DB write or moderate API call). One worker processes 50 msg/s. Useful
throughput equals workers × 50 msg/s:
| Target msg/s | Workers needed | Cost-of-fleet order of magnitude* |
|---|---|---|
| 5 000 msg/s | 100 | ~$500 / month |
| 10 000 msg/s | 200 | ~$1 k / month |
39 000 msg/s (Queen bp-10 ceiling) | 780 | ~$5 k / month |
| 104 000 msg/s (Queen peak, batch=100) | 2 080 | ~$10 k / month |
| 500 000 msg/s | 10 000 | ~$50 k / month |
| 1 500 000 msg/s (Kafka measured) | 30 000 | ~$150 k / month |
*Rough estimate at small-EC2 / small-container per-worker pricing. Real cost depends on your stack and what each worker does.
Said differently: at a real workload of 5 k msg/s, all three systems sit at low utilization on the broker side:
| System | Broker ceiling | Utilization at 5 k msg/s with 100 workers |
|---|---|---|
Queen bp-10 | 39 k msg/s | 13 % |
| RabbitMQ classic 1000q | 35 k msg/s | 14 % |
| Kafka 3.7 single-node | 1.5 M msg/s | 0.3 % |
When the broker isn't the bottleneck, what actually matters is:
- Latency at low load, Queen ~5–10 ms p99 (unsaturated), Kafka ~5–10 ms, RabbitMQ ~3–5 ms. Roughly tied. The “Kafka 39× faster than Queen” comparison was for saturated brokers; nobody runs production workloads saturated.
- Operational cost, Queen is 1 binary + your existing PG. Kafka is broker cluster + KRaft + topic admin + monitoring stack. The annual operational difference dwarfs the broker license/hardware cost.
- Durability semantics, Queen's
synchronous_commit=ondefault is the strictest of the three. Kafka withacks=1isn't equivalent; matched durability (acks=all + flush.messages=1) cuts Kafka throughput by 5–10×. - Integration with your stack, Queen lets you do
BEGIN; INSERT order; queen.push(...); COMMIT;in one PG transaction. The transactional-outbox pattern (a real cost in Kafka deployments) goes away.
The broker comparison only matters when work per message is sub-millisecond: true streaming pipelines (Kafka Streams, Flink, ksqlDB), log shipping, click-stream analytics. For those, Kafka is the right tool and nothing else competes. For everything else, order processing, notifications, webhooks, ML inference jobs, chat handling, workflow steps, the broker sits idle, the worker fleet sets your cost, and the differentiator is what's easy to operate and integrate.
Resource efficiency
The most consistent signal across all 18 tests: Queen is small.
| Metric | bp-1 | bp-10 | bp-100 (peak) | bp-10-cg10 (10 groups) |
|---|---|---|---|---|
| Queen RSS max | 30 MB | 52 MB | 72 MB | 169 MB |
| Queen CPU avg | 5.2 vCPU | 7.4 vCPU | 19 vCPU | 18 vCPU |
| DB pool active avg | 2.4 | 2.4 | 2.7 | 2.7 |
| PG cache hit rate | 100% | 100% | 99.99% | 100% |
messages_consumed table | , | 84 MB | , | 743 MB |
messages_consumed table
grows fast under fan-out, 743 MB after 15 minutes at 10 consumer groups (~70 GB/day
sustained). TTL retention is critical for any deployment running
more than a few days. Configure completedRetentionSeconds on hot queues.
Adaptive concurrency in action
Queen ships with DB_POOL_SIZE=50 by default. Across the entire benchmark
suite, including the 100k msg/s peak test, the libqueen Vegas controller kept the
active connection count at ~2.5. The other 47 connections sit idle
in the pool, available as overflow for transient spikes.
This is exactly what the TCP-Vegas-style controller is supposed to do: when adding in-flight work doesn't reduce RTT, the controller knows the pipe isn't congested and stays put. You can't manually tune your way to a better number for this workload, the controller already found it. See the architecture page for the math.
0.14 vs 0.12, adaptive engine impact
Same five tests, run once on each version, fresh DB:
| Test | 0.14 push | 0.12 push | 0.14 pop | 0.12 pop |
|---|---|---|---|---|
bp-10 | 39 060 | 31 820 | 38 351 | 17 149 |
bp-100 | 104 400 | 64 400 | 101 675 | 61 279 |
hi-part-1 | 29 487 | 13 696 | 27 849 | 3 194 |
hi-part-10000 | 21 044 | 17 331 | 17 825 | 3 643 |
q-10 | 40 500 | 31 610 | 38 540 | 29 555 |
Pop throughput improved 80–90% under partition contention, the single biggest win from the libqueen rewrite. PG memory usage 30–70% lower for the same workload. PG deadlock mode under heavy fan-out eliminated.
Data integrity, the headline number
Short suite
~3B events Tests18 Duration15 min each ackFailed0 DLQ messages0Long-running test
1.5B messages Duration14 hours Sustained rate~28k msg/smessages table35 GB at end
Lost0
Combined
1.6B+ events Hardware crashes0 PG outages0 Data loss0 Failover replay100%Bugs we found
Honest accounting. Four issues surfaced during the run; all four are cosmetic, none caused data loss. Listed here because a benchmark page that doesn't tell you what broke isn't useful.
| Bug | When it fires | Severity | Fix |
|---|---|---|---|
StatsService.refresh_all_stats_v1 30 s timeout |
Sustained ≥30k msg/s with many partitions, or multi-cg load | Cosmetic, advisory lock prevents pile-up | One line: SET LOCAL statement_timeout = 0; |
evict_expired_waiting_messages 30 s timeout |
10 consumer-group load only | Cosmetic | Same one-line fix in retention procedure |
PG deadlock detected during high-concurrency push |
10 001 partitions on 0.12 (mostly fixed in 0.14) | None, file-buffer failover catches everything | 0.14's v3 push procedure eliminates most cases |
queue-ops reports pushMessages: 0 |
Always (in 0.14.0.alpha.3) | Cosmetic, reporting only | Bug in per-queue stats aggregator |
Reproduce it yourself
All raw data, runner scripts, configs, and analysis live in the repo:
benchmark-queen/2026-04-26.
Each test directory has the per-consumer logs, queue-ops time-series JSON, postgres
stats, system metrics, and final docker-stats output. The
HOW-TO-RUN.md
walks through reproducing the entire suite step by step (Docker images, PostgreSQL
tuning, autocannon configuration).
How much Postgres do I need?
Interactive calculator turning your target msg/s into a PG vCPU budget, derived directly from these benchmarks. ~10 k PG ops/s/vCPU rule of thumb.
Full README
Compacted result table for all 18 tests, key conclusions, version comparison verdict.
pipeline-queen.md
4-stage pipeline at 1 000 partitions for per-entity ordering: 3 688 msg/s end-to-end, p99 = 1.02 s, 99.96 % completeness.
pipeline-queen-throughput.md
Same pipeline at 10 partitions for max throughput: 6 673 msg/s end-to-end, p99 = 1.10 s, 3.2× lower PG CPU.
HOW-TO-RUN.md
Step-by-step reproduction: Docker setup, PG tuning, the runner scripts, parameter sweeps.
vs-kafka.md
Direct head-to-head with Apache Kafka 3.7 at 1 000 partitions, full latency profile.
cg-axis-comparison.md
The 1 vs 5 vs 10 consumer-group sweep, including per-group fairness numbers.
