Queen MQ
Benchmarks

18 tests. 1.6 billion message events. Zero loss.

All numbers on this page come from benchmark sessions run on 2026-04-25 and 2026-04-26 on a fresh 32 vCPU / 62 GiB host with PostgreSQL upstream postgres:latest, Queen 0.14.0.alpha.3, Apache Kafka 3.7 (KRaft single-node), and RabbitMQ 3.12. Each test ran for 15 minutes against a fresh server state. The full raw data, per-minute time series, Postgres stats, system metrics, autocannon and perf-test output, lives in benchmark-queen/2026-04-26.

Headline

Push (batch=100) 104.4k msg/s single producer · 1×50 conns
Push p99 (batch=10) 38 ms at 39k msg/s sustained
Fan-out (10 cg) 165k msg/s total deliveries · 9.3× pop multiplier
Server RSS at peak 52 MB vs Kafka 3.1–7.2 GB
Partition density 10 001 21k msg/s sustained · zero degradation
DB pool active 2.5 / 50 Vegas finds the right number itself
PG cache hit rate 99.99 % across the entire suite
Lost messages 0 across 1.6 billion events

Test environment

Host32 vCPU · 62 GiB RAM · no swap · Ubuntu 24.04 (kernel 6.8)
PostgreSQLpostgres:latest · shared_buffers=24 GB · effective_cache_size=48 GB · autovacuum_naptime=10s · autovacuum_vacuum_scale_factor=0.05
Queen0.14.0.alpha.3 · NUM_WORKERS=10 · DB_POOL_SIZE=50 · SIDECAR_POOL_SIZE=250 · nofile=65535
Cleanupdocker rm -v + docker volume prune -f before each test. Every test starts with an empty database.
Duration15 min (900 s) per test
Clientautocannon v8 on Node 22, same host
Payload~28 bytes per message (matched across all systems compared)

Throughput & latency together

Throughput numbers without latency are half the story. Here are the four push configurations side by side, with both the producer and consumer latency profiles:

TestSetup Push msg/s p50 / p99 push p50 / p99 pop Queen CPU
bp-11×50, batch=15 7968 / 13 ms11 / 355 ms5.2 vCPU
bp-101×50, batch=1039 06011 / 38 ms16 / 356 ms7.4 vCPU
bp-1001×50, batch=100104 40045 / 131 ms35 / 254 ms19 vCPU
hi-part-15×100 conns, batch=129 48715 / 44 ms59 / 4 199 ms27 vCPU
The latency story is the surprising one. Going from bp-1 to bp-10, push throughput jumps 6.7× while p99 only goes from 13 ms to 38 ms, a 2.2× efficiency gain per request. At 100k msg/s peak, p99 is still 131 ms. Compare to Kafka under sustained unbounded load: 1.5M msg/s but p99 of 2 966 ms, Queen at 100k msg/s has ~22× lower tail latency than Kafka at peak.

Partition scaling, the Queen claim, validated

Same producer concurrency (5×100 connections), increasing partition count from 2 to 10 001. If partitions were physical commit-logs, this would degrade catastrophically. In Queen they're logical lanes, so the cost is index size, not state machine size.

2 partitions 29 487 msg/s
11 27 902 msg/s
101 26 041 msg/s
1 001 25 016 msg/s
10 001 21 044 msg/s

Partition count goes up by 5 000×; throughput drops by 29%. Zero errors at every scale. This is the curve that lets Queen offer "one partition per chat conversation" in production without operational drama.

Time-series stability, what 15 minutes of bp-10 looks like

Throughput averages don't tell you whether a system is stable or oscillating. Below is the per-minute pop rate from bp-10's own queue-ops.json output, a rendering of the actual time series.

bp-10 · pop msg/s · 15 min run · per-minute aggregates
50k 40k 30k 20k 10k mean 38k
min 1 min 5 min 10 min 15 (winding down)

Per-minute throughput bounces between 35k and 46k msg/s for 14 minutes, then ramps down as the producer hits its target. No drift, no degradation, no warm-up artifact. The same is true of every sustained test in the suite.

Consumer-group fan-out

GroupsPush msg/sTotal deliveries Pop / pushPer-group p99Pending at end
139 06038 351 msg/s1.0×356 ms1.5%
526 890127 777 msg/s4.9×315 ms0
1017 890165 480 msg/s9.3×471 ms0
Fairness across groups holds at any scale tested. At 10 consumer groups, all 10 deliver 360 ± 1 req/s with p99 within 5 ms of each other. The dispatch is fair, no group gets favored, none gets starved. Per-group CPU cost is sub-linear: Queen needs 7.4 vCPU for 1 group, but only 18 vCPU for 10 groups , about 25% the per-group cost of running them alone, thanks to libqueen's per-JobType batching amortizing across requests.
The cost. Adding consumer groups slows the producer 31% at 5 groups, 54% at 10. They're competing for Queen worker slots and PG resources, by design. The system isn't oversubscribed. If you need higher push throughput while consuming, add hardware.

Multi-tenancy is essentially free

Same total client load, distributed across one queue vs ten queues:

TestSetupPush msg/sPop msg/sΔ
q-11 queue, 1×50, batch=10, MP=100039 130~38 000baseline
q-1010 queues, same total load40 50038 540+3.5% / +1.4%

Routing to ten different queues instead of one was slightly faster (less contention on the single partition's advisory lock), well within noise. Multi-tenant deployments don't pay a partitioning tax.

Realistic pipeline, producer → worker → q2 fan-out

The throughput numbers above are pure broker-side push/pop micro-benchmarks. Most real workloads look different: multiple stages, real client SDK with batching and long-poll, work simulation between stages, multiple downstream consumer groups. This benchmark exercises that shape end-to-end with the official queen-mq JS client and pm2:

[producer ×2] ──push──▶  pipe-q1  ──pop──▶  [worker ×7] ──push──▶  pipe-q2 ─┬─pop──▶ [analytics ×7]
                                                                              │
                                                                              └─pop──▶ [log ×7]

Producer pushes single messages per HTTP call. Each worker pops a batch from q1, simulates 5–20 ms of long-tail per-message work in parallel via Promise.all, forwards the batch to q2 preserving per-partition ordering, then acks q1 (at-least-once: separate push then ack, no transactional commit). Both analytics and log consumer groups drain q2 with the same shape, fan-out, simulated work, batch ack. The whole thing runs on a 16 vCPU / 64 GB DigitalOcean VM with PG synchronous_commit=on.

Same engine, same client, same pipeline shape, two configurations of the partition knob:

High-cardinality config

1 000 partitions · per-entity ordering

  • 1 000 partitions, batch = 100, partitions/pop = 10
  • Producer = worker drain = 3 688 msg/s
  • End-to-end p50 / p99 = 359 ms / 1 024 ms
  • Per-entity FIFO ordering preserved (Kafka-like)
  • For: chat rooms, per-tenant streams, per-user state
Throughput-tuned config

10 partitions · weak ordering, max throughput

  • 10 partitions, batch = 1 000, partitions/pop = 1
  • Producer = worker drain = 6 673 msg/s (+81 %)
  • End-to-end p50 / p99 = 755 ms / 1 103 ms
  • Per-shard FIFO (RabbitMQ-like, but ordering still per-lane)
  • For: task queues, work distribution, log shipping
Best end-to-end throughput 6 673 msg/s producer = worker drain · q1 stays drained
End-to-end p99 (both configs) ~1 100 ms producer push → analytics ack
Delivery completeness 99.9 % ~few-thousand in-flight at cutoff
Duplicate processing 0 at-least-once held end-to-end · both configs
Stage / Config1 000-part p50 / p99 / max10-part p50 / p99 / max
End-to-end (producer → analytics) 359 / 1 024 / 15 464 ms 755 / 1 103 / 1 747 ms
q1 → worker 213 / 705 / 15 387 ms 440 / 745 / 1 010 ms
q2 → analytics 114 / 514 / 3 198 ms 312 / 601 / 999 ms
Partition count is a continuous knob, not a binary choice. Slide it toward many partitions for per-entity ordering at lower throughput; slide it toward few partitions for higher throughput with weak (per-shard) ordering. Both modes share the same SDK, same C++ engine, same Postgres, same durability tier. 0 duplicates and 0 lost messages in both runs. The 10-partition config also gives 9× tighter max latency (1.75 s vs 15.5 s) because every lane is always being drained.
Resource (steady state)1 000-part config10-part config
queen container CPU ~390 % (3.9 vCPU) ~340 % (3.4 vCPU)
queen container RSS ~70 MB ~175 MB (bigger in-flight batches)
postgres CPU ~1 500 % (15 vCPU) ~470 % (4.7 vCPU, 3.2× cheaper)
postgres RSS at end 16.7 GB 12.3 GB

In the throughput-tuned config Postgres CPU is 3.2× lower at nearly double the throughput, bigger batches amortise the per-call SQL overhead so much that the per-message PG cost collapses. The system is then producer-bound, not PG-bound: adding more producer processes would push the rate well into the 30–50 k/s range without changing anything broker-side. Full writeups with reproduction recipes: pipeline-queen.md (high-cardinality) and pipeline-queen-throughput.md (throughput-tuned).

Queen vs Kafka vs RabbitMQ, same hardware, same payload

Single-node, 1 000 partitions/queues, persistent durability, 15-min run, ~28-byte message payload, same 32 vCPU / 62 GiB host. All three measured directly: Kafka with kafka-producer-perf-test.sh, RabbitMQ with the official pivotalrabbitmq/perf-test. Each system uses its own native binary protocol (HTTP/JSON for Queen, Kafka wire protocol, AMQP for RabbitMQ).

Test methodology, important asymmetries you should know about

Each system was tested in its idiomatic high-throughput client configuration, not with literally identical client settings. The protocols differ enough that "same number of connections" doesn't mean the same thing in each system. Specifically:

SystemClient configEffective concurrency
Queen autocannon, 50 HTTP connections, batch=10, pipelining=1 ~50 in-flight HTTP requests
Kafka 1 producer process, linger.ms=10, batch.size=16384, max.in.flight=5 ~5 in-flight batched requests on 1 TCP connection
RabbitMQ 1 producer process, --confirm 200 ~200 unconfirmed publishes on 1 AMQP connection

Durability tiers are also not identical:

  • Queen: synchronous_commit=on, fsync of WAL before HTTP 201. Strictest.
  • Kafka: acks=1, broker writes to OS page cache, no fsync. If broker crashes before OS flushes (~1 s), messages can be lost. Single broker means no replication backup either.
  • RabbitMQ: delivery_mode=2 + --confirm 200, written to queue index on disk before confirm, but flushes are batched at the index level.

What this means for the comparison: Re-running with matched client concurrency (50 producers each) would push Kafka above 2 M msg/s and RabbitMQ to probably ~80-120 k msg/s. Tightening Kafka's durability to fsync-per-message would drop it to maybe 100-300 k msg/s. The numbers shown are each system at its idiomatic high-throughput config, with its idiomatic durability tier. That's a defensible-but-not-exhaustive choice. Queen's memory and architectural advantages hold regardless; the throughput numbers are the most sensitive to client setup.

Queen MQ bp-10

39k msg/s Push p9938 ms Server RSS52 MB CPU7.4 vCPU Disk per msg~400 B Per-key ordernative Replaytimestamp / offset Ops surface1 binary + PG

Kafka 3.7 kafka-1000p

1.52M msg/s Push p992 966 ms Server heap3.1–7.2 GB CPU3.5 vCPU Disk per msg~36 B Per-key ordernative Replaytimestamp / offset Ops surfacebroker + KRaft

RabbitMQ 3.12 rabbitmq-1000q

34.7k msg/s Confirm p999.3 ms Server RSS188 MB CPU1.5 vCPU Disk written~9 GB Per-key order1 queue per key Replaystreams only Ops surfacebroker + Erlang
Honest summary. Kafka does 39× more throughput at 1.5M msg/s, but at a weaker durability tier (acks=1, no fsync) and 78× higher saturation latency and ~80× more memory. RabbitMQ ties Queen on throughput (35k vs 39k msg/s) and decisively wins on latency (9 ms vs 38 ms p99 confirm) and CPU (1.5 vs 7.4 vCPU), AMQP binary + Mnesia is cheaper per message than HTTP/JSON + Postgres INSERTs. Queen wins on memory (52 MB vs 188 MB, ~3.6× lighter), the strictest default durability, and on architectural features no benchmark can show: per-key ordering with parallel consumers, replay-from-timestamp, transactional integration with PG, dynamic high-cardinality partitions. None of the three dominates on raw numbers at this tier, pick based on the architectural fit and the operational story.

Detailed cross-system reports: vs-kafka.md · vs-rabbitmq.md

What broker benchmarks miss, your workers are usually the bottleneck

Every benchmark on this page measures broker capacity. But most production workloads aren't broker-bound. If your messages do real work, a database write, an API call, an LLM inference, a webhook delivery, your consumer fleet becomes the bottleneck long before the broker does. That changes which system matters, and by how much.

Here's the math at 20 ms of work per message (representative of a typical DB write or moderate API call). One worker processes 50 msg/s. Useful throughput equals workers × 50 msg/s:

Target msg/sWorkers neededCost-of-fleet order of magnitude*
5 000 msg/s100~$500 / month
10 000 msg/s200~$1 k / month
39 000 msg/s (Queen bp-10 ceiling)780~$5 k / month
104 000 msg/s (Queen peak, batch=100)2 080~$10 k / month
500 000 msg/s10 000~$50 k / month
1 500 000 msg/s (Kafka measured)30 000~$150 k / month

*Rough estimate at small-EC2 / small-container per-worker pricing. Real cost depends on your stack and what each worker does.

Kafka's 1.5 M msg/s is mostly unreachable headroom for real workloads. Saturating it requires ~30 000 worker processes, a fleet most companies will never have. Real-world business workloads typically run on 100–2 000 workers, which means ~5 k–100 k msg/s of actual demand. That's well within Queen's envelope, with Kafka's broker idle at ~5 % and Queen's at 25–80 %.

Said differently: at a real workload of 5 k msg/s, all three systems sit at low utilization on the broker side:

SystemBroker ceilingUtilization at 5 k msg/s with 100 workers
Queen bp-1039 k msg/s13 %
RabbitMQ classic 1000q35 k msg/s14 %
Kafka 3.7 single-node1.5 M msg/s0.3 %

When the broker isn't the bottleneck, what actually matters is:

The broker comparison only matters when work per message is sub-millisecond: true streaming pipelines (Kafka Streams, Flink, ksqlDB), log shipping, click-stream analytics. For those, Kafka is the right tool and nothing else competes. For everything else, order processing, notifications, webhooks, ML inference jobs, chat handling, workflow steps, the broker sits idle, the worker fleet sets your cost, and the differentiator is what's easy to operate and integrate.

The honest framing. Queen's 39 k–104 k msg/s broker is “ceiling you can actually use” for most production workloads. Kafka's 1.5 M is mostly headroom you'll never reach. RabbitMQ is in Queen's territory but with a different feature set. If your messages do real work, pick the system that's easiest to operate and integrate, not the one with the biggest headline number.

Resource efficiency

The most consistent signal across all 18 tests: Queen is small.

Metricbp-1bp-10bp-100 (peak)bp-10-cg10 (10 groups)
Queen RSS max30 MB52 MB72 MB169 MB
Queen CPU avg5.2 vCPU7.4 vCPU19 vCPU18 vCPU
DB pool active avg2.42.42.72.7
PG cache hit rate100%100%99.99%100%
messages_consumed table,84 MB,743 MB
The one operational caveat: the messages_consumed table grows fast under fan-out, 743 MB after 15 minutes at 10 consumer groups (~70 GB/day sustained). TTL retention is critical for any deployment running more than a few days. Configure completedRetentionSeconds on hot queues.

Adaptive concurrency in action

Queen ships with DB_POOL_SIZE=50 by default. Across the entire benchmark suite, including the 100k msg/s peak test, the libqueen Vegas controller kept the active connection count at ~2.5. The other 47 connections sit idle in the pool, available as overflow for transient spikes.

Configured 50
Active avg 2.5

This is exactly what the TCP-Vegas-style controller is supposed to do: when adding in-flight work doesn't reduce RTT, the controller knows the pipe isn't congested and stays put. You can't manually tune your way to a better number for this workload, the controller already found it. See the architecture page for the math.

0.14 vs 0.12, adaptive engine impact

Same five tests, run once on each version, fresh DB:

Test0.14 push0.12 push0.14 pop0.12 pop
bp-1039 06031 82038 35117 149
bp-100104 40064 400101 67561 279
hi-part-129 48713 69627 8493 194
hi-part-1000021 04417 33117 8253 643
q-1040 50031 61038 54029 555

Pop throughput improved 80–90% under partition contention, the single biggest win from the libqueen rewrite. PG memory usage 30–70% lower for the same workload. PG deadlock mode under heavy fan-out eliminated.

Data integrity, the headline number

Short suite

~3B events Tests18 Duration15 min each ackFailed0 DLQ messages0

Long-running test

1.5B messages Duration14 hours Sustained rate~28k msg/s messages table35 GB at end Lost0

Combined

1.6B+ events Hardware crashes0 PG outages0 Data loss0 Failover replay100%

Bugs we found

Honest accounting. Four issues surfaced during the run; all four are cosmetic, none caused data loss. Listed here because a benchmark page that doesn't tell you what broke isn't useful.

BugWhen it firesSeverityFix
StatsService.refresh_all_stats_v1 30 s timeout Sustained ≥30k msg/s with many partitions, or multi-cg load Cosmetic, advisory lock prevents pile-up One line: SET LOCAL statement_timeout = 0;
evict_expired_waiting_messages 30 s timeout 10 consumer-group load only Cosmetic Same one-line fix in retention procedure
PG deadlock detected during high-concurrency push 10 001 partitions on 0.12 (mostly fixed in 0.14) None, file-buffer failover catches everything 0.14's v3 push procedure eliminates most cases
queue-ops reports pushMessages: 0 Always (in 0.14.0.alpha.3) Cosmetic, reporting only Bug in per-queue stats aggregator

Reproduce it yourself

All raw data, runner scripts, configs, and analysis live in the repo: benchmark-queen/2026-04-26. Each test directory has the per-consumer logs, queue-ops time-series JSON, postgres stats, system metrics, and final docker-stats output. The HOW-TO-RUN.md walks through reproducing the entire suite step by step (Docker images, PostgreSQL tuning, autocannon configuration).

Sizing

How much Postgres do I need?

Interactive calculator turning your target msg/s into a PG vCPU budget, derived directly from these benchmarks. ~10 k PG ops/s/vCPU rule of thumb.

Summary

Full README

Compacted result table for all 18 tests, key conclusions, version comparison verdict.

Pipeline · ordered

pipeline-queen.md

4-stage pipeline at 1 000 partitions for per-entity ordering: 3 688 msg/s end-to-end, p99 = 1.02 s, 99.96 % completeness.

Pipeline · throughput

pipeline-queen-throughput.md

Same pipeline at 10 partitions for max throughput: 6 673 msg/s end-to-end, p99 = 1.10 s, 3.2× lower PG CPU.

Recipe

HOW-TO-RUN.md

Step-by-step reproduction: Docker setup, PG tuning, the runner scripts, parameter sweeps.

Comparison

vs-kafka.md

Direct head-to-head with Apache Kafka 3.7 at 1 000 partitions, full latency profile.

Deep-dive

cg-axis-comparison.md

The 1 vs 5 vs 10 consumer-group sweep, including per-group fairness numbers.