Matrix · Queen 0.16 (pushser) vs 0.14 · June 2026

Same suite, new engine: faster on 10 of 11 cells.

We re-ran the April benchmark matrix on Queen 0.16 (the pushser architecture) on the same 32 vCPU / 62 GiB host with the same PostgreSQL tuning and the same autocannon harness, so the only thing that changed is the broker. Baseline is Queen 0.14.0.alpha.3 from the April 2026 run. Full numbers and raw data: benchmark-queen/2026-06-07.

Headline

Cells faster vs 0.14 10 / 11 by 1.3× - 3.0×

Peak push 171.6k msg/s bp-100 · was 104.4k

Partition scaling flat to 10k ~33k msg/s · 0.14 decayed

Fan-out (10 cg) 2.94× vs 0.14 on the same cell

Same hardware, same Postgres, same client, only the broker changed. The pushser engine is faster on every cell but one, and its biggest gains are exactly where it matters: batched producers, high partition counts, and consumer-group fan-out.

Push throughput vs 0.14

Push throughput per cell (producer autocannon aggregate × messages-per-push), 300 s per cell, fresh database each time. factor = 0.16 ÷ 0.14.

Cell	Setup	0.14 msg/s	0.16 msg/s	0.16 p50	Factor
`bp-1`	50 conns, batch 1	5.8k	17.5k	2 ms	3.02×
`bp-10`	50 conns, batch 10	39.1k	85.1k	4 ms	2.18×
`bp-100`	50 conns, batch 100	104.4k	171.6k	18 ms	1.64×
`q-1`	1 queue, batch 10	39.1k	84.7k	4 ms	2.17×
`q-10`	10 queues, batch 10	40.5k	86.5k	4 ms	2.13×
`bp-10-cg5`	5 consumer groups	26.9k	68.0k	5 ms	2.53×
`bp-10-cg10`	10 consumer groups	17.9k	52.7k	7 ms	2.94×
`hi-part-10`	500 conns, 10 part	27.9k	24.7k	23 ms	0.89×
`hi-part-100`	500 conns, 100 part	26.0k	33.8k	13 ms	1.30×
`hi-part-1000`	500 conns, 1k part	25.0k	33.4k	13 ms	1.33×
`hi-part-10000`	500 conns, 10k part	21.0k	32.5k	13 ms	1.55×

Biggest wins are the batched, low-concurrency cells (bp/q/cg: 2-3×), where the pushser push pipeline does the most work per round-trip. The one regression is hi-part-10 (−11%): few partitions under high connection count is the new engine's single weak spot.

Partition scaling: flat where 0.14 declined

Same producer load (500 connections, 1 msg/push), increasing the partition count from 10 to 10,000. In Queen, partitions are logical lanes, so the cost is index size, not state machine size, and the new engine flattens the curve.

Queen 0.16, holds flat

10 part24.7k

10033.8k

1 00033.4k

10 00032.5k

Queen 0.14, declines

10 part27.9k

10026.0k

1 00025.0k

10 00021.0k

This is an architectural win. 0.14 falls 27.9k → 21.0k msg/s as partitions grow 10 → 10,000; 0.16 stays ~32-34k flat. The advantage widens with scale, 1.55× at 10,000 partitions, which is exactly the "one partition per entity" regime Queen is built for.

Consumer-group fan-out

Same producer, increasing the number of consumer groups draining the queue:

Groups	0.14 push msg/s	0.16 push msg/s	Factor
5 groups	26.9k	68.0k	2.53×
10 groups	17.9k	52.7k	2.94×

Fan-out gets relatively better as you add groups (2.53× → 2.94×): the pushser batching amortizes the extra per-group pop work more effectively than 0.14 did.

Resource efficiency

CPU as cores (of 32). The broker stays tiny; Postgres is where the work, and the ceiling, lives.

Cell	msg/s	Queen CPU	Queen RSS	PG CPU	PG RSS
`bp-10`	85.1k	2.0	46 MB	8.7	13.5 GB
`bp-100`	171.6k	0.9	73 MB	11.2	25.0 GB
`bp-10-cg10`	52.7k	2.3	101 MB	15.0	9.7 GB
`hi-part-10000`	32.5k	5.4	53 MB	8.5	6.1 GB

The broker is lean everywhere, 32-106 MB RSS and ≤6 cores across the whole matrix. The headline: bp-100 pushes 171.6k msg/s on ~0.9 broker cores, big batches mean few HTTP requests, so almost no broker CPU; PG does the inserting.

The config lesson: match push-hold to your concurrency

The pushser push path fuses requests, holding briefly to batch them. At low client concurrency (this harness's 50 connections) a non-zero hold serializes the push lane; setting QUEEN_PUSH_MAX_HOLD_MS=0 pipelines it. The same knob, tuned to the workload, is the difference between 26k and 92k on bp-10:

Config	push msg/s	p50	p99
0.14 baseline	39.1k	11 ms	38 ms
0.16, default hold	26.0k	18 ms	48 ms
0.16, hold = 2 ms	34.2k	13 ms	39 ms
0.16, hold = 0	92.5k	4 ms	31 ms

Tune the hold to the workload. Low-concurrency / batched clients want hold=0 (pipelines the push path). High-concurrency throughput, like the 24-hour soak with 650 producers, wants a larger hold so fusion packs many messages per Postgres commit (it sustained ~118k msg/s that way). All matrix numbers above use hold=0, the right setting for this harness.

Caveats

This is the low-concurrency autocannon harness (50-500 connections). It measures per-cell push efficiency, not the engine's sustained ceiling, for that see the 24-hour soak (~118k msg/s sustained, balanced).
The metric is push throughput (producer side); pop throughput is recorded per cell in the raw consumer logs.
q-100 is omitted (no April baseline + the producer log didn't record cleanly). A handful of April-only cells weren't re-run.
Errors were negligible across all cells (non2xx ≤ 2 per run, ~3×10⁻⁵%).

Companion benchmark