How Queen works under the hood.
Queen is three layers stacked end-to-end. PostgreSQL stores everything. A C++ engine called libqueen fuses many HTTP requests into batched stored-procedure calls. A uWebSockets HTTP server distributes work across worker threads and routes responses back to the right connection. This page is the mental model, enough to predict how the system behaves under any workload, without reading the source.
The three layers
Read the system top-down. Each layer has one job.
The trick that makes this fast: HTTP threads never touch libpq directly.
They drop a request into libqueen's per-type queue and return immediately. libqueen
collects requests of the same type, fuses them into one batched stored-procedure call
on a non-blocking PostgreSQL connection, then disperses the per-row results back to
each waiting HTTP request via a cross-thread defer().
Data model, PostgreSQL is the source of truth
Everything the system knows about, every queue, every message, every consumer offset , lives in PostgreSQL rows. There is no in-memory queue state on the server that survives a restart. The schema is intentionally small:
| Table | What it holds |
|---|---|
queues |
Per-queue configuration: lease_time, retry_limit, retention windows, encryption flag, DLQ flags, priority. |
partitions |
Named lane inside a queue. Logical only, Queen does not use PostgreSQL native table partitioning. |
messages |
The payload. Dedupe key is (partition_id, transaction_id), not just transaction_id, idempotency is partition-scoped. |
partition_consumers |
The lease + offset cursor for each (partition_id, consumer_group) pair. Holds last_consumed_id, lease_expires_at, worker_id, batch_size, acked_count, batch_retry_count. This single row is what makes "queue mode" and "consumer-group mode" share the same code path. |
consumer_groups_metadata |
Per-named-group subscription mode (all / new / from timestamp). |
partition_lookup |
"Head of stream" snapshot per partition: last_message_id, last_message_created_at, updated_at. Lets pop decide whether a partition has anything new without scanning messages. |
consumer_watermarks |
Wildcard-pop optimization: last_empty_scan_at per (queue, group). Caught-up consumers skip stale partitions. |
dead_letter_queue |
Failed messages that exceeded the retry limit, plus the original error message. |
traces, messages_consumed, stats, system_metrics |
Append-only logs and rollups powering the dashboard. |
Two indexes carry the hot path
messages (partition_id, created_at, id), the cursor scan that pop walks to find the next batch of unconsumed messages on a claimed partition.- UNIQUE
(partition_id, transaction_id)onmessages, enforces dedupe viaON CONFLICT DO NOTHINGin push.
Update-heavy tables (partition_lookup, partition_consumers,
stats) ship with FILLFACTOR=50 and tuned autovacuum knobs to
keep them HOT-update-friendly under sustained write load.
Two coordination primitives
Both live inside the procedures, not the schema:
- POP wildcard claim,
pg_try_advisory_xact_lock(hashtextextended(partition_id::text, …)). Non-blocking try; the first request to claim a partition wins it for the duration of its transaction. - ACK serialization,
pg_advisory_xact_lock(...md5(partition_id || consumer_group)...). Blocking lock; prevents two concurrent acks from racing on the same cursor row.
These are the entire concurrency story. There are no application-level mutexes, no external coordination services. PostgreSQL is the lock manager.
The __QUEUE_MODE__ trick
When a consumer pops without specifying a group, every code path internally uses the
literal string '__QUEUE_MODE__' as the consumer-group value. Same
partition_consumers row mechanism, same offset-tracking, same procedures
, just a special name. This is why the behavioral difference between "queue mode" and
"consumer-group mode" is small enough to coexist in one stored procedure.
Stored procedures, the hot-path API
Every hot-path operation is a named, versioned PostgreSQL procedure. They all take a JSON array (the batch from libqueen) and return a JSON result. This is what makes request fusion safe: many client requests collapse into one procedure call, one network round trip, one transaction.
| Procedure | What it does |
|---|---|
queen.push_messages_v3 |
Upserts queue + partition rows, deduplicates within the batch and against existing rows via ON CONFLICT, inserts messages, returns inserted ids and per-partition partition_updates the server then writes to partition_lookup. |
queen.pop_unified_batch_v4 |
One procedure for both queue mode and consumer-group mode, both fixed-partition and wildcard. Scans partition_lookup + consumer_watermarks, claims partitions with an advisory try-lock, takes the lease in partition_consumers, fetches messages by cursor. v4 supports a per-call max_partitions field: the wildcard branch can drain up to N partitions in a single call (one shared lease across all of them, batch_size as the global message cap). v3 is preserved for emergency revert by flipping the dispatched SQL string in lib/queen/pending_job.hpp. |
queen.ack_messages_v2 |
For each ack: serialize on (partition, group), validate the lease, advance the cursor (on completed) or release+retry/DLQ-route (on failed). Clears the lease when acked_count hits batch_size. |
queen.execute_transaction_v2 |
Inlines simplified ack + push operations in one PL/pgSQL function = one PG transaction. Either everything commits or nothing does. This is what makes "exactly-once across queues" actually true. |
queen.renew_lease_v2 |
Extends lease_expires_at on rows where worker_id matches and the lease hasn't already expired. Idempotent and batchable. |
queen.has_pending_messages |
Cheap EXISTS with the same predicate as pop's candidate scan. Used by long-poll readiness checks without claiming anything. |
queen.update_partition_lookup_v1 |
Monotonic UPSERT into partition_lookup. Called by the server after every push commit, replacing what used to be a row trigger on messages. |
How push actually works
Worth a closer look, push is the highest-throughput hot path. push_messages_v3
is three statements with no temp tables:
Three SQL statements, one network round trip from libqueen, hundreds of pushed messages
per call under load. The partition_updates aggregate in the response is
what the server replays into update_partition_lookup_v1 after commit, so
the wildcard pop path always finds the latest writes.
How pop finds the next message
Pop is the most intricate procedure because it covers four shapes (queue / consumer group × fixed / wildcard partition) in one function. The wildcard-by-consumer-group path is the interesting one:
- Read the
last_empty_scan_atwatermark fromconsumer_watermarks. - Scan
partition_lookupfor partitions whereupdated_at >= watermark - 2 minutes(caught-up consumers skip the rest), the lease is free, and the cursor is behind the head. - For each candidate, try
pg_try_advisory_xact_lockon the partition id. The first successful claim wins; the loop exits. - UPSERT a row in
partition_consumers, then UPDATE it to take the lease and read the cursor. SELECT … FROM messages WHERE partition_id = $ AND (created_at, id) > (cursor) ORDER BY (created_at, id) LIMIT batch_size.
If no partition was claimed, the watermark is bumped and the procedure returns
no_available_partition. libqueen turns this into a long-poll backoff
rather than an error response.
libqueen, the request-fusion engine
libqueen is a per-HTTP-worker C++ library that owns its own libuv event loop and a pool of non-blocking PostgreSQL connections. Its job: take many HTTP requests of the same kind and turn them into one batched stored-procedure call.
The JobType enum
Each JobType has its own queue, its own batch policy, and its own
concurrency controller. They run in parallel, a high-load pop doesn't block push,
and vice versa.
BatchPolicy, when to fire
Three knobs per type, settable via QUEEN_<TYPE>_* env vars:
Fire when there are enough requests queued, or when the oldest one
has waited too long. So latency is bounded by max_hold_ms when load is
low; throughput compounds via batching when load is high. Defaults: PUSH 50/20ms,
POP 20/5ms (latency-sensitive), ACK 50/20ms, TRANSACTION 1/0 (never fuse).
ConcurrencyController, adaptive in-flight cap
Two implementations, switched by QUEEN_CONCURRENCY_MODE:
- Static: a fixed cap. Atomic counter against
max_concurrent. Predictable and boring. -
Vegas (default): TCP-Vegas-inspired adaptive limit. The controller computes
queue_load = in_flight × (1 − rtt_min ⁄ rtt_recent). Ifqueue_load < α(default 3) → grow the limit; ifqueue_load > β(default 12) → shrink.rtt_minis a 30-second sliding minimum (so a single fast outlier doesn't pin it). Adjusted at most once per second to prevent thrashing.
This is why QUEEN_PUSH_MAX_CONCURRENT ships at 24 even though Vegas
typically converges to ~17 in steady state, Vegas needs headroom to grow into when
load rises. The cap is a safety ceiling, not an operating point.
DrainOrchestrator, the heart
All scheduling runs on libqueen's dedicated libuv thread. The drain loop wakes on four sources:
| Wake source | Mechanism |
|---|---|
| New job submitted | uv_async_send from any thread |
| Batch completed | Slot freed in the connection pool |
| Hold timer fired | uv_timer_t re-armed at the earliest front_enqueue_time + max_hold_ms |
| Long-poll backoff ready | Earliest pop next_check deadline |
One drain pass: process expired long-polls; round-robin over the six JobTypes (with a
rotating start index so no type starves); for each type, while the BatchPolicy says
fire and the ConcurrencyController has a slot and there's a free DB connection, take
up to max_batch jobs, merge their JSON arrays, and call
PQsendQueryParams on a non-blocking connection. Re-arm the safety-net
timer.
libpq async, integrated with libuv
Each PostgreSQL connection is wrapped in a DBConnection with the libpq
socket fd and a uv_poll_t. When a batch fires, libqueen starts polling
for UV_WRITABLE | UV_READABLE:
- On writable →
PQflushuntil the query is sent. - On readable →
PQconsumeInput+PQisBusyloop, thenPQgetResultreads the JSON response. - The procedure returns one result row containing one JSONB array; libqueen splits it back into per-job results and invokes each
PendingJob::callback(result_str). - Slot returns to the free pool, kicks the orchestrator again, which is why throughput stays high when many slots become free at once.
Long polling
When pop returns no_available_partition, libqueen tracks a per-request
next_check with exponential backoff (initial 100ms, multiplier 2.0, cap
1000ms) and a wait_deadline from the client's timeout
parameter. The safety-net timer fires the request again at next_check
without holding any DB connection in between, this is what makes "30k idle
long-pollers" actually fine.
HTTP shell, uWebSockets, acceptor + workers
The HTTP layer is small and uniform. main_acceptor.cpp is 76 lines;
acceptor_server.cpp wires everything together.
Acceptor / worker pattern
- One acceptor thread listens on
:6632. When a TCP connection arrives, itadoptSocket()s the socket onto one of the workeruWS::Appevent loops. The acceptor never serves a request itself. - N worker threads (default 10), each running its own
uWS::Appand its ownQueeninstance. Each worker has its ownResponseRegistry, so cross-worker response delivery has no shared mutex. - Worker 0 bootstraps schema + starts the background services:
MetricsCollector,RetentionService,EvictionService,StatsService,PartitionLookupReconcileService, andSharedStateManager. Other workers wait on a condvar until worker 0 signals ready.
Cross-thread response delivery
This is the trickiest part of the C++ code, but the model is simple. When a route
handler calls queen->submit(job, callback), the callback runs on
libqueen's libuv thread, not the uWS thread that owns the
HttpResponse*. Touching that pointer from the wrong thread is undefined
behavior. So:
uWS::Loop::defer() queues the lambda on the worker's event loop. If the
client disconnected in the meantime, res->onAborted already invalidated
the registry entry, late deliveries become no-ops.
End-to-end walkthrough
One push, in detail
- Client sends
POST /api/v1/pushwith{items: [...]}. - Acceptor adopts the socket onto worker N.
- Auth middleware validates the JWT and stamps
producerSubfrom the verifiedsubclaim. Client-suppliedproducerSubis ignored, this is what closes the impersonation hole. - If the queue has encryption enabled,
EncryptionServicereplaces each payload with{encrypted, iv, authTag}(AES-256-GCM). - The route registers the response in the per-worker
ResponseRegistry, gets back arequest_id. - It stashes the items in
push_failover_storage[request_id]in case the DB call fails. - It builds a
JobRequest{op_type=PUSH, params=[items_json]}and callsqueen->submit(). The HTTP thread is unblocked. - libqueen's drain orchestrator collects this push along with others queued nearby. When BatchPolicy says fire, it merges their item arrays and calls
PQsendQueryParams("SELECT queen.push_messages_v3($1::jsonb)", ...). - The procedure runs three SQL statements (queues upsert, partitions upsert, messages insert) and returns
{items: [...], partition_updates: [...]}. - libqueen reads the response, splits it back into per-request results, invokes each callback.
- The callback
defer()s onto the uWS worker loop. There, the route writes the response. It also fires a UDPMESSAGE_AVAILABLEpacket to peer instances and writespartition_updatesviaupdate_partition_lookup_v1.
A transactional pipeline
The path is the same up to queen->submit, but the JobType is
TRANSACTION with max_concurrent=1, max_batch=1. libqueen
runs the request alone, calling execute_transaction_v2. Inside the
procedure, ack and push are inlined as one PL/pgSQL function, i.e. one PostgreSQL
transaction. Any RAISE in either operation rolls back the whole bundle: the input
message stays unacked, the output message never landed. On commit, both are durable
in the same WAL record. That is exactly-once across queues.
Failover, multi-instance sync, encryption
Disk failover for push
If PostgreSQL is unreachable mid-push, the route handler's callback parses
{success: false, error: …} and replays the items from
push_failover_storage into the FileBufferManager, a
rotating JSON-Lines file in FILE_BUFFER_DIR. A background scanner reads
the file every FILE_BUFFER_FLUSH_MS (default 100ms) and replays buffered
events into the database once it's reachable again. On startup, worker 0 runs a
recovery pass over leftover files before accepting traffic.
Pop and ack do not have a file-buffer path. They require the database to be live. The disk buffer is about not losing inbound writes, not about serving reads from disk. For full HA the database itself must be HA (Patroni, RDS Multi-AZ, Cloud SQL, etc.), Queen just reconnects when the upstream reappears.
Multi-instance UDP sync
Queen servers are stateless above PostgreSQL, you can run as many as you want pointed at the same DB. Two optional UDP layers make horizontal scale fast:
- Peer notifications: when a server commits a push, it broadcasts a
MESSAGE_AVAILABLEpacket to peers. libqueen's POP backoff tracker hears this and resets the affected long-polls to fire immediately. Sub-millisecond cross-instance wake-up. - Distributed cache (UDPSYNC): heartbeats + shared-state cache for queue config, partition-id LRU, and server-health tracking. Heartbeats authenticate via HMAC-SHA256 (
QUEEN_SYNC_SECRET).
The cluster works fine without UDP, long-poll consumers just fall back to a slightly longer wake-up interval.
At-rest encryption
When a queue has encryptionEnabled=true and QUEEN_ENCRYPTION_KEY
is set (32 bytes hex), the push route encrypts each payload with AES-256-GCM before
sending it to PostgreSQL. The stored payload becomes
{encrypted, iv, authTag}. Pop transparently decrypts on the way out. The
key never leaves the server process; PostgreSQL never sees plaintext.
Authentication
Optional JWT, supporting HS256, RS256, and EdDSA. Routes are gated by access level
(read-only / read-write / admin) derived from a configurable role claim. When auth is
enabled the server stamps the validated sub claim onto every pushed
message as producerSub, clients can't set this field, so even a
compromised JWT cannot impersonate another producer in the message log.
Why this design
PostgreSQL is the queue
Every queue, partition, message, lease, and offset lives in PostgreSQL rows. There is no in-memory state on the server that survives a restart. The system is as durable as your database.
Locks live in Postgres
No ZooKeeper, no Raft cluster, no etcd. Two advisory-lock conventions inside the procedures handle all coordination, one transaction-scoped try-lock for partition claims, one blocking lock per (partition, group) for ack serialization.
One JobType, one procedure
Each batchable hot-path operation maps to exactly one named, versioned PostgreSQL procedure. libqueen fuses many concurrent requests into one procedure call, fewer round trips, better commit amortization, simpler operational story.
Vegas instead of fixed pool size
Concurrency caps are ceilings, not operating points. The TCP-Vegas-inspired controller grows under stable RTT and shrinks when PostgreSQL queueing rises, so the system finds its own throughput sweet spot without manual tuning.
HTTP threads never block on libpq
Each worker has its own uWS event loop and its own libqueen libuv loop. Submit returns immediately; results hop back via a deferred callback. uWS responses are written from the loop that owns them, no shared mutex on the response path.
Disk buffer for inbound writes
If PostgreSQL is unreachable, pushes spill to a local file buffer and replay on recovery. Pop and ack require the database to be live, the disk buffer is about not losing inbound writes, not about serving reads from disk.
What we traded away
Every architectural choice has a counter-cost. Queen's choices buy strong durability, ordering at high partition cardinality, and operational simplicity, and they pay for it in four real ways. None of these are bugs; they're consequences of being a Postgres-backed message queue speaking HTTP/JSON. Knowing them upfront helps you decide whether Queen is the right tool for what you're building.
One Postgres is the limit
We measured Queen at ~104k msg/s peak on a 32-vCPU host with a well-tuned PostgreSQL. Past that, you start running into PG's own scaling limits, WAL throughput, autovacuum cycles, checkpoint pressure, not Queen's. There is no Queen-native multi-broker design. If your workload genuinely needs >200k msg/s sustained, Kafka is the right tool. Sharding Queen across multiple Postgres instances would work but you'd be doing it at the application layer, not the queue layer.
HTTP/JSON tax on the hot path
We picked HTTP/JSON because every language has an HTTP client and the operational story is dead simple. The cost: ~5× more CPU per msg/s than RabbitMQ's AMQP, and ~80× more than Kafka's binary protocol. At 10k msg/s you don't notice. At 100k msg/s, the protocol cost is most of Queen's CPU budget. A future binary protocol would help here; today HTTP is what we ship.
PG logical replication is your only path
Kafka has MirrorMaker, RabbitMQ has federation. Queen has whatever you can build
on top of pg_logical, async, eventually consistent, with the usual
gotchas (lag during heavy writes, sequence collisions, no automatic failover).
For multi-region active-active, Queen is the wrong tool unless you can accept the
complexity of running it on top of a multi-region PG topology yourself.
messages_consumed needs aggressive retention
Each consumer-group delivery writes a row in queen.messages_consumed.
We measured ~743 MB after 15 minutes at 10 consumer groups ,
extrapolated to ~70 GB/day on a sustained workload. Queen has retention controls
(completedRetentionSeconds) but you have to set them: out of the box,
a busy multi-tenant deployment will outgrow its disk in a week. This is
operational, not architectural, but you have to know about it.
Default 60s for failed-consumer takeover
When a consumer dies mid-batch, the partition's lease has to expire before another
consumer can take over. Default is leaseTime=60 seconds, tunable
per-queue. Tighter leases mean faster recovery but more overhead from lease
renewals on long-running jobs. Compare to Kafka's consumer group rebalance
(typically 1–10s with modern static membership). If sub-second recovery from
consumer crashes is critical, this is a real constraint.
2× storage cost on the messages table
We measured queen.messages at 22 GB indexes vs 13 GB heap
after 14 hours of sustained 28k msg/s. The indexes (messages_pkey,
idx_messages_partition_created, messages_partition_transaction_unique)
are what make per-partition pop fast, but they roughly double your storage cost
compared to a single-index heap. Plan capacity accordingly.
These trade-offs are real, but they're scoped: they only matter if your workload sits at the edges of Queen's envelope. For most business workloads (<100k msg/s, single-region, single-Postgres), none of this is a problem. The benchmarks page has the full picture with measured numbers and side-by-side comparisons against Kafka and RabbitMQ.
Where to read more
lib/schema/procedures/
The hot-path stored procedures, one file per concept. Heavily commented.
lib/queen.hpp + lib/queen/
libqueen, the request-fusion engine. queen.hpp is the main header; queen/ has the per-component pieces.
server/src/
The HTTP shell, the routes, the background services, the file buffer, the auth middleware.
ENV_VARIABLES.md
Every configuration knob, with rationale.
cdocs/LIBQUEEN_IMPROVEMENTS.md
The 2026 rewrite of libqueen, adaptive batching, concurrency, scheduling. Has the perf data behind the choices.
Benchmarks
Where this architecture lands on real hardware.
