| name | async-messaging |
| description | Build reliable event-driven flows with the Transactional Outbox pattern — write state and event in one transaction, relay asynchronously, achieve at-least-once delivery + consumer idempotency. Use when an action must reliably trigger downstream work, or when events are lost on crash (dual-write problem). Not for simple background work without state+event reliability (use background-jobs) or outbound HTTP webhook specifics (use webhook-design). |
| license | MIT |
Async Messaging (Transactional Outbox)
Purpose
Make "change state AND notify the world" reliable. The naive approach (write DB, then publish to a queue) loses events on crash between the two steps — the dual-write problem. The Outbox pattern solves it.
Universal — the Outbox pattern, the dual-write problem, and "exactly-once = at-least-once + consumer idempotency" are distributed-systems fundamentals independent of broker/language.
Procedure
-
Recognize the dual-write problem
- "Write DB, then publish to Kafka/Redis" has no atomicity — crash after DB write but before publish = lost event; crash after publish but before commit = phantom event
- You cannot atomically write to two systems without a distributed transaction (which you should avoid)
-
Apply the Transactional Outbox
- Write the business state change AND an
outbox row in ONE local DB transaction (see transaction-management)
- Both commit together or neither does — atomicity guaranteed by the single DB
-
Relay the outbox asynchronously
- Polling publisher: a
background-jobs task reads unsent outbox rows, publishes to the broker, marks sent
- CDC (Change Data Capture, e.g., Debezium): tail the DB log, publish outbox inserts — lower latency, more infra
- Start with polling; move to CDC only if latency demands it
-
Design consumers for at-least-once (idempotency)
- The relay guarantees at-least-once → consumers WILL see duplicates
- "Exactly-once" delivery is a myth; achieve effective exactly-once via at-least-once + idempotent consumers (dedupe on event id / idempotency key — see
resilience-patterns)
4b. Version events from day one (additive-only)
- Include
eventType + eventVersion (or schemaVersion) on every payload — consumers will outlive any one publisher
- Additive-only changes on a live event type: new fields nullable, never remove or repurpose; bump the version and run old+new in parallel during cutover. Treat events like a public API contract (see
api-contract)
- Never publish two semantically-different shapes under the same
eventType + version
4c. Dead-letter queue (DLQ) + poison-message handling
- A retried-N-times failure is a poison message: route to a DLQ after the retry budget rather than blocking the queue forever
- DLQ needs alerting (a growing DLQ = silent failure) + a manual replay path with the original payload + reason
- Cap retry attempts and use backoff + jitter (see
resilience-patterns / background-jobs)
- Choose the broker by needs (criteria, not products)
- Simple in-app queue — low setup, good for in-process jobs
- Routing broker — work queues + routing topologies, moderate scale
- High-throughput log — event replay, multiple consumer groups, high volume
- Ordering needs a partition key: same-entity events go to the same partition (
order_id, user_id). Within a partition: ordered; across partitions: no guarantees. Choose the key by the invariant you actually need
- Don't reach for a heavyweight log broker by default — it's heavy infra (specific products in Implementation)
5b. Retain only what you need
- The outbox table grows on every write — schedule a cleanup job to delete (or archive) sent rows past N days, or it eats the DB
- For broker retention: replay needs ≠ storage forever; tune by recovery window (e.g., 7 days for backfill, not 1 year "just in case")
- Validate (validation loop)
- Crash the service between state write and relay → verify the event is still delivered (outbox row persisted)
- Deliver an event twice → verify the consumer processes it once (idempotent)
- If an event is lost → the write wasn't in the same transaction as the outbox row; fix and re-test
Anti-patterns
| ❌ Anti-pattern | ✅ Correct |
|---|
| Write DB then publish to broker (dual write) | Write state + outbox row in one transaction, relay async |
| Assuming the broker gives exactly-once | At-least-once + idempotent consumers |
| Publishing inside the business transaction (network I/O in tx) | Outbox row in tx; relay outside |
| Heavyweight log broker for a simple in-app job | Simple queue first; log broker only at real scale |
| Consumer with no dedup | Dedupe on event id |
Two payload shapes under the same eventType (silent break) | eventVersion + additive-only changes; run old+new in parallel on cutover |
| Poison messages blocking the queue indefinitely | DLQ after retry budget; alert on DLQ growth; manual replay |
| Outbox table growing unbounded | Scheduled cleanup of sent rows past retention window |
| Ordering assumed without a partition key | Choose a partition key matching the invariant (same-entity events to the same partition) |
Severity tiers
| Tier | Examples | Action SLA |
|---|
| Critical | Dual-write losing payment/order events on crash; consumer with no idempotency double-charging; breaking schema change shipped on a live eventType (consumers crash on next message) | Fix immediately |
| Major | Publishing inside a transaction (lock contention); no DLQ for poison messages; outbox table grows without cleanup (DB-eating); no partition key on a flow that requires ordering | Fix this sprint |
| Minor | Polling interval untuned; broker over-provisioned (Kafka for tiny load); broker retention not aligned with actual replay needs | Schedule within 2 sprints |
Completion Criteria
Output
- Outbox table + relay: schema + polling publisher (or CDC config)
- Consumer handlers: idempotent, with dedup
- Architecture doc: event catalog + delivery guarantees
- Commit format:
feat(events): add outbox for <event> / feat(events): idempotent consumer for <event>
Implementation
TypeScript + Prisma + BullMQ/Redis (default)
- Outbox:
outbox(id, type, payload, created_at, sent_at) written inside prisma.$transaction with the state change
- Relay: a BullMQ repeatable job polls
WHERE sent_at IS NULL, publishes, sets sent_at
- Consumers: BullMQ workers with dedup table on event id
- Scale path: Redis Streams → Kafka (KafkaJS) if replay/throughput demands
Other stacks
- Python: SQLAlchemy outbox + Celery relay; or Debezium CDC → Kafka
- Go: outbox + worker; Watermill library for messaging abstractions
- Universal: Outbox is DB-agnostic; Debezium provides CDC for Postgres/MySQL/Mongo; the at-least-once + idempotency principle is broker-independent
Related skills
transaction-management — the outbox row is written in the same transaction as the state change
background-jobs — the outbox relay / consumers run as jobs
webhook-design — outbound webhooks can be driven by the outbox
Reference
- Key insight encoded: Solve the dual-write problem by writing business state and the event to an outbox table in one local transaction, then relay asynchronously (polling or CDC) — this is what makes at-least-once delivery achievable; "exactly-once" is really at-least-once + consumer idempotency. Treat events like a public API: version from day one, additive-only changes, partition key on flows that need ordering. Operational hygiene matters: a DLQ for poison messages + an outbox-cleanup job (else the outbox eats the DB).