Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

sysdesign-dead-letter-queue

Use when handling message failures in a queue or stream — installs a DLQ with retry policy, backoff, alerting on growth, and a reprocess-after-fix flow.

In Manus ausführen

Sterne0

Forks0

Aktualisiert23. April 2026 um 02:35

Quelle

danilods

danilods/matilha-sysdesign-pack

GitHub-Repository öffnen Creator-Repositorys ansehen

Installationsbefehl

Download

In Manus ausführen

Nützlich fürSOC

SoftwareentwicklerInformatik- und Mathematikberufe15-1252L4

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

matilha-sysdesign-trigger

danilods/matilha-sysdesign-pack

Use when the user mentions system design, scalability, distributed systems, latency, throughput, availability, CAP theorem, database design, caching, rate limiting, CDN, microservices, SLA, NFR, capacity planning, bottleneck, message queue, or any topic related to large-scale system architecture and infrastructure design. Fires independently of compose to ensure matilha-sysdesign-pack skills activate whenever system design domain appears.

2026-04-260

sysdesign-autocomplete-trie-fuzzy

danilods/matilha-sysdesign-pack

Use when designing autocomplete / search suggestions — weighted trie for top-k prefix matches, query sampling, fuzzy matching, and pre-serve moderation.

2026-04-230

sysdesign-cdn-object-store

danilods/matilha-sysdesign-pack

Use when serving images, video, or downloads globally — places a CDN in front of an object store, names invalidation strategy, and handles private-content auth.

2026-04-230

sysdesign-dual-write-event-sourcing

danilods/matilha-sysdesign-pack

Use when a design writes to two stores on one action (DB + Kafka, DB + cache, DB + search) or when deciding whether event sourcing is worth its complexity.

2026-04-230

sysdesign-event-streaming-kafka

danilods/matilha-sysdesign-pack

Use when deciding between Kafka and a simpler queue — picks Kafka for decoupling, ordered delivery, and replay, or rejects it for low-volume point-to-point work.

2026-04-230

sysdesign-idempotency-patterns

danilods/matilha-sysdesign-pack

Use when designing write endpoints that may be retried — install idempotency keys, dedup stores, and at-most-once semantics to prevent duplicate effects.

2026-04-230

name	sysdesign-dead-letter-queue
description	Use when handling message failures in a queue or stream — installs a DLQ with retry policy, backoff, alerting on growth, and a reprocess-after-fix flow.
category	sysdesign
version	1.0.0
requires	[]
optional_companions	[]

When this fires

Use when a service consumes messages (SQS, Kafka, RabbitMQ, Pub/Sub) and some messages will inevitably fail — malformed payloads, missing references, transient downstream errors, code bugs that surface only on specific inputs. Fires when someone says "we'll just retry" without naming the retry cap, backoff shape, where poison-pill messages go, or who is paged when the DLQ grows. The skill installs a concrete DLQ with retry policy, growth alerting, and a reprocessing flow so that failed messages isolate rather than stall the consumer.

Preconditions

A consumer exists (or is being designed) that processes messages from a queue or stream. Direct request/response work doesn't need this skill.
The messaging substrate supports DLQ primitives or the team is willing to implement one (most managed services do; Kafka requires a separate topic).
Someone owns the on-call rotation that will receive DLQ-growth pages. An unowned DLQ is a landfill.
The business has an opinion on "can we afford to drop a message after N retries?" — different answers drive very different retry policies.

Execution Workflow

Classify failure modes. Transient (downstream timeout, deploy blip) — retry will succeed. Permanent (malformed payload, missing foreign-key reference, deleted target) — retry will never succeed. Unknown (new exception in code) — usually permanent until a fix ships. The retry policy and DLQ behaviour differ per class.
Set the retry policy for transient failures. Exponential backoff with jitter (1s, 2s, 4s, 8s, 16s, 32s — with random jitter so consumers don't thunder together). Cap attempts at 5 to 10 for most workloads. More attempts mostly burn money; fewer miss legitimate recoveries.
After the retry cap, move the message to the DLQ. Do not keep retrying forever — a single poison message will pin one consumer thread indefinitely and starve the healthy work behind it. The DLQ is explicitly an isolation mechanism, not a trashcan.
Preserve context when moving to DLQ. Attach the original message, the exception trace, the attempt count, a timestamp, and the consumer version. A DLQ message without this context is unusable in an incident — no one can tell what went wrong or when.
Alert on DLQ growth, not absolute size. A DLQ with 3 messages that's been 3 for a month is healthy noise; a DLQ adding 50 messages per hour is an incident. Page on rate-of-change over a window. Any non-zero growth deserves human investigation within hours, not days.
Define the reprocess flow. After a fix ships, messages in the DLQ must be moveable back to the main queue. Ship this as a tested runbook (script, Lambda, CLI command), not an ad-hoc manual process. Expect to use it — not often, but reliably.
Handle idempotency at the consumer. Reprocessed DLQ messages may duplicate side effects if the consumer isn't idempotent. Pair this skill with sysdesign-idempotency-patterns and use dedup keys so safe replay is the default.
Bound DLQ retention. Unbounded DLQs accumulate forever and hide real trends. A retention window (14-30 days for most workloads) forces periodic review: if a message has been in the DLQ for a month, the organisation has decided to drop it; that decision should be explicit.

Rules: Do

Always pair a consumer with a DLQ. Consumers without a DLQ either retry forever (wasting resources and pinning partitions) or drop silently (losing data).
Use exponential backoff with jitter. Synchronous retry loops thunder the downstream and extend outages.
Preserve full context (message body, exception, attempt count, timestamp, consumer version) when writing to the DLQ. Post-incident debugging depends on it.
Page on DLQ growth rate, not absolute depth. Growth is the signal an incident is active.
Build and test the reprocess flow before it's needed in production. A runbook you've never executed will fail at 3am.

Rules: Don't

Don't let a poison message pin a consumer. If the consumer keeps retrying without a DLQ, healthy messages behind it starve.
Don't retry permanent errors. Malformed payloads and missing references never fix themselves through repetition; route them to the DLQ on the first detectable signal.
Don't ignore a growing DLQ. A non-zero DLQ growth rate is, by construction, messages the consumer was supposed to process and didn't — the business impact is already happening.
Don't reprocess from the DLQ into a non-idempotent consumer. Side effects will duplicate. Fix idempotency first; reprocess second.
Don't let the DLQ become a dumping ground. Set retention; if a message still sits there after the window, that is a decision to drop and must be made consciously.

Expected Behavior

After applying the skill, every consumer has an attached DLQ with a defined retry cap and backoff, full context preserved on every failed message, and a growth-rate alert that pages on incident-level behaviour. A tested reprocess-after-fix runbook exists. Incident response becomes: check DLQ growth (did we start rejecting messages?), inspect a sample DLQ message (what went wrong?), ship the fix, reprocess. Silent data loss stops; poison-pill messages stop stalling the consumer line.

Quality Gates

DLQ wired per consumer group (not shared across unrelated consumers; one group's DLQ is noise to another).
Retry policy documented: max attempts, initial backoff, multiplier, jitter, max backoff.
DLQ message schema includes original body, exception trace, attempt count, timestamp, consumer version.
Growth-rate alert exists with a page-worthy threshold; raw depth alert is dashboard-only.
Reprocess runbook exists, has been tested at least once in staging.
Retention policy set on the DLQ itself.

Companion Integration

Pairs with sysdesign-event-streaming-kafka (DLQ as a separate Kafka topic per consumer group), sysdesign-idempotency-patterns (safe reprocessing requires idempotent consumers), and sysdesign-monitoring-4-golden-signals (DLQ growth is a saturation signal variant on the consumer's dashboard). The matilha-harness-pack:harness-evaluator-optimizer-loop companion shares the retry-with-cap-then-isolate shape at the agent loop level.

Output Artifacts

Design-doc section "Message handling" naming retry policy, DLQ, alerting, and reprocess runbook.
Terraform / config entries for the DLQ resource(s) version-controlled.
A reprocess runbook (script or Lambda) checked into the repo with usage documentation.
Dashboard panel for DLQ depth and growth rate per consumer group.
Alert rule file entries for DLQ growth.

Example Constraint Language

Use "must" for: DLQ per consumer, exponential backoff with jitter, preserving context on failed messages, growth-rate alerting, tested reprocess runbook.
Use "should" for: 5-10 retry cap as a default, 14-30 day DLQ retention, pairing with idempotent consumer pattern, routing permanent errors to DLQ on first detection.
Use "may" for: retry forever for truly transient low-cost failures within bounds (tight timeout, cheap downstream), separate DLQ consumers that auto-replay after cooldown, additional tiered retry queues between main and DLQ.

Troubleshooting

"A malformed message stalled the consumer for two hours": retry cap is too high or absent, and there's no DLQ. Add DLQ, cap retries at 5-10, and route deserialisation failures to DLQ on attempt 1.
"DLQ has 50,000 messages and no one noticed": growth-rate alert missing; depth-based alerts rarely fire because depth climbs slowly. Add a rate-based page rule (e.g., > 10 new DLQ messages in 15 min).
"Reprocessed DLQ caused duplicate charges": consumer isn't idempotent. Ship dedup via sysdesign-idempotency-patterns before the next reprocess run.
"One consumer's failures are filling a shared DLQ": DLQs are shared. Split — one DLQ per consumer group — so failures isolate by owner.
"DLQ grew indefinitely, storage cost is hurting": retention is unbounded. Set 14-30 day retention and make the monthly review a standing meeting until the flow is tuned.

Concrete Example

A notifications service consumes Kafka events and sends email/SMS/push. A template regression ships that throws on a new locale code. Without a DLQ, one poison message would pin a consumer and stall thousands of notifications. The team has wired retry with exponential backoff (1s, 2s, 4s, jitter, cap 5), then routes failures to notifications.dlq with the full exception trace. The growth-rate alert pages after 12 messages in 10 minutes. On-call reads a DLQ sample, identifies the bad template, rolls back in 6 minutes. The reprocess runbook moves the DLQ back into the main queue; the idempotent consumer (dedup by notification_id) prevents any duplicates. Total user impact: 12 delayed notifications.

Sources

[[concepts/design-cases]] — DLQ as recurring pattern (Notifications, Messaging)
[[concepts/nfr-system-design]] — fault-tolerance patterns including DLQ and Circuit Breaker
Zhiyong Tan, Acing the System Design Interview, Chapters 9 (Notifications) and 14 (Messaging). Growth-rate alerting framing is Danilo's synthesis — Tan mentions DLQ alerting without specifying the rate-vs-depth distinction.