| name | managing-dlqs |
| description | Use when an SQS dead-letter queue has accumulated messages and you need to investigate, redrive, or drain them — e.g. a DLQ-depth CloudWatch alarm fires, the importer is failing, or you're cleaning up after a known bug fix has shipped. |
Managing SQS Dead-Letter Queues
Overview
DLQ messages mean a consumer (Lambda, Batch job, EventBridge target) failed past its retry budget. Before redriving, you need to know why they failed, whether the root cause is fixed, and what shape the messages have — otherwise you'll either bounce them right back to the DLQ or invoke the wrong tool.
scripts/ops/dlq.py is the single entrypoint. It has three subcommands (peek, drain, redrive) and four redrive strategies. Every action is dry-run by default; pass --confirm to act.
Triage flow
peek first. Look at sent timestamps, statusReason, and ErrorMessage attributes. Are messages clustered (a single incident) or arriving steadily (ongoing failure)?
- Find the root cause. Read CloudWatch logs for the failing consumer (the failed Batch job, Lambda invocation, or EventBridge target). Don't redrive blind.
- Confirm the fix is deployed — a redrive against still-buggy code just bounces messages back through the retry budget into the DLQ again.
- Redrive with the right strategy (see below). Do not drain without explicit operator confirmation — see the "Draining is destructive" section.
Picking a redrive --target
| DLQ shape | Use | What it does |
|---|
| Lambda async-invoke DLQ (body is the original event) | --target lambda --function-name <name> | lambda.invoke(RequestResponse) with the body as payload, delete on success |
| AWS Batch DLQ (body is a Batch "Job State Change FAILED" event) | --target batch --job-queue <name> --job-definition <name> | Parse detail.container.command, batch.SubmitJob with the same args, delete on success. Dedups by command tuple. |
| DLQ attached via SQS RedrivePolicy | --target sqs-move | sqs.start_message_move_task — AWS handles the move natively |
| DLQ holding raw EventBridge events | --target eventbridge --event-bus <name> --event-source <s> --detail-type <t> | events.put_events with the body, delete on success |
When in doubt, peek -v shows raw bodies — match them against the table.
Examples
python scripts/ops/dlq.py peek \
--queue-url https://sqs.us-west-2.amazonaws.com/<acct>/<env>-hawk-eval-log-importer-batch-dlq
python scripts/ops/dlq.py redrive \
--queue-url https://sqs.us-west-2.amazonaws.com/<acct>/<env>-hawk-eval-log-importer-batch-dlq \
--target batch \
--job-queue <env>-hawk-eval-log-importer \
--job-definition <env>-hawk-eval-log-importer
python scripts/ops/dlq.py drain \
--queue-url https://sqs.us-west-2.amazonaws.com/<acct>/<env>-hawk-some-dlq \
--confirm
Draining is destructive — get explicit confirmation
Never run drain --confirm based on your own classification of messages as "unrecoverable" or "ignorable" without an explicit yes from the operator. Deleted DLQ messages are gone forever — no replay, no audit trail beyond what you already captured. The consequence of being wrong is permanent data loss.
Common ways agents get this wrong:
- Inferring "all the same" from a sample.
peek returns at most 10 messages. If 7 of 43 match a known-bad pattern, the other 36 might not. Sample the full set (multiple peek calls across visibility cycles, or count by ApproximateNumberOfMessages before/after) before generalizing.
- Treating a known error class as universally drainable. "Missing UUID" or "malformed eval" might be unrecoverable for the user's primary use case, but the message body still records which file failed — useful for backfill investigations or a future fix. The operator may want those messages preserved or exported before deletion.
- Draining to silence an alarm. If the alarm is noisy, fix the alarm threshold or add a redrive policy — don't delete the data that triggered it.
Before draining:
- Run
peek -v and report what you see (timestamps, error patterns, eval keys, message count).
- State explicitly: "I want to delete N messages from because . Confirm?"
- Wait for a clear yes. "ignore those" or "they're known bad" is NOT the same as "delete them" — ignoring leaves them in the DLQ for the operator to handle on their own timeline.
- Only then run
drain --confirm.
If in doubt, leave the messages alone. A CloudWatch alarm staying in ALARM for a few hours is recoverable; deleted messages aren't.
Discovering queue URLs
The script doesn't auto-discover queues — pass --queue-url explicitly. Two quick ways to find one:
aws sqs list-queues --queue-name-prefix <env> --query 'QueueUrls[?contains(@, `dlq`)]' --output table
Gotchas
- Visibility timeout cycling.
peek uses a 30s visibility timeout (so the same message isn't double-counted across batches within one peek). A peek immediately followed by redrive --confirm may skip the just-peeked messages for up to 30s — wait, or use a fresh shell.
- Dedup. The
batch redrive dedups by full command tuple — N retries of the same eval submit one job, not N.
- Lambda invoke is synchronous.
lambda.invoke(RequestResponse) blocks. For DLQs with hundreds of messages this serializes; parallelize externally if it matters.
sqs-move needs a source queue. It only works when the DLQ is wired as a source queue's RedrivePolicy.deadLetterTargetArn. Batch/Lambda async DLQs are NOT SQS-source — sqs-move will fail.
- Don't redrive without a known fix. Messages that fail the same way three more times will land back in the DLQ and consume the
maxReceiveCount budget on whatever wrapper queue they have.
Related
scripts/ops/check-dlq.py — eval-log-importer batch DLQ specifically, with async log fetching and error classification. Useful when you need the per-message log dump alongside the redrive.
- CloudWatch alarms named
<env>-*-dlq-messages-visible fire on ApproximateNumberOfMessagesVisible > 0 and are what brings you here.