تشغيل أي مهارة في Manus بنقرة واحدة

managing-dlqs

Use when an SQS dead-letter queue has accumulated messages and you need to investigate, redrive, or drain them — e.g. a DLQ-depth CloudWatch alarm fires, the importer is failing, or you're cleaning up after a known bug fix has shipped.

تشغيل في Manus

نظرة عامة

أمر التثبيت

npx skills add https://github.com/METR/hawk --skill managing-dlqs

انسخ والصق هذا الأمر في Claude Code لتثبيت المهارة

المصدر

METR/hawk

النجوم٢٢

التفرعات١٨

آخر تحديث٣ يونيو ٢٠٢٦ في ٢٠:٤٥

SKILL.md

readonly

name	managing-dlqs
description	Use when an SQS dead-letter queue has accumulated messages and you need to investigate, redrive, or drain them — e.g. a DLQ-depth CloudWatch alarm fires, the importer is failing, or you're cleaning up after a known bug fix has shipped.

Managing SQS Dead-Letter Queues

Overview

DLQ messages mean a consumer (Lambda, Batch job, EventBridge target) failed past its retry budget. Before redriving, you need to know why they failed, whether the root cause is fixed, and what shape the messages have — otherwise you'll either bounce them right back to the DLQ or invoke the wrong tool.

scripts/ops/dlq.py is the single entrypoint. It has three subcommands (peek, drain, redrive) and four redrive strategies. Every action is dry-run by default; pass --confirm to act.

Triage flow

peek first. Look at sent timestamps, statusReason, and ErrorMessage attributes. Are messages clustered (a single incident) or arriving steadily (ongoing failure)?
Find the root cause. Read CloudWatch logs for the failing consumer (the failed Batch job, Lambda invocation, or EventBridge target). Don't redrive blind.
Confirm the fix is deployed — a redrive against still-buggy code just bounces messages back through the retry budget into the DLQ again.
Redrive with the right strategy (see below). Do not drain without explicit operator confirmation — see the "Draining is destructive" section.

Picking a redrive `--target`

DLQ shape	Use	What it does
Lambda async-invoke DLQ (body is the original event)	`--target lambda --function-name <name>`	`lambda.invoke(RequestResponse)` with the body as payload, delete on success
AWS Batch DLQ (body is a Batch "Job State Change FAILED" event)	`--target batch --job-queue <name> --job-definition <name>`	Parse `detail.container.command`, `batch.SubmitJob` with the same args, delete on success. Dedups by command tuple.
DLQ attached via SQS RedrivePolicy	`--target sqs-move`	`sqs.start_message_move_task` — AWS handles the move natively
DLQ holding raw EventBridge events	`--target eventbridge --event-bus <name> --event-source <s> --detail-type <t>`	`events.put_events` with the body, delete on success

When in doubt, peek -v shows raw bodies — match them against the table.

Examples

# 1. Look first.
python scripts/ops/dlq.py peek \
    --queue-url https://sqs.us-west-2.amazonaws.com/<acct>/<env>-hawk-eval-log-importer-batch-dlq

# 2. Redrive (dry-run, then for real).
python scripts/ops/dlq.py redrive \
    --queue-url https://sqs.us-west-2.amazonaws.com/<acct>/<env>-hawk-eval-log-importer-batch-dlq \
    --target batch \
    --job-queue <env>-hawk-eval-log-importer \
    --job-definition <env>-hawk-eval-log-importer
# add --confirm to actually submit

# 3. Drain — ONLY after the operator has explicitly OK'd it (see Draining section).
python scripts/ops/dlq.py drain \
    --queue-url https://sqs.us-west-2.amazonaws.com/<acct>/<env>-hawk-some-dlq \
    --confirm

Draining is destructive — get explicit confirmation

Never run drain --confirm based on your own classification of messages as "unrecoverable" or "ignorable" without an explicit yes from the operator. Deleted DLQ messages are gone forever — no replay, no audit trail beyond what you already captured. The consequence of being wrong is permanent data loss.

Common ways agents get this wrong:

Inferring "all the same" from a sample. peek returns at most 10 messages. If 7 of 43 match a known-bad pattern, the other 36 might not. Sample the full set (multiple peek calls across visibility cycles, or count by ApproximateNumberOfMessages before/after) before generalizing.
Treating a known error class as universally drainable. "Missing UUID" or "malformed eval" might be unrecoverable for the user's primary use case, but the message body still records which file failed — useful for backfill investigations or a future fix. The operator may want those messages preserved or exported before deletion.
Draining to silence an alarm. If the alarm is noisy, fix the alarm threshold or add a redrive policy — don't delete the data that triggered it.

Before draining:

Run peek -v and report what you see (timestamps, error patterns, eval keys, message count).
State explicitly: "I want to delete N messages from because . Confirm?"
Wait for a clear yes. "ignore those" or "they're known bad" is NOT the same as "delete them" — ignoring leaves them in the DLQ for the operator to handle on their own timeline.
Only then run drain --confirm.

If in doubt, leave the messages alone. A CloudWatch alarm staying in ALARM for a few hours is recoverable; deleted messages aren't.

Discovering queue URLs

The script doesn't auto-discover queues — pass --queue-url explicitly. Two quick ways to find one:

# All DLQs in the current account/region:
aws sqs list-queues --queue-name-prefix <env> --query 'QueueUrls[?contains(@, `dlq`)]' --output table

# Lambda/Batch resource → its DLQ via the relevant describe-* call.

Gotchas

Visibility timeout cycling. peek uses a 30s visibility timeout (so the same message isn't double-counted across batches within one peek). A peek immediately followed by redrive --confirm may skip the just-peeked messages for up to 30s — wait, or use a fresh shell.
Dedup. The batch redrive dedups by full command tuple — N retries of the same eval submit one job, not N.
Lambda invoke is synchronous. lambda.invoke(RequestResponse) blocks. For DLQs with hundreds of messages this serializes; parallelize externally if it matters.
sqs-move needs a source queue. It only works when the DLQ is wired as a source queue's RedrivePolicy.deadLetterTargetArn. Batch/Lambda async DLQs are NOT SQS-source — sqs-move will fail.
Don't redrive without a known fix. Messages that fail the same way three more times will land back in the DLQ and consume the maxReceiveCount budget on whatever wrapper queue they have.

scripts/ops/check-dlq.py — eval-log-importer batch DLQ specifically, with async log fetching and error classification. Useful when you need the per-message log dump alongside the redrive.
CloudWatch alarms named <env>-*-dlq-messages-visible fire on ApproximateNumberOfMessagesVisible > 0 and are what brings you here.

المزيد من هذا المستودع

نفس المستودع

smoke-tests

METR/hawk

Run smoke tests on a deployed environment to ensure basic functionality.

2026-04-1722

full-stack-dev

METR/hawk

How to develop the frontend and backend together. When you want to make changes to the UI, use this.

2026-04-1522

database-migrations

METR/hawk

How to create and use our alembic database migration tool. Use when making changes to models.py.

2026-01-1022

المصدر

METR

METR/hawk

فتح مستودع GitHub عرض مستودعات المنشئ

أمر التثبيت

تنزيل

تشغيل في Manus

مفيد لـSOC

مديرو الشبكات وأنظمة الحاسوبمهن الحاسوب والرياضيات15-1244L4

name	managing-dlqs
description	Use when an SQS dead-letter queue has accumulated messages and you need to investigate, redrive, or drain them — e.g. a DLQ-depth CloudWatch alarm fires, the importer is failing, or you're cleaning up after a known bug fix has shipped.

Managing SQS Dead-Letter Queues

Overview

scripts/ops/dlq.py is the single entrypoint. It has three subcommands (peek, drain, redrive) and four redrive strategies. Every action is dry-run by default; pass --confirm to act.

Triage flow

peek first. Look at sent timestamps, statusReason, and ErrorMessage attributes. Are messages clustered (a single incident) or arriving steadily (ongoing failure)?
Find the root cause. Read CloudWatch logs for the failing consumer (the failed Batch job, Lambda invocation, or EventBridge target). Don't redrive blind.
Confirm the fix is deployed — a redrive against still-buggy code just bounces messages back through the retry budget into the DLQ again.
Redrive with the right strategy (see below). Do not drain without explicit operator confirmation — see the "Draining is destructive" section.

Picking a redrive `--target`

DLQ shape	Use	What it does
Lambda async-invoke DLQ (body is the original event)	`--target lambda --function-name <name>`	`lambda.invoke(RequestResponse)` with the body as payload, delete on success
AWS Batch DLQ (body is a Batch "Job State Change FAILED" event)	`--target batch --job-queue <name> --job-definition <name>`	Parse `detail.container.command`, `batch.SubmitJob` with the same args, delete on success. Dedups by command tuple.
DLQ attached via SQS RedrivePolicy	`--target sqs-move`	`sqs.start_message_move_task` — AWS handles the move natively
DLQ holding raw EventBridge events	`--target eventbridge --event-bus <name> --event-source <s> --detail-type <t>`	`events.put_events` with the body, delete on success

When in doubt, peek -v shows raw bodies — match them against the table.

Examples

# 1. Look first.
python scripts/ops/dlq.py peek \
    --queue-url https://sqs.us-west-2.amazonaws.com/<acct>/<env>-hawk-eval-log-importer-batch-dlq

# 2. Redrive (dry-run, then for real).
python scripts/ops/dlq.py redrive \
    --queue-url https://sqs.us-west-2.amazonaws.com/<acct>/<env>-hawk-eval-log-importer-batch-dlq \
    --target batch \
    --job-queue <env>-hawk-eval-log-importer \
    --job-definition <env>-hawk-eval-log-importer
# add --confirm to actually submit

# 3. Drain — ONLY after the operator has explicitly OK'd it (see Draining section).
python scripts/ops/dlq.py drain \
    --queue-url https://sqs.us-west-2.amazonaws.com/<acct>/<env>-hawk-some-dlq \
    --confirm

Draining is destructive — get explicit confirmation

Common ways agents get this wrong:

Inferring "all the same" from a sample. peek returns at most 10 messages. If 7 of 43 match a known-bad pattern, the other 36 might not. Sample the full set (multiple peek calls across visibility cycles, or count by ApproximateNumberOfMessages before/after) before generalizing.
Treating a known error class as universally drainable. "Missing UUID" or "malformed eval" might be unrecoverable for the user's primary use case, but the message body still records which file failed — useful for backfill investigations or a future fix. The operator may want those messages preserved or exported before deletion.
Draining to silence an alarm. If the alarm is noisy, fix the alarm threshold or add a redrive policy — don't delete the data that triggered it.

Before draining:

Run peek -v and report what you see (timestamps, error patterns, eval keys, message count).
State explicitly: "I want to delete N messages from because . Confirm?"
Wait for a clear yes. "ignore those" or "they're known bad" is NOT the same as "delete them" — ignoring leaves them in the DLQ for the operator to handle on their own timeline.
Only then run drain --confirm.

If in doubt, leave the messages alone. A CloudWatch alarm staying in ALARM for a few hours is recoverable; deleted messages aren't.

Discovering queue URLs

The script doesn't auto-discover queues — pass --queue-url explicitly. Two quick ways to find one:

# All DLQs in the current account/region:
aws sqs list-queues --queue-name-prefix <env> --query 'QueueUrls[?contains(@, `dlq`)]' --output table

# Lambda/Batch resource → its DLQ via the relevant describe-* call.

Gotchas

Visibility timeout cycling. peek uses a 30s visibility timeout (so the same message isn't double-counted across batches within one peek). A peek immediately followed by redrive --confirm may skip the just-peeked messages for up to 30s — wait, or use a fresh shell.
Dedup. The batch redrive dedups by full command tuple — N retries of the same eval submit one job, not N.
Lambda invoke is synchronous. lambda.invoke(RequestResponse) blocks. For DLQs with hundreds of messages this serializes; parallelize externally if it matters.
sqs-move needs a source queue. It only works when the DLQ is wired as a source queue's RedrivePolicy.deadLetterTargetArn. Batch/Lambda async DLQs are NOT SQS-source — sqs-move will fail.
Don't redrive without a known fix. Messages that fail the same way three more times will land back in the DLQ and consume the maxReceiveCount budget on whatever wrapper queue they have.

scripts/ops/check-dlq.py — eval-log-importer batch DLQ specifically, with async log fetching and error classification. Useful when you need the per-message log dump alongside the redrive.
CloudWatch alarms named <env>-*-dlq-messages-visible fire on ApproximateNumberOfMessagesVisible > 0 and are what brings you here.

managing-dlqs

Managing SQS Dead-Letter Queues

Overview

Triage flow

Picking a redrive `--target`

Examples

Draining is destructive — get explicit confirmation

Discovering queue URLs

Gotchas

Related

Managing SQS Dead-Letter Queues

Overview

Triage flow

Picking a redrive `--target`

Examples

Draining is destructive — get explicit confirmation

Discovering queue URLs

Gotchas

Related

managing-dlqs

Managing SQS Dead-Letter Queues

Overview

Triage flow

Picking a redrive --target

Examples

Draining is destructive — get explicit confirmation

Discovering queue URLs

Gotchas

Related

المزيد من هذا المستودع

المزيد من هذا المستودع

Managing SQS Dead-Letter Queues

Overview

Triage flow

Picking a redrive --target

Examples

Draining is destructive — get explicit confirmation

Discovering queue URLs

Gotchas

Related

Picking a redrive `--target`

Picking a redrive `--target`