with one click
aws-sqs-dlq-investigation
// Investigate SQS Dead Letter Queue alarms — systematic root cause analysis for why messages ended up in a DLQ. Covers Lambda-triggered SQS, event source mappings, and the common traps.
// Investigate SQS Dead Letter Queue alarms — systematic root cause analysis for why messages ended up in a DLQ. Covers Lambda-triggered SQS, event source mappings, and the common traps.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | aws-sqs-dlq-investigation |
| description | Investigate SQS Dead Letter Queue alarms — systematic root cause analysis for why messages ended up in a DLQ. Covers Lambda-triggered SQS, event source mappings, and the common traps. |
| tags | ["aws","sqs","dlq","lambda","cloudwatch","debugging","incident-response"] |
| trigger | User reports a DLQ alarm, asks why messages are in a DLQ, or asks about SQS message failures. |
Do NOT conclude "timeout exceeded" or "Lambda failed" without checking actual numbers. A common trap is seeing high message age in the queue and assuming it exceeded a timeout — message age (time waiting in queue) is NOT the same as Lambda execution duration or visibility timeout being exceeded.
Always compare:
# Get source queue attributes
sqs.get_queue_attributes(QueueUrl=src_url, AttributeNames=['All'])
Key values to extract:
lam.get_function_configuration(FunctionName=fn_name)
lam.get_function_concurrency(FunctionName=fn_name) # reserved concurrency
lam.list_event_source_mappings(FunctionName=fn_name) # SQS trigger config
Key values:
ReportBatchItemFailures is set, partial batch failures are possibleCheck these Lambda metrics:
Check these SQS metrics:
fields @timestamp, @message
| filter @message like /(?i)(timed out|timeout|error|exception|Task timed|REPORT.*Timeout)/
| filter @message not like /^(START|END|REPORT|INIT_START)/
| sort @timestamp asc
| limit 50
Look for:
# Peek without consuming (short visibility timeout)
sqs.receive_message(QueueUrl=dlq_url, MaxNumberOfMessages=10,
AttributeNames=['All'], VisibilityTimeout=5)
Key attributes on each message:
Pattern: Lambda max duration is fine, Lambda errors = 0, but messages pile up. Mechanism: Event source mapping has a low MaximumConcurrency (e.g., 2). During traffic spikes, SQS poller receives messages but can't dispatch them fast enough. With maxReceiveCount=1, any message that gets received but not successfully processed on the first attempt goes straight to DLQ. Fix: Increase maxReceiveCount to 2-3, or increase MaximumConcurrency.
Pattern: Lambda Duration max ≈ Lambda Timeout, "Task timed out" in logs. Mechanism: Downstream service (Kinesis, DB, API) is slow or down. Fix: Increase Lambda timeout, fix downstream, add circuit breaker.
Pattern: Lambda Errors > 0, exception in logs. Mechanism: Code bug, permission error, resource not found. Fix: Fix the code/permissions.
Pattern: ApproximateReceiveCount > 1 on DLQ messages, Lambda duration close to VisibilityTimeout. Mechanism: Lambda takes longer than VisibilityTimeout, message becomes visible again, gets re-received, hits maxReceiveCount. Fix: Set VisibilityTimeout to 6x Lambda timeout (AWS recommendation).
Pattern: Lambda Throttles > 0, possibly account-level concurrent execution limit. Fix: Request limit increase, add reserved concurrency.
AWS credentials are in /home/ubuntu/.hermes/.env — parse the file and create a boto3 Session explicitly. Shell env vars may not persist across terminal sessions.