with one click
notifly-sqs-dlq-alarm-checking
// Check whether a Notifly SQS DLQ has CloudWatch alerting configured, verify it live in AWS, and add the missing alarm via Terraform + PR workflow.
// Check whether a Notifly SQS DLQ has CloudWatch alerting configured, verify it live in AWS, and add the missing alarm via Terraform + PR workflow.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | notifly-sqs-dlq-alarm-checking |
| description | Check whether a Notifly SQS DLQ has CloudWatch alerting configured, verify it live in AWS, and add the missing alarm via Terraform + PR workflow. |
| version | 1.0.0 |
| author | Hermes Agent |
| license | MIT |
| metadata | {"hermes":{"tags":["notifly","sqs","dlq","cloudwatch","terraform","github","alarms"]}} |
Use when asked whether a Notifly SQS DLQ has alerting, or to add missing DLQ alarms in team-michael/notifly-event.
If the user asks why an existing DLQ alarm fired rather than whether alerting exists, use notifly-alert-live-investigation as the umbrella and its references/sqs-dlq-root-cause.md checklist. In particular, do not stop at alarm history: inspect/preserve the DLQ message, decode Notifly base64+gzip bodies, compare SQS Received vs Deleted, and correlate VisibilityTimeout/maxReceiveCount with the DLQ-visible timestamp.
infra/terraform/prod/ap-northeast-2/sqsinfra/terraform/prod/ap-northeast-2/sqs/queues.tflocal.inventory_queueslocal.inventory_metric_alarm_detailsresource "aws_cloudwatch_metric_alarm" "metric_alarms"for_each = local.inventory_metric_alarm_detailsImportant: a queue appearing in cloudwatch/dashboards.tf only means it is graphed on a dashboard. That does not imply an alarm exists.
Existing DLQ alarms follow this exact pattern:
<queue-name> has been createdAWS/SQS / ApproximateNumberOfMessagesVisibleGreaterThanOrEqualToThreshold, 160Sumevaluation_periods = 1datapoints_to_alarm = 1data.aws_sns_topic.topics["cloudwatch-incidents-notifly"].arnDespite the name, has been created means "DLQ has visible messages", not queue resource creation.
Search queues.tf for the queue name in both sections:
inventory_queuesinventory_metric_alarm_detailsIf the queue exists but there is no matching "<queue>-dlq has been created" entry, Terraform is missing the alert.
Use Python via terminal, not execute_code, so shell AWS credentials are available.
Check queue attributes with boto3 SQS:
ApproximateNumberOfMessagesRedrivePolicyRedriveAllowPolicyCheck live alarms with boto3 CloudWatch:
describe_alarms()Dimensions[].Name == "QueueName" and the exact queue nameIf Terraform lacks the alarm and CloudWatch returns zero matching alarms, the DLQ is not alerting.
Insert a new entry near the top of local.inventory_metric_alarm_details in queues.tf:
"cafe24-worker-queue-dlq has been created" = {
"actions_enabled" = true
"alarm_actions" = [
data.aws_sns_topic.topics["cloudwatch-incidents-notifly"].arn,
]
"comparison_operator" = "GreaterThanOrEqualToThreshold"
"datapoints_to_alarm" = 1
"dimensions" = [
{
"name" = "QueueName"
"value" = "cafe24-worker-queue-dlq"
},
]
"evaluation_periods" = 1
"extended_statistic" = null
"insufficient_data_actions" = []
"metric_name" = "ApproximateNumberOfMessagesVisible"
"namespace" = "AWS/SQS"
"ok_actions" = []
"period" = 60
"statistic" = "Sum"
"threshold" = 1.0
"treat_missing_data" = "missing"
"alarm_description" = null
}
Adapt queue name as needed for other DLQs.
Run from repo root:
terraform fmt infra/terraform/prod/ap-northeast-2/sqs/queues.tf
terraform -chdir=infra/terraform/prod/ap-northeast-2/sqs init -backend=false
terraform -chdir=infra/terraform/prod/ap-northeast-2/sqs validate
tflint --chdir=infra/terraform/prod/ap-northeast-2/sqs
.terraform.lock.hclterraform init -backend=false may modify infra/terraform/prod/ap-northeast-2/sqs/.terraform.lock.hcl by adding an extra provider hash for the current platform. Do not commit that incidental lockfile change unless intended.
If you revert .terraform.lock.hcl after init, terraform validate may fail because the cached provider checksum no longer matches the reverted lock file. In that case, re-run terraform init -backend=false before validate/tflint.
Andrej Karpathy <team@greyboxhq.com>For AWS/GitHub access in this environment:
terminal + boto3.aws CLI may be unavailable.~/.hermes/.env if gh is unavailable.