with one click
oncall-cloudwatch-alert-triage
// Triage CloudWatch console-error alerts for ECS services using CLI-based log search and frequency analysis.
// Triage CloudWatch console-error alerts for ECS services using CLI-based log search and frequency analysis.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | oncall-cloudwatch-alert-triage |
| description | Triage CloudWatch console-error alerts for ECS services using CLI-based log search and frequency analysis. |
| triggers | ["CloudWatch alarm for /aws/ecs/* console error","Slack alert from Amazon Q or CloudWatch about ERROR spikes","Need to determine if an error pattern is recurring, new, or benign"] |
Turn a raw CloudWatch console-error alarm into a concrete judgment (spike / recurring / needs_fix / urgent) with evidence from logs and infrastructure metrics.
ap-northeast-2 for prod)message_ts (Slack) for precise time correlationmessage_tsmessage_ts to human UTC:
python3 -c "import datetime; print(datetime.datetime.fromtimestamp(<message_ts>, tz=datetime.timezone.utc))"
Build a window around the alert to see what actually happened.
START=$(python3 -c "import time; print(int((time.time()-7200)*1000))")
END=$(python3 -c "import time; print(int(time.time()*1000))")
aws logs filter-log-events \
--region <region> \
--log-group-name <log-group> \
--start-time "$START" --end-time "$END" \
--filter-pattern "ERROR" \
--limit 100 --output json
Extract sample messages to identify the error signature.
Once you know the exact error text, count it to judge severity.
START=$(python3 -c "import time; print(int((time.time()-86400)*1000))")
END=$(python3 -c "import time; print(int(time.time()*1000))")
aws logs filter-log-events \
--region <region> \
--log-group-name <log-group> \
--start-time "$START" --end-time "$END" \
--filter-pattern '"EXACT ERROR TEXT"' \
--limit 100 --output json | jq '.events | length'
Pitfall:
filter-patternrejects colons (:) and some special characters in unquoted terms. Always wrap the search phrase in single-quoted double quotes:'"Some error text"'.
# 7 days
START=$(python3 -c "import time; print(int((time.time()-7*86400)*1000))")
END=$(python3 -c "import time; print(int(time.time()*1000))")
aws logs filter-log-events ... --start-time "$START" --end-time "$END" ... | jq '.events | length'
# 30 days (if signal is sparse)
START=$(python3 -c "import time; print(int((time.time()-30*86400)*1000))")
END=$(python3 -c "import time; print(int(time.time()*1000))")
aws logs filter-log-events ... --start-time "$START" --end-time "$END" ... | jq '.events | length'
Find the prod target group name (it may not match the ECS service name exactly):
aws elbv2 describe-target-groups --region <region> \
--query 'TargetGroups[*].[TargetGroupName,TargetGroupArn]' --output text | grep <service>
Pull target 5xx counts:
aws cloudwatch get-metric-statistics \
--region <region> \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_Target_5XX_Count \
--dimensions Name=TargetGroup,Value=targetgroup/<name>/<suffix> \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 --statistics Sum --output json
Empty datapoints usually means the error is logged but not returned to clients (internal noise / non-fatal).
Map findings into the response framework:
| Evidence | Judgment | Status directive |
|---|---|---|
| 5xx == 0, low historical count, no recurrence | Spike / monitor | [[hermes:processing_status=no_action]] |
| Recurring, no 5xx, fixable in code/config | Needs fix | [[hermes:processing_status=needs_fix]] |
| 5xx > 0, rapid increase, data-loss risk, cascading | Urgent | [[hermes:processing_status=urgent]] + @engineers |
Use @engineers only for genuine emergencies (active outage, data loss, severe customer impact, cascading failures).
filter-log-events has a 1 MB response limit; use --next-token or CloudWatch Logs Insights for very large windows.filter-pattern silently fails on unquoted special characters. Quote aggressively.targetgroup/<name>/<suffix> form, not the ECS service name.