Run any Skill in Manus with one click

$pwd:

check

Name: Check
Author: team-michael

// Investigate Notifly Slack/Amazon Q/CloudWatch alerts from live data sources using Hermes profile/global .env-backed AWS, GitHub, Postgres, DynamoDB, and Athena credentials. Start from pasted alert text or Slack subscription context, recover alarm/log context, and produce one concise Korean final answer.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 12:23

File Explorer

15 files

SKILL.md

readonly

package.json

"author": "team-michael"

"repository": "team-michael/hermes-agent"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	check
description	Investigate Notifly Slack/Amazon Q/CloudWatch alerts from live data sources using Hermes profile/global .env-backed AWS, GitHub, Postgres, DynamoDB, and Athena credentials. Start from pasted alert text or Slack subscription context, recover alarm/log context, and produce one concise Korean final answer.
version	1.2.2
author	Hermes Agent
license	MIT
metadata	{"hermes":{"tags":["notifly","alerts","cloudwatch","aws","dynamodb","github","postgres","slack","investigation"]}}

Notifly Alert Check

Use this when the user pastes or a Slack message subscription delivers:

a Slack thread root from Amazon Q / AWS Chatbot
a CloudWatch alarm name
a CloudWatch Logs URL or log group
a Redis / SQS / RDS / segment-publisher error snippet

and wants the same live investigation pattern used in prior Notifly sessions.

This skill is not about replaying old Hermes session archives. It is about using live data sources via credentials in the current Hermes profile .env, then global ~/.hermes/.env.

Automated Slack alert contract

When this skill is invoked by a Slack message_subscriptions prompt, operate silently until the investigation is complete.

Final response rules:

Post exactly one final message; no acknowledgement or progress messages.
Do not call send_message for Slack subscription alerts. Return the final answer as the assistant's final response so the gateway posts it back to the originating Slack thread.
Never use bare slack or slack:<channel_id> as a fallback target for subscription alert results; that posts to the channel main timeline and breaks alert threading.
Korean only. Use the fixed concise list format below; alert context completeness has priority over length.
Helper/script timestamps remain UTC/ISO internally. In the final Slack/user-facing message only, convert every timestamp to KST (YYYY-MM-DD HH:mm KST or M/D HH:mm KST). Do not expose UTC as the primary time.
If needed, slightly exceed the target length or bullet count to include the mandatory context. Do not omit important context just to satisfy the short format.
Prioritize the fixed labels, mandatory scope attribution, alert metric/threshold context, DB instance/query context when DB-shaped, strongest evidence, 30d/7d/1d/10m trend, customer impact, and immediate action decision.
Never abbreviate stable identifiers in the final Korean answer. Do not use ellipses for project IDs, product/project names, campaign IDs, user journey IDs, table names, constraint names, file paths, function names, alarm names, or log groups. If space is tight, remove prose first and keep identifiers complete.
If a helper sample line already contains ..., do not copy that truncated sample into the final answer. Use structured full fields such as logs.current_error_details[].project_ids, table_names, table_refs, projects, and scope_attribution instead.
For log-derived alarms, the strongest evidence is the current alarm-window error detail, not the alarm name or frequency. If logs.current_error_details exists, the final Korean answer must describe the concrete triggering error from that field before discussing 7d/30d frequency or threshold sensitivity.
Always include one compact Korean scope field that names the related project/product and exactly one of campaign or user journey. Campaign and user journey are mutually exclusive: if campaign evidence exists, do not also print an unknown user-journey value; if neither can be tied to the alert after reasonable checks, explicitly say in Korean that campaign/user journey is unknown. If the alert is service/infra-wide, say so in Korean.
Campaigns are project-scoped. Never list campaign IDs as a standalone flat list when a project can be known. Prefer project/campaign pairs such as fitpet/Zxj6Nx; if only a campaign ID is known, say in Korean that the project is unknown for that campaign.
For DB-shaped alerts, always include one compact Korean DB field naming the concrete DB instance/role and top SQL family/query fingerprint. If unavailable, say in Korean that the instance or query is unknown with the shortest reason.
For needs_fix or urgent, the implementation target must be concrete somewhere in the fixed labels: file/module/function, SQL/index/table family, or Terraform path/resource. Avoid generic advice like "threshold review" unless paired with the exact Terraform alarm/config location to change.
If immediate action is not required, do not print 액션 아이템:; put the concrete non-urgent target briefly in 즉시 조치 필요 여부: 추적 필요 .... Pitfall: before finalizing a no_action response, count visible bullets. It must be exactly five labels (원인, 범위, 빈도, 고객 영향도, 즉시 조치 필요 여부). Including 액션 아이템: under no_action breaks the Slack reaction contract and inflates perceived severity.
If the exact code or Terraform location is not found, write the most specific next lookup target instead of a generic action target.
Mention @engineers only for urgent issues requiring immediate engineering response.
End with exactly one hidden directive: [[hermes:processing_status=no_action]], [[hermes:processing_status=needs_fix]], or [[hermes:processing_status=urgent]].

Final answer format:

Use short Markdown bullet lines, not paragraph prose.
Each visible line must start with - <label>.
Use five visible bullets by default; add the sixth 액션 아이템: bullet only when immediate action is needed.
Use exactly these Korean labels, in this order:
- 원인: alarm-triggering system-level cause plus code-level cause. Include both in one compact line; if one is unknown, say why briefly.
- 범위: project/product plus exactly one of campaign or user journey. Campaign and user journey are mutually exclusive.
- 빈도: recent 30일 / 7일 / 1일 / 10분 occurrence counts. Prefer alert transition counts from history.alarm_count_30d, history.alarm_count_7d, history.alarm_count_1d, and history.alarm_count_10m; use log-event counts only when alert history is unavailable or the user explicitly asks for log volume. If any window is unavailable, mark only that window as 확인 불가(<short reason>).
- 고객 영향도: concrete customer-facing impact, data loss/delay/failure/noise status, and whether users/customers were likely affected.
- 즉시 조치 필요 여부: 필요, 불필요, or 추적 필요 plus the shortest reason.
- 액션 아이템: include this line only when immediate action is needed. Name the exact owner-facing implementation or infrastructure target.
Do not add separate 판단, 근거, 조치, 현재 상태, or narrative summary labels.
Keep each label to one line unless a single line would hide mandatory identifiers.

Status selection:

no_action: false positive, already recovered transient spike, known issue within the recent baseline, expected business rejection, noisy metric filter, or any case where no immediate owner action is required. Use this even when the final answer includes a later improvement suggestion, if the current alert is benign or already understood.
needs_fix: non-urgent but actionable engineering work should be tracked now because the signal is new, worsening, outside baseline, causing real failed work, repeated customer impact, data-loss risk, runaway cost/load, or materially harmful alert noise. Do not use needs_fix merely because a code/config/threshold improvement is possible someday.
urgent: immediate customer impact, data loss risk, sustained outage, runaway cost/load, or failed critical dependency.

Known-issue rule:

If the alert matches a known recurring pattern, is already recovered or within baseline, and does not require immediate mitigation, choose no_action so Slack gets the checkmark reaction.
Escalate from no_action to needs_fix only when the recurrence is increasing, the root cause is not understood, the alert creates real operational burden that should be scheduled now, or there is evidence of failed customer-facing work.
If history.rapid_recurrence.status is rapid or there are two or more ALARM transitions within 10 minutes, investigate more deeply before deciding. Do not dismiss it as routine solely because 7d/30d history is recurring; cite the rapid recurrence and use needs_fix unless current load, impact, and dominant source are clearly benign.

Live data sources

Backed by env credentials already present in this environment:

AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
GitHub: GITHUB_TOKEN
Postgres: POSTGRES_HOST, POSTGRES_PORT, POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD
DynamoDB project table via AWS creds
Athena via AWS creds when log/query history is in Athena

Core pattern learned from prior sessions

The recurring investigation flow is:

Start from the pasted alert text or Slack channel ID
- extract alarm name, region, log group, service name, queue name, project IDs, error phrases
- if the user gives a Slack channel ID and asks about Amazon Q / AWS Chatbot messages, use chatbot.describe_slack_channel_configurations in us-west-2 to map SlackChannelId -> SnsTopicArns; then inspect CloudWatch alarms whose AlarmActions/OKActions include those SNS topics. This reconstructs recent Amazon Q alert messages without needing Slack history access.
Use AWS first, not guesswork
- describe_alarms
- describe_alarm_history
- CloudWatch metric datapoints
- Logs Insights / metric filters
- RDS / Performance Insights / SQS / SNS / CloudTrail as needed
If a project_id appears, always map it to product
- DynamoDB project table
- return project_id + product_id + project.name
- Postgres tables follow table_$project_id; use this naming convention when checking campaign/user_journey evidence.
If the user asks when the issue started or which change caused it
- find earliest retained log evidence first
- then correlate to local git history / GitHub PRs with GITHUB_TOKEN
If the alert is DB-shaped
- separate alarm sensitivity from actual workload
- identify writer vs reader
- identify the exact DB instance and role
- identify the SQL family/query fingerprint from Performance Insights or DB logs, not just the metric name
If the alert is log-shaped
- inspect metric filter breadth first
- inspect notification routing drift (SNS subscribers / CloudTrail) if alert volume changed

Mandatory scope attribution

Every investigation must identify which project and which single campaign or user journey the alert is related to, or explicitly say why that scope is not available.

Resolve scope in this order:

Use IDs present in the alert text, alarm dimensions, log signatures, payload samples, table names, or helper output (project_id, campaign_id, user_journey_id, schedule IDs, journey session IDs).
Map every project_id through DynamoDB project and report the product/name mapping, not just the raw ID.
- If mapping fails or the item is missing, report the full project_id and the exact failure reason from projects[].mapping_failure_reason or scope_attribution.project_mapping_failures.
For campaign or user_journey IDs, use read-only DynamoDB/Postgres/Athena lookups to map names and owning project/product when available.
For Postgres table names, infer project from the table_$project_id suffix and then map that project.
For log lines containing both Project Id and Campaign Id, treat the pair as the primary campaign scope. Current alarm-window pairs outrank 7d/30d historical signatures.
- Also treat log-style campaign_id: <id>, project_id: <id> and project_id: <id>, campaign_id: <id> as primary project/campaign pairs.
- Also treat compact ECS log lines such as campaignId: UL1T00 (with or without accompanying projectId) as primary scope evidence when they appear in the current alarm-window stream.
- Never combine a standalone campaign ID with an unrelated sharded table suffix from another log line. IDs from relation "<table>_<project_id>" does not exist are table references, not campaign ownership evidence, unless that table error is the actual current trigger and no stronger project/campaign pair exists.
For DB alerts, Performance Insights SQL statements are scope evidence: event_intermediate_counts_$project_id, users_$project_id, delivery_result_$project_id, message_events_$project_id, etc. mean the project is known and must not be reported as unknown.
For campaign/user_journey scope, first check whether the SQL/table family can carry campaign_id, resource_type, or user_journey_id. If yes, run a read-only aggregate around the alarm/PI window to find the top campaign or user-journey contributor. If campaign evidence exists, stop there and do not also report user journey. If not available because the query is parameterized or the table family has no campaign/user_journey column, say that specific reason.
For service-wide, Lambda/ECS, RDS, SQS, Redis, or broad metric-filter alerts with no per-project evidence, state in Korean that the project and campaign/user journey are unknown, and add a Korean service-wide or infra-wide marker when that is the correct scope.

Do not omit scope to stay under the target length. Compress wording first; if still necessary, exceed the target length.

Important tool discipline

One-pass first

For automated Slack alerts, run the helper first and treat its compact JSON as the primary evidence bundle. Do not manually repeat helper-covered steps unless the helper explicitly reports missing data or an error.

The helper is expected to collect in one terminal call:

answerability fields: can_answer_root_cause, missing_required_context, and required_followups
CloudWatch alarm metadata and alarm history
7d/30d alarm transition counts
metric filter configuration
vetted Logs Insights 7d/30d counts
top 5 sanitized log signatures with at most 3 sample lines each
current-alarm-window signatures, trigger-centered sanitized log contexts, and compact concrete error details (logs.current_top_signatures, logs.current_trigger_contexts, logs.current_error_details) for log-derived alarms
HTTP 4xx/5xx metric context when the alarm namespace/dimensions support it
SQS/DLQ queue attributes, redrive source hints, and safe queue metrics when relevant
Lambda configuration, event sources, async destination/retry config, and error/throttle/duration metrics when relevant
RDS topology when relevant
RDS Performance Insights top SQL by instance when relevant
RDS current alarm focus-window project attribution (rds_performance_insights.detected_scope_ids.current_top_projects_by_load) when relevant
project IDs inferred from RDS Performance Insights sharded table suffixes before SQL sanitization
project mapping from DynamoDB when project_id is present
project plus campaign-or-user_journey attribution, including explicit Korean unknown values when no specific scope is supported
project-campaign pairs from logs (logs.current_project_campaign_pairs) when the current alarm-window payload contains both IDs
campaign/user-journey narrowing hints from log payload IDs and campaign-capable table families
related implementation and Terraform source locations with only 30-50 nearby lines for the top matches

Helper collectors must be selected from CloudWatch alarm metadata, metric namespace/name, dimensions, metric filters, and log group shape. Do not add service-name-specific branches for individual ECS services or alarm names; add generic namespace/metric/dimension collectors instead. When a reusable collection step is needed, add it as a bounded collector in scripts/notifly_alert_context/collectors.py and keep pattern constants in scripts/notifly_alert_context/config.py so new monitoring patterns do not require CLI orchestration changes.

Prefer `terminal` + Python for AWS

Do not rely on execute_code for AWS calls when credentials may only exist in shell env.

Use:

terminal("python - <<'PY' ... PY")
explicit boto3.Session(...) from env vars

When the helper skips Logs Insights despite a clear metric filter (e.g., filter_pattern is present but logs.skipped says "no stable filter terms inferred"), fall back to the bounded manual trace in references/ecs-log-manual-trace.md.

Safe defaults

AWS: read-only
Postgres: read-only
never print secrets or full sensitive payloads
avoid dumping raw Received event: logs if they contain sender credentials
never print raw CloudWatch log dumps; use signature counts and sanitized samples only
never paste full _aws metric JSON, access logs, or full event payloads into the conversation
for source search, avoid broad file reads and AGENTS/SOUL/session context; read only the relevant function/file area when the helper result is insufficient
when the action is alarm/threshold/routing/config related, search Terraform under infra/terraform and name the exact resource/path when found

Fast path helper

Use the helper script first when the user gives pasted alert text. Its default output is compact JSON designed to keep the prompt small:

python "${HERMES_HOME:-$HOME/.hermes}/skills/software-development/check/scripts/collect_notifly_alert_context.py" \
  --text 'Amazon Q: CloudWatch Alarm | notifly-db-prod-cluster CPUUtilization too high | ap-northeast-2 | Account: 702197142747'

Or with a file:

python "${HERMES_HOME:-$HOME/.hermes}/skills/software-development/check/scripts/collect_notifly_alert_context.py" \
  --text-file /tmp/alert.txt

If the helper fails to parse the alarm name from free-form text (detected.alarm_name is null), pass it explicitly:

python "${HERMES_HOME:-$HOME/.hermes}/skills/software-development/check/scripts/collect_notifly_alert_context.py" \
  --text 'segment-publisher slow eic query' \
  --alarm-name '/aws/ecs/notifly-services-prod/segment-publisher slow eic query' \
  --region ap-northeast-2

Pitfall: alarm names with embedded priority tiers (e.g., ScheduledBatchDelivery-P2-FCMLatencyP99) may not be detected by the text parser. Pass --alarm-name explicitly in these cases.

Pitfall: When a metric filter pattern (e.g., took too long) differs materially from the alarm or metric name (e.g., segment-publisher-prod slow eic query), the helper may derive Logs Insights filter terms from the name and report count_7d: 0 / count_30d: 0 despite actual matches existing. Do not treat zero counts as absence of logs; fall back to the bounded manual trace using the exact filter_pattern string from metric_filters[].filter_pattern.

The script does the single-pass first investigation:

parse alert text
query live CloudWatch alarm metadata/history
summarize 7d and 30d alarm history
Pitfall: describe-alarm-history may return entries with StateValue: null and StateReason: null. When this happens, the helper cannot count ALARM transitions from history alone. Fall back to metric datapoint breach density and the alarm's current StateReason from describe-alarms.
fetch CloudWatch metric datapoints
detect log groups / project IDs
inspect metric filters
run fixed Logs Insights query templates for counts and top signatures
collect current alarm-window CloudWatch log contexts from the latest ALARM transition, then trigger-centered contexts from the exact log stream/time window
summarize HTTP 4xx/5xx metrics when inferable
summarize SQS/DLQ context when queue names/dimensions are present
summarize Lambda config/event sources/runtime metrics when function names/dimensions are present. Pitfall: alarm name prefixes may contain priority tiers or other suffixes that do not match actual Lambda function names (e.g., ScheduledBatchDelivery-P2-... maps to scheduled-batch-delivery); when the collector fails with ResourceNotFoundException, fall back to manual name resolution. See references/lambda-name-mapping-gaps.md.
map project IDs via DynamoDB project
map or explicitly rule out project/campaign/user_journey scope
inspect RDS topology if the alarm is RDS-shaped
query Performance Insights for top SQL grouped by DB instance when the alarm is RDS-shaped
search local repo for exact error/alarm strings and return only compact implementation/Terraform context

Do not write custom Logs Insights syntax in the LLM loop unless the helper failed and the missing question cannot be answered otherwise. If a manual Logs Insights query is unavoidable, keep it based on the helper's fixed query shape and return only aggregate counts or sanitized samples.

Helper answerability gate

If the helper returns can_answer_root_cause: true, produce the final answer from the helper output immediately. Do not run manual AWS, source-search, or CloudWatch Logs follow-up calls unless missing_required_context contains a blocking item whose absence makes the fixed final format impossible.

For service-wide or infra-wide metric alarms with no project/campaign/user journey evidence, do not search raw logs just to find a project. Use scope_attribution.required_final_field and state that the scope is service wide and campaign/user journey is unknown.

Never run broad aws logs filter-log-events or raw log-dump commands in the LLM loop. If a log query is genuinely required, add or use a bounded helper collector that returns grouped counts and sanitized samples only.

After the helper returns, inspect these fields before composing the final answer:

can_answer_root_cause
missing_required_context
required_followups

If can_answer_root_cause is false, do not finalize from the first helper output unless every listed follow-up is impossible, unsafe, or lacks credentials. Execute read-only required_followups in priority order, keeping the output compact, and use the new evidence in the final answer.

If missing_required_context is non-empty but can_answer_root_cause is true, answer the root cause from the available evidence but still fill safe follow-ups that affect mandatory final fields such as project/campaign/user journey, DB instance/query, or concrete code/Terraform action.

If a follow-up cannot be completed, the final answer must name the unavailable context briefly instead of implying it was checked.

For log-derived alarms, never finalize with only alarm frequency, threshold, or metric-filter wording when current trigger log details are available. Use logs.current_error_details[].likely_error, context_lines, and error_lines to explain what actually happened in the triggering request/job.

Continuous improvement loop

Every check execution should improve this skill over time:

If you needed extra manual tool calls beyond the helper, decide whether that step is deterministic and reusable.
If it is reusable, silently fold it into the helper package (scripts/notifly_alert_context/), preferably as a config entry, collector registry entry, or fixed query template, during the same session when safe.
For any new alert pattern not covered by the helper, classify the missing context before finalizing: alarm family, AWS API needed, log query shape, source-search token, and final response field it should feed.
Add a small bounded collector or fixed query template for the new pattern when it can be implemented read-only and compactly. Prefer structured fields over prose.
If code changes are not safe during that Slack session, include a helper_gap note in the private reasoning and keep the final answer concrete with the best available evidence.
Prefer adding compact helper fields, fixed query templates, or output caps over adding more prose instructions.
If no reusable improvement is found, do not edit files just to create churn.
For Slack automated alerts, keep this maintenance silent and still post exactly one final Korean response with the hidden status directive.

Investigation recipes

A. RDS / Aurora CPU / memory alarm

Pattern examples:

CPUUtilization too high
FreeableMemory
notifly-db-prod-cluster

Flow:

alarm metadata + exact thresholds
alarm history (OK -> ALARM, ALARM -> OK)
CloudWatch datapoints that actually breached
instance topology (writer/readers)
Performance Insights db.load.avg grouped by db.sql on the offending instance
use the current alarm focus window first; report dominant current_top_projects_by_load instead of listing every project seen in the broader PI lookback
if current_unattributed_top_sql has significant focus load, report it separately as unattributed DB load instead of assigning it to every detected project
if sharded table suffix/project_id appears in SQL -> map via DynamoDB and include the project/product
for campaign/user journey, inspect campaign-capable table families (delivery_result, message_events, scheduled_messages, campaign, user journey tables) with read-only aggregates around the alarm window; do not mark campaign/user journey unknown until this is impossible or inapplicable

Questions to answer:

Why did the alarm fire?
Which instance, writer, or reader caused it?
Which SQL family/query fingerprint created the load?
Which project/product is dominant in the current alarm focus window, and which projects are only background/minor contributors?
Which project/product/campaign/user journey is connected to the SQL table suffix or aggregate? Do not print campaign and user journey together.
Is this a noisy alert or a real incident signal?

B. ECS console/log-derived alarm

Pattern examples:

/aws/ecs/notifly-services-prod/...
console error
slow eic query
Processing took longer than expected
Redis / CROSSSLOT / rate-limit errors

Flow:

alarm + metric filter config
live alarm history
Logs Insights for the primary metric filter pattern and daily counts
inspect logs.current_alarm_window, logs.current_top_signatures, and logs.current_trigger_contexts before writing the final answer; root cause must be based on the error that caused the latest ALARM transition, not a historical 7d/30d top signature, alarm name, or broad service name
if the current alarm-window context shows DB errors, duplicate keys, deadlocks, dependency timeouts, or route/controller frames, treat those as the primary cause and map them to project/table/code context
Helper fallback: if the helper skips Logs Insights (logs.skipped) despite a clear metric filter, use the bounded manual trace in references/ecs-log-manual-trace.md rather than running broad filter-log-events across the whole log group.
if alert volume changed, inspect:
- metric filter drift
- alarm config drift
- SNS subscriber drift
- CloudTrail PutMetricFilter / PutMetricAlarm / Subscribe
trace exact code path in notifly-event
if user asks when it started, find earliest retained log and correlate to PR/commit

Do not claim a metric filter is matching unrelated messages unless the helper's primary filter terms and current alarm-window contexts prove it. Related metric filters, historical top signatures, and broad alarm words are only supporting context.

Pitfall — metric-filter name vs. actual trigger: an alarm may be named after a historic cause (e.g., slow eic query) while the current trigger is a different, coarser log pattern (e.g., [WARN] Processing took longer than expected). When the same log group already carries a purpose-built metric filter in a custom namespace (e.g., Custom/segment-publisher → SegmentPublisher.ExecutionTimeOverThreshold), the ConsoleErrors copy is likely redundant or stale. Always inspect the exact log line that breached the threshold and the full set of metric filters on the log group before letting the alarm name dictate the root cause. See references/segment-publisher-slow-eic-query-noise.md for a concrete example.

Pitfall — broad metric filter catching multiple unrelated causes: a coarse substring filter (e.g., took too long) may match both a benign WARN continuation and a real DB-query latency signal. The alarm name may be accurate for one pattern (e.g., EventCounterCteManager.extract slow EIC query) while a second pattern (batch-processing [WARN]) is noise. Always read the exact log line and surrounding context to determine which pattern fired and triage separately.

Pitfall — helper skipping literal substring metric filters: when metric_filters[].filter_pattern is a simple literal string (e.g., Processing took longer than expected) and the helper reports logs.skipped: "no stable filter terms inferred", the helper’s term extractor is failing on what should be a stable substring. Do not conclude "no logs exist." Fall back to the bounded manual trace in references/ecs-log-manual-trace.md with the exact literal string, or run a direct Logs Insights filter @message like 'Processing took longer than expected' query bounded to the alarm window. This commonly affects segment-publisher long running alam and similar alarms whose filter pattern is a plain phrase rather than a tokenized keyword list.

Pitfall — access-log benign substring matching coarse filter: a metric filter such as %ERROR|Exception% may match benign substrings embedded in HTTP access logs, e.g. a query parameter templateName=service_error or a path segment containing error. The resulting log line is a normal 200/304 request, not an application error. When logs.current_error_details is empty and the trigger context shows only access logs, run a follow-up Logs Insights query that filters out known benign patterns (e.g., service_error, error.html) to confirm whether any real ERROR logs exist in the alarm window. See references/ecs-console-error-false-positive-patterns.md.

Pitfall — Sentry email alert pipeline matching its own payload: The ops-email-receiver Lambda writes parsed Sentry email alerts to a dedicated log group (/aws/ecs/notifly-services-prod/web-console/sentry). A broad %ERROR% metric filter on that log group matches the S-issue title ("Error", "SyntaxError") and "level":"error" inside the JSON payload, causing the alarm to fire whenever any Sentry alert arrives. The Lambda itself is healthy; the real errors are web-console Next.js issues tracked in Sentry. When the helper returns empty current_trigger_contexts, fall back to aws logs filter-log-events bounded by the exact metric-datapoint window because Logs Insights lags metric-filter ingestion. See references/sentry-email-alert-pipeline-false-positives.md for scope extraction via productId and DynamoDB GSI lookup.

C. Console error log-level triage / bulk Amazon Q review

Use this when the user asks to review recent Amazon Q / AWS Chatbot console error alerts over a time range and decide which logs can be downgraded from ERROR to WARN/INFO in notifly-event.

Flow:

If the user gives only a Slack channel ID, map SlackChannelId -> SnsTopicArns via AWS Chatbot in us-west-2, then find CloudWatch alarms whose actions use those SNS topics; this reconstructs Amazon Q alert scope without Slack history.
For each relevant ConsoleErrors / log-derived alarm, use alarm history plus CloudWatch Logs Insights around recent ALARM windows to extract actual triggering log signatures, not just alarm names.
Group signatures by service and code path, then trace the exact source location in notifly-event before recommending any log-level change.
Apply a fail-closed downgrade rule:
- safe to downgrade only when the log is a handled expected/business/validation outcome and the invocation continues or exits normally;
- keep ERROR for unhandled exceptions, Lambda invocation failures, DLQ-producing paths, DB/SQS/Kinesis writes, provider unknown/network failures, data loss, or dependency failures.
Prefer WARN for handled but operator-visible data/config quality issues; prefer INFO for normal empty-result/no-op outcomes.
Remove or minimize full payload/request dumps while downgrading; log only compact non-sensitive context such as project/campaign IDs, metric names, dates, and counts. Treat recipient/device/request payloads as potential PII.
Implement in small service-scoped PRs with tests that assert both the new level and that console.error is not called for the safe path; also assert non-suppressible failure paths remain ERROR.
PR body should explicitly list out-of-scope ERROR paths so reviewers can see service-fault observability is preserved.

Good candidates seen before:

empty result with explicit user notification and normal return -> INFO
invalid recipient/device tokens already converted into delivery failure records -> WARN
provider error branches already marked suppressible by code -> WARN
duplicate non-fatal config/cache mappings where processing continues -> WARN

Bad candidates / keep ERROR:

database/SQS/Kinesis write failures
Lambda unhandled exceptions, timeout/OOM, DLQ/retry exhaustion
provider unknown, network, auth, or rate-limit failures unless explicitly handled/suppressed
cache initialization or delivery-policy missing data when it may hide a real initialization/data bug

D. SQS / DLQ alert

Pattern examples:

ApproximateNumberOfMessagesVisible
*-dlq
retry / maxReceiveCount questions

Flow:

queue attributes
main queue vs DLQ metrics
redrive policy and source queue hints
Lambda/event source mapping for the consumer when inferable
Lambda logs for retry phrases
avoid receive_message unless explicitly approved because it changes message visibility
separate:
- retry broken
- retry working but poison messages still exhausting budget
- historical DLQ residue only

E. HTTP 4xx / 5xx / API error-rate alarm

Pattern examples:

[api-service] 4xx error response is greater than 300 in 5m
API Gateway / ALB 4XXError, 5XXError, HTTPCode_Target_4XX_Count

Flow:

alarm metric/dimensions and exact threshold
4xx/5xx/request-count peer metrics over 7d
Logs Insights or Athena/access-log aggregate by status, route/path, method, target service, and project/campaign IDs when available
source search for the route/controller/error mapper that emits the dominant status
distinguish customer/client input spikes from server-side regression
if AI gateway or Workers AI dependency is suspected, verify Cloudflare status per references/cloudflare-workers-ai-status-check.md

F. Redis / CROSSSLOT / cache incident

Pattern examples:

All keys in the pipeline should belong to the same slots allocation group
CROSSSLOT
enableAutoPipelining

Flow:

exact error logs and first-seen time
error daily trend vs traffic/command metrics
inspect ElastiCache cluster shape and headroom
trace repo call sites and redis client config
correlate to PR/commit that changed cache behavior or redis config
separate direct root cause from later traffic amplifier

G. Lambda latency / error / throttle alarm

Pattern examples:

*-FCMLatencyP99, *-LatencyP99
*-Errors, *-Throttles
Metric namespace Notifly/ScheduledBatchDelivery, AWS/Lambda

Flow:

alarm metadata + exact threshold and metric statistic (p99, Average, Sum)
alarm history and recurrence pattern
CloudWatch metric datapoints that breached, from the custom namespace if available
resolve the real Lambda function name: alarm prefixes may include priority tiers (e.g., -P2) that are not part of the actual function name; see references/lambda-name-mapping-gaps.md
Lambda configuration (MemorySize, Timeout, LastModified) from the actual function name
AWS/Lambda Duration/Errors/Throttles metrics for the real function
log group /aws/lambda/<actual_name> for ERROR lines or trigger context
correlate LastModified deploy time to the alarm window; recurring alarms that spike right after a deploy are not purely baseline
determine scope: these are usually service-wide unless log payloads carry project_id/campaign_id; do not force a project scope when none exists

Pitfall: do not assume the alarm name prefix equals the Lambda function name. When the helper Lambda collector fails with ResourceNotFoundException, manually list Lambdas and match by base service name, then verify LastModified.

Distinguishing real bugs from metric-filter noise: The ConsoleErrors namespace is a coarse log substring filter. For Lambda functions, always cross-check the AWS/Lambda Errors metric. If Errors > 0, the alarm reflects a real invocation failure (unhandled exception, timeout, OOM). If Errors == 0 and Throttles == 0, the log line is likely benign text caught by the broad filter. See references/kds-consumer-event-timestamp-rangeerror.md for a concrete real-bug example where the RangeError in getValidEventTimestampInMilliseconds elevates both console ERROR logs and Lambda runtime Errors.

Percentile metric pitfall: get-metric-statistics does not accept p99 or any percentile statistic. The valid set is SampleCount | Average | Sum | Minimum | Maximum. For percentile alarms such as *-FCMLatencyP99, use Maximum as a conservative proxy, or switch to get-metric-data with ExtendedStatistics=['p99'] if the exact value is required. See references/scheduled-batch-delivery-fcm-latency.md for a concrete FCM latency triage recipe.

DynamoDB project mapping rule

Whenever you find a project_id, fetch from DynamoDB project table with a projection expression and report:

id
product_id
name
mapping status and failure reason when unavailable

Do not fetch full items because the table may contain sensitive sender credentials.

Sentry alert scoping: When the alert originates from the web-console/sentry log group, the Sentry JSON payload may contain a Notifly product slug in request.url or tags.url (e.g. productId=hybiome). Use the project table GSI product_id-project_id-index to map this slug to a Notifly project.id. See references/sentry-email-alert-pipeline-false-positives.md for the exact query and duplicate-item pitfall.

Non-existent project edge case (api-service)

If the project_id is missing from the project table and no related event_list_<project_id> or sharded DB tables (e.g., campaign_statistics_<project_id>, users_<project_id>) exist, the project is effectively non-existent. When an api-service 42P01 (relation does not exist) error is tied to such a project, the root cause is usually an external caller using an invalid or stale project_id. Check the structured access/error log for:

ip / userAgent (e.g., curl/7.81.0 indicates a manual/scripted call)
path / method of the request
Request volume and recurrence pattern

Single or sporadic curl requests from an unrecognized IP suggest a misconfigured client test rather than a service regression. See references/api-service-invalid-project-tracing.md for the full trace recipe.

GitHub correlation rule

If the user asks:

"When did it start?"
"Which commit or PR was related?"
"Which change surfaced it?"

then:

first find earliest retained log time
then inspect local git history in ~/workspace or /home/ubuntu/notifly-event
if needed, use GitHub API with GITHUB_TOKEN
separate:
- first observed time
- direct enabling change
- later change that amplified / surfaced the issue

Postgres / DynamoDB / Athena step

Use read-only Postgres/DynamoDB/Athena when AWS logs/metrics identify a project, campaign, user journey, event family, or log table to verify.

Examples:

confirm campaign / user journey / schedule relationship
inspect schema/index shape for a known table family
verify shard table existence for a discovered project_id
compare recent 7d/30d event or error counts

Pitfall — sharded campaign lookup: Notifly campaigns are stored in 1,400+ sharded Postgres tables (campaigns_<project_id_hash>). Scanning all tables to map a campaign ID to its owning project is impractical and may hit command-length limits. Prefer:

DynamoDB event_list_* tables for recent campaign-project relationships.
Athena notifly_analytics.notifly_campaign_events for historical mapping.
Accepting "project unknown for campaign" in the final answer when neither source is available.

Never mutate data.

Output shape

For interactive user requests, answer in this order:

direct conclusion
mandatory scope: project/product and exactly one of campaign or user journey, or explicit Korean unknown/service-wide/infra-wide wording
DB instance + SQL fingerprint when DB-shaped, or exact evidence from AWS/logs/metrics otherwise
exact evidence from AWS/logs/metrics
tradeoff: real issue vs noisy alert
concrete next action naming the implementation file/function, SQL/index/table family, or Terraform resource/path to change

For automated Slack subscription alerts, obey the automated Slack alert contract instead of this longer shape.

Practical note

The helper script is only the first pass. For full incident work, continue with the appropriate datasource-specific steps above.

name	check
description	Investigate Notifly Slack/Amazon Q/CloudWatch alerts from live data sources using Hermes profile/global .env-backed AWS, GitHub, Postgres, DynamoDB, and Athena credentials. Start from pasted alert text or Slack subscription context, recover alarm/log context, and produce one concise Korean final answer.
version	1.2.2
author	Hermes Agent
license	MIT
metadata	{"hermes":{"tags":["notifly","alerts","cloudwatch","aws","dynamodb","github","postgres","slack","investigation"]}}

Notifly Alert Check

Use this when the user pastes or a Slack message subscription delivers:

a Slack thread root from Amazon Q / AWS Chatbot
a CloudWatch alarm name
a CloudWatch Logs URL or log group
a Redis / SQS / RDS / segment-publisher error snippet

and wants the same live investigation pattern used in prior Notifly sessions.

This skill is not about replaying old Hermes session archives. It is about using live data sources via credentials in the current Hermes profile .env, then global ~/.hermes/.env.

Automated Slack alert contract

When this skill is invoked by a Slack message_subscriptions prompt, operate silently until the investigation is complete.

Final response rules:

Post exactly one final message; no acknowledgement or progress messages.
Do not call send_message for Slack subscription alerts. Return the final answer as the assistant's final response so the gateway posts it back to the originating Slack thread.
Never use bare slack or slack:<channel_id> as a fallback target for subscription alert results; that posts to the channel main timeline and breaks alert threading.
Korean only. Use the fixed concise list format below; alert context completeness has priority over length.
Helper/script timestamps remain UTC/ISO internally. In the final Slack/user-facing message only, convert every timestamp to KST (YYYY-MM-DD HH:mm KST or M/D HH:mm KST). Do not expose UTC as the primary time.
If needed, slightly exceed the target length or bullet count to include the mandatory context. Do not omit important context just to satisfy the short format.
Prioritize the fixed labels, mandatory scope attribution, alert metric/threshold context, DB instance/query context when DB-shaped, strongest evidence, 30d/7d/1d/10m trend, customer impact, and immediate action decision.
Never abbreviate stable identifiers in the final Korean answer. Do not use ellipses for project IDs, product/project names, campaign IDs, user journey IDs, table names, constraint names, file paths, function names, alarm names, or log groups. If space is tight, remove prose first and keep identifiers complete.
If a helper sample line already contains ..., do not copy that truncated sample into the final answer. Use structured full fields such as logs.current_error_details[].project_ids, table_names, table_refs, projects, and scope_attribution instead.
For log-derived alarms, the strongest evidence is the current alarm-window error detail, not the alarm name or frequency. If logs.current_error_details exists, the final Korean answer must describe the concrete triggering error from that field before discussing 7d/30d frequency or threshold sensitivity.
Always include one compact Korean scope field that names the related project/product and exactly one of campaign or user journey. Campaign and user journey are mutually exclusive: if campaign evidence exists, do not also print an unknown user-journey value; if neither can be tied to the alert after reasonable checks, explicitly say in Korean that campaign/user journey is unknown. If the alert is service/infra-wide, say so in Korean.
Campaigns are project-scoped. Never list campaign IDs as a standalone flat list when a project can be known. Prefer project/campaign pairs such as fitpet/Zxj6Nx; if only a campaign ID is known, say in Korean that the project is unknown for that campaign.
For DB-shaped alerts, always include one compact Korean DB field naming the concrete DB instance/role and top SQL family/query fingerprint. If unavailable, say in Korean that the instance or query is unknown with the shortest reason.
For needs_fix or urgent, the implementation target must be concrete somewhere in the fixed labels: file/module/function, SQL/index/table family, or Terraform path/resource. Avoid generic advice like "threshold review" unless paired with the exact Terraform alarm/config location to change.
If immediate action is not required, do not print 액션 아이템:; put the concrete non-urgent target briefly in 즉시 조치 필요 여부: 추적 필요 .... Pitfall: before finalizing a no_action response, count visible bullets. It must be exactly five labels (원인, 범위, 빈도, 고객 영향도, 즉시 조치 필요 여부). Including 액션 아이템: under no_action breaks the Slack reaction contract and inflates perceived severity.
If the exact code or Terraform location is not found, write the most specific next lookup target instead of a generic action target.
Mention @engineers only for urgent issues requiring immediate engineering response.
End with exactly one hidden directive: [[hermes:processing_status=no_action]], [[hermes:processing_status=needs_fix]], or [[hermes:processing_status=urgent]].

Final answer format:

Use short Markdown bullet lines, not paragraph prose.
Each visible line must start with - <label>.
Use five visible bullets by default; add the sixth 액션 아이템: bullet only when immediate action is needed.
Use exactly these Korean labels, in this order:
- 원인: alarm-triggering system-level cause plus code-level cause. Include both in one compact line; if one is unknown, say why briefly.
- 범위: project/product plus exactly one of campaign or user journey. Campaign and user journey are mutually exclusive.
- 빈도: recent 30일 / 7일 / 1일 / 10분 occurrence counts. Prefer alert transition counts from history.alarm_count_30d, history.alarm_count_7d, history.alarm_count_1d, and history.alarm_count_10m; use log-event counts only when alert history is unavailable or the user explicitly asks for log volume. If any window is unavailable, mark only that window as 확인 불가(<short reason>).
- 고객 영향도: concrete customer-facing impact, data loss/delay/failure/noise status, and whether users/customers were likely affected.
- 즉시 조치 필요 여부: 필요, 불필요, or 추적 필요 plus the shortest reason.
- 액션 아이템: include this line only when immediate action is needed. Name the exact owner-facing implementation or infrastructure target.
Do not add separate 판단, 근거, 조치, 현재 상태, or narrative summary labels.
Keep each label to one line unless a single line would hide mandatory identifiers.

Status selection:

no_action: false positive, already recovered transient spike, known issue within the recent baseline, expected business rejection, noisy metric filter, or any case where no immediate owner action is required. Use this even when the final answer includes a later improvement suggestion, if the current alert is benign or already understood.
needs_fix: non-urgent but actionable engineering work should be tracked now because the signal is new, worsening, outside baseline, causing real failed work, repeated customer impact, data-loss risk, runaway cost/load, or materially harmful alert noise. Do not use needs_fix merely because a code/config/threshold improvement is possible someday.
urgent: immediate customer impact, data loss risk, sustained outage, runaway cost/load, or failed critical dependency.

Known-issue rule:

If the alert matches a known recurring pattern, is already recovered or within baseline, and does not require immediate mitigation, choose no_action so Slack gets the checkmark reaction.
Escalate from no_action to needs_fix only when the recurrence is increasing, the root cause is not understood, the alert creates real operational burden that should be scheduled now, or there is evidence of failed customer-facing work.
If history.rapid_recurrence.status is rapid or there are two or more ALARM transitions within 10 minutes, investigate more deeply before deciding. Do not dismiss it as routine solely because 7d/30d history is recurring; cite the rapid recurrence and use needs_fix unless current load, impact, and dominant source are clearly benign.

Live data sources

Backed by env credentials already present in this environment:

AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
GitHub: GITHUB_TOKEN
Postgres: POSTGRES_HOST, POSTGRES_PORT, POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD
DynamoDB project table via AWS creds
Athena via AWS creds when log/query history is in Athena

Core pattern learned from prior sessions

The recurring investigation flow is:

Start from the pasted alert text or Slack channel ID
- extract alarm name, region, log group, service name, queue name, project IDs, error phrases
- if the user gives a Slack channel ID and asks about Amazon Q / AWS Chatbot messages, use chatbot.describe_slack_channel_configurations in us-west-2 to map SlackChannelId -> SnsTopicArns; then inspect CloudWatch alarms whose AlarmActions/OKActions include those SNS topics. This reconstructs recent Amazon Q alert messages without needing Slack history access.
Use AWS first, not guesswork
- describe_alarms
- describe_alarm_history
- CloudWatch metric datapoints
- Logs Insights / metric filters
- RDS / Performance Insights / SQS / SNS / CloudTrail as needed
If a project_id appears, always map it to product
- DynamoDB project table
- return project_id + product_id + project.name
- Postgres tables follow table_$project_id; use this naming convention when checking campaign/user_journey evidence.
If the user asks when the issue started or which change caused it
- find earliest retained log evidence first
- then correlate to local git history / GitHub PRs with GITHUB_TOKEN
If the alert is DB-shaped
- separate alarm sensitivity from actual workload
- identify writer vs reader
- identify the exact DB instance and role
- identify the SQL family/query fingerprint from Performance Insights or DB logs, not just the metric name
If the alert is log-shaped
- inspect metric filter breadth first
- inspect notification routing drift (SNS subscribers / CloudTrail) if alert volume changed

Mandatory scope attribution

Every investigation must identify which project and which single campaign or user journey the alert is related to, or explicitly say why that scope is not available.

Resolve scope in this order:

Use IDs present in the alert text, alarm dimensions, log signatures, payload samples, table names, or helper output (project_id, campaign_id, user_journey_id, schedule IDs, journey session IDs).
Map every project_id through DynamoDB project and report the product/name mapping, not just the raw ID.
- If mapping fails or the item is missing, report the full project_id and the exact failure reason from projects[].mapping_failure_reason or scope_attribution.project_mapping_failures.
For campaign or user_journey IDs, use read-only DynamoDB/Postgres/Athena lookups to map names and owning project/product when available.
For Postgres table names, infer project from the table_$project_id suffix and then map that project.
For log lines containing both Project Id and Campaign Id, treat the pair as the primary campaign scope. Current alarm-window pairs outrank 7d/30d historical signatures.
- Also treat log-style campaign_id: <id>, project_id: <id> and project_id: <id>, campaign_id: <id> as primary project/campaign pairs.
- Also treat compact ECS log lines such as campaignId: UL1T00 (with or without accompanying projectId) as primary scope evidence when they appear in the current alarm-window stream.
- Never combine a standalone campaign ID with an unrelated sharded table suffix from another log line. IDs from relation "<table>_<project_id>" does not exist are table references, not campaign ownership evidence, unless that table error is the actual current trigger and no stronger project/campaign pair exists.
For DB alerts, Performance Insights SQL statements are scope evidence: event_intermediate_counts_$project_id, users_$project_id, delivery_result_$project_id, message_events_$project_id, etc. mean the project is known and must not be reported as unknown.
For campaign/user_journey scope, first check whether the SQL/table family can carry campaign_id, resource_type, or user_journey_id. If yes, run a read-only aggregate around the alarm/PI window to find the top campaign or user-journey contributor. If campaign evidence exists, stop there and do not also report user journey. If not available because the query is parameterized or the table family has no campaign/user_journey column, say that specific reason.
For service-wide, Lambda/ECS, RDS, SQS, Redis, or broad metric-filter alerts with no per-project evidence, state in Korean that the project and campaign/user journey are unknown, and add a Korean service-wide or infra-wide marker when that is the correct scope.

Do not omit scope to stay under the target length. Compress wording first; if still necessary, exceed the target length.

Important tool discipline

One-pass first

The helper is expected to collect in one terminal call:

answerability fields: can_answer_root_cause, missing_required_context, and required_followups
CloudWatch alarm metadata and alarm history
7d/30d alarm transition counts
metric filter configuration
vetted Logs Insights 7d/30d counts
top 5 sanitized log signatures with at most 3 sample lines each
current-alarm-window signatures, trigger-centered sanitized log contexts, and compact concrete error details (logs.current_top_signatures, logs.current_trigger_contexts, logs.current_error_details) for log-derived alarms
HTTP 4xx/5xx metric context when the alarm namespace/dimensions support it
SQS/DLQ queue attributes, redrive source hints, and safe queue metrics when relevant
Lambda configuration, event sources, async destination/retry config, and error/throttle/duration metrics when relevant
RDS topology when relevant
RDS Performance Insights top SQL by instance when relevant
RDS current alarm focus-window project attribution (rds_performance_insights.detected_scope_ids.current_top_projects_by_load) when relevant
project IDs inferred from RDS Performance Insights sharded table suffixes before SQL sanitization
project mapping from DynamoDB when project_id is present
project plus campaign-or-user_journey attribution, including explicit Korean unknown values when no specific scope is supported
project-campaign pairs from logs (logs.current_project_campaign_pairs) when the current alarm-window payload contains both IDs
campaign/user-journey narrowing hints from log payload IDs and campaign-capable table families
related implementation and Terraform source locations with only 30-50 nearby lines for the top matches

Prefer `terminal` + Python for AWS

Do not rely on execute_code for AWS calls when credentials may only exist in shell env.

Use:

terminal("python - <<'PY' ... PY")
explicit boto3.Session(...) from env vars

Safe defaults

AWS: read-only
Postgres: read-only
never print secrets or full sensitive payloads
avoid dumping raw Received event: logs if they contain sender credentials
never print raw CloudWatch log dumps; use signature counts and sanitized samples only
never paste full _aws metric JSON, access logs, or full event payloads into the conversation
for source search, avoid broad file reads and AGENTS/SOUL/session context; read only the relevant function/file area when the helper result is insufficient
when the action is alarm/threshold/routing/config related, search Terraform under infra/terraform and name the exact resource/path when found

Fast path helper

Use the helper script first when the user gives pasted alert text. Its default output is compact JSON designed to keep the prompt small:

python "${HERMES_HOME:-$HOME/.hermes}/skills/software-development/check/scripts/collect_notifly_alert_context.py" \
  --text 'Amazon Q: CloudWatch Alarm | notifly-db-prod-cluster CPUUtilization too high | ap-northeast-2 | Account: 702197142747'

Or with a file:

python "${HERMES_HOME:-$HOME/.hermes}/skills/software-development/check/scripts/collect_notifly_alert_context.py" \
  --text-file /tmp/alert.txt

If the helper fails to parse the alarm name from free-form text (detected.alarm_name is null), pass it explicitly:

python "${HERMES_HOME:-$HOME/.hermes}/skills/software-development/check/scripts/collect_notifly_alert_context.py" \
  --text 'segment-publisher slow eic query' \
  --alarm-name '/aws/ecs/notifly-services-prod/segment-publisher slow eic query' \
  --region ap-northeast-2

Pitfall: alarm names with embedded priority tiers (e.g., ScheduledBatchDelivery-P2-FCMLatencyP99) may not be detected by the text parser. Pass --alarm-name explicitly in these cases.

The script does the single-pass first investigation:

parse alert text
query live CloudWatch alarm metadata/history
summarize 7d and 30d alarm history
Pitfall: describe-alarm-history may return entries with StateValue: null and StateReason: null. When this happens, the helper cannot count ALARM transitions from history alone. Fall back to metric datapoint breach density and the alarm's current StateReason from describe-alarms.
fetch CloudWatch metric datapoints
detect log groups / project IDs
inspect metric filters
run fixed Logs Insights query templates for counts and top signatures
collect current alarm-window CloudWatch log contexts from the latest ALARM transition, then trigger-centered contexts from the exact log stream/time window
summarize HTTP 4xx/5xx metrics when inferable
summarize SQS/DLQ context when queue names/dimensions are present
summarize Lambda config/event sources/runtime metrics when function names/dimensions are present. Pitfall: alarm name prefixes may contain priority tiers or other suffixes that do not match actual Lambda function names (e.g., ScheduledBatchDelivery-P2-... maps to scheduled-batch-delivery); when the collector fails with ResourceNotFoundException, fall back to manual name resolution. See references/lambda-name-mapping-gaps.md.
map project IDs via DynamoDB project
map or explicitly rule out project/campaign/user_journey scope
inspect RDS topology if the alarm is RDS-shaped
query Performance Insights for top SQL grouped by DB instance when the alarm is RDS-shaped
search local repo for exact error/alarm strings and return only compact implementation/Terraform context

Helper answerability gate

After the helper returns, inspect these fields before composing the final answer:

can_answer_root_cause
missing_required_context
required_followups

If a follow-up cannot be completed, the final answer must name the unavailable context briefly instead of implying it was checked.

Continuous improvement loop

Every check execution should improve this skill over time:

If you needed extra manual tool calls beyond the helper, decide whether that step is deterministic and reusable.
If it is reusable, silently fold it into the helper package (scripts/notifly_alert_context/), preferably as a config entry, collector registry entry, or fixed query template, during the same session when safe.
For any new alert pattern not covered by the helper, classify the missing context before finalizing: alarm family, AWS API needed, log query shape, source-search token, and final response field it should feed.
Add a small bounded collector or fixed query template for the new pattern when it can be implemented read-only and compactly. Prefer structured fields over prose.
If code changes are not safe during that Slack session, include a helper_gap note in the private reasoning and keep the final answer concrete with the best available evidence.
Prefer adding compact helper fields, fixed query templates, or output caps over adding more prose instructions.
If no reusable improvement is found, do not edit files just to create churn.
For Slack automated alerts, keep this maintenance silent and still post exactly one final Korean response with the hidden status directive.

Investigation recipes

A. RDS / Aurora CPU / memory alarm

Pattern examples:

CPUUtilization too high
FreeableMemory
notifly-db-prod-cluster

Flow:

alarm metadata + exact thresholds
alarm history (OK -> ALARM, ALARM -> OK)
CloudWatch datapoints that actually breached
instance topology (writer/readers)
Performance Insights db.load.avg grouped by db.sql on the offending instance
use the current alarm focus window first; report dominant current_top_projects_by_load instead of listing every project seen in the broader PI lookback
if current_unattributed_top_sql has significant focus load, report it separately as unattributed DB load instead of assigning it to every detected project
if sharded table suffix/project_id appears in SQL -> map via DynamoDB and include the project/product
for campaign/user journey, inspect campaign-capable table families (delivery_result, message_events, scheduled_messages, campaign, user journey tables) with read-only aggregates around the alarm window; do not mark campaign/user journey unknown until this is impossible or inapplicable

Questions to answer:

Why did the alarm fire?
Which instance, writer, or reader caused it?
Which SQL family/query fingerprint created the load?
Which project/product is dominant in the current alarm focus window, and which projects are only background/minor contributors?
Which project/product/campaign/user journey is connected to the SQL table suffix or aggregate? Do not print campaign and user journey together.
Is this a noisy alert or a real incident signal?

B. ECS console/log-derived alarm

Pattern examples:

/aws/ecs/notifly-services-prod/...
console error
slow eic query
Processing took longer than expected
Redis / CROSSSLOT / rate-limit errors

Flow:

alarm + metric filter config
live alarm history
Logs Insights for the primary metric filter pattern and daily counts
inspect logs.current_alarm_window, logs.current_top_signatures, and logs.current_trigger_contexts before writing the final answer; root cause must be based on the error that caused the latest ALARM transition, not a historical 7d/30d top signature, alarm name, or broad service name
if the current alarm-window context shows DB errors, duplicate keys, deadlocks, dependency timeouts, or route/controller frames, treat those as the primary cause and map them to project/table/code context
Helper fallback: if the helper skips Logs Insights (logs.skipped) despite a clear metric filter, use the bounded manual trace in references/ecs-log-manual-trace.md rather than running broad filter-log-events across the whole log group.
if alert volume changed, inspect:
- metric filter drift
- alarm config drift
- SNS subscriber drift
- CloudTrail PutMetricFilter / PutMetricAlarm / Subscribe
trace exact code path in notifly-event
if user asks when it started, find earliest retained log and correlate to PR/commit

C. Console error log-level triage / bulk Amazon Q review

Use this when the user asks to review recent Amazon Q / AWS Chatbot console error alerts over a time range and decide which logs can be downgraded from ERROR to WARN/INFO in notifly-event.

Flow:

If the user gives only a Slack channel ID, map SlackChannelId -> SnsTopicArns via AWS Chatbot in us-west-2, then find CloudWatch alarms whose actions use those SNS topics; this reconstructs Amazon Q alert scope without Slack history.
For each relevant ConsoleErrors / log-derived alarm, use alarm history plus CloudWatch Logs Insights around recent ALARM windows to extract actual triggering log signatures, not just alarm names.
Group signatures by service and code path, then trace the exact source location in notifly-event before recommending any log-level change.
Apply a fail-closed downgrade rule:
- safe to downgrade only when the log is a handled expected/business/validation outcome and the invocation continues or exits normally;
- keep ERROR for unhandled exceptions, Lambda invocation failures, DLQ-producing paths, DB/SQS/Kinesis writes, provider unknown/network failures, data loss, or dependency failures.
Prefer WARN for handled but operator-visible data/config quality issues; prefer INFO for normal empty-result/no-op outcomes.
Remove or minimize full payload/request dumps while downgrading; log only compact non-sensitive context such as project/campaign IDs, metric names, dates, and counts. Treat recipient/device/request payloads as potential PII.
Implement in small service-scoped PRs with tests that assert both the new level and that console.error is not called for the safe path; also assert non-suppressible failure paths remain ERROR.
PR body should explicitly list out-of-scope ERROR paths so reviewers can see service-fault observability is preserved.

Good candidates seen before:

empty result with explicit user notification and normal return -> INFO
invalid recipient/device tokens already converted into delivery failure records -> WARN
provider error branches already marked suppressible by code -> WARN
duplicate non-fatal config/cache mappings where processing continues -> WARN

Bad candidates / keep ERROR:

database/SQS/Kinesis write failures
Lambda unhandled exceptions, timeout/OOM, DLQ/retry exhaustion
provider unknown, network, auth, or rate-limit failures unless explicitly handled/suppressed
cache initialization or delivery-policy missing data when it may hide a real initialization/data bug

D. SQS / DLQ alert

Pattern examples:

ApproximateNumberOfMessagesVisible
*-dlq
retry / maxReceiveCount questions

Flow:

queue attributes
main queue vs DLQ metrics
redrive policy and source queue hints
Lambda/event source mapping for the consumer when inferable
Lambda logs for retry phrases
avoid receive_message unless explicitly approved because it changes message visibility
separate:
- retry broken
- retry working but poison messages still exhausting budget
- historical DLQ residue only

E. HTTP 4xx / 5xx / API error-rate alarm

Pattern examples:

[api-service] 4xx error response is greater than 300 in 5m
API Gateway / ALB 4XXError, 5XXError, HTTPCode_Target_4XX_Count

Flow:

alarm metric/dimensions and exact threshold
4xx/5xx/request-count peer metrics over 7d
Logs Insights or Athena/access-log aggregate by status, route/path, method, target service, and project/campaign IDs when available
source search for the route/controller/error mapper that emits the dominant status
distinguish customer/client input spikes from server-side regression
if AI gateway or Workers AI dependency is suspected, verify Cloudflare status per references/cloudflare-workers-ai-status-check.md

F. Redis / CROSSSLOT / cache incident

Pattern examples:

All keys in the pipeline should belong to the same slots allocation group
CROSSSLOT
enableAutoPipelining

Flow:

exact error logs and first-seen time
error daily trend vs traffic/command metrics
inspect ElastiCache cluster shape and headroom
trace repo call sites and redis client config
correlate to PR/commit that changed cache behavior or redis config
separate direct root cause from later traffic amplifier

G. Lambda latency / error / throttle alarm

Pattern examples:

*-FCMLatencyP99, *-LatencyP99
*-Errors, *-Throttles
Metric namespace Notifly/ScheduledBatchDelivery, AWS/Lambda

Flow:

alarm metadata + exact threshold and metric statistic (p99, Average, Sum)
alarm history and recurrence pattern
CloudWatch metric datapoints that breached, from the custom namespace if available
resolve the real Lambda function name: alarm prefixes may include priority tiers (e.g., -P2) that are not part of the actual function name; see references/lambda-name-mapping-gaps.md
Lambda configuration (MemorySize, Timeout, LastModified) from the actual function name
AWS/Lambda Duration/Errors/Throttles metrics for the real function
log group /aws/lambda/<actual_name> for ERROR lines or trigger context
correlate LastModified deploy time to the alarm window; recurring alarms that spike right after a deploy are not purely baseline
determine scope: these are usually service-wide unless log payloads carry project_id/campaign_id; do not force a project scope when none exists

DynamoDB project mapping rule

Whenever you find a project_id, fetch from DynamoDB project table with a projection expression and report:

id
product_id
name
mapping status and failure reason when unavailable

Do not fetch full items because the table may contain sensitive sender credentials.

Non-existent project edge case (api-service)

ip / userAgent (e.g., curl/7.81.0 indicates a manual/scripted call)
path / method of the request
Request volume and recurrence pattern

GitHub correlation rule

If the user asks:

"When did it start?"
"Which commit or PR was related?"
"Which change surfaced it?"

then:

first find earliest retained log time
then inspect local git history in ~/workspace or /home/ubuntu/notifly-event
if needed, use GitHub API with GITHUB_TOKEN
separate:
- first observed time
- direct enabling change
- later change that amplified / surfaced the issue

Postgres / DynamoDB / Athena step

Use read-only Postgres/DynamoDB/Athena when AWS logs/metrics identify a project, campaign, user journey, event family, or log table to verify.

Examples:

confirm campaign / user journey / schedule relationship
inspect schema/index shape for a known table family
verify shard table existence for a discovered project_id
compare recent 7d/30d event or error counts

DynamoDB event_list_* tables for recent campaign-project relationships.
Athena notifly_analytics.notifly_campaign_events for historical mapping.
Accepting "project unknown for campaign" in the final answer when neither source is available.

Never mutate data.

Output shape

For interactive user requests, answer in this order:

direct conclusion
mandatory scope: project/product and exactly one of campaign or user journey, or explicit Korean unknown/service-wide/infra-wide wording
DB instance + SQL fingerprint when DB-shaped, or exact evidence from AWS/logs/metrics otherwise
exact evidence from AWS/logs/metrics
tradeoff: real issue vs noisy alert
concrete next action naming the implementation file/function, SQL/index/table family, or Terraform resource/path to change

For automated Slack subscription alerts, obey the automated Slack alert contract instead of this longer shape.

Practical note

The helper script is only the first pass. For full incident work, continue with the appropriate datasource-specific steps above.

check

Notifly Alert Check

Automated Slack alert contract

Live data sources

Core pattern learned from prior sessions

Mandatory scope attribution

Important tool discipline

One-pass first

Prefer terminal + Python for AWS

Safe defaults

Fast path helper

Helper answerability gate

Continuous improvement loop

Investigation recipes

A. RDS / Aurora CPU / memory alarm

B. ECS console/log-derived alarm

C. Console error log-level triage / bulk Amazon Q review

D. SQS / DLQ alert

E. HTTP 4xx / 5xx / API error-rate alarm

F. Redis / CROSSSLOT / cache incident

G. Lambda latency / error / throttle alarm

DynamoDB project mapping rule

Non-existent project edge case (api-service)

GitHub correlation rule

Postgres / DynamoDB / Athena step

Output shape

Practical note

Notifly Alert Check

Automated Slack alert contract

Live data sources

Core pattern learned from prior sessions

Mandatory scope attribution

Important tool discipline

One-pass first

Prefer terminal + Python for AWS

Safe defaults

Fast path helper

Helper answerability gate

Continuous improvement loop

Investigation recipes

A. RDS / Aurora CPU / memory alarm

B. ECS console/log-derived alarm

C. Console error log-level triage / bulk Amazon Q review

D. SQS / DLQ alert

E. HTTP 4xx / 5xx / API error-rate alarm

F. Redis / CROSSSLOT / cache incident

G. Lambda latency / error / throttle alarm

DynamoDB project mapping rule

Non-existent project edge case (api-service)

GitHub correlation rule

Postgres / DynamoDB / Athena step

Output shape

Practical note

Prefer `terminal` + Python for AWS

Prefer `terminal` + Python for AWS