with one click
aws-cloudwatch-debugging-via-boto3
// Investigate AWS alarms and related CloudWatch Logs using Python+boto3 when AWS CLI is unavailable or credentials only exist in shell env.
// Investigate AWS alarms and related CloudWatch Logs using Python+boto3 when AWS CLI is unavailable or credentials only exist in shell env.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | aws-cloudwatch-debugging-via-boto3 |
| description | Investigate AWS alarms and related CloudWatch Logs using Python+boto3 when AWS CLI is unavailable or credentials only exist in shell env. |
| version | 1.0.0 |
| author | Hermes Agent |
| license | MIT |
| metadata | {"hermes":{"tags":["aws","cloudwatch","boto3","debugging","alarms","logs"]}} |
Use this when a user asks to investigate an AWS alarm from the agent environment, especially if:
aws CLI is not installedIn this environment:
aws CLI may be missingexecute_code may not inherit AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEYterminal does see shell env varsSo prefer Python via terminal and explicitly construct a boto3 Session from env vars.
Reference for noisy p95/p99/p99.9 alarm tuning: references/noisy-percentile-alarm-tuning.md.
Check environment first
aws --versionenv | grep '^AWS_' | sortCreate explicit boto3 session in terminal
import boto3, os
session = boto3.Session(
aws_access_key_id=os.environ.get('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.environ.get('AWS_SECRET_ACCESS_KEY'),
region_name=os.environ.get('AWS_DEFAULT_REGION', 'ap-northeast-2'),
)
Confirm identity
get_caller_identity()Find the alarm
cloudwatch.describe_alarms() with keyword filtering on AlarmNameAlarmNameStateValueStateReasonStateUpdatedTimestamplogs.describe_metric_filters()
filterPatternPull alarm history
describe_alarm_history(HistoryItemType='StateUpdate')OK -> ALARM timestamp and datapoint valueCorrelate CloudWatch Logs
filter_log_events for tight windowsstart_query + get_query_results for broader error scansValidate with metric datapoints
get_metric_statistics() or get_metric_data() on the alarm metricdescribe_alarm_history() for month-scale counts, start-time clustering, and ALARM-state durationsget_metric_data() for detailed minute-level reconstruction only where the datapoints are still retainedWhen the user asks "when did this start?" find the first seen log + correlate to git/PR history
sort @timestamp asc | limit N on the exact error string to find the earliest retained occurrence@ptr / log record pointer, call logs.get_log_record() to recover the exact event metadata; this is often more reliable than trying to re-fetch the same event with a narrow get_log_events windowgit log --follow -- <path> for the logging/config filegit log --since <t0> --until <t1> -- <service-or-package> for nearby service changesgh is unavailable, use the GitHub REST API directly with GITHUB_TOKEN from ~/.hermes/.env to inspect PR titles, merge times, bodies, and changed filesWhen the user says "it increased recently" validate that statistically before assuming a new regression
stats count(*) by bin(1d)GetTypeCmdsSetTypeCmdsNewConnectionsCurrConnectionsGetTypeCmds, while SetTypeCmds stays roughly flatgit log -- <cache file> <service path> <redis wrapper>), and also verify whether an ECS deploy happened even if code did not changeFor RDS CPU alarms, correlate with Performance Insights
rds.describe_db_clusters() / describe_db_instances() to identify writer vs readers and collect DbiResourceIdpi.get_resource_metrics, pi.describe_dimension_keys) on the offending instancedb.load.avgos.cpuUtilization.total.avgValidate with metric datapoints
get_metric_statistics() / get_metric_data() on the alarm metricFor RDS/Aurora instance alarms, pivot to the exact instance and query/load shape
describe_db_clusters() and describe_db_instances() firstPerformanceInsightsEnabled=true, use the PI client (session.client('pi')) against the instance DbiResourceIddb.load.avgos.cpuUtilization.total.avgdb.sampledload.avg grouped by db.wait_event_typedescribe_dimension_keys(... GroupBy={'Group':'db.sql_tokenized'})GroupBy={'Group':'db.sql'} to get concrete sampled statements + db.sql.tokenized_idReadIOPS, WriteIOPS, DatabaseConnections, ReadLatency, WriteLatency, SwapUsage, and DiskQueueDepth to distinguish:
WriteIOPS + PI wait events dominated by IO / IO:XactSync / IO:DataFileReaddb.sql_tokenized entries are INSERT/UPSERTFreeableMemory low but SwapUsage roughly flat, CPU moderate, latencies still normal, error rate lowFreeableMemory alarms specifically:
os.memory.total.avgos.swap.in.avg / os.swap.out.avg are actually ramping, not just whether swap is non-zerousers_<projectId>, campaigns_<projectId>, user_journey_sessions_<projectId>:
ProjectId/aws/ecs/notifly-services-prod/api-service) and aggregate by ProjectId + NormalizedPath in the alarm window/user-state/:projectId/:userId?user_journey_id / campaign_id in SQL (example: where user_journey_id = 'UL1T00'), search service logs for that ID/aws/ecs/notifly-services-prod/segment-publisher logs often map that ID to campaign name and publish batches via messages like:
campaignId: UL1T00, 460068 recipients published. (batch index: 9)Received event: {"<projectId>":{"user_journeys":[{"id":"UL1T00","name":"[만보기] 매일 적립 리마인드"...}When the user says alerts feel noisier, inspect alarm config drift vs. real flapping
describe_alarms() and capture:
PeriodEvaluationPeriodsDatapointsToAlarmThresholdTreatMissingDataAlarmActions / OKActionsdescribe_alarm_history() for three history types:
ConfigurationUpdate -> when the alarm was created/changedStateUpdate -> how often it flips OK <-> ALARMAction -> how often notifications were actually publishedStateUpdate / Action entries with Period=60, EvaluationPeriods=1, DatapointsToAlarm=1datapoints_to_alarm / evaluation_periods2/3 over 5-minute periods as a common balanced default for transient external-API jitter: filters isolated spikes, still pages on ~10 minutes of bad latency within 15 minutesdescribe_alarm_history() as the actual user-visible frequencyget_metric_data() for candidate configs, e.g. threshold=3000 1/1, 3500 1/1, 4000 1/1, 3000 2/2, 3000 2/3, 3000 3/3PutMetricFilter and PutMetricAlarm
%ERROR|Exception% -> %ERROR|Error% -> ERROR)lookup_events with EventName=PutMetricFilter / PutMetricAlarm and filter the request payload for the exact filter/alarm namesns.list_subscriptions_by_topic() to enumerate current subscribersSubscribe, Unsubscribe, SetSubscriptionAttributes, ConfirmSubscription to identify when extra endpoints were addedState the root cause precisely
distinguish:
do not stop at "alarm fired"; identify the app behavior or SQL family that created the metric
do not stop at "alarm fired"; identify the app behavior or query family that created the metric
python - <<'PY'
import boto3, os, json
session = boto3.Session(
aws_access_key_id=os.environ.get('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.environ.get('AWS_SECRET_ACCESS_KEY'),
region_name=os.environ.get('AWS_DEFAULT_REGION', 'ap-northeast-2'),
)
print(json.dumps(session.client('sts').get_caller_identity(), indent=2, default=str))
PY
python - <<'PY'
import boto3, os, json
session = boto3.Session(
aws_access_key_id=os.environ.get('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.environ.get('AWS_SECRET_ACCESS_KEY'),
region_name=os.environ.get('AWS_DEFAULT_REGION', 'ap-northeast-2'),
)
cw = session.client('cloudwatch')
resp = cw.describe_alarm_history(
AlarmName='/aws/ecs/notifly-services-prod/web-console console error',
HistoryItemType='StateUpdate',
MaxRecords=10,
)
print(json.dumps(resp['AlarmHistoryItems'], indent=2, default=str))
PY
python - <<'PY'
import boto3, os, datetime
session = boto3.Session(
aws_access_key_id=os.environ.get('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.environ.get('AWS_SECRET_ACCESS_KEY'),
region_name=os.environ.get('AWS_DEFAULT_REGION', 'ap-northeast-2'),
)
cw = session.client('cloudwatch')
resp = cw.get_metric_statistics(
Namespace='ConsoleErrors',
MetricName='/aws/ecs/notifly-services-prod/web-console console error',
StartTime=datetime.datetime(2026,4,21,5,20,tzinfo=datetime.timezone.utc),
EndTime=datetime.datetime(2026,4,21,7,45,tzinfo=datetime.timezone.utc),
Period=60,
Statistics=['Sum','SampleCount'],
)
for p in sorted(resp['Datapoints'], key=lambda x: x['Timestamp']):
if p.get('Sum', 0) > 0:
print(p['Timestamp'].isoformat(), p['Sum'], p.get('SampleCount'))
PY
query = r'''
fields @timestamp, @message, @logStream
| filter @message like /Error|TypeError|ReferenceError|Exception|Unhandled/
| sort @timestamp desc
| limit 100
'''
This same boto3-via-terminal approach also works well for queue-consumer incidents such as:
Recommended workflow:
Read live queue attributes first
sqs.get_queue_attributes(..., AttributeNames=['All'])ApproximateNumberOfMessagesApproximateNumberOfMessagesNotVisibleApproximateNumberOfMessagesDelayedVisibilityTimeoutRedrivePolicyRedriveAllowPolicyPull CloudWatch SQS metrics over time
cloudwatch.get_metric_data() for:
ApproximateNumberOfMessagesVisibleApproximateNumberOfMessagesNotVisibleApproximateAgeOfOldestMessageNumberOfMessagesReceivedNumberOfMessagesDeletedNumberOfMessagesSentReceived >> Deleted on main queue -> repeated retries / poison messages likelyApproximateAgeOfOldestMessage rising on DLQ while count stays flat -> stale unprocessed DLQ backlogSample a few DLQ messages carefully
sqs.receive_message() with a tiny VisibilityTimeout (for example 1) and MaxNumberOfMessages=10MessageIdApproximateReceiveCountSentTimestampApproximateReceiveCount = maxReceiveCount + 1, that confirms they exhausted retries before landing in DLQCorrelate with Lambda logs
/aws/lambda/<function-name> with Logs Insights for phrases such as:
will retry via SQSrate-limitedfailed, will retry via SQSparse ... | stats count() by ...State the result precisely
Practical example pattern observed in Notifly cafe24-worker:
rate-limited for <mall>, will retry via SQSApproximateReceiveCount = 7 while queue maxReceiveCount = 6A strong alarm investigation answer should include:
When the user asks for alarms that adapt automatically to instance churn or failover, prefer these rules of thumb:
Do not propose SEARCH expressions for alarms
SEARCH() is useful for graphs but cannot back an alarm because it returns multiple time series.Use Metrics Insights alarms for dynamic fleets
For per-resource identification, use multi-time-series alarms + contributors
DescribeAlarmContributors.ContributorAttributes can identify the specific breaching resource (for example DBInstanceIdentifier).DescribeAlarmContributors -> enriched Slack/page message.LIMIT to fleet/cluster coverage queries unless you intentionally want to monitor only the top N series. Metrics Insights can return up to 500 time series and Aurora can have more than 10 instances, so LIMIT 10 can silently drop cluster members and contributors.For Aurora/RDS, verify actual metric dimensions before proposing a query
list_metrics or get_metric_data to confirm which labels exist on the target metric.AWS/RDS instance CPU metrics may expose DBInstanceIdentifier without DBClusterIdentifier, so a query cannot always dynamically filter to one cluster unless you add tags.Tagging is often the cleanest solution
Service=notifly, Env=prod, or DbCluster=notifly-db-prod-cluster to every DB instance.WHERE tag.<key> = 'value' GROUP BY DBInstanceIdentifier.If tags are unavailable, recommend automation instead of clever math
execute_code for AWS API calls if credentials are only present in shell env.Received event: messages). Prefer aggregated/parsing queries over dumping raw payloads, and redact any sensitive fields before reporting.For web-console console-error alarms, a single OK -> ALARM transition was traced to a repeated application error like:
Error: The campaign was updated by another user: <id>PUT .../campaigns returning 500This indicates a likely optimistic-locking or stale-write conflict being surfaced as a server error. Recommended fix:
409 Conflict instead of 500