원클릭으로 Manus에서 모든 스킬 실행

$pwd:

china-incident-rca

Name: China Incident Rca
Author: aws-samples

// Root cause analysis for a triaged incident in either China account (aws-cn or aws-cn-2). Use this skill after triage has produced a Triage Card, or when the user directly asks RCA, 根本原因, 根因, 为什么挂, why did X fail, deep dive, deep investigation, 深入分析, dig into, 调查. Correlates the CloudTrail API log window around the incident, recent deploy events (CloudFormation stack events, CodeDeploy, ECR pushes, Lambda updates), metric anomalies against prior-week baseline, and cross-account blast radius — specifically, whether the same failure pattern also hit the other China account around the same time, which would suggest a shared upstream cause (IAM partition-wide, AWS region event, or common dependency). Produces a single root-cause hypothesis plus the evidence chain. Does NOT execute remediation.

Manus에서 실행

$ git log --oneline --stat

stars:1

forks:0

updated:2026년 5월 28일 03:56

SKILL.md

readonly

name

china-incident-rca

description

Root cause analysis for a triaged incident in either China account (aws-cn or aws-cn-2). Use this skill after triage has produced a Triage Card, or when the user directly asks RCA, 根本原因, 根因, 为什么挂, why did X fail, deep dive, deep investigation, 深入分析, dig into, 调查. Correlates the CloudTrail API log window around the incident, recent deploy events (CloudFormation stack events, CodeDeploy, ECR pushes, Lambda updates), metric anomalies against prior-week baseline, and cross-account blast radius — specifically, whether the same failure pattern also hit the other China account around the same time, which would suggest a shared upstream cause (IAM partition-wide, AWS region event, or common dependency). Produces a single root-cause hypothesis plus the evidence chain. Does NOT execute remediation.

China Incident RCA

Routing is governed by china-region-multi-account-routing. Triage hand-off is governed by china-incident-triage. This skill picks up from a Triage Card and produces a single root-cause hypothesis with evidence.

Intended agent type

Upload with Agent Type Incident RCA selected.

When to use

A Triage Card exists and severity is SEV-3 or higher
User asks RCA-style questions: "根本原因", "why did X fail", "深入分析"
A recurring alarm needs explanation even if low-severity

Do not use this skill for:

Ambiguous or untriaged incidents (run triage first)
"How do I fix this?" questions (use mitigation skill)
Capacity or cost analysis (different skills)

Investigation framework — the 4 axes

Check these four axes in parallel. Most root causes surface on exactly one axis; the others serve as confirmation or elimination.

Axis 1 — CloudTrail API log

For the 30-minute window around incident_start_time (15 min before, 15 min after), in the affected account's region:

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=<resource> \
  --start-time <incident_start - 15min> \
  --end-time <incident_start + 15min>

Look for:

Who (principal) made changes?
What was changed (Modify*, Delete*, Update*, Put*)?
Was it human (IAM user) or automated (role from CI/pipeline)?
Was this change correlated with the alarm start time within 5 minutes?

Axis 2 — Deploy correlation

Query in parallel:

aws cloudformation describe-stack-events --stack-name <stack> (if resource is CFN-managed)
aws codedeploy list-deployments --deployment-group-name <group> (if applicable)
aws ecr describe-images --repository-name <repo> (image push times)
aws lambda list-versions-by-function --function-name <fn> (Lambda updates)
Git push events from the pipeline (if GitFarm/GitHub integration active)

A deploy within ±30 minutes of incident start is a top suspect.

Axis 3 — Metric anomaly vs. baseline

Pull two GetMetricData windows for the affected resource:

Current: incident_start - 1h to incident_start + 30min
Baseline: same 90-minute window, 7 days earlier

Compute the delta. A metric that:

Stepped up/down sharply at incident_start → direct evidence
Was already anomalous before incident_start → earlier upstream cause
Is unchanged → not the cause, eliminate

Relevant metrics by incident class:

Network: ALB HTTPCode_Target_5XX_Count, TargetResponseTime, TargetGroup HealthyHostCount
Compute: EC2 StatusCheckFailed, ASG GroupInServiceInstances, Lambda Throttles/Errors
Identity: CloudTrail AccessDenied rate, STS AssumeRole failures
Data: RDS DatabaseConnections, CPUUtilization, ReplicaLag

Axis 4 — Cross-account blast radius

Unique to this dual-account setup. Run the same alarm/metric check on the other China account for the same time window. Three outcomes:

Other account status	Interpretation
Also affected	Likely shared upstream — AWS region event, shared service dependency, or identical misconfiguration deployed to both
Not affected	Account-scoped cause — credentials, account-specific deploy, account-specific resource
Unknown (no comparable signal)	Inconclusive — continue with account-scoped investigation

This axis is the single biggest value-add vs. single-account RCA. Always run it.

Procedure

Step 1 — Consume the Triage Card

Parse account, class, severity, resource, first-seen. If no card exists, request one (or run triage first).

Step 2 — Launch 4-axis parallel investigation

Fire all 4 axes concurrently. Do not serialize — Axis 2 deploy correlation alone may involve 3–5 API calls.

Step 3 — Synthesize

Rank the evidence by directness:

Direct — a CloudTrail event or deploy that modified the exact failing resource in the right time window
Strong circumstantial — metric step change at incident_start with no deploy correlation
Weak — metric drift over hours/days, multiple possible causes

Pick the single most likely root cause. If two candidates are roughly equal, say so explicitly — don't pretend certainty.

Step 4 — Produce the RCA Report

## Root Cause (hypothesis)
**<One sentence root cause>**
Confidence: <Direct / Strong / Weak>

## Evidence
1. <Axis 1 finding with timestamps and ARNs>
2. <Axis 2 finding>
3. <Axis 3 finding>
4. <Axis 4 finding — blast radius>

## Eliminated hypotheses
- <Thing you considered and ruled out, with reason>

## Blast radius
- <Account-scoped / Region-scoped / Partition-scoped>
- <Other account status>

## Recommended next step
→ Hand off to china-incident-mitigation with this RCA as input.

Things not to do

Do not execute remediation actions — this skill is read-only. Mitigation is a separate skill.
Do not claim the cause without at least one piece of direct or strong evidence. "Probably X" with no evidence is a hypothesis, not a root cause.
Do not skip Axis 4 (cross-account blast radius). It is cheap to check and eliminates the "is this widespread or just us?" question that every SRE asks next.
Do not investigate beyond the 4 axes without stating why. Expanding scope during RCA dilutes the signal.
Do not blame a correlated deploy without verifying it actually touched the failing resource. Every deploy is temporally correlated with something.
Do not re-run triage inside RCA. Take the card as given.

Examples

Input: Triage card = aws-cn / Network / SEV-2 / prod-alb / 14:22

Action: Parallel run the 4 axes.

Findings:

Axis 1: CloudTrail shows ModifyTargetGroupAttributes at 14:19 by role terraform-ci — 3 min before alarm
Axis 2: Terraform pipeline deployed stack prod-alb at 14:19 (matches)
Axis 3: HealthyHostCount dropped from 4 to 0 at 14:20, recovered to 2 at 14:25
Axis 4: aws-cn-2 prod-alb unaffected

Hypothesis: "Terraform deploy at 14:19 changed target group deregistration_delay from 30s to 300s, causing healthy hosts to be marked draining and unavailable for ~5 min until connections drained." Confidence: Direct.

Blast radius: aws-cn only. aws-cn-2 unaffected → not a region event.

Input: "为什么今天两个中国区账号都 Lambda 调用 AuthFailure"

Action: Investigation starts with the dual-account signal — that's itself Axis 4 evidence. Skip direct to:

Axis 1: CloudTrail CreateAccessKey / UpdateAccessKey in both accounts
Axis 2: Did someone re-seed Secrets Manager in both within the same window?

Likely root cause: shared credential rotation event (e.g., operator re-ran aws secretsmanager update-secret for both /mcp/aws-cn and /mcp/aws-cn-2 with a stale/mis-typed AK/SK).

Blast radius: Partition-scoped — both China accounts affected simultaneously. Strong evidence of a shared-operator cause, not an AWS infra event.

related-skills.json

같은 저장소

china-account-prevention-checks.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Proactive prevention and pre-alarm health checks across the two China region accounts (aws-cn and aws-cn-2). Use this skill when the user asks about prevention, 防护, 预防, proactive, 体检, health check, risk assessment, 潜在风险, 隐患, or "what might break soon", and when the Evaluation agent runs scheduled recommendation workflows. Looks for conditions that predict future incidents — single points of failure, service quotas nearing limits, stale AMIs, aging credentials, certificates expiring within 30 days, deprecated Lambda runtimes. This skill is distinct from cross-account-security-posture-check, which reports current-state security risk. Prevention predicts future failure; security posture describes current exposure.

2026-05-281

china-incident-mitigation.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Draft step-by-step mitigation CLI commands for a root-caused incident in either China account (aws-cn or aws-cn-2). Use this skill after RCA has identified the root cause, when the user asks for mitigation, remediation, 缓解, 修复, 回滚, rollback, restore service, 怎么修, fix it, 怎么办. Covers common mitigation patterns such as credential rotation, Kubernetes pod rollout-restart, ALB target group reattach, security group rule revoke, IAM policy rollback, and safe CloudFormation stack rollback. Output always includes the exact CLI command, a one-line explanation of what it changes, a rollback/undo command, and an explicit human approval prompt. CRITICAL — this skill NEVER executes commands autonomously; every mitigation step requires explicit user approval before running.

2026-05-281

china-incident-triage.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

First-response triage for an incoming alarm, ticket, or failure report originating from either China account (aws-cn or aws-cn-2). Use this skill when the trigger is an alarm name, CloudWatch alarm payload, SIM ticket body, error log snippet, or a user phrase such as 告警, 出事了, 服务挂了, incident, triage, 分类, 初步判断, 看一下这个告警, what happened. Determines which of the two accounts is affected, classifies the incident into one of six classes (compute / network / identity-credentials / data / cost / unknown), estimates severity from the signal, and checks whether a similar incident fired recently so duplicates are marked. Output is a short triage card that hands off to RCA or mitigation depending on severity. This skill is the entry point of the incident response pipeline.

2026-05-281

china-region-multi-account-routing.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Routing and disambiguation guidance for the two AWS China region MCP servers exposed by this Agent Space. Use this skill whenever the user's request mentions "中国区", "China", "cn-north-1", "cn-northwest-1", "Beijing", "Ningxia", or any AWS resource that must resolve to a specific China partition account. The skill explains which MCP endpoint maps to which account, how to pick when the user does not specify, and how to label cross-account results so the user can tell them apart.

2026-05-281

cn-partition-arn-routing.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Diagnose and explain AWS partition ARN mismatches in China region accounts. Use this skill whenever an investigation in `cn-north-1` or `cn-northwest-1` involves ARNs starting with `arn:aws:` instead of `arn:aws-cn:`, or whenever symptoms include AccessDenied, AuthFailure, MalformedPolicyDocument, NoSuchEntity, or "principal cannot be assumed" errors against IAM roles, SNS topics, KMS keys, S3 buckets, or any other ARN-bearing resource. Triggers also include the user mentioning "partition", "aws-cn vs aws", "cross-partition", "中国区 ARN 不对", "global partition ARN in China account", "trust policy 写错了", or pasting any ARN that looks like `arn:aws:iam::*` while the account context is China. Importantly, use this skill BEFORE concluding that an IAM trust policy or resource policy is "missing permissions" — the more common root cause in China region accounts is a partition string mismatch that the agent and generic LLM debuggers consistently get wrong.

2026-05-281

cross-account-cost-attribution.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Retrieve, compare, and attribute AWS spend across the two China region accounts (aws-cn in cn-northwest-1 and aws-cn-2 in cn-north-1). Use this skill when the user asks about cost, spend, billing, 花费, 成本, 账单, 多少钱, expensive, 贵, top services, cost breakdown, month-over-month, budget, or wants to know which China account spends more and on what. Covers month-to-date, last month, last 90 days, and custom time ranges. Also use when the user wants to correlate cost with specific resources or services (e.g. "哪个账号的 EC2 花费高"). Report totals per account, top services per account, and deltas vs previous period when available.

2026-05-281

package.json

"author": "aws-samples"

"repository": "aws-samples/sample-skills-for-AWS-Devops-agent"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

name

china-incident-rca

description

China Incident RCA

Intended agent type

Upload with Agent Type Incident RCA selected.

When to use

A Triage Card exists and severity is SEV-3 or higher
User asks RCA-style questions: "根本原因", "why did X fail", "深入分析"
A recurring alarm needs explanation even if low-severity

Do not use this skill for:

Ambiguous or untriaged incidents (run triage first)
"How do I fix this?" questions (use mitigation skill)
Capacity or cost analysis (different skills)

Investigation framework — the 4 axes

Check these four axes in parallel. Most root causes surface on exactly one axis; the others serve as confirmation or elimination.

Axis 1 — CloudTrail API log

For the 30-minute window around incident_start_time (15 min before, 15 min after), in the affected account's region:

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=<resource> \
  --start-time <incident_start - 15min> \
  --end-time <incident_start + 15min>

Look for:

Who (principal) made changes?
What was changed (Modify*, Delete*, Update*, Put*)?
Was it human (IAM user) or automated (role from CI/pipeline)?
Was this change correlated with the alarm start time within 5 minutes?

Axis 2 — Deploy correlation

Query in parallel:

aws cloudformation describe-stack-events --stack-name <stack> (if resource is CFN-managed)
aws codedeploy list-deployments --deployment-group-name <group> (if applicable)
aws ecr describe-images --repository-name <repo> (image push times)
aws lambda list-versions-by-function --function-name <fn> (Lambda updates)
Git push events from the pipeline (if GitFarm/GitHub integration active)

A deploy within ±30 minutes of incident start is a top suspect.

Axis 3 — Metric anomaly vs. baseline

Pull two GetMetricData windows for the affected resource:

Current: incident_start - 1h to incident_start + 30min
Baseline: same 90-minute window, 7 days earlier

Compute the delta. A metric that:

Stepped up/down sharply at incident_start → direct evidence
Was already anomalous before incident_start → earlier upstream cause
Is unchanged → not the cause, eliminate

Relevant metrics by incident class:

Network: ALB HTTPCode_Target_5XX_Count, TargetResponseTime, TargetGroup HealthyHostCount
Compute: EC2 StatusCheckFailed, ASG GroupInServiceInstances, Lambda Throttles/Errors
Identity: CloudTrail AccessDenied rate, STS AssumeRole failures
Data: RDS DatabaseConnections, CPUUtilization, ReplicaLag

Axis 4 — Cross-account blast radius

Unique to this dual-account setup. Run the same alarm/metric check on the other China account for the same time window. Three outcomes:

Other account status	Interpretation
Also affected	Likely shared upstream — AWS region event, shared service dependency, or identical misconfiguration deployed to both
Not affected	Account-scoped cause — credentials, account-specific deploy, account-specific resource
Unknown (no comparable signal)	Inconclusive — continue with account-scoped investigation

This axis is the single biggest value-add vs. single-account RCA. Always run it.

Procedure

Step 1 — Consume the Triage Card

Parse account, class, severity, resource, first-seen. If no card exists, request one (or run triage first).

Step 2 — Launch 4-axis parallel investigation

Fire all 4 axes concurrently. Do not serialize — Axis 2 deploy correlation alone may involve 3–5 API calls.

Step 3 — Synthesize

Rank the evidence by directness:

Direct — a CloudTrail event or deploy that modified the exact failing resource in the right time window
Strong circumstantial — metric step change at incident_start with no deploy correlation
Weak — metric drift over hours/days, multiple possible causes

Pick the single most likely root cause. If two candidates are roughly equal, say so explicitly — don't pretend certainty.

Step 4 — Produce the RCA Report

## Root Cause (hypothesis)
**<One sentence root cause>**
Confidence: <Direct / Strong / Weak>

## Evidence
1. <Axis 1 finding with timestamps and ARNs>
2. <Axis 2 finding>
3. <Axis 3 finding>
4. <Axis 4 finding — blast radius>

## Eliminated hypotheses
- <Thing you considered and ruled out, with reason>

## Blast radius
- <Account-scoped / Region-scoped / Partition-scoped>
- <Other account status>

## Recommended next step
→ Hand off to china-incident-mitigation with this RCA as input.

Things not to do

Do not execute remediation actions — this skill is read-only. Mitigation is a separate skill.
Do not claim the cause without at least one piece of direct or strong evidence. "Probably X" with no evidence is a hypothesis, not a root cause.
Do not skip Axis 4 (cross-account blast radius). It is cheap to check and eliminates the "is this widespread or just us?" question that every SRE asks next.
Do not investigate beyond the 4 axes without stating why. Expanding scope during RCA dilutes the signal.
Do not blame a correlated deploy without verifying it actually touched the failing resource. Every deploy is temporally correlated with something.
Do not re-run triage inside RCA. Take the card as given.

Examples

Input: Triage card = aws-cn / Network / SEV-2 / prod-alb / 14:22

Action: Parallel run the 4 axes.

Findings:

Axis 1: CloudTrail shows ModifyTargetGroupAttributes at 14:19 by role terraform-ci — 3 min before alarm
Axis 2: Terraform pipeline deployed stack prod-alb at 14:19 (matches)
Axis 3: HealthyHostCount dropped from 4 to 0 at 14:20, recovered to 2 at 14:25
Axis 4: aws-cn-2 prod-alb unaffected

Blast radius: aws-cn only. aws-cn-2 unaffected → not a region event.

Input: "为什么今天两个中国区账号都 Lambda 调用 AuthFailure"

Action: Investigation starts with the dual-account signal — that's itself Axis 4 evidence. Skip direct to:

Axis 1: CloudTrail CreateAccessKey / UpdateAccessKey in both accounts
Axis 2: Did someone re-seed Secrets Manager in both within the same window?

Likely root cause: shared credential rotation event (e.g., operator re-ran aws secretsmanager update-secret for both /mcp/aws-cn and /mcp/aws-cn-2 with a stale/mis-typed AK/SK).

Blast radius: Partition-scoped — both China accounts affected simultaneously. Strong evidence of a shared-operator cause, not an AWS infra event.

china-incident-rca

China Incident RCA

Intended agent type

When to use

Investigation framework — the 4 axes

Axis 1 — CloudTrail API log

Axis 2 — Deploy correlation

Axis 3 — Metric anomaly vs. baseline

Axis 4 — Cross-account blast radius

Procedure

Step 1 — Consume the Triage Card

Step 2 — Launch 4-axis parallel investigation

Step 3 — Synthesize

Step 4 — Produce the RCA Report

Things not to do

Examples

이 저장소의 다른 Skills

China Incident RCA

Intended agent type

When to use

Investigation framework — the 4 axes

Axis 1 — CloudTrail API log

Axis 2 — Deploy correlation

Axis 3 — Metric anomaly vs. baseline

Axis 4 — Cross-account blast radius

Procedure

Step 1 — Consume the Triage Card

Step 2 — Launch 4-axis parallel investigation

Step 3 — Synthesize

Step 4 — Produce the RCA Report

Things not to do

Examples

이 저장소의 다른 Skills