원클릭으로 Manus에서 모든 스킬 실행

$pwd:

china-incident-triage

Name: China Incident Triage
Author: aws-samples

// First-response triage for an incoming alarm, ticket, or failure report originating from either China account (aws-cn or aws-cn-2). Use this skill when the trigger is an alarm name, CloudWatch alarm payload, SIM ticket body, error log snippet, or a user phrase such as 告警, 出事了, 服务挂了, incident, triage, 分类, 初步判断, 看一下这个告警, what happened. Determines which of the two accounts is affected, classifies the incident into one of six classes (compute / network / identity-credentials / data / cost / unknown), estimates severity from the signal, and checks whether a similar incident fired recently so duplicates are marked. Output is a short triage card that hands off to RCA or mitigation depending on severity. This skill is the entry point of the incident response pipeline.

Manus에서 실행

$ git log --oneline --stat

stars:1

forks:0

updated:2026년 5월 28일 03:56

SKILL.md

readonly

name

china-incident-triage

description

First-response triage for an incoming alarm, ticket, or failure report originating from either China account (aws-cn or aws-cn-2). Use this skill when the trigger is an alarm name, CloudWatch alarm payload, SIM ticket body, error log snippet, or a user phrase such as 告警, 出事了, 服务挂了, incident, triage, 分类, 初步判断, 看一下这个告警, what happened. Determines which of the two accounts is affected, classifies the incident into one of six classes (compute / network / identity-credentials / data / cost / unknown), estimates severity from the signal, and checks whether a similar incident fired recently so duplicates are marked. Output is a short triage card that hands off to RCA or mitigation depending on severity. This skill is the entry point of the incident response pipeline.

China Incident Triage

Routing is governed by china-region-multi-account-routing. This skill assumes an incident signal has just arrived and needs initial classification in < 60 seconds of wall time.

Intended agent type

Upload with Agent Type Incident Triage selected.

When to use

A CloudWatch alarm payload arrives
A SIM ticket autocuts to the resolver group
A user pastes an error log or says "XXX 挂了" / "something is broken"
User phrases: 告警, 出事了, incident, triage, 初步判断, 看下这个, what's wrong

Do not use this skill for known-classified incidents (user already said "this is a network issue, investigate" → skip triage, go to RCA).

Output contract — the Triage Card

Every invocation produces a Triage Card with exactly these fields. Anything longer dilutes triage speed.

╭─ Triage Card ─────────────────────────────────────────────────────────╮
│ Account:     aws-cn  (Ningxia, cn-northwest-1)                        │
│ Class:       Network                                                  │
│ Severity:    SEV-2 (customer-facing impact probable)                  │
│ Resource:    arn:aws-cn:elasticloadbalancing:.../app/prod-alb/...     │
│ First seen:  2026-05-11 14:22 CST                                     │
│ Duplicate?   No (no similar alarm in last 24h)                        │
│ Next step:   Hand off to china-incident-rca with this card            │
╰───────────────────────────────────────────────────────────────────────╯

Procedure

Step 1 — Determine affected account

From the signal, extract account attribution in this order (first match wins):

ARN in signal — check ARN partition (arn:aws-cn:...) and 12-digit account ID segment
Region in signal — cn-northwest-1 → aws-cn, cn-north-1 → aws-cn-2
Resource tag (if the signal includes tags like Account=aws-cn)
Host / endpoint hostname — aws-cn.yingchu.cloud vs aws-cn-2.yingchu.cloud
Ask the user if none of the above resolves

Step 2 — Classify the incident

Bucket the signal into exactly one class:

Class	Typical signals
Compute	EC2 status check failed, ASG unable to launch, EKS pod crashloop, Lambda throttled, ECS task exits
Network	ALB 5xx spike, target unhealthy, NAT gateway errors, VPC Lattice disconnect, DNS resolution failure, Route 53 health check failed
Identity/Credentials	`AuthFailure`, `ExpiredToken`, `InvalidClientTokenId`, `SignatureDoesNotMatch`, IAM policy eval errors, STS AssumeRole failures
Data	RDS connection errors, DynamoDB throttle, S3 4xx/5xx, OpenSearch red status, replication lag
Cost	Budget alarm, anomaly detection trigger, quota-based alarm (API throttling from quota)
Unknown	Signal is ambiguous — log a note and request user clarification

If the signal spans multiple classes (e.g., compute failure caused by credentials), pick the root-evident class and note the secondary.

Step 3 — Estimate severity

Use a simple rubric:

Severity	Criterion
SEV-1	Complete account outage, multiple services down, data loss in progress
SEV-2	Single service customer-facing impact, 5xx rate > 5%, MCP endpoint down
SEV-3	Degraded latency, single-AZ impact, partial feature failure
SEV-4	Monitoring-only, internal tool affected, no customer impact
SEV-5	Informational, metric threshold crossed without user-visible effect

Default to SEV-3 when the signal is ambiguous — escalate via RCA findings later if needed.

Step 4 — Dedup check

Before handing off, query:

aws cloudwatch describe-alarms --alarm-name-prefix <prefix>
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=<event> \
  --max-items 20 --start-time <24h ago>

If a very similar alarm fired within the last 24 hours and is still active or recently resolved, mark Duplicate? Yes with the original incident ID. This prevents RCA duplication.

Step 5 — Hand off

Emit the Triage Card. Based on severity:

SEV-1 / SEV-2 → "Escalating to RCA; will auto-invoke china-incident-rca."
SEV-3 → "Triage complete. Run RCA when ready."
SEV-4 / SEV-5 → "Low impact; log and monitor. Skip RCA unless trend develops."

Things not to do

Do not start RCA investigation during triage. Triage is < 60 seconds of classification, not root-cause discovery.
Do not speculate on cause. "Probably network" in the Class field is fine; "probably because someone changed the SG last night" is RCA's job.
Do not escalate to SEV-1 based on noisy signals alone. Check for corroborating indicators (multiple metrics, multiple alarm targets) before claiming account-wide outage.
Do not classify into Unknown without first checking account, region, and service fields in the signal. Unknown is a last resort.
Do not skip the dedup check. Repeated investigations on a flapping alarm waste agent time and user attention.

Examples

Input: CloudWatch alarm prod-alb-5xx-rate state=ALARM in cn-northwest-1

Action:

Account:     aws-cn (cn-northwest-1)
Class:       Network
Severity:    SEV-2
Resource:    prod-alb (ALB)
First seen:  <alarm.StateChangeTime>
Duplicate?   <check last 24h>
Next step:   Hand off to china-incident-rca

Input: "aws-cn-2 Lambda throwing AuthFailure since 14:00"

Action:

Account:     aws-cn-2 (cn-north-1)
Class:       Identity/Credentials
Severity:    SEV-2 (likely wide-impact — credentials power all calls)
Resource:    <Lambda function ARN if provided, else "Lambda fleet">
First seen:  ~14:00 today
Duplicate?   Check recent AuthFailure rate
Next step:   Hand off to china-incident-rca (prioritize: check recent
             Secrets Manager update, check IAM key age)

Input: "中国区好像不太对劲"

Action: Signal too vague for triage. Ask: "Which account, aws-cn or aws-cn-2? And what symptom — slow response, error, or metric anomaly?" Do not emit a Triage Card on speculation.

related-skills.json

같은 저장소

china-account-prevention-checks.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Proactive prevention and pre-alarm health checks across the two China region accounts (aws-cn and aws-cn-2). Use this skill when the user asks about prevention, 防护, 预防, proactive, 体检, health check, risk assessment, 潜在风险, 隐患, or "what might break soon", and when the Evaluation agent runs scheduled recommendation workflows. Looks for conditions that predict future incidents — single points of failure, service quotas nearing limits, stale AMIs, aging credentials, certificates expiring within 30 days, deprecated Lambda runtimes. This skill is distinct from cross-account-security-posture-check, which reports current-state security risk. Prevention predicts future failure; security posture describes current exposure.

2026-05-281

china-incident-mitigation.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Draft step-by-step mitigation CLI commands for a root-caused incident in either China account (aws-cn or aws-cn-2). Use this skill after RCA has identified the root cause, when the user asks for mitigation, remediation, 缓解, 修复, 回滚, rollback, restore service, 怎么修, fix it, 怎么办. Covers common mitigation patterns such as credential rotation, Kubernetes pod rollout-restart, ALB target group reattach, security group rule revoke, IAM policy rollback, and safe CloudFormation stack rollback. Output always includes the exact CLI command, a one-line explanation of what it changes, a rollback/undo command, and an explicit human approval prompt. CRITICAL — this skill NEVER executes commands autonomously; every mitigation step requires explicit user approval before running.

2026-05-281

china-incident-rca.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Root cause analysis for a triaged incident in either China account (aws-cn or aws-cn-2). Use this skill after triage has produced a Triage Card, or when the user directly asks RCA, 根本原因, 根因, 为什么挂, why did X fail, deep dive, deep investigation, 深入分析, dig into, 调查. Correlates the CloudTrail API log window around the incident, recent deploy events (CloudFormation stack events, CodeDeploy, ECR pushes, Lambda updates), metric anomalies against prior-week baseline, and cross-account blast radius — specifically, whether the same failure pattern also hit the other China account around the same time, which would suggest a shared upstream cause (IAM partition-wide, AWS region event, or common dependency). Produces a single root-cause hypothesis plus the evidence chain. Does NOT execute remediation.

2026-05-281

china-region-multi-account-routing.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Routing and disambiguation guidance for the two AWS China region MCP servers exposed by this Agent Space. Use this skill whenever the user's request mentions "中国区", "China", "cn-north-1", "cn-northwest-1", "Beijing", "Ningxia", or any AWS resource that must resolve to a specific China partition account. The skill explains which MCP endpoint maps to which account, how to pick when the user does not specify, and how to label cross-account results so the user can tell them apart.

2026-05-281

cn-partition-arn-routing.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Diagnose and explain AWS partition ARN mismatches in China region accounts. Use this skill whenever an investigation in `cn-north-1` or `cn-northwest-1` involves ARNs starting with `arn:aws:` instead of `arn:aws-cn:`, or whenever symptoms include AccessDenied, AuthFailure, MalformedPolicyDocument, NoSuchEntity, or "principal cannot be assumed" errors against IAM roles, SNS topics, KMS keys, S3 buckets, or any other ARN-bearing resource. Triggers also include the user mentioning "partition", "aws-cn vs aws", "cross-partition", "中国区 ARN 不对", "global partition ARN in China account", "trust policy 写错了", or pasting any ARN that looks like `arn:aws:iam::*` while the account context is China. Importantly, use this skill BEFORE concluding that an IAM trust policy or resource policy is "missing permissions" — the more common root cause in China region accounts is a partition string mismatch that the agent and generic LLM debuggers consistently get wrong.

2026-05-281

cross-account-cost-attribution.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Retrieve, compare, and attribute AWS spend across the two China region accounts (aws-cn in cn-northwest-1 and aws-cn-2 in cn-north-1). Use this skill when the user asks about cost, spend, billing, 花费, 成本, 账单, 多少钱, expensive, 贵, top services, cost breakdown, month-over-month, budget, or wants to know which China account spends more and on what. Covers month-to-date, last month, last 90 days, and custom time ranges. Also use when the user wants to correlate cost with specific resources or services (e.g. "哪个账号的 EC2 花费高"). Report totals per account, top services per account, and deltas vs previous period when available.

2026-05-281

package.json

"author": "aws-samples"

"repository": "aws-samples/sample-skills-for-AWS-Devops-agent"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

name

china-incident-triage

description

China Incident Triage

Routing is governed by china-region-multi-account-routing. This skill assumes an incident signal has just arrived and needs initial classification in < 60 seconds of wall time.

Intended agent type

Upload with Agent Type Incident Triage selected.

When to use

A CloudWatch alarm payload arrives
A SIM ticket autocuts to the resolver group
A user pastes an error log or says "XXX 挂了" / "something is broken"
User phrases: 告警, 出事了, incident, triage, 初步判断, 看下这个, what's wrong

Do not use this skill for known-classified incidents (user already said "this is a network issue, investigate" → skip triage, go to RCA).

Output contract — the Triage Card

Every invocation produces a Triage Card with exactly these fields. Anything longer dilutes triage speed.

╭─ Triage Card ─────────────────────────────────────────────────────────╮
│ Account:     aws-cn  (Ningxia, cn-northwest-1)                        │
│ Class:       Network                                                  │
│ Severity:    SEV-2 (customer-facing impact probable)                  │
│ Resource:    arn:aws-cn:elasticloadbalancing:.../app/prod-alb/...     │
│ First seen:  2026-05-11 14:22 CST                                     │
│ Duplicate?   No (no similar alarm in last 24h)                        │
│ Next step:   Hand off to china-incident-rca with this card            │
╰───────────────────────────────────────────────────────────────────────╯

Procedure

Step 1 — Determine affected account

From the signal, extract account attribution in this order (first match wins):

ARN in signal — check ARN partition (arn:aws-cn:...) and 12-digit account ID segment
Region in signal — cn-northwest-1 → aws-cn, cn-north-1 → aws-cn-2
Resource tag (if the signal includes tags like Account=aws-cn)
Host / endpoint hostname — aws-cn.yingchu.cloud vs aws-cn-2.yingchu.cloud
Ask the user if none of the above resolves

Step 2 — Classify the incident

Bucket the signal into exactly one class:

Class	Typical signals
Compute	EC2 status check failed, ASG unable to launch, EKS pod crashloop, Lambda throttled, ECS task exits
Network	ALB 5xx spike, target unhealthy, NAT gateway errors, VPC Lattice disconnect, DNS resolution failure, Route 53 health check failed
Identity/Credentials	`AuthFailure`, `ExpiredToken`, `InvalidClientTokenId`, `SignatureDoesNotMatch`, IAM policy eval errors, STS AssumeRole failures
Data	RDS connection errors, DynamoDB throttle, S3 4xx/5xx, OpenSearch red status, replication lag
Cost	Budget alarm, anomaly detection trigger, quota-based alarm (API throttling from quota)
Unknown	Signal is ambiguous — log a note and request user clarification

If the signal spans multiple classes (e.g., compute failure caused by credentials), pick the root-evident class and note the secondary.

Step 3 — Estimate severity

Use a simple rubric:

Severity	Criterion
SEV-1	Complete account outage, multiple services down, data loss in progress
SEV-2	Single service customer-facing impact, 5xx rate > 5%, MCP endpoint down
SEV-3	Degraded latency, single-AZ impact, partial feature failure
SEV-4	Monitoring-only, internal tool affected, no customer impact
SEV-5	Informational, metric threshold crossed without user-visible effect

Default to SEV-3 when the signal is ambiguous — escalate via RCA findings later if needed.

Step 4 — Dedup check

Before handing off, query:

aws cloudwatch describe-alarms --alarm-name-prefix <prefix>
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=<event> \
  --max-items 20 --start-time <24h ago>

If a very similar alarm fired within the last 24 hours and is still active or recently resolved, mark Duplicate? Yes with the original incident ID. This prevents RCA duplication.

Step 5 — Hand off

Emit the Triage Card. Based on severity:

SEV-1 / SEV-2 → "Escalating to RCA; will auto-invoke china-incident-rca."
SEV-3 → "Triage complete. Run RCA when ready."
SEV-4 / SEV-5 → "Low impact; log and monitor. Skip RCA unless trend develops."

Things not to do

Do not start RCA investigation during triage. Triage is < 60 seconds of classification, not root-cause discovery.
Do not speculate on cause. "Probably network" in the Class field is fine; "probably because someone changed the SG last night" is RCA's job.
Do not escalate to SEV-1 based on noisy signals alone. Check for corroborating indicators (multiple metrics, multiple alarm targets) before claiming account-wide outage.
Do not classify into Unknown without first checking account, region, and service fields in the signal. Unknown is a last resort.
Do not skip the dedup check. Repeated investigations on a flapping alarm waste agent time and user attention.

Examples

Input: CloudWatch alarm prod-alb-5xx-rate state=ALARM in cn-northwest-1

Action:

Account:     aws-cn (cn-northwest-1)
Class:       Network
Severity:    SEV-2
Resource:    prod-alb (ALB)
First seen:  <alarm.StateChangeTime>
Duplicate?   <check last 24h>
Next step:   Hand off to china-incident-rca

Input: "aws-cn-2 Lambda throwing AuthFailure since 14:00"

Action:

Account:     aws-cn-2 (cn-north-1)
Class:       Identity/Credentials
Severity:    SEV-2 (likely wide-impact — credentials power all calls)
Resource:    <Lambda function ARN if provided, else "Lambda fleet">
First seen:  ~14:00 today
Duplicate?   Check recent AuthFailure rate
Next step:   Hand off to china-incident-rca (prioritize: check recent
             Secrets Manager update, check IAM key age)

Input: "中国区好像不太对劲"

Action: Signal too vague for triage. Ask: "Which account, aws-cn or aws-cn-2? And what symptom — slow response, error, or metric anomaly?" Do not emit a Triage Card on speculation.

china-incident-triage

China Incident Triage

Intended agent type

When to use

Output contract — the Triage Card

Procedure

Step 1 — Determine affected account

Step 2 — Classify the incident

Step 3 — Estimate severity

Step 4 — Dedup check

Step 5 — Hand off

Things not to do

Examples

이 저장소의 다른 Skills

이 저장소의 다른 Skills

China Incident Triage

Intended agent type

When to use

Output contract — the Triage Card

Procedure

Step 1 — Determine affected account

Step 2 — Classify the incident

Step 3 — Estimate severity

Step 4 — Dedup check

Step 5 — Hand off

Things not to do

Examples