Run any Skill in Manus with one click

$pwd:

china-incident-mitigation

Name: China Incident Mitigation
Author: aws-samples

// Draft step-by-step mitigation CLI commands for a root-caused incident in either China account (aws-cn or aws-cn-2). Use this skill after RCA has identified the root cause, when the user asks for mitigation, remediation, 缓解, 修复, 回滚, rollback, restore service, 怎么修, fix it, 怎么办. Covers common mitigation patterns such as credential rotation, Kubernetes pod rollout-restart, ALB target group reattach, security group rule revoke, IAM policy rollback, and safe CloudFormation stack rollback. Output always includes the exact CLI command, a one-line explanation of what it changes, a rollback/undo command, and an explicit human approval prompt. CRITICAL — this skill NEVER executes commands autonomously; every mitigation step requires explicit user approval before running.

Run Skill in Manus

$ git log --oneline --stat

stars:1

forks:0

updated:May 28, 2026 at 03:56

SKILL.md

readonly

name

china-incident-mitigation

description

Draft step-by-step mitigation CLI commands for a root-caused incident in either China account (aws-cn or aws-cn-2). Use this skill after RCA has identified the root cause, when the user asks for mitigation, remediation, 缓解, 修复, 回滚, rollback, restore service, 怎么修, fix it, 怎么办. Covers common mitigation patterns such as credential rotation, Kubernetes pod rollout-restart, ALB target group reattach, security group rule revoke, IAM policy rollback, and safe CloudFormation stack rollback. Output always includes the exact CLI command, a one-line explanation of what it changes, a rollback/undo command, and an explicit human approval prompt. CRITICAL — this skill NEVER executes commands autonomously; every mitigation step requires explicit user approval before running.

China Incident Mitigation

Routing is governed by china-region-multi-account-routing. RCA hand-off is governed by china-incident-rca. This skill produces command drafts for human approval, not autonomous actions.

Intended agent type

Upload with Agent Type Incident Mitigation selected.

The approval contract

This skill only produces command drafts for human review. It never executes write operations autonomously.

Enterprise deployments: route approvals through your change management system (ServiceNow, Jira, OpsGenie, etc.). The agent outputs the command; a human approves and executes it in the change window.

Operator/demo deployments: the agent will wait for explicit "approve step N" before executing. But even then, treat this as a convenience shortcut — never in production without a change record.

Every mitigation output in this skill uses this exact 4-field format. No exceptions.

### Mitigation step N — <short name>

Command:
  aws <command> --region <region> --profile <account>  \
    <args>

What it does:
  <one sentence describing the state change>

Rollback:
  aws <undo command> --region <region> --profile <account>  \
    <args>

Approval required:
  This command changes production state in <account>.
  Enterprise: submit to change management system before executing.
  Operator: reply "approve step N" to execute, "skip N" to skip,
  or "stop" to abort.

The agent never executes without explicit approval. "yes", "do it", "go ahead" are NOT valid — must be "approve step N" or a documented change record approval.

Pattern library

Map RCA findings to known mitigation patterns. If the RCA does not match any pattern, fall back to general investigation (do not improvise mitigation).

Pattern A — Credential failure (AuthFailure / ExpiredToken)

Root cause signal: MCP pod logs show AuthFailure or ExpiredToken.

Steps:

Verify credentials in Secrets Manager are current

aws secretsmanager get-secret-value \
  --secret-id /mcp/aws-cn --region us-east-1

If stale, rotate via operator — user provides the new AK/SK, agent does not generate credentials

aws secretsmanager put-secret-value \
  --secret-id /mcp/aws-cn --region us-east-1 \
  --secret-string '{"AK":"<NEW>","SK":"<NEW>"}'

Force pod restart to pick up new secret

kubectl -n mcp rollout restart deploy/aws-cn

Pattern B — MCP pod crashloop

Root cause signal: pod RESTARTS > 0, recent image change.

Steps:

Inspect recent crash

kubectl -n mcp logs deploy/aws-cn --previous

If image rollout is the cause, roll back

kubectl -n mcp rollout undo deploy/aws-cn

Confirm healthy

kubectl -n mcp rollout status deploy/aws-cn --timeout=2m

Pattern C — ALB target unhealthy

Root cause signal: ALB 5xx spike, target group health check failing.

Steps:

Describe current target health

aws elbv2 describe-target-health \
  --target-group-arn <tg-arn> --region cn-northwest-1

If pods are ready but ALB marks them unhealthy, check recent ModifyTargetGroupAttributes from RCA evidence and revert:

aws elbv2 modify-target-group-attributes \
  --target-group-arn <tg-arn> --region cn-northwest-1 \
  --attributes Key=<attr>,Value=<previous-value>

Rollback command (if revert itself breaks):

aws elbv2 modify-target-group-attributes \
  --target-group-arn <tg-arn> --region cn-northwest-1 \
  --attributes Key=<attr>,Value=<current-value>

Pattern D — Overly-permissive SG (security incident)

Root cause signal: SG rule 0.0.0.0/0 on sensitive port added accidentally.

Steps:

Revoke the rule

aws ec2 revoke-security-group-ingress \
  --group-id <sg-id> --region <region> \
  --protocol tcp --port <port> --cidr 0.0.0.0/0

Rollback command (if this rule was intentional and you just revoked a production-needed rule):

aws ec2 authorize-security-group-ingress \
  --group-id <sg-id> --region <region> \
  --protocol tcp --port <port> --cidr 0.0.0.0/0

Verify

aws ec2 describe-security-groups \
  --group-ids <sg-id> --region <region>

Pattern E — CloudFormation stack stuck

Root cause signal: CFN stack in UPDATE_ROLLBACK_FAILED or UPDATE_IN_PROGRESS too long.

Steps:

Inspect

aws cloudformation describe-stack-events \
  --stack-name <stack> --region <region> --max-items 20

Continue rollback skipping failed resources (surgical):

aws cloudformation continue-update-rollback \
  --stack-name <stack> --region <region> \
  --resources-to-skip <logical-id>

Rollback option: none — CFN rollback is itself the rollback. Do not delete-stack unless the user explicitly authorizes (data loss risk).

Pattern F — Secrets Manager secret accidentally deleted

Root cause signal: ResourceNotFoundException when MCP pod reads secret.

Steps:

Check if recoverable (7-day window)

aws secretsmanager restore-secret \
  --secret-id /mcp/aws-cn --region us-east-1

If past recovery window, recreate and repopulate (user provides AK/SK — agent does not generate)

aws secretsmanager create-secret \
  --name /mcp/aws-cn --region us-east-1 \
  --secret-string '{"AK":"<NEW>","SK":"<NEW>"}'

Procedure

Read the RCA output
Match to a pattern in the library
Produce each mitigation step in the 4-field format
Wait for explicit "approve step N" before executing
After each step, report result and wait for next approval

Things not to do

Do not execute any command without the exact phrase "approve step N". Ambiguous responses ("sure", "yes", "ok") → ask again, do not execute.
Do not generate AWS credentials (AK/SK). The agent cannot see customer root-level credential creation. User must provide.
Do not bundle multiple steps into a single approval. Each step is approved individually so the user can abort at any point.
Do not improvise a mitigation if the RCA does not match a library pattern. Instead, state: "This pattern is not in the mitigation library. Recommend manual operator action." Then offer to help draft a one-off command with the user's confirmation of each field.
Do not run aws cloudformation delete-stack as a mitigation — too often used to "fix" stuck stacks when continue-update-rollback is safer.
Do not use kubectl delete pod when kubectl rollout restart deploy does the same thing with deployment-managed replacement.
Do not silently retry failed commands. If step N fails, report the failure verbatim and wait for user decision.

Examples

Input: RCA hypothesis = "Pattern A — Credential failure in aws-cn"

Output:

### Mitigation step 1 — Verify current credentials

Command:
  aws secretsmanager get-secret-value \
    --secret-id /mcp/aws-cn --region us-east-1

What it does:
  Reads the current AK/SK from Secrets Manager. Read-only.

Rollback:
  None needed (read-only).

Approval:
  This command reads production credentials in aws-cn. Reply
  "approve step 1" to execute, "skip 1" to move on, or "stop" to abort.

After approval + execution, the agent reports result and emits step 2.

Input: RCA hypothesis = "unknown — signal doesn't fit any pattern"

Output:

The RCA finding does not match any pattern in the mitigation library
(Patterns A–F). I will not auto-draft a mitigation. Options:

1. Describe the failure in more detail — I may be able to match a
   pattern I missed.
2. Draft a one-off mitigation together, field-by-field, with explicit
   confirmation at each step.
3. Escalate to a human operator.

Which would you like?

related-skills.json

same repository

china-account-prevention-checks.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Proactive prevention and pre-alarm health checks across the two China region accounts (aws-cn and aws-cn-2). Use this skill when the user asks about prevention, 防护, 预防, proactive, 体检, health check, risk assessment, 潜在风险, 隐患, or "what might break soon", and when the Evaluation agent runs scheduled recommendation workflows. Looks for conditions that predict future incidents — single points of failure, service quotas nearing limits, stale AMIs, aging credentials, certificates expiring within 30 days, deprecated Lambda runtimes. This skill is distinct from cross-account-security-posture-check, which reports current-state security risk. Prevention predicts future failure; security posture describes current exposure.

2026-05-281

china-incident-rca.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Root cause analysis for a triaged incident in either China account (aws-cn or aws-cn-2). Use this skill after triage has produced a Triage Card, or when the user directly asks RCA, 根本原因, 根因, 为什么挂, why did X fail, deep dive, deep investigation, 深入分析, dig into, 调查. Correlates the CloudTrail API log window around the incident, recent deploy events (CloudFormation stack events, CodeDeploy, ECR pushes, Lambda updates), metric anomalies against prior-week baseline, and cross-account blast radius — specifically, whether the same failure pattern also hit the other China account around the same time, which would suggest a shared upstream cause (IAM partition-wide, AWS region event, or common dependency). Produces a single root-cause hypothesis plus the evidence chain. Does NOT execute remediation.

2026-05-281

china-incident-triage.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

First-response triage for an incoming alarm, ticket, or failure report originating from either China account (aws-cn or aws-cn-2). Use this skill when the trigger is an alarm name, CloudWatch alarm payload, SIM ticket body, error log snippet, or a user phrase such as 告警, 出事了, 服务挂了, incident, triage, 分类, 初步判断, 看一下这个告警, what happened. Determines which of the two accounts is affected, classifies the incident into one of six classes (compute / network / identity-credentials / data / cost / unknown), estimates severity from the signal, and checks whether a similar incident fired recently so duplicates are marked. Output is a short triage card that hands off to RCA or mitigation depending on severity. This skill is the entry point of the incident response pipeline.

2026-05-281

china-region-multi-account-routing.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Routing and disambiguation guidance for the two AWS China region MCP servers exposed by this Agent Space. Use this skill whenever the user's request mentions "中国区", "China", "cn-north-1", "cn-northwest-1", "Beijing", "Ningxia", or any AWS resource that must resolve to a specific China partition account. The skill explains which MCP endpoint maps to which account, how to pick when the user does not specify, and how to label cross-account results so the user can tell them apart.

2026-05-281

cn-partition-arn-routing.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Diagnose and explain AWS partition ARN mismatches in China region accounts. Use this skill whenever an investigation in `cn-north-1` or `cn-northwest-1` involves ARNs starting with `arn:aws:` instead of `arn:aws-cn:`, or whenever symptoms include AccessDenied, AuthFailure, MalformedPolicyDocument, NoSuchEntity, or "principal cannot be assumed" errors against IAM roles, SNS topics, KMS keys, S3 buckets, or any other ARN-bearing resource. Triggers also include the user mentioning "partition", "aws-cn vs aws", "cross-partition", "中国区 ARN 不对", "global partition ARN in China account", "trust policy 写错了", or pasting any ARN that looks like `arn:aws:iam::*` while the account context is China. Importantly, use this skill BEFORE concluding that an IAM trust policy or resource policy is "missing permissions" — the more common root cause in China region accounts is a partition string mismatch that the agent and generic LLM debuggers consistently get wrong.

2026-05-281

cross-account-cost-attribution.md

from "aws-samples/sample-skills-for-AWS-Devops-agent"

Retrieve, compare, and attribute AWS spend across the two China region accounts (aws-cn in cn-northwest-1 and aws-cn-2 in cn-north-1). Use this skill when the user asks about cost, spend, billing, 花费, 成本, 账单, 多少钱, expensive, 贵, top services, cost breakdown, month-over-month, budget, or wants to know which China account spends more and on what. Covers month-to-date, last month, last 90 days, and custom time ranges. Also use when the user wants to correlate cost with specific resources or services (e.g. "哪个账号的 EC2 花费高"). Report totals per account, top services per account, and deltas vs previous period when available.

2026-05-281

package.json

"author": "aws-samples"

"repository": "aws-samples/sample-skills-for-AWS-Devops-agent"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

name

china-incident-mitigation

description

China Incident Mitigation

Routing is governed by china-region-multi-account-routing. RCA hand-off is governed by china-incident-rca. This skill produces command drafts for human approval, not autonomous actions.

Intended agent type

Upload with Agent Type Incident Mitigation selected.

The approval contract

This skill only produces command drafts for human review. It never executes write operations autonomously.

Enterprise deployments: route approvals through your change management system (ServiceNow, Jira, OpsGenie, etc.). The agent outputs the command; a human approves and executes it in the change window.

Operator/demo deployments: the agent will wait for explicit "approve step N" before executing. But even then, treat this as a convenience shortcut — never in production without a change record.

Every mitigation output in this skill uses this exact 4-field format. No exceptions.

### Mitigation step N — <short name>

Command:
  aws <command> --region <region> --profile <account>  \
    <args>

What it does:
  <one sentence describing the state change>

Rollback:
  aws <undo command> --region <region> --profile <account>  \
    <args>

Approval required:
  This command changes production state in <account>.
  Enterprise: submit to change management system before executing.
  Operator: reply "approve step N" to execute, "skip N" to skip,
  or "stop" to abort.

The agent never executes without explicit approval. "yes", "do it", "go ahead" are NOT valid — must be "approve step N" or a documented change record approval.

Pattern library

Map RCA findings to known mitigation patterns. If the RCA does not match any pattern, fall back to general investigation (do not improvise mitigation).

Pattern A — Credential failure (AuthFailure / ExpiredToken)

Root cause signal: MCP pod logs show AuthFailure or ExpiredToken.

Steps:

Verify credentials in Secrets Manager are current

aws secretsmanager get-secret-value \
  --secret-id /mcp/aws-cn --region us-east-1

If stale, rotate via operator — user provides the new AK/SK, agent does not generate credentials

aws secretsmanager put-secret-value \
  --secret-id /mcp/aws-cn --region us-east-1 \
  --secret-string '{"AK":"<NEW>","SK":"<NEW>"}'

Force pod restart to pick up new secret

kubectl -n mcp rollout restart deploy/aws-cn

Pattern B — MCP pod crashloop

Root cause signal: pod RESTARTS > 0, recent image change.

Steps:

Inspect recent crash

kubectl -n mcp logs deploy/aws-cn --previous

If image rollout is the cause, roll back

kubectl -n mcp rollout undo deploy/aws-cn

Confirm healthy

kubectl -n mcp rollout status deploy/aws-cn --timeout=2m

Pattern C — ALB target unhealthy

Root cause signal: ALB 5xx spike, target group health check failing.

Steps:

Describe current target health

aws elbv2 describe-target-health \
  --target-group-arn <tg-arn> --region cn-northwest-1

If pods are ready but ALB marks them unhealthy, check recent ModifyTargetGroupAttributes from RCA evidence and revert:

aws elbv2 modify-target-group-attributes \
  --target-group-arn <tg-arn> --region cn-northwest-1 \
  --attributes Key=<attr>,Value=<previous-value>

Rollback command (if revert itself breaks):

aws elbv2 modify-target-group-attributes \
  --target-group-arn <tg-arn> --region cn-northwest-1 \
  --attributes Key=<attr>,Value=<current-value>

Pattern D — Overly-permissive SG (security incident)

Root cause signal: SG rule 0.0.0.0/0 on sensitive port added accidentally.

Steps:

Revoke the rule

aws ec2 revoke-security-group-ingress \
  --group-id <sg-id> --region <region> \
  --protocol tcp --port <port> --cidr 0.0.0.0/0

Rollback command (if this rule was intentional and you just revoked a production-needed rule):

aws ec2 authorize-security-group-ingress \
  --group-id <sg-id> --region <region> \
  --protocol tcp --port <port> --cidr 0.0.0.0/0

Verify

aws ec2 describe-security-groups \
  --group-ids <sg-id> --region <region>

Pattern E — CloudFormation stack stuck

Root cause signal: CFN stack in UPDATE_ROLLBACK_FAILED or UPDATE_IN_PROGRESS too long.

Steps:

Inspect

aws cloudformation describe-stack-events \
  --stack-name <stack> --region <region> --max-items 20

Continue rollback skipping failed resources (surgical):

aws cloudformation continue-update-rollback \
  --stack-name <stack> --region <region> \
  --resources-to-skip <logical-id>

Rollback option: none — CFN rollback is itself the rollback. Do not delete-stack unless the user explicitly authorizes (data loss risk).

Pattern F — Secrets Manager secret accidentally deleted

Root cause signal: ResourceNotFoundException when MCP pod reads secret.

Steps:

Check if recoverable (7-day window)

aws secretsmanager restore-secret \
  --secret-id /mcp/aws-cn --region us-east-1

If past recovery window, recreate and repopulate (user provides AK/SK — agent does not generate)

aws secretsmanager create-secret \
  --name /mcp/aws-cn --region us-east-1 \
  --secret-string '{"AK":"<NEW>","SK":"<NEW>"}'

Procedure

Read the RCA output
Match to a pattern in the library
Produce each mitigation step in the 4-field format
Wait for explicit "approve step N" before executing
After each step, report result and wait for next approval

Things not to do

Do not execute any command without the exact phrase "approve step N". Ambiguous responses ("sure", "yes", "ok") → ask again, do not execute.
Do not generate AWS credentials (AK/SK). The agent cannot see customer root-level credential creation. User must provide.
Do not bundle multiple steps into a single approval. Each step is approved individually so the user can abort at any point.
Do not improvise a mitigation if the RCA does not match a library pattern. Instead, state: "This pattern is not in the mitigation library. Recommend manual operator action." Then offer to help draft a one-off command with the user's confirmation of each field.
Do not run aws cloudformation delete-stack as a mitigation — too often used to "fix" stuck stacks when continue-update-rollback is safer.
Do not use kubectl delete pod when kubectl rollout restart deploy does the same thing with deployment-managed replacement.
Do not silently retry failed commands. If step N fails, report the failure verbatim and wait for user decision.

Examples

Input: RCA hypothesis = "Pattern A — Credential failure in aws-cn"

Output:

### Mitigation step 1 — Verify current credentials

Command:
  aws secretsmanager get-secret-value \
    --secret-id /mcp/aws-cn --region us-east-1

What it does:
  Reads the current AK/SK from Secrets Manager. Read-only.

Rollback:
  None needed (read-only).

Approval:
  This command reads production credentials in aws-cn. Reply
  "approve step 1" to execute, "skip 1" to move on, or "stop" to abort.

After approval + execution, the agent reports result and emits step 2.

Input: RCA hypothesis = "unknown — signal doesn't fit any pattern"

Output:

The RCA finding does not match any pattern in the mitigation library
(Patterns A–F). I will not auto-draft a mitigation. Options:

1. Describe the failure in more detail — I may be able to match a
   pattern I missed.
2. Draft a one-off mitigation together, field-by-field, with explicit
   confirmation at each step.
3. Escalate to a human operator.

Which would you like?

china-incident-mitigation

China Incident Mitigation

Intended agent type

The approval contract

Pattern library

Pattern A — Credential failure (AuthFailure / ExpiredToken)

Pattern B — MCP pod crashloop

Pattern C — ALB target unhealthy

Pattern D — Overly-permissive SG (security incident)

Pattern E — CloudFormation stack stuck

Pattern F — Secrets Manager secret accidentally deleted

Procedure

Things not to do

Examples

More from this repository

China Incident Mitigation

Intended agent type

The approval contract

Pattern library

Pattern A — Credential failure (AuthFailure / ExpiredToken)

Pattern B — MCP pod crashloop

Pattern C — ALB target unhealthy

Pattern D — Overly-permissive SG (security incident)

Pattern E — CloudFormation stack stuck

Pattern F — Secrets Manager secret accidentally deleted

Procedure

Things not to do

Examples

More from this repository