تشغيل أي مهارة في Manus بنقرة واحدة

aws-resource-health-diagnose

Analyze AWS resource health, diagnose issues from CloudWatch logs and metrics, and create a remediation plan for identified problems.

تشغيل في Manus

النجوم٣٥٬٢٦٠

التفرعات٤٬٣٥٣

آخر تحديث١٠ يونيو ٢٠٢٦ في ٠٤:٤٣

المصدر

github

github/awesome-copilot

فتح مستودع GitHub عرض مستودعات المنشئ

أمر التثبيت

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

مديرو الشبكات وأنظمة الحاسوبمهن الحاسوب والرياضيات·SOC 15-1244

SKILL.md

readonly

name	aws-resource-health-diagnose
description	Analyze AWS resource health, diagnose issues from CloudWatch logs and metrics, and create a remediation plan for identified problems.

AWS Resource Health & Issue Diagnosis

This workflow analyzes a specific AWS resource to assess its health status, diagnose potential issues using CloudWatch logs and metrics, and develop a comprehensive remediation plan for any problems discovered.

Prerequisites

AWS CLI configured and authenticated
Target AWS resource identified (name, type, and optionally region/account)
CloudWatch logging and metrics enabled on the target resource

Workflow Steps

Step 1: Get AWS Diagnostic Best Practices

Fetch https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ for monitoring and troubleshooting guidance to inform the diagnostic approach.

Step 2: Resource Discovery & Identification

Locate the target resource using the appropriate AWS CLI command for its type:

# EC2
aws ec2 describe-instances --filters "Name=tag:Name,Values=<name>"
# Lambda
aws lambda get-function --function-name <name>
# RDS
aws rds describe-db-instances --db-instance-identifier <name>
# ECS
aws ecs describe-services --cluster <cluster> --services <name>
# ALB
aws elbv2 describe-load-balancers --names <name>
# DynamoDB
aws dynamodb describe-table --table-name <name>
# SQS
aws sqs get-queue-attributes --queue-url <url> --attribute-names All
# API Gateway
aws apigatewayv2 get-apis

If multiple matches are found, prompt the user to specify region/account.

Step 3: Health Status Assessment

Run service-specific health checks:

# EC2
aws ec2 describe-instance-status --instance-ids <id>

# RDS
aws rds describe-db-instances --db-instance-identifier <name> \
  --query 'DBInstances[0].DBInstanceStatus'

# Lambda - error rate over 24h
aws cloudwatch get-metric-statistics --namespace AWS/Lambda \
  --metric-name Errors --dimensions Name=FunctionName,Value=<name> \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 --statistics Sum

# ECS
aws ecs describe-services --cluster <cluster> --services <name> \
  --query 'services[0].[status,runningCount,desiredCount,pendingCount]'

Key health indicators by service type:

Lambda: Error rate, throttle rate, duration P99, concurrent executions
RDS: CPU utilization, FreeStorageSpace, DatabaseConnections, ReadLatency/WriteLatency
ECS: Running vs desired task count, task stop reason
ALB: TargetResponseTime, HTTPCode_ELB_5XX_Count, UnHealthyHostCount
SQS: ApproximateNumberOfMessagesNotVisible, ApproximateAgeOfOldestMessage
DynamoDB: ConsumedReadCapacityUnits, ThrottledRequests, SuccessfulRequestLatency

Step 4: Log & Metrics Analysis

Find log groups and run CloudWatch Logs Insights queries:

# Find log groups
aws logs describe-log-groups --log-group-name-prefix /aws/<service>/<name>

# Start a query (last 24h errors)
aws logs start-query \
  --log-group-name /aws/lambda/<name> \
  --start-time $(date -u -d '24 hours ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string 'filter @message like /ERROR/ | stats count(*) as errorCount by bin(1h)'

# Get results
aws logs get-query-results --query-id <id>

# Lambda cold starts
aws logs start-query \
  --log-group-name /aws/lambda/<name> \
  --start-time $(date -u -d '24 hours ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string 'filter @type = "REPORT" | filter @initDuration > 0 | stats count() as coldStarts by bin(1h)'

# RDS Performance Insights (if enabled)
aws pi get-resource-metrics \
  --service-type RDS --identifier db:<identifier> \
  --metric-queries '[{"Metric":"db.load.avg"}]' \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period-in-seconds 3600

Identify: recurring error patterns, correlation with deployments (CloudTrail), performance trends, dependency failures.

Step 5: Issue Classification & Root Cause Analysis

Severity:

Critical: Service unavailable, data loss, security incidents
High: Performance degradation, error rates >5%, intermittent failures
Medium: Warnings, suboptimal configuration, minor performance issues
Low: Informational alerts, optimization opportunities

Root Cause Categories:

Configuration Issues: wrong settings, missing env vars, IAM permission denials
Resource Constraints: CPU/memory/disk limits, Lambda throttling, RDS connection exhaustion
Network Issues: security group rules, VPC routing, DNS, NACLs
Application Issues: code bugs, memory leaks, unhandled exceptions, slow queries
Dependency Issues: downstream timeouts, SQS/SNS failures, external API limits
Security Issues: KMS key issues, certificate expiration

Step 6: Generate Remediation Plan

Immediate Actions (Critical):

# Lambda throttling — increase reserved concurrency
aws lambda put-reserved-concurrency \
  --function-name <name> --reserved-concurrent-executions 100

# RDS connection exhaustion — reboot to reset connections
aws rds reboot-db-instance --db-instance-identifier <name>

Short-term Fixes (High/Medium): Configuration adjustments, right-sizing, CloudWatch alarm improvements, IAM corrections.

Long-term Improvements: Architectural changes for resilience, preventive monitoring, enable AWS Health Dashboard notifications via EventBridge.

Step 7: Report & User Confirmation

Present findings:

🏥 AWS Resource Health Assessment

📊 Resource Overview:
• Resource: [Name] ([Type])
• Status: [Healthy/Warning/Critical]
• Region: [Region] | Account: [Account ID]

🚨 Issues Identified:
• Critical: X | High: Y | Medium: Z | Low: N

🔍 Top Issues:
1. [Issue]: [Description] — Impact: [High/Medium/Low]
2. [Issue]: [Description] — Impact: [High/Medium/Low]

🛠️ Remediation: X immediate, Y short-term, Z long-term actions

❓ Proceed with detailed remediation plan? (y/n)

Then generate a full markdown report covering: health metrics, issues with root cause analysis, phased remediation steps with AWS CLI commands, CloudWatch alarm recommendations, and validation checklist.

Error Handling

Resource Not Found: Ask user to clarify name/region
Authentication Issues: Guide through aws configure
Insufficient Permissions: List required IAM actions (logs:*, cloudwatch:*, pi:*)
No Logs Available: Suggest enabling CloudWatch logging for the resource type
Query Timeouts: Use shorter time windows

Success Criteria

✅ Resource health accurately assessed across all key metrics
✅ All significant issues identified and classified by severity
✅ Root cause analysis completed for major problems
✅ Actionable remediation plan with AWS CLI commands
✅ CloudWatch monitoring recommendations included
✅ Implementation steps include validation and rollback procedures

المزيد من هذا المستودع

نفس المستودع

em-dash

github/awesome-copilot

Expert on the history, origin, and correct use of the em dash. Use when writing or reviewing code, comments, or data files to avoid em and en dashes, defaulting to never using them and replacing any found with a hyphen (-). Includes strong knowledge of punctuation marks and the proper usage of punctuation characters when writing comments.

2026-06-1835.3k

incident-postmortem

github/awesome-copilot

Use when an outage, production incident, or significant service degradation has occurred and the team needs to write a structured blameless post-mortem. Triggers on phrases like "write a post-mortem", "incident review", "what went wrong", "outage report", "root cause analysis", or "RCA". Covers timeline reconstruction, contributing factor analysis, impact quantification, and action item generation with owners.

2026-06-1735.3k

setup-my-iq

github/awesome-copilot

Create, set up, or update the personal context portfolio: structured markdown files describing who you are, how you work, your teams, and your tool/ADO configuration. Runs the interview workflow for first-time setup and targeted edits for updates. Trigger this skill when the user asks to: set up their context, create or update their context portfolio, "create my IQ", "set up my IQ", edit their profile, add/remove a stakeholder, update ADO config, change team info, update pillars, or set up any plugin configuration. Trigger when another skill fails to find context (missing files or TODO markers) and needs context populated. Also trigger when the user mentions a context change in passing (e.g., "my manager changed", "we added someone to the team") to offer a context file update. Do NOT trigger for read-only questions like "who's on my team?" or "what's my ADO config?". Those are answered directly from the context files referenced in the loaded custom instructions; no skill is needed.

2026-06-1635.3k

harness-engineering

github/awesome-copilot

Adopt repository-level harness engineering for coding agents. Use when a user wants to prevent repeated AI coding-agent mistakes by turning failures into durable instructions, drift checks, regression tests, failure memory, and adoption reports tailored to the target repository.

2026-06-1635.3k

github-actions-hardening

github/awesome-copilot

Security hardening reviewer for GitHub Actions workflow files (.github/workflows/*.yml). Reasons about the Actions threat model that pattern matchers and general code linters miss — untrusted-input script injection, privileged triggers running fork code, mutable action references, and over-scoped tokens. Use this skill when asked to review, audit, harden, or secure a GitHub Actions workflow, when writing a new workflow, or for any request like "is this workflow safe?", "review my CI for security issues", "why is pull_request_target dangerous here?", "pin my actions", or "lock down GITHUB_TOKEN permissions". Covers script injection via ${{ }} interpolation, pull_request_target / workflow_run privilege escalation, SHA-pinning of third-party actions, least-privilege permissions, GITHUB_ENV/GITHUB_OUTPUT injection, secret exposure, OIDC over long-lived credentials, and self-hosted runner exposure on public repositories.

2026-06-1635.3k

x-twitter-scraper

github/awesome-copilot

Build GitHub Copilot workflows with Xquik X API SDKs, REST endpoints, MCP tools, signed webhooks, tweet search, user lookup, follower exports, media actions, and agent automation.

2026-06-1635.3k

name	aws-resource-health-diagnose
description	Analyze AWS resource health, diagnose issues from CloudWatch logs and metrics, and create a remediation plan for identified problems.

AWS Resource Health & Issue Diagnosis

Prerequisites

AWS CLI configured and authenticated
Target AWS resource identified (name, type, and optionally region/account)
CloudWatch logging and metrics enabled on the target resource

Workflow Steps

Step 1: Get AWS Diagnostic Best Practices

Fetch https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ for monitoring and troubleshooting guidance to inform the diagnostic approach.

Step 2: Resource Discovery & Identification

Locate the target resource using the appropriate AWS CLI command for its type:

# EC2
aws ec2 describe-instances --filters "Name=tag:Name,Values=<name>"
# Lambda
aws lambda get-function --function-name <name>
# RDS
aws rds describe-db-instances --db-instance-identifier <name>
# ECS
aws ecs describe-services --cluster <cluster> --services <name>
# ALB
aws elbv2 describe-load-balancers --names <name>
# DynamoDB
aws dynamodb describe-table --table-name <name>
# SQS
aws sqs get-queue-attributes --queue-url <url> --attribute-names All
# API Gateway
aws apigatewayv2 get-apis

If multiple matches are found, prompt the user to specify region/account.

Step 3: Health Status Assessment

Run service-specific health checks:

# EC2
aws ec2 describe-instance-status --instance-ids <id>

# RDS
aws rds describe-db-instances --db-instance-identifier <name> \
  --query 'DBInstances[0].DBInstanceStatus'

# Lambda - error rate over 24h
aws cloudwatch get-metric-statistics --namespace AWS/Lambda \
  --metric-name Errors --dimensions Name=FunctionName,Value=<name> \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 --statistics Sum

# ECS
aws ecs describe-services --cluster <cluster> --services <name> \
  --query 'services[0].[status,runningCount,desiredCount,pendingCount]'

Key health indicators by service type:

Lambda: Error rate, throttle rate, duration P99, concurrent executions
RDS: CPU utilization, FreeStorageSpace, DatabaseConnections, ReadLatency/WriteLatency
ECS: Running vs desired task count, task stop reason
ALB: TargetResponseTime, HTTPCode_ELB_5XX_Count, UnHealthyHostCount
SQS: ApproximateNumberOfMessagesNotVisible, ApproximateAgeOfOldestMessage
DynamoDB: ConsumedReadCapacityUnits, ThrottledRequests, SuccessfulRequestLatency

Step 4: Log & Metrics Analysis

Find log groups and run CloudWatch Logs Insights queries:

# Find log groups
aws logs describe-log-groups --log-group-name-prefix /aws/<service>/<name>

# Start a query (last 24h errors)
aws logs start-query \
  --log-group-name /aws/lambda/<name> \
  --start-time $(date -u -d '24 hours ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string 'filter @message like /ERROR/ | stats count(*) as errorCount by bin(1h)'

# Get results
aws logs get-query-results --query-id <id>

# Lambda cold starts
aws logs start-query \
  --log-group-name /aws/lambda/<name> \
  --start-time $(date -u -d '24 hours ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string 'filter @type = "REPORT" | filter @initDuration > 0 | stats count() as coldStarts by bin(1h)'

# RDS Performance Insights (if enabled)
aws pi get-resource-metrics \
  --service-type RDS --identifier db:<identifier> \
  --metric-queries '[{"Metric":"db.load.avg"}]' \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period-in-seconds 3600

Identify: recurring error patterns, correlation with deployments (CloudTrail), performance trends, dependency failures.

Step 5: Issue Classification & Root Cause Analysis

Severity:

Critical: Service unavailable, data loss, security incidents
High: Performance degradation, error rates >5%, intermittent failures
Medium: Warnings, suboptimal configuration, minor performance issues
Low: Informational alerts, optimization opportunities

Root Cause Categories:

Configuration Issues: wrong settings, missing env vars, IAM permission denials
Resource Constraints: CPU/memory/disk limits, Lambda throttling, RDS connection exhaustion
Network Issues: security group rules, VPC routing, DNS, NACLs
Application Issues: code bugs, memory leaks, unhandled exceptions, slow queries
Dependency Issues: downstream timeouts, SQS/SNS failures, external API limits
Security Issues: KMS key issues, certificate expiration

Step 6: Generate Remediation Plan

Immediate Actions (Critical):

# Lambda throttling — increase reserved concurrency
aws lambda put-reserved-concurrency \
  --function-name <name> --reserved-concurrent-executions 100

# RDS connection exhaustion — reboot to reset connections
aws rds reboot-db-instance --db-instance-identifier <name>

Short-term Fixes (High/Medium): Configuration adjustments, right-sizing, CloudWatch alarm improvements, IAM corrections.

Long-term Improvements: Architectural changes for resilience, preventive monitoring, enable AWS Health Dashboard notifications via EventBridge.

Step 7: Report & User Confirmation

Present findings:

🏥 AWS Resource Health Assessment

📊 Resource Overview:
• Resource: [Name] ([Type])
• Status: [Healthy/Warning/Critical]
• Region: [Region] | Account: [Account ID]

🚨 Issues Identified:
• Critical: X | High: Y | Medium: Z | Low: N

🔍 Top Issues:
1. [Issue]: [Description] — Impact: [High/Medium/Low]
2. [Issue]: [Description] — Impact: [High/Medium/Low]

🛠️ Remediation: X immediate, Y short-term, Z long-term actions

❓ Proceed with detailed remediation plan? (y/n)

Error Handling

Resource Not Found: Ask user to clarify name/region
Authentication Issues: Guide through aws configure
Insufficient Permissions: List required IAM actions (logs:*, cloudwatch:*, pi:*)
No Logs Available: Suggest enabling CloudWatch logging for the resource type
Query Timeouts: Use shorter time windows

Success Criteria

✅ Resource health accurately assessed across all key metrics
✅ All significant issues identified and classified by severity
✅ Root cause analysis completed for major problems
✅ Actionable remediation plan with AWS CLI commands
✅ CloudWatch monitoring recommendations included
✅ Implementation steps include validation and rollback procedures