Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

Loslegen

$pwd:

aws-troubleshoot

Name: Aws Troubleshoot
Author: incidentfox

// AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues.

In Manus ausführen

$ git log --oneline --stat

stars:1

forks:2

updated:23. Februar 2026 um 06:12

SKILL.md

readonly

name	aws-troubleshoot
description	AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues.
allowed-tools	Bash(aws , python )

AWS Troubleshooting Expertise

Investigation Methodology

Identify the AWS resource/service involved
Check resource status using describe functions
Review CloudWatch logs for errors
Check CloudWatch metrics for anomalies
Analyze configuration for misconfigurations
Synthesize and recommend

CloudWatch Logs Strategy

Partition First (CRITICAL)

Never dump all logs. Use aggregation queries first:

# Error rate over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)

# Top error messages
filter @message like /Exception/
| stats count(*) by @message
| sort count desc
| limit 10

# Latency percentiles
stats pct(@duration, 50) as p50, pct(@duration, 99) as p99 by bin(5m)

# Unique error types
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type

Query Flow

Statistics first: Get error counts, distributions
Identify time windows: Find when errors spiked
Sample from spikes: Get specific examples
Compare to baseline: Query same period yesterday/last week

Service-Specific Patterns

EC2 Issues

Symptom	First Check	Typical Cause
Unreachable	`describe_ec2_instance`	Security group, stopped, status check failed
Performance	`get_cloudwatch_metrics` (CPUUtilization)	CPU exhaustion, network saturation
Disk full	`get_cloudwatch_metrics` (DiskSpaceUtilization)	Logs, temp files

Key CloudWatch metrics for EC2:

CPUUtilization
NetworkIn, NetworkOut
DiskReadOps, DiskWriteOps
StatusCheckFailed

Lambda Issues

Symptom	First Check	Typical Cause
Timeout	CloudWatch logs	External call slow, cold start, insufficient memory
Permission denied	CloudWatch logs	IAM role missing permissions
Memory error	CloudWatch metrics	Memory allocation too low
Cold starts	CloudWatch logs + metrics	Provisioned concurrency needed

Key CloudWatch metrics for Lambda:

Invocations
Duration
Errors
Throttles
ConcurrentExecutions

CloudWatch Insights for Lambda:

# Cold start analysis
filter @type = "REPORT"
| stats avg(@initDuration) as avg_cold_start,
        count(@initDuration) as cold_starts,
        count(*) as total_invocations
        by bin(5m)

# Timeout analysis
filter @message like /Task timed out/
| stats count(*) by bin(5m)

ECS/Fargate Issues

Symptom	First Check	Typical Cause
Task failed	`list_ecs_tasks`	Container crash, resource limits, image pull
Service unhealthy	`list_ecs_tasks`	Health check failing, target group issues
Slow scaling	CloudWatch metrics	Insufficient capacity, service limits

Investigation flow:

list_ecs_tasks - See task status and health
Check stopped reason in task description
Review CloudWatch logs for the task
Check container insights metrics

RDS Issues

Symptom	First Check	Typical Cause
Connection refused	`get_rds_instance_status`	Security group, stopped, maintenance
Slow queries	CloudWatch metrics	CPU, IOPS, connections
Storage full	CloudWatch metrics	Data growth, logs, snapshots

Key CloudWatch metrics for RDS:

CPUUtilization
DatabaseConnections
ReadIOPS, WriteIOPS
FreeStorageSpace
ReadLatency, WriteLatency

Common AWS Errors

Permission Errors

AccessDeniedException
UnauthorizedAccess

→ Check IAM role/policy attached to the service

Throttling

Throttling
Rate exceeded
TooManyRequestsException

→ Implement exponential backoff, request limit increase

Resource Not Found

ResourceNotFoundException
NoSuchEntity

→ Verify resource name, region, account

Practical AWS CLI Commands

EC2

aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" --query 'Reservations[].Instances[].{ID:InstanceId,Type:InstanceType,State:State.Name,Name:Tags[?Key==`Name`].Value|[0]}'

aws ec2 describe-instance-status --instance-ids <id>

ECS

aws ecs list-clusters

aws ecs list-services --cluster <cluster>

aws ecs describe-services --cluster <cluster> --services <service>

aws ecs list-tasks --cluster <cluster> --service-name <service> --desired-status STOPPED

CloudWatch Logs

aws logs describe-log-groups --log-group-name-prefix /ecs/

aws logs filter-log-events --log-group-name <group> --start-time <epoch-ms> --filter-pattern "ERROR"

aws logs start-query --log-group-name <group> --start-time <epoch> --end-time <epoch> --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50'

EKS

aws eks list-clusters

aws eks describe-cluster --name <cluster>

aws eks update-kubeconfig --name <cluster> --region <region>

RDS

aws rds describe-db-instances --query 'DBInstances[].{ID:DBInstanceIdentifier,Engine:Engine,Status:DBInstanceStatus,Class:DBInstanceClass}'

aws cloudwatch get-metric-statistics --namespace AWS/RDS --metric-name CPUUtilization --dimensions Name=DBInstanceIdentifier,Value=<id> --start-time <iso> --end-time <iso> --period 300 --statistics Average

related-skills.json

gleiches Repository

confluence-docs.md

from "incidentfox/self-learning-ai-agent"

Search and read Confluence documentation. Use when looking for internal docs, knowledge base articles, runbooks, or team documentation stored in Confluence.

2026-02-231

database-postgresql.md

from "incidentfox/self-learning-ai-agent"

PostgreSQL database inspection and queries. Use when investigating table schemas, running queries, checking locks, replication status, or long-running queries.

2026-02-231

deployment-correlation.md

from "incidentfox/self-learning-ai-agent"

Correlate incidents with recent deployments and code changes. Use when investigating if a deployment caused an issue, finding what changed, or identifying the commit that introduced a bug.

2026-02-231

github-code.md

from "incidentfox/self-learning-ai-agent"

GitHub code search, file reading, PR review, branch/file management, and commit operations. Use when you need to search code patterns, read repository files, review pull requests, create branches, commit files, or open PRs.

2026-02-231

kubernetes-debug.md

from "incidentfox/self-learning-ai-agent"

Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures.

2026-02-231

jira-tasks.md

from "incidentfox/self-learning-ai-agent"

Jira issue tracking and project management. Use for creating, searching, updating, and commenting on Jira issues. Supports JQL queries for advanced searching.

2026-02-231

package.json

"author": "incidentfox"

"repository": "incidentfox/self-learning-ai-agent"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

Netzwerk- und ComputersystemadministratorenInformatik- und Mathematikberufe15-1244L4

name	aws-troubleshoot
description	AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues.
allowed-tools	Bash(aws , python )

AWS Troubleshooting Expertise

Investigation Methodology

Identify the AWS resource/service involved
Check resource status using describe functions
Review CloudWatch logs for errors
Check CloudWatch metrics for anomalies
Analyze configuration for misconfigurations
Synthesize and recommend

CloudWatch Logs Strategy

Partition First (CRITICAL)

Never dump all logs. Use aggregation queries first:

# Error rate over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)

# Top error messages
filter @message like /Exception/
| stats count(*) by @message
| sort count desc
| limit 10

# Latency percentiles
stats pct(@duration, 50) as p50, pct(@duration, 99) as p99 by bin(5m)

# Unique error types
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type

Query Flow

Statistics first: Get error counts, distributions
Identify time windows: Find when errors spiked
Sample from spikes: Get specific examples
Compare to baseline: Query same period yesterday/last week

Service-Specific Patterns

EC2 Issues

Symptom	First Check	Typical Cause
Unreachable	`describe_ec2_instance`	Security group, stopped, status check failed
Performance	`get_cloudwatch_metrics` (CPUUtilization)	CPU exhaustion, network saturation
Disk full	`get_cloudwatch_metrics` (DiskSpaceUtilization)	Logs, temp files

Key CloudWatch metrics for EC2:

CPUUtilization
NetworkIn, NetworkOut
DiskReadOps, DiskWriteOps
StatusCheckFailed

Lambda Issues

Symptom	First Check	Typical Cause
Timeout	CloudWatch logs	External call slow, cold start, insufficient memory
Permission denied	CloudWatch logs	IAM role missing permissions
Memory error	CloudWatch metrics	Memory allocation too low
Cold starts	CloudWatch logs + metrics	Provisioned concurrency needed

Key CloudWatch metrics for Lambda:

Invocations
Duration
Errors
Throttles
ConcurrentExecutions

CloudWatch Insights for Lambda:

# Cold start analysis
filter @type = "REPORT"
| stats avg(@initDuration) as avg_cold_start,
        count(@initDuration) as cold_starts,
        count(*) as total_invocations
        by bin(5m)

# Timeout analysis
filter @message like /Task timed out/
| stats count(*) by bin(5m)

ECS/Fargate Issues

Symptom	First Check	Typical Cause
Task failed	`list_ecs_tasks`	Container crash, resource limits, image pull
Service unhealthy	`list_ecs_tasks`	Health check failing, target group issues
Slow scaling	CloudWatch metrics	Insufficient capacity, service limits

Investigation flow:

list_ecs_tasks - See task status and health
Check stopped reason in task description
Review CloudWatch logs for the task
Check container insights metrics

RDS Issues

Symptom	First Check	Typical Cause
Connection refused	`get_rds_instance_status`	Security group, stopped, maintenance
Slow queries	CloudWatch metrics	CPU, IOPS, connections
Storage full	CloudWatch metrics	Data growth, logs, snapshots

Key CloudWatch metrics for RDS:

CPUUtilization
DatabaseConnections
ReadIOPS, WriteIOPS
FreeStorageSpace
ReadLatency, WriteLatency

Common AWS Errors

Permission Errors

AccessDeniedException
UnauthorizedAccess

→ Check IAM role/policy attached to the service

Throttling

Throttling
Rate exceeded
TooManyRequestsException

→ Implement exponential backoff, request limit increase

Resource Not Found

ResourceNotFoundException
NoSuchEntity

→ Verify resource name, region, account

Practical AWS CLI Commands

EC2

aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" --query 'Reservations[].Instances[].{ID:InstanceId,Type:InstanceType,State:State.Name,Name:Tags[?Key==`Name`].Value|[0]}'

aws ec2 describe-instance-status --instance-ids <id>

ECS

aws ecs list-clusters

aws ecs list-services --cluster <cluster>

aws ecs describe-services --cluster <cluster> --services <service>

aws ecs list-tasks --cluster <cluster> --service-name <service> --desired-status STOPPED

CloudWatch Logs

aws logs describe-log-groups --log-group-name-prefix /ecs/

aws logs filter-log-events --log-group-name <group> --start-time <epoch-ms> --filter-pattern "ERROR"

aws logs start-query --log-group-name <group> --start-time <epoch> --end-time <epoch> --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50'

EKS

aws eks list-clusters

aws eks describe-cluster --name <cluster>

aws eks update-kubeconfig --name <cluster> --region <region>

RDS

aws rds describe-db-instances --query 'DBInstances[].{ID:DBInstanceIdentifier,Engine:Engine,Status:DBInstanceStatus,Class:DBInstanceClass}'

aws cloudwatch get-metric-statistics --namespace AWS/RDS --metric-name CPUUtilization --dimensions Name=DBInstanceIdentifier,Value=<id> --start-time <iso> --end-time <iso> --period 300 --statistics Average

aws-troubleshoot

AWS Troubleshooting Expertise

Investigation Methodology

CloudWatch Logs Strategy

Partition First (CRITICAL)

Query Flow

Service-Specific Patterns

EC2 Issues

Lambda Issues

ECS/Fargate Issues

RDS Issues

Common AWS Errors

Permission Errors

Throttling

Resource Not Found

Practical AWS CLI Commands

EC2

ECS

CloudWatch Logs

EKS

RDS

Mehr aus diesem Repository

Mehr aus diesem Repository

AWS Troubleshooting Expertise

Investigation Methodology

CloudWatch Logs Strategy

Partition First (CRITICAL)

Query Flow

Service-Specific Patterns

EC2 Issues

Lambda Issues

ECS/Fargate Issues

RDS Issues

Common AWS Errors

Permission Errors

Throttling

Resource Not Found

Practical AWS CLI Commands

EC2

ECS

CloudWatch Logs

EKS

RDS