mit einem Klick
aws-troubleshoot
// AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues.
// AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues.
Search and read Confluence documentation. Use when looking for internal docs, knowledge base articles, runbooks, or team documentation stored in Confluence.
PostgreSQL database inspection and queries. Use when investigating table schemas, running queries, checking locks, replication status, or long-running queries.
Correlate incidents with recent deployments and code changes. Use when investigating if a deployment caused an issue, finding what changed, or identifying the commit that introduced a bug.
GitHub code search, file reading, PR review, branch/file management, and commit operations. Use when you need to search code patterns, read repository files, review pull requests, create branches, commit files, or open PRs.
Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures.
Jira issue tracking and project management. Use for creating, searching, updating, and commenting on Jira issues. Supports JQL queries for advanced searching.
| name | aws-troubleshoot |
| description | AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues. |
| allowed-tools | Bash(aws *, python *) |
Never dump all logs. Use aggregation queries first:
# Error rate over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)
# Top error messages
filter @message like /Exception/
| stats count(*) by @message
| sort count desc
| limit 10
# Latency percentiles
stats pct(@duration, 50) as p50, pct(@duration, 99) as p99 by bin(5m)
# Unique error types
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
| Symptom | First Check | Typical Cause |
|---|---|---|
| Unreachable | describe_ec2_instance | Security group, stopped, status check failed |
| Performance | get_cloudwatch_metrics (CPUUtilization) | CPU exhaustion, network saturation |
| Disk full | get_cloudwatch_metrics (DiskSpaceUtilization) | Logs, temp files |
Key CloudWatch metrics for EC2:
| Symptom | First Check | Typical Cause |
|---|---|---|
| Timeout | CloudWatch logs | External call slow, cold start, insufficient memory |
| Permission denied | CloudWatch logs | IAM role missing permissions |
| Memory error | CloudWatch metrics | Memory allocation too low |
| Cold starts | CloudWatch logs + metrics | Provisioned concurrency needed |
Key CloudWatch metrics for Lambda:
CloudWatch Insights for Lambda:
# Cold start analysis
filter @type = "REPORT"
| stats avg(@initDuration) as avg_cold_start,
count(@initDuration) as cold_starts,
count(*) as total_invocations
by bin(5m)
# Timeout analysis
filter @message like /Task timed out/
| stats count(*) by bin(5m)
| Symptom | First Check | Typical Cause |
|---|---|---|
| Task failed | list_ecs_tasks | Container crash, resource limits, image pull |
| Service unhealthy | list_ecs_tasks | Health check failing, target group issues |
| Slow scaling | CloudWatch metrics | Insufficient capacity, service limits |
Investigation flow:
list_ecs_tasks - See task status and health| Symptom | First Check | Typical Cause |
|---|---|---|
| Connection refused | get_rds_instance_status | Security group, stopped, maintenance |
| Slow queries | CloudWatch metrics | CPU, IOPS, connections |
| Storage full | CloudWatch metrics | Data growth, logs, snapshots |
Key CloudWatch metrics for RDS:
AccessDeniedException
UnauthorizedAccess
→ Check IAM role/policy attached to the service
Throttling
Rate exceeded
TooManyRequestsException
→ Implement exponential backoff, request limit increase
ResourceNotFoundException
NoSuchEntity
→ Verify resource name, region, account
aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" --query 'Reservations[].Instances[].{ID:InstanceId,Type:InstanceType,State:State.Name,Name:Tags[?Key==`Name`].Value|[0]}'
aws ec2 describe-instance-status --instance-ids <id>
aws ecs list-clusters
aws ecs list-services --cluster <cluster>
aws ecs describe-services --cluster <cluster> --services <service>
aws ecs list-tasks --cluster <cluster> --service-name <service> --desired-status STOPPED
aws logs describe-log-groups --log-group-name-prefix /ecs/
aws logs filter-log-events --log-group-name <group> --start-time <epoch-ms> --filter-pattern "ERROR"
aws logs start-query --log-group-name <group> --start-time <epoch> --end-time <epoch> --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50'
aws eks list-clusters
aws eks describe-cluster --name <cluster>
aws eks update-kubeconfig --name <cluster> --region <region>
aws rds describe-db-instances --query 'DBInstances[].{ID:DBInstanceIdentifier,Engine:Engine,Status:DBInstanceStatus,Class:DBInstanceClass}'
aws cloudwatch get-metric-statistics --namespace AWS/RDS --metric-name CPUUtilization --dimensions Name=DBInstanceIdentifier,Value=<id> --start-time <iso> --end-time <iso> --period 300 --statistics Average