تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

aws-investigation

Name: Aws Investigation
Author: mock-server

// Investigates AWS infrastructure issues affecting Buildkite build agents (EC2, AutoScaling, Lambda). Returns structured JSON to the parent for formatting. Triggers when users ask about build agents not running, EC2 issues, ASG scaling problems, or infrastructure health.

تشغيل في Manus

$ git log --oneline --stat

stars:٤٬٨٧٩

forks:١٬١٠٧

updated:١٤ مايو ٢٠٢٦ في ٠٩:٥٨

مستكشف الملفات

2 ملفات

SKILL.md

readonly

related-skills.json

نفس المستودع

dependabot-snyk-pr-management.md

from "mock-server/mockserver-monorepo"

Interact with Dependabot and Snyk pull requests for dependency upgrades and security fixes. Documents Dependabot commands, javax/jakarta compatibility checks, safe merge workflows, and troubleshooting. Use when managing dependency upgrade PRs or security fix PRs.

2026-05-284.9k

pr-monitor.md

from "mock-server/mockserver-monorepo"

Monitors Dependabot and Snyk dependency upgrade PRs, automatically merging them when builds pass. Handles javax/jakarta compatibility validation and provides detailed status reporting. Use when the user says "monitor PRs", "watch builds", "auto-merge PRs", "merge passing PRs", or "watch dependency PRs".

2026-05-284.9k

review-code.md

from "mock-server/mockserver-monorepo"

Deep adversarial code review using the 8-lens review constitution. Examines diffs for correctness, security, completeness, and MockServer-specific concerns (ByteBuf leaks, module boundaries, javax/jakarta compatibility, ring buffer sizing). Use when performing pre-commit reviews, quality-loop iterations, or on-demand code audits. Loaded by review-cheap and review-final agents.

2026-05-284.9k

review-spec.md

from "mock-server/mockserver-monorepo"

Deep adversarial specification review using the 8-lens review constitution. Evaluates design documents, plans, and specs for ambiguity, completeness, feasibility, security, and MockServer-specific concerns. Loaded by review-cheap and review-final agents.

2026-05-284.9k

docker-build-push.md

from "mock-server/mockserver-monorepo"

Builds and pushes the MockServer Maven CI Docker image locally. Covers corporate CA certificate setup, architecture selection (amd64 vs arm64), buildx gotchas with corporate TLS proxies, and Docker Hub authentication. Use when the user says "build docker image", "push maven image", "rebuild CI image", "docker build", "push to docker hub", or needs to manually build/push the mockserver/mockserver:maven image outside of CI.

2026-05-274.9k

release-management.md

from "mock-server/mockserver-monorepo"

Prepares a MockServer release by recommending the release version from Semantic Versioning rules and `changelog.md`, checking release readiness, and listing the exact Buildkite release parameters. Use when users say "prepare release", "release version", "run the release pipeline", "which version should we release", or need to verify changelog and secret readiness before triggering the release pipeline.

2026-05-214.9k

package.json

"author": "mock-server"

"repository": "mock-server/mockserver-monorepo"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مديرو الشبكات وأنظمة الحاسوبمهن الحاسوب والرياضيات15-1244L4

name	aws-investigation
description	Investigates AWS infrastructure issues affecting Buildkite build agents (EC2, AutoScaling, Lambda). Returns structured JSON to the parent for formatting. Triggers when users ask about build agents not running, EC2 issues, ASG scaling problems, or infrastructure health.

AWS Infrastructure Investigation

Investigate AWS infrastructure issues affecting Buildkite build agents. Covers EC2 instances, AutoScaling Groups, the autoscaling Lambda, and related resources.

Prerequisites

AWS CLI installed (brew install awscli)
AWS SSO profile mockserver-build configured (SSO region: eu-west-2)
Active SSO session: aws sso login --profile mockserver-build
Corporate TLS proxy (if applicable): export AWS_CA_BUNDLE=$NODE_EXTRA_CA_CERTS (only if NODE_EXTRA_CA_CERTS is set)
macOS + Python 3.14 + Homebrew: if you get pyexpat symbol errors, export DYLD_LIBRARY_PATH=/opt/homebrew/opt/expat/lib

Infrastructure Overview

There are two build agent stacks. Investigate the current stack first; fall back to the legacy stack only if the current one has not been provisioned yet.

Current: Terraform-managed (eu-west-2)

Managed by terraform/buildkite-agents/ using the official Buildkite Elastic CI Stack module.

Property	Value
Region	`eu-west-2`
Instance type	Read from `terraform/buildkite-agents/terraform.tfvars` (`instance_types`)
Scaling	Read from Terraform variables (`min_size`, `max_size`, `on_demand_percentage`)
Scaler version	`buildkite-agent-scaler` v1.11.2
Scaler runtime	`provided.al2023`
Queue	`default`
IaC	`terraform/buildkite-agents/`

Resource names are generated by Terraform with a random suffix. To find them:

# Get ASG name from Terraform state
cd terraform/buildkite-agents
terraform output auto_scaling_group_name

# Or find ASGs with the Buildkite tag
aws autoscaling describe-auto-scaling-groups \
  --region eu-west-2 --profile mockserver-build \
  --query 'AutoScalingGroups[?contains(Tags[?Key==`Stack`].Value | [0], `buildkite-mockserver`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Min:MinSize,Max:MaxSize,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'

Legacy: CloudFormation-managed (us-east-1)

Being replaced by the Terraform stack above. May still be active during migration.

Resource	Identifier	Region
AutoScaling Group	`buildkite-AgentAutoScaleGroup-VGG28FR0DE6Q`	`us-east-1`
CloudFormation Stack	`buildkite`	`us-east-1`
Instance Type	Inspect live ASG launch template via AWS CLI	`us-east-1`
Autoscaling Lambda	Use discovery query below (name generated by CloudFormation)	`us-east-1`

AWS CLI Prefix

All commands require --region and --profile flags:

# Current stack (eu-west-2)
aws ... --region eu-west-2 --profile mockserver-build

# Legacy stack (us-east-1)
aws ... --region us-east-1 --profile mockserver-build

Investigation Workflow

Step 1: Determine Active Stack

Check which stack is currently running agents:

# Check current stack (eu-west-2) — look for ASGs tagged with buildkite-mockserver
aws autoscaling describe-auto-scaling-groups \
  --region eu-west-2 --profile mockserver-build \
  --query 'AutoScalingGroups[?contains(Tags[?Key==`Stack`].Value | [0], `buildkite-mockserver`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'

# Check legacy stack (us-east-1)
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names "buildkite-AgentAutoScaleGroup-VGG28FR0DE6Q" \
  --region us-east-1 --profile mockserver-build \
  --query 'AutoScalingGroups[0].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'

Use whichever stack has instances (or non-zero desired capacity) for the remaining steps. Substitute the correct --region and ASG name accordingly.

Step 2: Quick Health Check

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names "<ASG_NAME>" \
  --region <REGION> --profile mockserver-build \
  --query 'AutoScalingGroups[0].{Desired:DesiredCapacity,Min:MinSize,Max:MaxSize,Instances:Instances[*].{ID:InstanceId,State:LifecycleState,Health:HealthStatus}}'

Expected healthy state:

If queue is empty: Desired = 0 can be healthy (scale-to-zero)
If queue has pending jobs: desired capacity should increase above 0 within 1-2 scaler intervals
Active instances should be InService and Healthy

Problem indicators:

Desired: 0 — no agents requested (scaler not seeing jobs, or Lambda not running)
Desired > 0 but no instances — launch failures
Instances in Pending for >5 min — launch issues
Instances Unhealthy — failing health checks

Step 3: Check EC2 Instance Status

aws ec2 describe-instances \
  --filters "Name=tag:aws:autoscaling:groupName,Values=<ASG_NAME>" \
  --region <REGION> --profile mockserver-build \
  --query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name,Type:InstanceType,Launch:LaunchTime,AZ:Placement.AvailabilityZone}'

For running instances, check system/instance status:

aws ec2 describe-instance-status \
  --instance-ids <instance-id-1> <instance-id-2> \
  --region <REGION> --profile mockserver-build

Step 4: Check Scaling Activities

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name "<ASG_NAME>" \
  --region <REGION> --profile mockserver-build \
  --max-items 10

Look for:

"user request explicitly set group desired capacity" — the Lambda scaler adjusted capacity
"an instance was taken out of service" — scale-in event
Failed status codes — launch failures (AMI issues, capacity, subnet exhaustion)

Step 5: Check the Autoscaling Lambda

Find the scaler Lambda by listing functions with a Buildkite-related name:

aws lambda list-functions \
  --region <REGION> --profile mockserver-build \
  --query 'Functions[?contains(FunctionName, `buildkite`) && (contains(FunctionName, `scaler`) || contains(FunctionName, `caling`))].{Name:FunctionName,Runtime:Runtime,State:State,LastModified:LastModified}'

Then check its logs:

# Recent invocations (last hour)
aws logs filter-log-events \
  --log-group-name "/aws/lambda/<LAMBDA_FUNCTION_NAME>" \
  --region <REGION> --profile mockserver-build \
  --start-time $(python3 -c "import time; print(int((time.time() - 3600) * 1000))") \
  --limit 20

# Error logs (last hour)
aws logs filter-log-events \
  --log-group-name "/aws/lambda/<LAMBDA_FUNCTION_NAME>" \
  --region <REGION> --profile mockserver-build \
  --start-time $(python3 -c "import time; print(int((time.time() - 3600) * 1000))") \
  --filter-pattern "ERROR" \
  --limit 10

Step 6: Check EC2 Console Output

For instances that are running but not registering as Buildkite agents:

aws ec2 get-console-output \
  --instance-id <instance-id> \
  --region <REGION> --profile mockserver-build \
  --query 'Output' --output text

Step 7: Check Suspended ASG Processes

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names "<ASG_NAME>" \
  --region <REGION> --profile mockserver-build \
  --query 'AutoScalingGroups[0].SuspendedProcesses'

Note: AZRebalance is intentionally suspended to prevent killing running builds. Other suspended processes may indicate problems.

Failure Patterns

Symptom	Likely Cause	Investigation
ASG desired=0, no instances	No Buildkite jobs pending, or Lambda not invoking	Check Step 5 (Lambda logs)
ASG desired>0, no instances launching	Launch template issue, AMI missing, capacity error	Check Step 4 (scaling activities for errors)
Instances running but builds stuck	Buildkite agent not starting on instance, token issue	Check Step 6 (console output)
Lambda not invoking	EventBridge rule disabled	Check Step 5 (Lambda and EventBridge)
Lambda invoking but not scaling	Buildkite API auth failure (expired token)	Check Step 5 (Lambda error logs)
Instances cycle rapidly (launch/terminate)	Health check failures, instance crashing on boot	Check Steps 3, 4, 6
Agents run briefly then terminate	Normal — `MIN_SIZE=0`, scaler scales down when jobs finish	Not a bug

Emergency: Manually Scale Up Agents

If the Lambda is broken and you need agents immediately:

aws autoscaling set-desired-capacity \
  --auto-scaling-group-name "<ASG_NAME>" \
  --desired-capacity <TEMP_CAPACITY_LEQ_MAX_SIZE> \
  --region <REGION> --profile mockserver-build

Choose a temporary capacity that does not exceed the ASG MaxSize from Step 2.

Warning: The Lambda scaler may override this on its next invocation if it sees no pending jobs.

Output — Structured Data Return

Return this structure in your final message:

{
  "schema": "aws-investigation/v1",
  "timestamp": "<ISO8601>",
  "active_stack": "terraform-eu-west-2 | legacy-us-east-1",
  "asg": {
    "name": "<ASG name>",
    "region": "<region>",
    "desired_capacity": 0,
    "min_size": 0,
    "max_size": "<max_size>",
    "instances": [
      {
        "instance_id": "<id>",
        "state": "InService|Pending|Terminating",
        "health": "Healthy|Unhealthy",
        "availability_zone": "<az>"
      }
    ],
    "suspended_processes": ["<process names>"]
  },
  "lambda": {
    "function_name": "<name>",
    "state": "Active|Inactive",
    "runtime": "<runtime>",
    "recent_errors": ["<error messages>"],
    "last_invocation": "<ISO8601 or null>"
  },
  "root_cause": {
    "summary": "<one-line description>",
    "detail": "<technical explanation>",
    "category": "<category from failure patterns>",
    "evidence": "<relevant log lines or CLI output>"
  },
  "recommended_fix": "<actionable steps>",
  "warnings": ["<deprecation notices, capacity concerns, etc.>"]
}

After returning the JSON, provide a brief summary (2-3 lines).

Notes

Always run Step 1 first to determine which stack is active
The Lambda scaler logs are the most valuable data source for understanding scaling decisions
Always check if the Buildkite agent token is still valid if agents start but don't register

aws-investigation

المزيد من هذا المستودع

AWS Infrastructure Investigation

Prerequisites

Infrastructure Overview

Current: Terraform-managed (eu-west-2)

Legacy: CloudFormation-managed (us-east-1)

AWS CLI Prefix

Investigation Workflow

Step 1: Determine Active Stack

Step 2: Quick Health Check

Step 3: Check EC2 Instance Status

Step 4: Check Scaling Activities

Step 5: Check the Autoscaling Lambda

Step 6: Check EC2 Console Output

Step 7: Check Suspended ASG Processes

Failure Patterns

Emergency: Manually Scale Up Agents

Output — Structured Data Return

Notes

AWS Infrastructure Investigation

Prerequisites

Infrastructure Overview

Current: Terraform-managed (eu-west-2)

Legacy: CloudFormation-managed (us-east-1)

AWS CLI Prefix

Investigation Workflow

Step 1: Determine Active Stack

Step 2: Quick Health Check

Step 3: Check EC2 Instance Status

Step 4: Check Scaling Activities

Step 5: Check the Autoscaling Lambda

Step 6: Check EC2 Console Output

Step 7: Check Suspended ASG Processes

Failure Patterns

Emergency: Manually Scale Up Agents

Output — Structured Data Return

Notes

المزيد من هذا المستودع