| name | aws-investigation |
| description | Investigates AWS infrastructure issues affecting Buildkite build agents (EC2, AutoScaling, Lambda). Returns structured JSON to the parent for formatting. Triggers when users ask about build agents not running, EC2 issues, ASG scaling problems, or infrastructure health. |
AWS Infrastructure Investigation
Investigate AWS infrastructure issues affecting Buildkite build agents. Covers EC2 instances, AutoScaling Groups, the autoscaling Lambda, and related resources.
Prerequisites
- AWS CLI installed (
brew install awscli)
- AWS SSO profile
mockserver-build configured (SSO region: eu-west-2)
- Active SSO session:
aws sso login --profile mockserver-build
- Corporate TLS proxy (if applicable):
export AWS_CA_BUNDLE=$NODE_EXTRA_CA_CERTS (only if NODE_EXTRA_CA_CERTS is set)
- macOS + Python 3.14 + Homebrew: if you get
pyexpat symbol errors, export DYLD_LIBRARY_PATH=/opt/homebrew/opt/expat/lib
Infrastructure Overview
There are two build agent stacks. Investigate the current stack first; fall back to the legacy stack only if the current one has not been provisioned yet.
Current: Terraform-managed (eu-west-2)
Managed by terraform/buildkite-agents/ using the official Buildkite Elastic CI Stack module.
| Property | Value |
|---|
| Region | eu-west-2 |
| Instance type | Read from terraform/buildkite-agents/terraform.tfvars (instance_types) |
| Scaling | Read from Terraform variables (min_size, max_size, on_demand_percentage) |
| Scaler version | buildkite-agent-scaler v1.11.2 |
| Scaler runtime | provided.al2023 |
| Queue | default |
| IaC | terraform/buildkite-agents/ |
Resource names are generated by Terraform with a random suffix. To find them:
cd terraform/buildkite-agents
terraform output auto_scaling_group_name
aws autoscaling describe-auto-scaling-groups \
--region eu-west-2 --profile mockserver-build \
--query 'AutoScalingGroups[?contains(Tags[?Key==`Stack`].Value | [0], `buildkite-mockserver`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Min:MinSize,Max:MaxSize,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'
Legacy: CloudFormation-managed (us-east-1)
Being replaced by the Terraform stack above. May still be active during migration.
| Resource | Identifier | Region |
|---|
| AutoScaling Group | buildkite-AgentAutoScaleGroup-VGG28FR0DE6Q | us-east-1 |
| CloudFormation Stack | buildkite | us-east-1 |
| Instance Type | Inspect live ASG launch template via AWS CLI | us-east-1 |
| Autoscaling Lambda | Use discovery query below (name generated by CloudFormation) | us-east-1 |
AWS CLI Prefix
All commands require --region and --profile flags:
aws ... --region eu-west-2 --profile mockserver-build
aws ... --region us-east-1 --profile mockserver-build
Investigation Workflow
Step 1: Determine Active Stack
Check which stack is currently running agents:
aws autoscaling describe-auto-scaling-groups \
--region eu-west-2 --profile mockserver-build \
--query 'AutoScalingGroups[?contains(Tags[?Key==`Stack`].Value | [0], `buildkite-mockserver`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names "buildkite-AgentAutoScaleGroup-VGG28FR0DE6Q" \
--region us-east-1 --profile mockserver-build \
--query 'AutoScalingGroups[0].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Instances:Instances[*].{ID:InstanceId,State:LifecycleState}}'
Use whichever stack has instances (or non-zero desired capacity) for the remaining steps. Substitute the correct --region and ASG name accordingly.
Step 2: Quick Health Check
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names "<ASG_NAME>" \
--region <REGION> --profile mockserver-build \
--query 'AutoScalingGroups[0].{Desired:DesiredCapacity,Min:MinSize,Max:MaxSize,Instances:Instances[*].{ID:InstanceId,State:LifecycleState,Health:HealthStatus}}'
Expected healthy state:
- If queue is empty:
Desired = 0 can be healthy (scale-to-zero)
- If queue has pending jobs: desired capacity should increase above 0 within 1-2 scaler intervals
- Active instances should be
InService and Healthy
Problem indicators:
Desired: 0 — no agents requested (scaler not seeing jobs, or Lambda not running)
Desired > 0 but no instances — launch failures
- Instances in
Pending for >5 min — launch issues
- Instances
Unhealthy — failing health checks
Step 3: Check EC2 Instance Status
aws ec2 describe-instances \
--filters "Name=tag:aws:autoscaling:groupName,Values=<ASG_NAME>" \
--region <REGION> --profile mockserver-build \
--query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name,Type:InstanceType,Launch:LaunchTime,AZ:Placement.AvailabilityZone}'
For running instances, check system/instance status:
aws ec2 describe-instance-status \
--instance-ids <instance-id-1> <instance-id-2> \
--region <REGION> --profile mockserver-build
Step 4: Check Scaling Activities
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name "<ASG_NAME>" \
--region <REGION> --profile mockserver-build \
--max-items 10
Look for:
"user request explicitly set group desired capacity" — the Lambda scaler adjusted capacity
"an instance was taken out of service" — scale-in event
Failed status codes — launch failures (AMI issues, capacity, subnet exhaustion)
Step 5: Check the Autoscaling Lambda
Find the scaler Lambda by listing functions with a Buildkite-related name:
aws lambda list-functions \
--region <REGION> --profile mockserver-build \
--query 'Functions[?contains(FunctionName, `buildkite`) && (contains(FunctionName, `scaler`) || contains(FunctionName, `caling`))].{Name:FunctionName,Runtime:Runtime,State:State,LastModified:LastModified}'
Then check its logs:
aws logs filter-log-events \
--log-group-name "/aws/lambda/<LAMBDA_FUNCTION_NAME>" \
--region <REGION> --profile mockserver-build \
--start-time $(python3 -c "import time; print(int((time.time() - 3600) * 1000))") \
--limit 20
aws logs filter-log-events \
--log-group-name "/aws/lambda/<LAMBDA_FUNCTION_NAME>" \
--region <REGION> --profile mockserver-build \
--start-time $(python3 -c "import time; print(int((time.time() - 3600) * 1000))") \
--filter-pattern "ERROR" \
--limit 10
Step 6: Check EC2 Console Output
For instances that are running but not registering as Buildkite agents:
aws ec2 get-console-output \
--instance-id <instance-id> \
--region <REGION> --profile mockserver-build \
--query 'Output' --output text
Step 7: Check Suspended ASG Processes
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names "<ASG_NAME>" \
--region <REGION> --profile mockserver-build \
--query 'AutoScalingGroups[0].SuspendedProcesses'
Note: AZRebalance is intentionally suspended to prevent killing running builds. Other suspended processes may indicate problems.
Failure Patterns
| Symptom | Likely Cause | Investigation |
|---|
| ASG desired=0, no instances | No Buildkite jobs pending, or Lambda not invoking | Check Step 5 (Lambda logs) |
| ASG desired>0, no instances launching | Launch template issue, AMI missing, capacity error | Check Step 4 (scaling activities for errors) |
| Instances running but builds stuck | Buildkite agent not starting on instance, token issue | Check Step 6 (console output) |
| Lambda not invoking | EventBridge rule disabled | Check Step 5 (Lambda and EventBridge) |
| Lambda invoking but not scaling | Buildkite API auth failure (expired token) | Check Step 5 (Lambda error logs) |
| Instances cycle rapidly (launch/terminate) | Health check failures, instance crashing on boot | Check Steps 3, 4, 6 |
| Agents run briefly then terminate | Normal — MIN_SIZE=0, scaler scales down when jobs finish | Not a bug |
Emergency: Manually Scale Up Agents
If the Lambda is broken and you need agents immediately:
aws autoscaling set-desired-capacity \
--auto-scaling-group-name "<ASG_NAME>" \
--desired-capacity <TEMP_CAPACITY_LEQ_MAX_SIZE> \
--region <REGION> --profile mockserver-build
Choose a temporary capacity that does not exceed the ASG MaxSize from Step 2.
Warning: The Lambda scaler may override this on its next invocation if it sees no pending jobs.
Output — Structured Data Return
Return this structure in your final message:
{
"schema": "aws-investigation/v1",
"timestamp": "<ISO8601>",
"active_stack": "terraform-eu-west-2 | legacy-us-east-1",
"asg": {
"name": "<ASG name>",
"region": "<region>",
"desired_capacity": 0,
"min_size": 0,
"max_size": "<max_size>",
"instances": [
{
"instance_id": "<id>",
"state": "InService|Pending|Terminating",
"health": "Healthy|Unhealthy",
"availability_zone": "<az>"
}
],
"suspended_processes": ["<process names>"]
},
"lambda": {
"function_name": "<name>",
"state": "Active|Inactive",
"runtime": "<runtime>",
"recent_errors": ["<error messages>"],
"last_invocation": "<ISO8601 or null>"
},
"root_cause": {
"summary": "<one-line description>",
"detail": "<technical explanation>",
"category": "<category from failure patterns>",
"evidence": "<relevant log lines or CLI output>"
},
"recommended_fix": "<actionable steps>",
"warnings": ["<deprecation notices, capacity concerns, etc.>"]
}
After returning the JSON, provide a brief summary (2-3 lines).
Notes
- Always run Step 1 first to determine which stack is active
- The Lambda scaler logs are the most valuable data source for understanding scaling decisions
- Always check if the Buildkite agent token is still valid if agents start but don't register