| name | ops-pipelines |
| description | Internal — for Boundless team members only. Monitor Boundless deployment pipelines (AWS CodePipeline + CodeBuild) on the ops account. Use when the user wants to track a deployment after merging a PR, check whether a commit has rolled out to staging/prod, diagnose a failed deployment, watch the status of a specific pipeline, or get prompted to approve a production rollout once staging succeeds. Do NOT use for service runtime debugging (use ops-logs-query) or for deploying dev infrastructure (use ops-infra-deploy). |
Ops Pipelines
Monitor Boundless service deployments running through AWS CodePipeline /
CodeBuild in the ops account. Designed for the post-merge workflow: track a
commit through staging, surface failures with build logs, and prompt the user
to approve the production rollout.
Setup
Read network_secrets.toml from the repo root and extract [aws.ops]
(access_key_id, secret_access_key). These are read-only and can query
CodePipeline / CodeBuild / CloudWatch but cannot approve, start, or retry
pipelines. If the file is missing, point the user at the Boundless
runbook.
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-west-2"
All Boundless pipelines live in us-west-2 in account 968153779208
(BoundlessOps).
Pipelines
Pipeline definitions live in infra/pipelines/pipelines/ — read that dir to
see what's deployed, branch config, stage layout per service. The boundless
repo's pipelines are the l-* ones; signal-pipeline,
kailua-order-generator-pipeline, zeth-requestor-pipeline come from
different repos.
Discover the live list any time:
aws codepipeline list-pipelines --query 'pipelines[].name' --output table
Typical stage layout: Source → DeployStaging (parallel CodeBuild per
chain) → DeployProduction (manual approval, then parallel CodeBuild per
chain). l-prover-ansible-pipeline adds a DeployNightly stage between
staging and production.
CodeBuild project names have a Pulumi resource hash suffix
(l-indexer-staging-167000-build-aee3645); always derive the suffix from
pipeline state, never hardcode it.
Core workflows
1. Monitor a deployment for a freshly merged commit
Primary use case. Given a commit SHA (or PR), find the pipeline executions
for that SHA and poll until staging finishes.
Resolve the SHA if needed:
gh pr view <number> --json mergeCommit --jq '.mergeCommit.oid'
Find the matching execution per pipeline (executions store the SHA in
sourceRevisions[0].revisionId):
SHA="..."
for P in $(aws codepipeline list-pipelines --query 'pipelines[?starts_with(name, `l-`)].name' --output text); do
EXEC=$(aws codepipeline list-pipeline-executions --pipeline-name "$P" --max-items 20 \
--query "pipelineExecutionSummaries[?sourceRevisions[0].revisionId=='$SHA'] | [0].pipelineExecutionId" \
--output text)
echo "$P -> $EXEC"
sleep 1
done
Default to all l-* pipelines unless the user specifies a service. Skip
l-prover-ansible-pipeline unless the change touched ansible/ or
infra/cw-monitoring/.
If the SHA's execution is Superseded, a later commit took over and that
SHA will not deploy to prod on its own. Call this out — common source of
confusion when merging PRs back-to-back.
Use the Monitor tool to poll in the background so the user can keep
working while staging runs (CodePipeline deployments take 10–30+ minutes).
The Monitor tool runs a script in the background and feeds each output line
back, so the agent can interject as soon as a stage transitions.
Run one Monitor per tracked pipeline with a script that prints a status
heartbeat every 30s and exits on a terminal event. Use the execution's
overall status from get-pipeline-execution, and get-pipeline-state to
detect the approval gate (filtered to the inbound exec at DeployProduction
so a newer superseding exec doesn't trigger a false approval signal):
PIPELINE="l-indexer-pipeline"
EXEC="<pipelineExecutionId from step above>"
while true; do
STATUS=$(aws codepipeline get-pipeline-execution \
--pipeline-name "$PIPELINE" --pipeline-execution-id "$EXEC" \
--query 'pipelineExecution.status' --output text 2>/dev/null || echo Unknown)
echo "$(date -u +%H:%M:%SZ) $PIPELINE exec=$EXEC status=$STATUS"
case "$STATUS" in
Succeeded)
echo "DONE pipeline=$PIPELINE exec=$EXEC"; break ;;
Failed|Stopped|Cancelled|Superseded)
echo "ALERT pipeline=$PIPELINE status=$STATUS exec=$EXEC"; break ;;
InProgress)
APPROVAL=$(aws codepipeline get-pipeline-state --name "$PIPELINE" --output json \
| jq -r --arg E "$EXEC" '
.stageStates[]
| select(.stageName=="DeployProduction"
and .inboundExecution.pipelineExecutionId == $E)
| .actionStates[] | select(.actionName=="ApproveDeployToProduction")
| .latestExecution.status // empty' | head -1)
if [ "$APPROVAL" = "InProgress" ]; then
echo "READY-TO-APPROVE pipeline=$PIPELINE exec=$EXEC"; break
fi ;;
esac
sleep 30
done
Each tracked pipeline gets its own Monitor (run them in parallel — the
boundless ops account handles the call rate fine at 30s intervals). React
when a line starting with ALERT, READY-TO-APPROVE, or DONE arrives:
READY-TO-APPROVE → prompt the user to approve production (workflow 2).
ALERT ... Failed → surface the failure (workflow 3).
ALERT ... Superseded → a newer commit took over; tell the user this SHA
will not deploy to prod on its own.
DONE ... Succeeded → pipeline fully complete (rare without approval).
If the user cancels the run, moves on to unrelated work, or asks to stop
monitoring, cancel the monitors — they cost API calls and clutter context.
If the Monitor tool isn't available, fall back to manual polling with
get-pipeline-state every 30s.
2. Approving production deploys
When staging completes, summarise: commit SHA + subject, which pipelines are
waiting, and per-pipeline AWS Console links:
https://us-west-2.console.aws.amazon.com/codesuite/codepipeline/pipelines/<pipeline-name>/view?region=us-west-2
Ask the user explicitly whether to approve. The user approves via the AWS
Console (or Slack — pipelines emit manual-approval-needed events to the
boundless-alerts-launch and boundless-alerts-staging-launch channels).
NEVER attempt the approval call yourself — the read-only ops creds will fail
with AccessDenied. After the user approves, optionally keep polling
production stages.
3. Diagnose a failed deployment
Find the failed action and its CodeBuild build:
aws codepipeline list-action-executions --pipeline-name "$P" \
--filter pipelineExecutionId="$EXEC" \
--query "actionExecutionDetails[?status=='Failed'].{stage:stageName, action:actionName, build:output.executionResult.externalExecutionId, project:input.configuration.ProjectName, url:output.executionResult.externalExecutionUrl}" \
--output json
build is <project>:<uuid>. Get the failed phase + log location:
aws codebuild batch-get-builds --ids "$BUILD_ID" \
--query 'builds[].{status:buildStatus, phase:currentPhase, group:logs.groupName, stream:logs.streamName, deepLink:logs.deepLink, start:startTime, end:endTime, failures:phases[?phaseStatus==`FAILED`].[phaseType,contexts[].message]}' \
--output json
Pull log lines around the failure (CodeBuild logs go to
/aws/codebuild/<project-name>):
aws logs filter-log-events \
--log-group-name "$LOG_GROUP" \
--log-stream-names "$LOG_STREAM" \
--start-time "$START_MS" --end-time "$END_MS" \
--filter-pattern '?ERROR ?error ?Failed ?failed ?"exit code"' \
--output json | jq '.events[] | {ts: (.timestamp/1000|todate), msg: .message}'
Common Boundless-specific failure patterns:
Still using ops account — assume-role didn't take effect; usually
transient, retry the stage.
pulumi cancel / update is in progress — previous run was killed;
next run usually self-recovers via pulumi cancel --yes in the buildspec.
Resource ... already exists — Pulumi state drift; manual fix.
401 Unauthorized from ghcr.io / docker.io — token rotation issue.
AccessDenied — IAM problem in the target account.
unhealthy / failed to start: container (prover-ansible) —
cross-reference with ops-logs-query on the bento prover log group.
Surface the failed phase + 10–30 most relevant log lines + console deep
link. Don't dump the full build log.
4. Fleet status overview
for P in $(aws codepipeline list-pipelines --query 'pipelines[].name' --output text); do
echo "=== $P ==="
aws codepipeline get-pipeline-state --name "$P" \
--query 'stageStates[].{stage:stageName, status:latestExecution.status}' \
--output table
sleep 1
done
Highlight: any Failed stage, any DeployProduction waiting on approval,
pipelines with no recent runs.
Status reference
| Status | Meaning |
|---|
InProgress | Currently running. |
Succeeded | Finished successfully. |
Failed | Failed; pipeline halted. |
Stopped / Stopping | Manually stopped. |
Superseded | Newer execution took over; this one will not progress further. Common — webhook fires on every push. |
Cancelled | Cancelled before completion (rare). |
CodeBuild: SUCCEEDED, FAILED, FAULT, TIMED_OUT, IN_PROGRESS,
STOPPED.
Tips
sleep 1 between AWS calls — CodePipeline TPS limits are low.
- Prefer
get-pipeline-state over list-action-executions for live polling
(one call, everything needed).
- Always show full pipeline name + execution ID + console deep link in any
status report.
- A stage can show
Failed while most chain-specific actions inside it
succeeded — identify which chain(s) actually failed.
Important
- NEVER attempt to approve, start, retry, or stop a pipeline — the
read-only ops creds fail with
AccessDenied. Direct the user to the AWS
Console.
- NEVER fabricate a
pipelineExecutionId, actionExecutionId, approval
token, or CodeBuild ID. They must come from a live AWS query.
- This skill is read-only. To modify pipelines themselves, see
infra/pipelines/.