// Troubleshoot ARC (Actions Runner Controller) runners on Rackspace Spot Kubernetes. Diagnose stuck jobs, scaling issues, and cluster access. Activates on "runner", "ARC", "stuck job", "queued", "GitHub Actions", or "CI stuck".
| name | arc-runner-troubleshooting |
| description | Troubleshoot ARC (Actions Runner Controller) runners on Rackspace Spot Kubernetes. Diagnose stuck jobs, scaling issues, and cluster access. Activates on "runner", "ARC", "stuck job", "queued", "GitHub Actions", or "CI stuck". |
| allowed-tools | Read, Grep, Glob, Bash |
project-beta uses self-hosted GitHub Actions runners deployed via ARC (Actions Runner Controller) on Rackspace Spot Kubernetes. This guide covers common issues and troubleshooting procedures.
GitHub Actions
โ
ARC (Actions Runner Controller) โ Watches for queued jobs
โ
AutoscalingRunnerSet โ Scales runner pods 0โN
โ
Runner Pods โ Execute GitHub Actions jobs
โ
Rackspace Spot Kubernetes โ Underlying infrastructure
| Pool | Target | Namespace | Repository |
|---|---|---|---|
arc-beta-runners | Org-level | arc-runners | All project-beta repos |
Note: As of Dec 12, 2025, all workflows use arc-beta-runners label. The runnerScaleSetName and ArgoCD releaseName must both be arc-beta-runners.
A CI validation check now prevents releaseName/runnerScaleSetName mismatches:
scripts/validate-release-names.sh - Validation script.github/workflows/validate.yaml - Runs on PRs touching config filesRun locally:
./scripts/validate-release-names.sh
matchpoint-github-runners-helm/
โโโ examples/
โ โโโ beta-runners-values.yaml โ DEPLOYED Helm values (org-level)
โ โโโ frontend-runners-values.yaml โ DEPLOYED Helm values (frontend)
โโโ values/
โ โโโ repositories.yaml โ Documentation (NOT deployed)
โโโ charts/
โ โโโ github-actions-runners/ โ Helm chart
โโโ terraform/
โโโ modules/ โ Infrastructure as Code
Primary Root Cause: ArgoCD release name โ runnerScaleSetName (mismatch causes tracking failure)
Secondary Root Cause: ACTIONS_RUNNER_LABELS environment variable does not work with ARC
CRITICAL: ArgoCD/Helm Alignment Issue (Dec 12, 2025 Discovery)
If the ArgoCD helm release name doesn't match runnerScaleSetName:
Fix:
# argocd/apps-live/arc-runners.yaml
helm:
releaseName: arc-beta-runners # MUST match runnerScaleSetName!
# examples/runners-values.yaml
gha-runner-scale-set:
runnerScaleSetName: "arc-beta-runners" # MUST match releaseName!
Diagnosis tip: Check runner pod names:
arc-runners-*-runner-* โ OLD ARS still active (problem!)arc-beta-runners-*-runner-* โ NEW ARS deployed (correct!)Symptoms:
[] in GitHubos: "unknown" in GitHub APIDiagnosis:
# Check runner labels via GitHub API
gh api /orgs/Matchpoint-AI/actions/runners --jq '.runners[] | {name, status, labels: [.labels[].name], os}'
# Bad output (empty labels):
{
"name": "arc-runners-w74pg-runner-2xppt",
"status": "online",
"labels": [],
"os": "unknown"
}
# Good output (proper labels):
{
"name": "arc-beta-runners-xxxxx-runner-yyyyy",
"status": "online",
"labels": ["arc-beta-runners", "self-hosted", "Linux", "X64"],
"os": "Linux"
}
Root Cause Explanation:
Per GitHub's official documentation:
"You cannot use additional labels to target runners created by ARC. You can only use the installation name of the runner scale set that you specified during the installation or by defining the value of the
runnerScaleSetNamefield in your values.yaml file."
How ARC Labels Work:
runnerScaleSetName as the GitHub labelACTIONS_RUNNER_LABELS environment variableself-hosted, OS, and architecture labelsFix:
# examples/runners-values.yaml or frontend-runners-values.yaml
gha-runner-scale-set:
runnerScaleSetName: "arc-beta-runners" # This becomes the GitHub label
template:
spec:
containers:
- name: runner
env:
# DO NOT SET ACTIONS_RUNNER_LABELS - it's ignored by ARC!
# Only runnerScaleSetName matters
- name: RUNNER_NAME_PREFIX
value: "arc-beta"
Deployment Steps:
runnerScaleSetName matches workflow runs-on: labelsACTIONS_RUNNER_LABELS env varskubectl delete pods -n arc-runners -l app.kubernetes.io/component=runner
References:
Root Cause: ArgoCD ApplicationSet injects Helm parameters that conflict with values file.
CRITICAL: Dec 12, 2025 Discovery
The ApplicationSet (argocd/applicationset.yaml) was injecting:
parameters:
- name: gha-runner-scale-set.githubConfigSecret.github_token
value: "$ARGOCD_ENV_GITHUB_TOKEN"
But the values file uses:
githubConfigSecret: arc-org-github-secret # String reference to pre-created secret
The Conflict:
--set gha-runner-scale-set.githubConfigSecret.github_token= expects githubConfigSecret to be a mapgithubConfigSecret as a string (secret name reference)interface conversion: interface {} is string, not map[string]interface {}Symptoms:
ComparisonError in conditionsUnknown sync statusDiagnosis:
# Check ArgoCD Application status
kubectl get application arc-runners -n argocd -o jsonpath='{.status.conditions[*]}'
# Look for error like:
# "failed parsing --set data: unable to parse key: interface conversion: interface {} is string, not map[string]interface {}"
# Check ApplicationSet for conflicting parameters
kubectl get applicationset github-runners -n argocd -o jsonpath='{.spec.template.spec.source.helm}'
Fix:
parameters section from argocd/applicationset.yaml# argocd/applicationset.yaml - DO NOT include parameters
helm:
releaseName: '{{name}}'
valueFiles:
- '../../{{valuesFile}}'
# NO parameters section - values file handles secrets
# examples/runners-values.yaml
githubConfigSecret: arc-org-github-secret # Pre-created in cluster
Apply Fix to Cluster:
# kubectl apply may not remove fields - use replace
kubectl replace -f argocd/applicationset.yaml --force
# Verify parameters removed
kubectl get applicationset github-runners -n argocd -o jsonpath='{.spec.template.spec.source.helm}'
Secret Setup:
# Create the secret manually in the cluster
kubectl create secret generic arc-org-github-secret \
--namespace=arc-runners \
--from-literal=github_token='ghp_...'
References:
Root Cause: minRunners: 0 causes cold-start delays
Symptoms:
Diagnosis:
# Check current Helm values
cat /home/pselamy/repositories/matchpoint-github-runners-helm/examples/beta-runners-values.yaml | grep minRunners
# Check if issue is minRunners: 0
# If minRunners: 0 โ cold start on every job
Fix:
# examples/beta-runners-values.yaml
minRunners: 2 # Changed from 0 - keep 2 runners pre-warmed
maxRunners: 20
Why This Happens:
With minRunners: 0:
Job Queued โ ARC detects โ Schedule pod โ Pull image โ
Start container โ Register runner โ Job starts
Total: 120-300 seconds
With minRunners: 2:
Job Queued โ Assign to pre-warmed runner โ Job starts
Total: 5-10 seconds
Problem: Cannot connect to Rackspace Spot cluster
Common Errors:
error: You must be logged in to the server (the server has asked for the client to provide credentials)
error: unknown command "oidc-login" for "kubectl"
dial tcp: lookup hcp-xxx.spot.rackspace.com: no such host
CRITICAL: Kubeconfig Token Expiration (Dec 13, 2025 Discovery)
Rackspace Spot kubeconfig JWT tokens expire after 3 days. This is why:
Verify token expiration:
# Decode JWT to check expiration
TOKEN=$(grep "token:" kubeconfig.yaml | head -1 | awk '{print $2}')
echo $TOKEN | cut -d. -f2 | base64 -d 2>/dev/null | python3 -c "
import json, sys
from datetime import datetime
payload = json.load(sys.stdin)
exp = datetime.fromtimestamp(payload['exp'])
print(f'Token expires: {exp}')
print(f'Expired: {datetime.now() > exp}')
"
Solutions:
Option A: Get kubeconfig from Terraform State (RECOMMENDED)
This is the preferred method - it gets a fresh kubeconfig from the terraform state.
# 1. Get GitHub token from gh CLI config
export TF_HTTP_PASSWORD=$(cat ~/.config/gh/hosts.yml | grep oauth_token | awk '{print $2}')
# 2. Navigate to terraform directory
cd /home/pselamy/repositories/matchpoint-github-runners-helm/terraform
# 3. Initialize terraform (reads state only, no changes)
terraform init -input=false
# 4. Get kubeconfig from terraform output (read-only operation)
terraform output -raw kubeconfig_raw > /tmp/runners-kubeconfig.yaml
# 5. Use the kubeconfig
export KUBECONFIG=/tmp/runners-kubeconfig.yaml
kubectl get pods -A
Why this works:
terraform output reads from cached state in tfstate.devNote: A scheduled workflow (refresh-kubeconfig.yml) runs every 2 days to refresh the token in terraform state before the 3-day expiration.
Option B: Use token-based auth (ngpc-user)
# Check if token expired
kubectl config view --minify -o jsonpath='{.users[0].user.token}' | cut -d. -f2 | base64 -d | jq .exp
# Compare with current timestamp
# Get new kubeconfig from Rackspace Spot console
# 1. Login to https://spot.rackspace.com
# 2. Select cloudspace
# 3. Download kubeconfig
Option C: Install oidc-login plugin
# Install krew (kubectl plugin manager)
brew install krew # or appropriate package manager
# Install oidc-login
kubectl krew install oidc-login
# Use OIDC context
kubectl config use-context tradestreamhq-tradestream-cluster-oidc
Option D: Use ngpc CLI
# Install ngpc CLI from Rackspace
pip install ngpc-cli
# Login and refresh credentials
ngpc login
ngpc kubeconfig get <cloudspace-name>
Problem: Cluster hostname not resolving
dial tcp: lookup hcp-xxx.spot.rackspace.com: no such host
Causes:
Solution: Use terraform to get kubeconfig for the CURRENT active cluster:
# Get fresh kubeconfig from terraform (see Option A above)
export TF_HTTP_PASSWORD="<github-token>"
cd /home/pselamy/repositories/matchpoint-github-runners-helm/terraform
terraform init
terraform output -raw kubeconfig_raw > /tmp/runners-kubeconfig.yaml
Note: The kubeconfig-matchpoint-runners-prod.yaml file in the repo root may be stale if the cluster was recreated. Always use terraform output to get the current kubeconfig.
Diagnosis:
# Check terraform state for current cloudspace
cd /home/pselamy/repositories/matchpoint-github-runners-helm/terraform
export TF_HTTP_PASSWORD="<github-token>"
terraform init
terraform state list | grep cloudspace
# View cloudspace details
terraform state show module.cloudspace.spot_cloudspace.main
Problem: CI workflows fail with "command not found" for common tools
CRITICAL: Custom Runner Image (Dec 13, 2025 Discovery)
Two runner images exist:
ghcr.io/actions/actions-runner:latest - Generic (missing many tools)ghcr.io/matchpoint-ai/arc-runner:latest - Custom (has all tools)Symptoms:
/bin/bash: wget: command not found
/bin/bash: docker: command not found
Root Cause: Configuration may be using the generic image instead of custom.
Diagnosis:
# Check which image is configured
grep -r "ghcr.io" examples/*.yaml values/*.yaml | grep -v "#"
# Check which image is actually running
kubectl get pods -n arc-runners -o jsonpath='{.items[0].spec.containers[0].image}'
Custom Image Includes:
| Tool | Version |
|---|---|
| wget, curl, jq | latest |
| Node.js | 20 LTS |
| Python | 3.12 + pip + poetry |
| Docker CLI | 24.x |
| Terraform | 1.9.x |
| PostgreSQL client | 16 |
| Build tools | make, gcc, etc. |
Fix:
# examples/runners-values.yaml
containers:
- name: runner
image: ghcr.io/matchpoint-ai/arc-runner:latest # NOT actions-runner!
Note: The custom image is built from images/arc-runner/Dockerfile in this repo. The build workflow runs on pushes to images/arc-runner/**.
Reference: Issue #135, PR #138
Problem: Docker commands fail even though DinD sidecar is configured
Symptoms:
Cannot connect to the Docker daemon at tcp://localhost:2375
Diagnosis:
# Check pod has 2 containers (runner + dind)
kubectl get pods -n arc-runners -o jsonpath='{.items[*].spec.containers[*].name}'
# Should show: runner dind
# Check DinD logs
kubectl logs -n arc-runners <pod-name> -c dind --tail=50
# Should show: "API listen on [::]:2375"
# Verify DOCKER_HOST env var
kubectl get pods -n arc-runners -o jsonpath='{.items[0].spec.containers[0].env}' | jq '.[] | select(.name=="DOCKER_HOST")'
# Should show: tcp://localhost:2375
Common Issues:
tcp://localhost:2375Verify DinD is healthy:
kubectl exec -n arc-runners <pod-name> -c runner -- docker version
kubectl exec -n arc-runners <pod-name> -c runner -- docker info
Problem: Documentation says one thing, deployed config is different
Key Insight: The examples/*.yaml files are what actually gets deployed. The values/repositories.yaml is documentation/reference only.
Audit Configuration:
# Check what's ACTUALLY deployed
cat examples/beta-runners-values.yaml | grep -E "(minRunners|maxRunners)"
# vs what documentation says
cat values/repositories.yaml | grep -E "(minRunners|maxRunners)"
# List queued workflows
gh run list --repo Matchpoint-AI/project-beta-api --status queued
# List in-progress workflows
gh run list --repo Matchpoint-AI/project-beta-api --status in_progress
# View specific run
gh run view <RUN_ID> --repo Matchpoint-AI/project-beta-api
# Set kubeconfig
export KUBECONFIG=/path/to/kubeconfig.yaml
# Check runner scale set
kubectl get autoscalingrunnerset -n arc-beta-runners-new
# Check runner pods
kubectl get pods -n arc-beta-runners-new -l app.kubernetes.io/component=runner
# Check ARC controller logs
kubectl logs -n arc-systems deployment/arc-gha-rs-controller --tail=50
# Check for scaling events
kubectl get events -n arc-beta-runners-new --sort-by='.lastTimestamp' | tail -20
# List registered runners
gh api /orgs/Matchpoint-AI/actions/runners --jq '.runners[] | {name, status, busy}'
# Check runner groups
gh api /orgs/Matchpoint-AI/actions/runner-groups --jq '.runner_groups[].name'
minRunners in deployed Helm values| Issue/PR | Repository | Description |
|---|---|---|
| #135 | matchpoint-github-runners-helm | Epic: ARC runner environment limitations (Dec 2025) |
| #136 | matchpoint-github-runners-helm | Document troubleshooting learnings in skills |
| #137 | matchpoint-github-runners-helm | PR: Auto-refresh kubeconfig token workflow |
| #138 | matchpoint-github-runners-helm | PR: Use custom runner image with pre-installed tools |
| #112 | matchpoint-github-runners-helm | CI jobs stuck - PR #98 broke alignment |
| #113 | matchpoint-github-runners-helm | CI validation feature request |
| #114 | matchpoint-github-runners-helm | PR: Fix releaseName alignment + CI validation |
| #89 | matchpoint-github-runners-helm | Empty runner labels investigation |
| #91 | matchpoint-github-runners-helm | PR: Change release name (superseded) |
| #93 | matchpoint-github-runners-helm | PR: Revert to arc-runners naming - MERGED |
| #94 | matchpoint-github-runners-helm | PR: Remove ApplicationSet parameters - MERGED |
| #97 | matchpoint-github-runners-helm | PR: Standardize labels to arc-beta-runners |
| #98 | matchpoint-github-runners-helm | PR: Update runnerScaleSetName (broke alignment!) |
| #798 | project-beta-api | PR: Update workflow labels to arc-runners |
| #72 | matchpoint-github-runners-helm | Root cause analysis for queuing |
| #77 | matchpoint-github-runners-helm | Fix PR (minRunners: 0 โ 2) - MERGED |
| #76 | matchpoint-github-runners-helm | Investigation state file |
| #1624 | project-beta | ARC runners stuck (closed) |
| #1577 | project-beta | P0: ARC unavailable (closed) |
| #1521 | project-beta | Runners stuck (closed) |
| Setting | Cost Impact | Recommendation |
|---|---|---|
minRunners: 0 | Lowest ($0 idle) | Development/low-traffic |
minRunners: 2 | ~$150-300/mo | Production/high-traffic |
minRunners: 5 | ~$400-700/mo | Enterprise/critical CI |
ROI Calculation:
argocd app sync arc-beta-runnerskubectl rollout restart deployment -n arc-beta-runners-new# Temporarily switch workflow to GitHub-hosted
jobs:
build:
runs-on: ubuntu-latest # Instead of self-hosted
matchpoint-github-runners-helm/charts/github-actions-runners/matchpoint-github-runners-helm/terraform/