$pwd:

rackspace-spot-best-practices

Name: Rackspace Spot Best Practices
Author: Matchpoint-AI

// Best practices for running GitHub Actions ARC runners on Rackspace Spot Kubernetes. Covers spot instance management, preemption handling, cost optimization, and resilience strategies. Activates on "rackspace spot", "spot instances", "preemption", "cost optimization", or "spot interruption".

$ git log --oneline --stat

stars:0

forks:0

updated:December 17, 2025 at 16:18

SKILL.md

readonly

package.json

"author": "Matchpoint-AI"

"repository": "Matchpoint-AI/matchpoint-github-runners-helm"

$ git clone

$ download --local

[HINT] Download the complete skill directory including SKILL.md and all related files

related-imports.ts

// Related Skills

import railway-cli-management

from "CaptainCrouton89"

490

import arc-runner-troubleshooting

from "Matchpoint-AI"

import devops-orchestrator

from "Brownbull"

import create-observability-config

from "foolishimp"

import secrets-management

from "wshobson"

22,962

import moai-baas-railway-ext

from "modu-ai"

310

import senior-devops

from "alirezarezvani"

148

import aws-agentic-ai

from "zxkane"

import deployment

from "rubys"

import grey-haven-deployment-cloudflare

from "greyhaven-ai"

import docker-helper

from "CuriousLearner"

import tekton

from "vdemeester"

name	rackspace-spot-best-practices
description	Best practices for running GitHub Actions ARC runners on Rackspace Spot Kubernetes. Covers spot instance management, preemption handling, cost optimization, and resilience strategies. Activates on "rackspace spot", "spot instances", "preemption", "cost optimization", or "spot interruption".
allowed-tools	Read, Grep, Glob, Bash

Rackspace Spot Best Practices for ARC Runners

Overview

This guide covers best practices for running GitHub Actions ARC (Actions Runner Controller) runners on Rackspace Spot Kubernetes. Rackspace Spot is a managed Kubernetes platform that leverages spot pricing through a unique auction-based marketplace.

What is Rackspace Spot?

Rackspace Spot is the world's only open market auction for cloud servers, delivered as turnkey, fully managed Kubernetes clusters.

Key Differentiators from AWS/GCP/Azure Spot

Feature	Rackspace Spot	AWS EC2 Spot	Other Cloud Spot
Pricing Model	True market auction	AWS-controlled floor price	Provider-controlled
Cost Savings	Up to 90%+	50-90%	50-90%
Control Plane	Free (included)	$72/month (EKS)	$70-150/month
Interruption Notice	Managed transparently	2 minutes	2-30 seconds (varies)
Management	Fully managed K8s	Self-managed	Self-managed or managed
Lock-in	Multi-cloud capable	AWS-specific	Cloud-specific
Min Cluster Cost	$0.72/month	Much higher	Much higher

Rackspace Spot Architecture

GitHub Actions
      ↓
ARC (Actions Runner Controller)
      ↓
Kubernetes Cluster (Rackspace Spot)
      ↓
Spot Instance Pool (Auction-based)
      ↓
Rackspace Global Datacenters

Key Features:

Cluster-API based Kubernetes control plane (deployed and managed)
Auto-scaling node pools
Terraform provider for IaC
Persistent storage with transparent migration
Built-in storage classes (SSD, SATA)
Calico CNI with network policies
Built-in load balancers

Cost Optimization

Understanding Rackspace Spot Pricing

Market-Driven Auction:

Bid on compute capacity based on real-time supply and demand
Visibility into market prices and capacity levels
Even if you overbid, you only pay the market price
Prices slide with the market automatically

ROI Examples:

Configuration	Rackspace Spot Cost	AWS EKS Equivalent	Savings
Control Plane	$0 (included)	$72/month	100%
2 t3.medium runners (24/7)	~$15-30/month	~$100-150/month	75-85%
5 t3.large runners (24/7)	~$50-80/month	~$300-400/month	75-85%
Full ARC setup (2 min + scale to 20)	~$150-300/month	~$800-1200/month	75-80%

Bidding Strategy for CI/CD Runners

Best Practice Bidding:

Analyze Historical Capacity:
- View market prices in Rackspace Spot console
- Check available capacity at different price points
- Set bids based on predictable patterns

Set Safe Bid Thresholds:

Conservative: Set bid at 50% of on-demand equivalent
Balanced: Set bid at 70% of on-demand equivalent
Aggressive: Set bid at 90% of on-demand equivalent

Ensure Capacity Availability:
- Bid higher than floor to guarantee capacity
- Even with high bids, you pay market price
- For critical CI/CD, bid conservatively to avoid interruption

Cost Optimization for ARC Runners

minRunners Configuration:

Strategy	minRunners	Cost Impact	Use Case
Zero Idle Cost	0	$0 when idle, 2-5 min cold start	Low-traffic repos, dev
Fast Start	2	~$15-30/month, <10 sec start	Production, active repos
Enterprise	5	~$50-80/month, instant	High-volume CI/CD

Recommendation for ARC on Rackspace Spot:

minRunners: 2 provides best balance of cost and performance
Cold start penalty eliminated (120-300 sec → 5-10 sec)
Developer time saved >> marginal runner cost
Rackspace Spot pricing makes this affordable

Cost Comparison Example:

AWS EKS with minRunners: 2
- Control plane: $72/month
- 2 t3.medium spot: ~$30-50/month
- Total: ~$100-120/month

Rackspace Spot with minRunners: 2
- Control plane: $0/month (included)
- 2 equivalent runners: ~$15-30/month
- Total: ~$15-30/month

Savings: 75-85%

Handling Spot Instance Preemption

Understanding Preemption on Rackspace Spot

Key Differences:

Unlike AWS/GCP/Azure where spot instances can be terminated with 2 minutes notice:

Rackspace Spot manages preemption transparently
Persistent storage migrates automatically during spot server recycling
Built for stateful applications, not just batch processing
Focus is on availability rather than interruption warnings

User Testimonials:

"Running Airflow in Spot with fault-tolerant Kubernetes components means preemption wasn't a concern" - OpsMx case study

ARC-Specific Preemption Considerations

The Challenge:

GitHub Actions jobs don't fit typical Kubernetes workload patterns:

Cannot be safely terminated mid-job
Job is lost if pod terminated before completion
Each job needs dedicated resources (concurrency: 1)
ARC autoscaler constantly scales runners up/down

Best Practices for ARC on Spot:

1. Use Pod Disruption Budgets (PDBs)

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: arc-runner-pdb
  namespace: arc-beta-runners-new
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: runner

Note: PDBs protect against voluntary evictions (drains, upgrades) but NOT spot interruptions. However, Rackspace Spot's transparent migration may handle this better than traditional spot providers.

2. Configure Do-Not-Disrupt Annotations (Karpenter)

If using Karpenter for node provisioning:

# runner-template.yaml
apiVersion: v1
kind: Pod
metadata:
  name: runner
  annotations:
    karpenter.sh/do-not-disrupt: "true"
spec:
  # runner spec

When this helps:

Prevents Karpenter from consolidating/evicting active runners
Protects runners during cluster optimization
Does NOT prevent cloud-level spot interruption

3. Configure Tolerations for Spot Nodes

# arc-runner-values.yaml
spec:
  template:
    spec:
      tolerations:
        - key: "spot"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: "capacity-type"
                    operator: In
                    values:
                      - spot

Benefits:

Runners explicitly tolerate spot nodes
Can prefer spot over on-demand for cost savings
Can require spot for guaranteed spot pricing

4. Set Graceful Shutdown Timeouts

# arc-runner-values.yaml
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 300  # 5 minutes
      containers:
        - name: runner
          env:
            - name: RUNNER_GRACEFUL_STOP_TIMEOUT
              value: "300"

Limitations:

AWS spot: Only 2 minutes warning
GCP spot: Only 30 seconds before force termination
Rackspace Spot: Transparent migration may make this less critical

5. Diversify Node Pools

Strategy: Mix of instance types

# Multiple instance types for availability
nodeSelector:
  node.kubernetes.io/instance-type: "t3.medium,t3a.medium,t2.medium"

Benefits:

Reduces likelihood of capacity exhaustion
If one instance type unavailable, others can scale
Increases overall availability
Lower interruption rate

Rackspace Spot Advantages for ARC

Why Rackspace Spot is Better for CI/CD:

Transparent Migration:
- Persistent storage migrates even if spot server recycled
- Less disruptive than AWS 2-minute termination
- Better for stateful workloads like CI/CD runners
Managed Kubernetes:
- Control plane managed for you
- Auto-healing built-in
- Less operational overhead
Auction Model:
- More predictable capacity
- Can view availability before bidding
- Less sudden interruptions
Cost Efficiency:
- Free control plane saves $72/month minimum
- True auction pricing (not artificial floor)
- Can run persistent workloads affordably

Resilience Strategies

1. Mixed On-Demand/Spot Strategy

When to Use:

Critical production workloads
Jobs that cannot tolerate interruption
Compliance/regulatory requirements

Implementation:

# Create separate runner scale sets
---
# Spot runners (most jobs)
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: spot-runners
spec:
  replicas: 2-20
  template:
    spec:
      nodeSelector:
        capacity-type: spot
      labels:
        - "self-hosted"
        - "linux"
        - "spot"

---
# On-demand runners (critical jobs)
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: ondemand-runners
spec:
  replicas: 0-5
  template:
    spec:
      nodeSelector:
        capacity-type: on-demand
      labels:
        - "self-hosted"
        - "linux"
        - "on-demand"

Workflow Usage:

# .github/workflows/deploy-production.yml
jobs:
  build:
    runs-on: [self-hosted, linux, spot]  # Cost-effective

  deploy-production:
    runs-on: [self-hosted, linux, on-demand]  # Reliable

2. Topology Spread Constraints

Spread pods across failure domains:

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: runner

Benefits:

Prevents all runners in single availability zone
If one zone loses capacity, others continue
Increases overall availability

3. Namespace Isolation

Best Practice: Separate namespaces for isolation

arc-systems                 → ARC controllers
arc-beta-runners-new        → Org-level runners
arc-frontend-runners        → Frontend-specific
arc-api-runners             → API-specific

Benefits:

Resource quotas per namespace
Network policies for security
Blast radius containment
Independent scaling

4. Resource Quotas and Limits

apiVersion: v1
kind: ResourceQuota
metadata:
  name: runner-quota
  namespace: arc-beta-runners-new
spec:
  hard:
    requests.cpu: "40"
    requests.memory: "80Gi"
    pods: "25"

Prevents:

Runaway scaling
Resource exhaustion
Cost overruns

Monitoring and Observability

Key Metrics for Spot Runners

1. Runner Availability:

# Check available runners
kubectl get pods -n arc-beta-runners-new -l app.kubernetes.io/component=runner

# Check scaling events
kubectl get events -n arc-beta-runners-new --sort-by='.lastTimestamp' | grep -i scale

2. Spot Interruptions:

# Track pod evictions/terminations
kubectl get events -A --field-selector reason=Evicted

# Check node status
kubectl get nodes -o wide | grep -i spot

3. Job Queue Times:

# GitHub Actions queue times
gh run list --repo Matchpoint-AI/project-beta-api --status queued --json createdAt,name

# Calculate average queue time
gh run list --json createdAt,startedAt,conclusion --jq '.[] | select(.conclusion != null) | (.startedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)'

4. Cost Tracking:

# Via Rackspace Spot console
# - View real-time billing
# - Track instance hours
# - Compare bid vs actual prices paid

Alert Thresholds

Metric	Normal	Warning	Critical	Action
Avg Queue Time	<30s	30-120s	>120s	Check minRunners, scaling
Available Runners	2+	1	0	Scale up, check capacity
Pod Restart Rate	<1/day	1-5/day	>5/day	Investigate interruptions
Failed Jobs %	<2%	2-5%	>5%	Check resource limits, OOM
Spot Interruptions	0-2/week	2-5/week	>5/week	Review bid strategy, diversify

Configuration Best Practices

Helm Values Structure

# examples/beta-runners-values.yaml
githubConfigUrl: "https://github.com/Matchpoint-AI"
githubConfigSecret: arc-runner-token

minRunners: 2              # Keep 2 pre-warmed (RECOMMENDED for Rackspace Spot)
maxRunners: 20             # Scale to 20 for parallel jobs

runnerGroup: "default"

template:
  spec:
    containers:
      - name: runner
        image: summerwind/actions-runner:latest
        resources:
          limits:
            cpu: "2"
            memory: "4Gi"
          requests:
            cpu: "1"
            memory: "2Gi"

    # Spot tolerations
    tolerations:
      - key: "spot"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

    # Node affinity for spot
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
                - key: "capacity-type"
                  operator: In
                  values:
                    - spot

    # Graceful shutdown
    terminationGracePeriodSeconds: 300

Terraform Configuration

# terraform/main.tf
module "cloudspace" {
  source = "./modules/cloudspace"

  name              = "matchpoint-runners-prod"
  region            = "us-east-1"
  kubernetes_version = "1.28"

  # Spot configuration
  node_pools = {
    spot_runners = {
      instance_type  = "t3.medium"
      capacity_type  = "spot"
      min_size       = 2
      max_size       = 20
      spot_max_price = "0.05"  # Bid price
    }
  }
}

# Get kubeconfig from spot provider
data "spot_kubeconfig" "runners" {
  cloudspace_id = module.cloudspace.cloudspace_id
}

output "kubeconfig_raw" {
  value     = data.spot_kubeconfig.runners.raw
  sensitive = true
}

Getting Kubeconfig:

# Always get fresh kubeconfig from terraform
export TF_HTTP_PASSWORD="<github-token>"
cd matchpoint-github-runners-helm/terraform
terraform init
terraform output -raw kubeconfig_raw > /tmp/runners-kubeconfig.yaml
export KUBECONFIG=/tmp/runners-kubeconfig.yaml
kubectl get pods -A

Critical Configuration Issues

ArgoCD releaseName vs runnerScaleSetName Mismatch (P0)

Issue #112 Root Cause (Dec 12, 2025)

When ArgoCD releaseName doesn't match runnerScaleSetName in values files, runners register with empty labels, causing all CI jobs to queue indefinitely.

Symptoms:

{"labels":[],"name":"arc-beta-runners-xxxxx","os":"unknown","status":"offline"}

Root Cause:

ArgoCD tracks Helm resources under releaseName
ARC creates AutoscalingRunnerSet with runnerScaleSetName
When mismatched, resource tracking breaks → broken registration → empty labels

CRITICAL: These MUST match:

# argocd/apps-live/arc-runners.yaml
helm:
  releaseName: arc-beta-runners  # MUST match runnerScaleSetName!

# examples/runners-values.yaml
gha-runner-scale-set:
  runnerScaleSetName: "arc-beta-runners"  # MUST match releaseName!

CI Validation Added:

scripts/validate-release-names.sh - Validates alignment
.github/workflows/validate.yaml - Runs on PRs to prevent recurrence

Diagnosis:

# Check runner labels
gh api /orgs/Matchpoint-AI/actions/runners --jq '.runners[] | {name, labels: [.labels[].name], os}'

# Run validation script
cd matchpoint-github-runners-helm
./scripts/validate-release-names.sh

Fix:

Align releaseName in ArgoCD Application with runnerScaleSetName in values
Update ignoreDifferences secret name (format: {releaseName}-gha-rs-github-secret)
Merge and wait for ArgoCD sync
Runners will re-register with proper labels

Related Issues: #89, #91, #93, #97, #98, #112, #113

Troubleshooting Spot-Specific Issues

Issue: High Spot Interruption Rate

Symptoms:

Frequent pod restarts
Jobs failing mid-execution
"Node not found" errors

Diagnosis:

# Check pod restart counts
kubectl get pods -n arc-beta-runners-new -o json | jq '.items[] | {name: .metadata.name, restarts: .status.containerStatuses[0].restartCount}'

# Check node events
kubectl get events -A --field-selector involvedObject.kind=Node | grep -i "spot"

Solutions:

Increase spot bid price (more headroom)
Diversify instance types
Add on-demand pool for critical jobs
Review capacity availability in Rackspace console

Issue: Spot Capacity Unavailable

Symptoms:

Pods stuck in "Pending" state
"Insufficient spot capacity" events
Jobs queued but not starting

Diagnosis:

# Check pending pods
kubectl get pods -n arc-beta-runners-new -o wide | grep Pending

# Check node status
kubectl describe node <node-name>

Solutions:

Increase spot bid in Rackspace console
Add multiple instance types to node pool
Temporarily use on-demand capacity
Check Rackspace Spot marketplace for capacity

Issue: Transparent Migration Failing

Symptoms:

Runner job fails during execution
Storage not accessible
Pod crash loops

Diagnosis:

# Check persistent volumes
kubectl get pv,pvc -A

# Check pod events
kubectl describe pod <pod-name> -n arc-beta-runners-new

Solutions:

Verify storage class configuration
Check volume mount paths
Ensure PV reclaim policy is "Retain"
Contact Rackspace support (managed service)

Security Best Practices

1. Isolate Runner Namespaces

apiVersion: v1
kind: Namespace
metadata:
  name: arc-beta-runners-new
  labels:
    pod-security.kubernetes.io/enforce: restricted

2. Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: runner-network-policy
  namespace: arc-beta-runners-new
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: runner
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443  # HTTPS only

3. Service Account Permissions

apiVersion: v1
kind: ServiceAccount
metadata:
  name: arc-runner
  namespace: arc-beta-runners-new
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: arc-runner
  namespace: arc-beta-runners-new
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]  # Minimal permissions

4. Secrets Management

# Use GitHub secrets for ARC token
kubectl create secret generic arc-runner-token \
  --from-literal=github_token=$GITHUB_TOKEN \
  -n arc-beta-runners-new

# Seal secrets for GitOps
kubeseal --format=yaml < secret.yaml > sealed-secret.yaml

Migration Guide: AWS EKS → Rackspace Spot

Pre-Migration Checklist

Document current EKS configuration (node groups, instance types, scaling)
Export ARC runner configurations (Helm values)
Identify all GitHub repositories using self-hosted runners
Calculate current costs (EKS control plane + spot instances)
Determine downtime tolerance for migration

Migration Steps

Set up Rackspace Spot Cloudspace:

cd matchpoint-github-runners-helm/terraform
terraform init
terraform plan
terraform apply

Deploy ARC to Rackspace Spot:

# Get kubeconfig
terraform output -raw kubeconfig_raw > /tmp/rs-kubeconfig.yaml
export KUBECONFIG=/tmp/rs-kubeconfig.yaml

# Deploy ARC controllers
helm install arc-controller oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
  -n arc-systems --create-namespace

# Deploy runner scale sets
helm install arc-beta-runners ./charts/github-actions-runners \
  -n arc-beta-runners-new --create-namespace \
  -f examples/beta-runners-values.yaml

Parallel Testing Period:
- Run both EKS and Rackspace Spot in parallel
- Route 10% of jobs to new runners (using labels)
- Monitor for 1-2 weeks
- Verify cost savings and reliability
Gradual Migration:
- Week 1: 25% of repos to Rackspace Spot
- Week 2: 50% of repos
- Week 3: 75% of repos
- Week 4: 100%, decomission EKS

Cost Validation:

Before (EKS):
- Control plane: $72/month
- Spot instances: ~$200/month
- Total: ~$272/month

After (Rackspace Spot):
- Control plane: $0/month
- Spot instances: ~$50/month
- Total: ~$50/month

Savings: ~$220/month (81%)

References and Resources

Documentation

Related Skills

/home/pselamy/repositories/project-beta-dev-workspace/.claude/skills/arc-runner-troubleshooting/SKILL.md - ARC troubleshooting
/home/pselamy/repositories/project-beta-dev-workspace/.claude/skills/project-beta-infrastructure/SKILL.md - Infrastructure patterns

Case Studies

OpsMx + Rackspace Spot Case Study - 83% cost reduction

Community Resources

Kubeconfig Token Expiration (CRITICAL - Dec 13, 2025 Discovery)

The Problem

Rackspace Spot kubeconfig JWT tokens expire after 3 days. This causes:

kubectl commands to fail with auth errors
Misleading "no such host" DNS errors (actually token expiration)
Agents unable to access cluster for diagnostics

Verify Token Expiration

# Decode JWT to check when token expires
TOKEN=$(grep "token:" kubeconfig.yaml | head -1 | awk '{print $2}')
echo $TOKEN | cut -d. -f2 | base64 -d 2>/dev/null | python3 -c "
import json, sys
from datetime import datetime
payload = json.load(sys.stdin)
exp = datetime.fromtimestamp(payload['exp'])
print(f'Token expires: {exp}')
print(f'Expired: {datetime.now() > exp}')
"

Solution: Scheduled Terraform Refresh

A GitHub Actions workflow (refresh-kubeconfig.yml) runs every 2 days to refresh the terraform state before the 3-day token expiration.

How it works:

Workflow runs terraform refresh with RACKSPACE_SPOT_API_TOKEN
Terraform's data.spot_kubeconfig fetches fresh token from Rackspace API
Fresh token stored in terraform state (tfstate.dev)
Users can get fresh kubeconfig via terraform output

Reference: Issue #135, PR #137

Getting Fresh Kubeconfig

Option A: From Terraform State (RECOMMENDED)

The state is auto-refreshed every 2 days:

# Get GitHub token from gh CLI config
export TF_HTTP_PASSWORD=$(cat ~/.config/gh/hosts.yml | grep oauth_token | awk '{print $2}')

cd matchpoint-github-runners-helm/terraform
terraform init -input=false
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml
kubectl get nodes

Option B: Manual Refresh (if token expired)

If the scheduled workflow hasn't run recently:

# Requires RACKSPACE_SPOT_API_TOKEN
cd matchpoint-github-runners-helm/terraform
terraform refresh -var="rackspace_spot_token=$RACKSPACE_SPOT_API_TOKEN"
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml

Option C: OIDC Context (interactive)

The kubeconfig includes an OIDC context that auto-refreshes via browser login:

kubectl --kubeconfig=kubeconfig.yaml --context=matchpoint-ai-matchpoint-runners-oidc get pods

Requires: kubectl oidc-login plugin (kubectl krew install oidc-login)

Quick Reference Commands

# Get fresh kubeconfig (using gh CLI token)
export TF_HTTP_PASSWORD=$(cat ~/.config/gh/hosts.yml | grep oauth_token | awk '{print $2}')
cd matchpoint-github-runners-helm/terraform
terraform init -input=false
terraform output -raw kubeconfig_raw > /tmp/rs-kubeconfig.yaml
export KUBECONFIG=/tmp/rs-kubeconfig.yaml

# Check runner status
kubectl get pods -n arc-beta-runners-new -l app.kubernetes.io/component=runner

# Check scaling
kubectl get autoscalingrunnerset -A

# View spot node status
kubectl get nodes -o wide | grep spot

# Check for interruptions
kubectl get events -A --field-selector reason=Evicted --sort-by='.lastTimestamp'

# Monitor GitHub Actions queue
gh run list --repo Matchpoint-AI/project-beta-api --status queued

# Check runner registration
gh api /orgs/Matchpoint-AI/actions/runners --jq '.runners[] | {name, status, busy}'

name	rackspace-spot-best-practices
description	Best practices for running GitHub Actions ARC runners on Rackspace Spot Kubernetes. Covers spot instance management, preemption handling, cost optimization, and resilience strategies. Activates on "rackspace spot", "spot instances", "preemption", "cost optimization", or "spot interruption".
allowed-tools	Read, Grep, Glob, Bash

Rackspace Spot Best Practices for ARC Runners

Overview

What is Rackspace Spot?

Rackspace Spot is the world's only open market auction for cloud servers, delivered as turnkey, fully managed Kubernetes clusters.

Key Differentiators from AWS/GCP/Azure Spot

Feature	Rackspace Spot	AWS EC2 Spot	Other Cloud Spot
Pricing Model	True market auction	AWS-controlled floor price	Provider-controlled
Cost Savings	Up to 90%+	50-90%	50-90%
Control Plane	Free (included)	$72/month (EKS)	$70-150/month
Interruption Notice	Managed transparently	2 minutes	2-30 seconds (varies)
Management	Fully managed K8s	Self-managed	Self-managed or managed
Lock-in	Multi-cloud capable	AWS-specific	Cloud-specific
Min Cluster Cost	$0.72/month	Much higher	Much higher

Rackspace Spot Architecture

GitHub Actions
      ↓
ARC (Actions Runner Controller)
      ↓
Kubernetes Cluster (Rackspace Spot)
      ↓
Spot Instance Pool (Auction-based)
      ↓
Rackspace Global Datacenters

Key Features:

Cluster-API based Kubernetes control plane (deployed and managed)
Auto-scaling node pools
Terraform provider for IaC
Persistent storage with transparent migration
Built-in storage classes (SSD, SATA)
Calico CNI with network policies
Built-in load balancers

Cost Optimization

Understanding Rackspace Spot Pricing

Market-Driven Auction:

Bid on compute capacity based on real-time supply and demand
Visibility into market prices and capacity levels
Even if you overbid, you only pay the market price
Prices slide with the market automatically

ROI Examples:

Configuration	Rackspace Spot Cost	AWS EKS Equivalent	Savings
Control Plane	$0 (included)	$72/month	100%
2 t3.medium runners (24/7)	~$15-30/month	~$100-150/month	75-85%
5 t3.large runners (24/7)	~$50-80/month	~$300-400/month	75-85%
Full ARC setup (2 min + scale to 20)	~$150-300/month	~$800-1200/month	75-80%

Bidding Strategy for CI/CD Runners

Best Practice Bidding:

Analyze Historical Capacity:
- View market prices in Rackspace Spot console
- Check available capacity at different price points
- Set bids based on predictable patterns

Set Safe Bid Thresholds:

Conservative: Set bid at 50% of on-demand equivalent
Balanced: Set bid at 70% of on-demand equivalent
Aggressive: Set bid at 90% of on-demand equivalent

Ensure Capacity Availability:
- Bid higher than floor to guarantee capacity
- Even with high bids, you pay market price
- For critical CI/CD, bid conservatively to avoid interruption

Cost Optimization for ARC Runners

minRunners Configuration:

Strategy	minRunners	Cost Impact	Use Case
Zero Idle Cost	0	$0 when idle, 2-5 min cold start	Low-traffic repos, dev
Fast Start	2	~$15-30/month, <10 sec start	Production, active repos
Enterprise	5	~$50-80/month, instant	High-volume CI/CD

Recommendation for ARC on Rackspace Spot:

minRunners: 2 provides best balance of cost and performance
Cold start penalty eliminated (120-300 sec → 5-10 sec)
Developer time saved >> marginal runner cost
Rackspace Spot pricing makes this affordable

Cost Comparison Example:

AWS EKS with minRunners: 2
- Control plane: $72/month
- 2 t3.medium spot: ~$30-50/month
- Total: ~$100-120/month

Rackspace Spot with minRunners: 2
- Control plane: $0/month (included)
- 2 equivalent runners: ~$15-30/month
- Total: ~$15-30/month

Savings: 75-85%

Handling Spot Instance Preemption

Understanding Preemption on Rackspace Spot

Key Differences:

Unlike AWS/GCP/Azure where spot instances can be terminated with 2 minutes notice:

Rackspace Spot manages preemption transparently
Persistent storage migrates automatically during spot server recycling
Built for stateful applications, not just batch processing
Focus is on availability rather than interruption warnings

User Testimonials:

"Running Airflow in Spot with fault-tolerant Kubernetes components means preemption wasn't a concern" - OpsMx case study

ARC-Specific Preemption Considerations

The Challenge:

GitHub Actions jobs don't fit typical Kubernetes workload patterns:

Cannot be safely terminated mid-job
Job is lost if pod terminated before completion
Each job needs dedicated resources (concurrency: 1)
ARC autoscaler constantly scales runners up/down

Best Practices for ARC on Spot:

1. Use Pod Disruption Budgets (PDBs)

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: arc-runner-pdb
  namespace: arc-beta-runners-new
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: runner

2. Configure Do-Not-Disrupt Annotations (Karpenter)

If using Karpenter for node provisioning:

# runner-template.yaml
apiVersion: v1
kind: Pod
metadata:
  name: runner
  annotations:
    karpenter.sh/do-not-disrupt: "true"
spec:
  # runner spec

When this helps:

Prevents Karpenter from consolidating/evicting active runners
Protects runners during cluster optimization
Does NOT prevent cloud-level spot interruption

3. Configure Tolerations for Spot Nodes

# arc-runner-values.yaml
spec:
  template:
    spec:
      tolerations:
        - key: "spot"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: "capacity-type"
                    operator: In
                    values:
                      - spot

Benefits:

Runners explicitly tolerate spot nodes
Can prefer spot over on-demand for cost savings
Can require spot for guaranteed spot pricing

4. Set Graceful Shutdown Timeouts

# arc-runner-values.yaml
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 300  # 5 minutes
      containers:
        - name: runner
          env:
            - name: RUNNER_GRACEFUL_STOP_TIMEOUT
              value: "300"

Limitations:

AWS spot: Only 2 minutes warning
GCP spot: Only 30 seconds before force termination
Rackspace Spot: Transparent migration may make this less critical

5. Diversify Node Pools

Strategy: Mix of instance types

# Multiple instance types for availability
nodeSelector:
  node.kubernetes.io/instance-type: "t3.medium,t3a.medium,t2.medium"

Benefits:

Reduces likelihood of capacity exhaustion
If one instance type unavailable, others can scale
Increases overall availability
Lower interruption rate

Rackspace Spot Advantages for ARC

Why Rackspace Spot is Better for CI/CD:

Transparent Migration:
- Persistent storage migrates even if spot server recycled
- Less disruptive than AWS 2-minute termination
- Better for stateful workloads like CI/CD runners
Managed Kubernetes:
- Control plane managed for you
- Auto-healing built-in
- Less operational overhead
Auction Model:
- More predictable capacity
- Can view availability before bidding
- Less sudden interruptions
Cost Efficiency:
- Free control plane saves $72/month minimum
- True auction pricing (not artificial floor)
- Can run persistent workloads affordably

Resilience Strategies

1. Mixed On-Demand/Spot Strategy

When to Use:

Critical production workloads
Jobs that cannot tolerate interruption
Compliance/regulatory requirements

Implementation:

# Create separate runner scale sets
---
# Spot runners (most jobs)
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: spot-runners
spec:
  replicas: 2-20
  template:
    spec:
      nodeSelector:
        capacity-type: spot
      labels:
        - "self-hosted"
        - "linux"
        - "spot"

---
# On-demand runners (critical jobs)
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: ondemand-runners
spec:
  replicas: 0-5
  template:
    spec:
      nodeSelector:
        capacity-type: on-demand
      labels:
        - "self-hosted"
        - "linux"
        - "on-demand"

Workflow Usage:

# .github/workflows/deploy-production.yml
jobs:
  build:
    runs-on: [self-hosted, linux, spot]  # Cost-effective

  deploy-production:
    runs-on: [self-hosted, linux, on-demand]  # Reliable

2. Topology Spread Constraints

Spread pods across failure domains:

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: runner

Benefits:

Prevents all runners in single availability zone
If one zone loses capacity, others continue
Increases overall availability

3. Namespace Isolation

Best Practice: Separate namespaces for isolation

arc-systems                 → ARC controllers
arc-beta-runners-new        → Org-level runners
arc-frontend-runners        → Frontend-specific
arc-api-runners             → API-specific

Benefits:

Resource quotas per namespace
Network policies for security
Blast radius containment
Independent scaling

4. Resource Quotas and Limits

apiVersion: v1
kind: ResourceQuota
metadata:
  name: runner-quota
  namespace: arc-beta-runners-new
spec:
  hard:
    requests.cpu: "40"
    requests.memory: "80Gi"
    pods: "25"

Prevents:

Runaway scaling
Resource exhaustion
Cost overruns

Monitoring and Observability

Key Metrics for Spot Runners

1. Runner Availability:

# Check available runners
kubectl get pods -n arc-beta-runners-new -l app.kubernetes.io/component=runner

# Check scaling events
kubectl get events -n arc-beta-runners-new --sort-by='.lastTimestamp' | grep -i scale

2. Spot Interruptions:

# Track pod evictions/terminations
kubectl get events -A --field-selector reason=Evicted

# Check node status
kubectl get nodes -o wide | grep -i spot

3. Job Queue Times:

# GitHub Actions queue times
gh run list --repo Matchpoint-AI/project-beta-api --status queued --json createdAt,name

# Calculate average queue time
gh run list --json createdAt,startedAt,conclusion --jq '.[] | select(.conclusion != null) | (.startedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)'

4. Cost Tracking:

# Via Rackspace Spot console
# - View real-time billing
# - Track instance hours
# - Compare bid vs actual prices paid

Alert Thresholds

Metric	Normal	Warning	Critical	Action
Avg Queue Time	<30s	30-120s	>120s	Check minRunners, scaling
Available Runners	2+	1	0	Scale up, check capacity
Pod Restart Rate	<1/day	1-5/day	>5/day	Investigate interruptions
Failed Jobs %	<2%	2-5%	>5%	Check resource limits, OOM
Spot Interruptions	0-2/week	2-5/week	>5/week	Review bid strategy, diversify

Configuration Best Practices

Helm Values Structure

# examples/beta-runners-values.yaml
githubConfigUrl: "https://github.com/Matchpoint-AI"
githubConfigSecret: arc-runner-token

minRunners: 2              # Keep 2 pre-warmed (RECOMMENDED for Rackspace Spot)
maxRunners: 20             # Scale to 20 for parallel jobs

runnerGroup: "default"

template:
  spec:
    containers:
      - name: runner
        image: summerwind/actions-runner:latest
        resources:
          limits:
            cpu: "2"
            memory: "4Gi"
          requests:
            cpu: "1"
            memory: "2Gi"

    # Spot tolerations
    tolerations:
      - key: "spot"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

    # Node affinity for spot
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
                - key: "capacity-type"
                  operator: In
                  values:
                    - spot

    # Graceful shutdown
    terminationGracePeriodSeconds: 300

Terraform Configuration

# terraform/main.tf
module "cloudspace" {
  source = "./modules/cloudspace"

  name              = "matchpoint-runners-prod"
  region            = "us-east-1"
  kubernetes_version = "1.28"

  # Spot configuration
  node_pools = {
    spot_runners = {
      instance_type  = "t3.medium"
      capacity_type  = "spot"
      min_size       = 2
      max_size       = 20
      spot_max_price = "0.05"  # Bid price
    }
  }
}

# Get kubeconfig from spot provider
data "spot_kubeconfig" "runners" {
  cloudspace_id = module.cloudspace.cloudspace_id
}

output "kubeconfig_raw" {
  value     = data.spot_kubeconfig.runners.raw
  sensitive = true
}

Getting Kubeconfig:

# Always get fresh kubeconfig from terraform
export TF_HTTP_PASSWORD="<github-token>"
cd matchpoint-github-runners-helm/terraform
terraform init
terraform output -raw kubeconfig_raw > /tmp/runners-kubeconfig.yaml
export KUBECONFIG=/tmp/runners-kubeconfig.yaml
kubectl get pods -A

Critical Configuration Issues

ArgoCD releaseName vs runnerScaleSetName Mismatch (P0)

Issue #112 Root Cause (Dec 12, 2025)

When ArgoCD releaseName doesn't match runnerScaleSetName in values files, runners register with empty labels, causing all CI jobs to queue indefinitely.

Symptoms:

{"labels":[],"name":"arc-beta-runners-xxxxx","os":"unknown","status":"offline"}

Root Cause:

ArgoCD tracks Helm resources under releaseName
ARC creates AutoscalingRunnerSet with runnerScaleSetName
When mismatched, resource tracking breaks → broken registration → empty labels

CRITICAL: These MUST match:

# argocd/apps-live/arc-runners.yaml
helm:
  releaseName: arc-beta-runners  # MUST match runnerScaleSetName!

# examples/runners-values.yaml
gha-runner-scale-set:
  runnerScaleSetName: "arc-beta-runners"  # MUST match releaseName!

CI Validation Added:

scripts/validate-release-names.sh - Validates alignment
.github/workflows/validate.yaml - Runs on PRs to prevent recurrence

Diagnosis:

# Check runner labels
gh api /orgs/Matchpoint-AI/actions/runners --jq '.runners[] | {name, labels: [.labels[].name], os}'

# Run validation script
cd matchpoint-github-runners-helm
./scripts/validate-release-names.sh

Fix:

Align releaseName in ArgoCD Application with runnerScaleSetName in values
Update ignoreDifferences secret name (format: {releaseName}-gha-rs-github-secret)
Merge and wait for ArgoCD sync
Runners will re-register with proper labels

Related Issues: #89, #91, #93, #97, #98, #112, #113

Troubleshooting Spot-Specific Issues

Issue: High Spot Interruption Rate

Symptoms:

Frequent pod restarts
Jobs failing mid-execution
"Node not found" errors

Diagnosis:

# Check pod restart counts
kubectl get pods -n arc-beta-runners-new -o json | jq '.items[] | {name: .metadata.name, restarts: .status.containerStatuses[0].restartCount}'

# Check node events
kubectl get events -A --field-selector involvedObject.kind=Node | grep -i "spot"

Solutions:

Increase spot bid price (more headroom)
Diversify instance types
Add on-demand pool for critical jobs
Review capacity availability in Rackspace console

Issue: Spot Capacity Unavailable

Symptoms:

Pods stuck in "Pending" state
"Insufficient spot capacity" events
Jobs queued but not starting

Diagnosis:

# Check pending pods
kubectl get pods -n arc-beta-runners-new -o wide | grep Pending

# Check node status
kubectl describe node <node-name>

Solutions:

Increase spot bid in Rackspace console
Add multiple instance types to node pool
Temporarily use on-demand capacity
Check Rackspace Spot marketplace for capacity

Issue: Transparent Migration Failing

Symptoms:

Runner job fails during execution
Storage not accessible
Pod crash loops

Diagnosis:

# Check persistent volumes
kubectl get pv,pvc -A

# Check pod events
kubectl describe pod <pod-name> -n arc-beta-runners-new

Solutions:

Verify storage class configuration
Check volume mount paths
Ensure PV reclaim policy is "Retain"
Contact Rackspace support (managed service)

Security Best Practices

1. Isolate Runner Namespaces

apiVersion: v1
kind: Namespace
metadata:
  name: arc-beta-runners-new
  labels:
    pod-security.kubernetes.io/enforce: restricted

2. Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: runner-network-policy
  namespace: arc-beta-runners-new
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: runner
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443  # HTTPS only

3. Service Account Permissions

apiVersion: v1
kind: ServiceAccount
metadata:
  name: arc-runner
  namespace: arc-beta-runners-new
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: arc-runner
  namespace: arc-beta-runners-new
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]  # Minimal permissions

4. Secrets Management

# Use GitHub secrets for ARC token
kubectl create secret generic arc-runner-token \
  --from-literal=github_token=$GITHUB_TOKEN \
  -n arc-beta-runners-new

# Seal secrets for GitOps
kubeseal --format=yaml < secret.yaml > sealed-secret.yaml

Migration Guide: AWS EKS → Rackspace Spot

Pre-Migration Checklist

Document current EKS configuration (node groups, instance types, scaling)
Export ARC runner configurations (Helm values)
Identify all GitHub repositories using self-hosted runners
Calculate current costs (EKS control plane + spot instances)
Determine downtime tolerance for migration

Migration Steps

Set up Rackspace Spot Cloudspace:

cd matchpoint-github-runners-helm/terraform
terraform init
terraform plan
terraform apply

Deploy ARC to Rackspace Spot:

# Get kubeconfig
terraform output -raw kubeconfig_raw > /tmp/rs-kubeconfig.yaml
export KUBECONFIG=/tmp/rs-kubeconfig.yaml

# Deploy ARC controllers
helm install arc-controller oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
  -n arc-systems --create-namespace

# Deploy runner scale sets
helm install arc-beta-runners ./charts/github-actions-runners \
  -n arc-beta-runners-new --create-namespace \
  -f examples/beta-runners-values.yaml

Parallel Testing Period:
- Run both EKS and Rackspace Spot in parallel
- Route 10% of jobs to new runners (using labels)
- Monitor for 1-2 weeks
- Verify cost savings and reliability
Gradual Migration:
- Week 1: 25% of repos to Rackspace Spot
- Week 2: 50% of repos
- Week 3: 75% of repos
- Week 4: 100%, decomission EKS

Cost Validation:

Before (EKS):
- Control plane: $72/month
- Spot instances: ~$200/month
- Total: ~$272/month

After (Rackspace Spot):
- Control plane: $0/month
- Spot instances: ~$50/month
- Total: ~$50/month

Savings: ~$220/month (81%)

References and Resources

Documentation

Related Skills

/home/pselamy/repositories/project-beta-dev-workspace/.claude/skills/arc-runner-troubleshooting/SKILL.md - ARC troubleshooting
/home/pselamy/repositories/project-beta-dev-workspace/.claude/skills/project-beta-infrastructure/SKILL.md - Infrastructure patterns

Case Studies

OpsMx + Rackspace Spot Case Study - 83% cost reduction

Community Resources

Kubeconfig Token Expiration (CRITICAL - Dec 13, 2025 Discovery)

The Problem

Rackspace Spot kubeconfig JWT tokens expire after 3 days. This causes:

kubectl commands to fail with auth errors
Misleading "no such host" DNS errors (actually token expiration)
Agents unable to access cluster for diagnostics

Verify Token Expiration

# Decode JWT to check when token expires
TOKEN=$(grep "token:" kubeconfig.yaml | head -1 | awk '{print $2}')
echo $TOKEN | cut -d. -f2 | base64 -d 2>/dev/null | python3 -c "
import json, sys
from datetime import datetime
payload = json.load(sys.stdin)
exp = datetime.fromtimestamp(payload['exp'])
print(f'Token expires: {exp}')
print(f'Expired: {datetime.now() > exp}')
"

Solution: Scheduled Terraform Refresh

A GitHub Actions workflow (refresh-kubeconfig.yml) runs every 2 days to refresh the terraform state before the 3-day token expiration.

How it works:

Workflow runs terraform refresh with RACKSPACE_SPOT_API_TOKEN
Terraform's data.spot_kubeconfig fetches fresh token from Rackspace API
Fresh token stored in terraform state (tfstate.dev)
Users can get fresh kubeconfig via terraform output

Reference: Issue #135, PR #137

Getting Fresh Kubeconfig

Option A: From Terraform State (RECOMMENDED)

The state is auto-refreshed every 2 days:

# Get GitHub token from gh CLI config
export TF_HTTP_PASSWORD=$(cat ~/.config/gh/hosts.yml | grep oauth_token | awk '{print $2}')

cd matchpoint-github-runners-helm/terraform
terraform init -input=false
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml
export KUBECONFIG=/tmp/kubeconfig.yaml
kubectl get nodes

Option B: Manual Refresh (if token expired)

If the scheduled workflow hasn't run recently:

# Requires RACKSPACE_SPOT_API_TOKEN
cd matchpoint-github-runners-helm/terraform
terraform refresh -var="rackspace_spot_token=$RACKSPACE_SPOT_API_TOKEN"
terraform output -raw kubeconfig_raw > /tmp/kubeconfig.yaml

Option C: OIDC Context (interactive)

The kubeconfig includes an OIDC context that auto-refreshes via browser login:

kubectl --kubeconfig=kubeconfig.yaml --context=matchpoint-ai-matchpoint-runners-oidc get pods

Requires: kubectl oidc-login plugin (kubectl krew install oidc-login)

Quick Reference Commands

# Get fresh kubeconfig (using gh CLI token)
export TF_HTTP_PASSWORD=$(cat ~/.config/gh/hosts.yml | grep oauth_token | awk '{print $2}')
cd matchpoint-github-runners-helm/terraform
terraform init -input=false
terraform output -raw kubeconfig_raw > /tmp/rs-kubeconfig.yaml
export KUBECONFIG=/tmp/rs-kubeconfig.yaml

# Check runner status
kubectl get pods -n arc-beta-runners-new -l app.kubernetes.io/component=runner

# Check scaling
kubectl get autoscalingrunnerset -A

# View spot node status
kubectl get nodes -o wide | grep spot

# Check for interruptions
kubectl get events -A --field-selector reason=Evicted --sort-by='.lastTimestamp'

# Monitor GitHub Actions queue
gh run list --repo Matchpoint-AI/project-beta-api --status queued

# Check runner registration
gh api /orgs/Matchpoint-AI/actions/runners --jq '.runners[] | {name, status, busy}'