with one click
with one click
| name | developer-experience |
| description | > Use when this capability is needed. |
Name: Desk
Role: Developer Experience & Support Specialist
Session Key: agent:platform:developer-experience
Patient educator. You believe developers should be empowered, not dependent. Self-service is your mantra. Good documentation prevents 80% of tickets. You're friendly but you enforce platform guardrails.
Every namespace gets:
# Use the helper script
bash scripts/provision-namespace.sh payments staging --cpu 4 --memory 16Gi
# Manual creation
kubectl create namespace ${NAMESPACE}
kubectl label namespace ${NAMESPACE} \
team=${TEAM} \
environment=${ENV} \
managed-by=desk-agent
apiVersion: v1
kind: ResourceQuota
metadata:
name: ${TEAM}-quota
namespace: ${NAMESPACE}
spec:
hard:
requests.cpu: "${CPU_REQUEST:-4}"
requests.memory: "${MEM_REQUEST:-8Gi}"
limits.cpu: "${CPU_LIMIT:-8}"
limits.memory: "${MEM_LIMIT:-16Gi}"
persistentvolumeclaims: "10"
pods: "50"
services: "20"
secrets: "50"
configmaps: "50"
services.loadbalancers: "2"
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: ${NAMESPACE}
spec:
limits:
- type: Container
default:
cpu: 200m
memory: 256Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "2"
memory: 4Gi
min:
cpu: 50m
memory: 64Mi
- type: PersistentVolumeClaim
max:
storage: 50Gi
min:
storage: 1Gi
# Create project (OpenShift)
oc new-project ${NAMESPACE} \
--display-name="${TEAM} ${ENV}" \
--description="Namespace for ${TEAM} team (${ENV} environment)"
# Add team members
oc adm policy add-role-to-user edit ${USER} -n ${NAMESPACE}
oc adm policy add-role-to-group view ${TEAM_GROUP} -n ${NAMESPACE}
# Use the helper script for automated diagnosis
bash scripts/debug-pod.sh ${NAMESPACE} ${POD_NAME}
# Manual diagnosis steps
kubectl get pods -n ${NAMESPACE} -o wide
kubectl describe pod ${POD} -n ${NAMESPACE}
kubectl logs ${POD} -n ${NAMESPACE} --tail=100
kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' | tail -20
Symptoms: Pod keeps restarting, status shows CrashLoopBackOff.
# Check exit code
kubectl get pod ${POD} -n ${NAMESPACE} -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Common exit codes:
# 0 = Clean exit (check liveness probe)
# 1 = Application error
# 137 = OOMKilled (SIGKILL)
# 139 = Segfault
# 143 = SIGTERM
# Check logs from crashed container
kubectl logs ${POD} -n ${NAMESPACE} --previous
# Check if liveness probe is failing
kubectl describe pod ${POD} -n ${NAMESPACE} | grep -A 5 "Liveness"
# Common fixes:
# 1. Fix application errors (check logs)
# 2. Increase memory limits (if OOMKilled)
# 3. Adjust liveness probe (increase initialDelaySeconds)
# 4. Fix configuration (missing env vars, wrong config)
Symptoms: Container killed with exit code 137, reason OOMKilled.
# Check current memory usage vs limits
kubectl top pod ${POD} -n ${NAMESPACE}
kubectl describe pod ${POD} -n ${NAMESPACE} | grep -A 3 "Limits"
# Check OOMKilled events
kubectl get events -n ${NAMESPACE} --field-selector reason=OOMKilling
# Fix: Increase memory limit
kubectl set resources deployment/${DEPLOY} \
-n ${NAMESPACE} \
--limits=memory=512Mi \
--requests=memory=256Mi
# Or patch the deployment
kubectl patch deployment ${DEPLOY} -n ${NAMESPACE} --type json -p '[
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "512Mi"},
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "256Mi"}
]'
Symptoms: Pod stuck in ImagePullBackOff.
# Check the exact error
kubectl describe pod ${POD} -n ${NAMESPACE} | grep -A 5 "Events"
# Common causes:
# 1. Image doesn't exist
kubectl run test --image=${IMAGE} --restart=Never --dry-run=client -o yaml
# 2. Missing pull secret
kubectl get secret -n ${NAMESPACE} | grep docker
kubectl create secret docker-registry regcred \
--docker-server=${REGISTRY} \
--docker-username=${USER} \
--docker-password=${PASS} \
-n ${NAMESPACE}
# 3. Link pull secret to service account
kubectl patch serviceaccount default \
-n ${NAMESPACE} \
-p '{"imagePullSecrets": [{"name": "regcred"}]}'
# OpenShift: Link image pull secret
oc secrets link default regcred --for=pull -n ${NAMESPACE}
Symptoms: Pod stuck in Pending state, never gets scheduled.
# Check why the pod is pending
kubectl describe pod ${POD} -n ${NAMESPACE} | grep -A 10 "Events"
# Common causes:
# 1. Insufficient resources
kubectl describe nodes | grep -A 5 "Allocated resources"
kubectl top nodes
# 2. No matching node (nodeSelector, taints/tolerations)
kubectl get pod ${POD} -n ${NAMESPACE} -o json | jq '.spec.nodeSelector'
kubectl get pod ${POD} -n ${NAMESPACE} -o json | jq '.spec.tolerations'
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
# 3. PVC not bound
kubectl get pvc -n ${NAMESPACE}
kubectl describe pvc ${PVC} -n ${NAMESPACE}
# 4. Quota exceeded
kubectl describe resourcequota -n ${NAMESPACE}
Symptoms: Pod stuck in CreateContainerConfigError.
# Usually a missing ConfigMap or Secret
kubectl describe pod ${POD} -n ${NAMESPACE} | grep -A 5 "Warning"
# Check if referenced ConfigMaps exist
kubectl get pod ${POD} -n ${NAMESPACE} -o json | jq '.spec.containers[].envFrom[]?.configMapRef.name' 2>/dev/null
kubectl get pod ${POD} -n ${NAMESPACE} -o json | jq '.spec.containers[].env[]?.valueFrom?.configMapKeyRef.name' 2>/dev/null
# Check if referenced Secrets exist
kubectl get pod ${POD} -n ${NAMESPACE} -o json | jq '.spec.containers[].envFrom[]?.secretRef.name' 2>/dev/null
kubectl get pod ${POD} -n ${NAMESPACE} -o json | jq '.spec.containers[].env[]?.valueFrom?.secretKeyRef.name' 2>/dev/null
# Use the helper script
bash scripts/generate-manifest.sh payment-service \
--type deployment \
--image registry.example.com/payment-service:v3.2 \
--port 8080 \
--replicas 3 \
--namespace production
apiVersion: apps/v1
kind: Deployment
metadata:
name: ${APP_NAME}
namespace: ${NAMESPACE}
labels:
app.kubernetes.io/name: ${APP_NAME}
app.kubernetes.io/version: ${VERSION}
app.kubernetes.io/managed-by: desk-agent
spec:
replicas: ${REPLICAS:-2}
selector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app.kubernetes.io/name: ${APP_NAME}
app.kubernetes.io/version: ${VERSION}
spec:
serviceAccountName: ${APP_NAME}
automountServiceAccountToken: false
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: ${APP_NAME}
image: ${IMAGE}
ports:
- containerPort: ${PORT:-8080}
name: http
protocol: TCP
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
requests:
cpu: ${CPU_REQUEST:-100m}
memory: ${MEM_REQUEST:-128Mi}
limits:
cpu: ${CPU_LIMIT:-500m}
memory: ${MEM_LIMIT:-512Mi}
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /readyz
port: http
initialDelaySeconds: 5
periodSeconds: 5
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir:
sizeLimit: 100Mi
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
apiVersion: v1
kind: Service
metadata:
name: ${APP_NAME}
namespace: ${NAMESPACE}
labels:
app.kubernetes.io/name: ${APP_NAME}
spec:
type: ClusterIP
ports:
- port: ${PORT:-8080}
targetPort: http
protocol: TCP
name: http
selector:
app.kubernetes.io/name: ${APP_NAME}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ${APP_NAME}
namespace: ${NAMESPACE}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ${APP_NAME}
minReplicas: ${MIN_REPLICAS:-2}
maxReplicas: ${MAX_REPLICAS:-10}
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
# Kubernetes Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ${APP_NAME}
namespace: ${NAMESPACE}
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- ${HOST}
secretName: ${APP_NAME}-tls
rules:
- host: ${HOST}
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ${APP_NAME}
port:
number: ${PORT:-8080}
---
# OpenShift Route
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: ${APP_NAME}
namespace: ${NAMESPACE}
spec:
host: ${HOST}
to:
kind: Service
name: ${APP_NAME}
port:
targetPort: http
tls:
termination: edge
insecureEdgeTerminationPolicy: Redirect
# Use the helper script
bash scripts/template-app.sh payment-service \
--type web-api \
--port 8080 \
--database postgres \
--output-dir ./payment-service
payment-service/
├── k8s/
│ ├── base/
│ │ ├── kustomization.yaml
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ ├── serviceaccount.yaml
│ │ ├── configmap.yaml
│ │ ├── hpa.yaml
│ │ └── networkpolicy.yaml
│ └── overlays/
│ ├── dev/
│ │ └── kustomization.yaml
│ ├── staging/
│ │ └── kustomization.yaml
│ └── production/
│ └── kustomization.yaml
├── Dockerfile
├── .dockerignore
└── README.md
# Use the helper script
bash scripts/onboard-team.sh payments \
--members "alice@example.com,bob@example.com" \
--namespaces "payments-dev,payments-staging"
| Topic | Content |
|---|---|
| Getting Started | kubectl setup, cluster access, first deployment |
| Deploying Apps | GitOps workflow, ArgoCD usage, Helm charts |
| Debugging | Common pod issues, logs, events, exec |
| Monitoring | Prometheus queries, Grafana dashboards, alerts |
| Security | Image scanning, secrets management, RBAC |
| CI/CD | Pipeline setup, artifact promotion, environments |
| Scaling | HPA, VPA, cluster autoscaler, resource planning |
| Networking | Services, Ingress, NetworkPolicy, DNS |
| Storage | PVC, StorageClasses, snapshots, backups |
# Check if image build succeeded
kubectl get builds -n ${NAMESPACE} -l app=${APP} # OpenShift
# Check Tekton pipeline runs
kubectl get pipelineruns -n ${NAMESPACE}
kubectl describe pipelinerun ${RUN_NAME} -n ${NAMESPACE}
# Check if ArgoCD can see the new image
argocd app get ${APP} -o json | jq '.status.summary.images'
# Check if webhook is firing
kubectl get events -n ${ARGOCD_NS} --field-selector reason=WebhookReceived
| Script | Purpose |
|---|---|
provision-namespace.sh | Create namespace with full guardrails |
debug-pod.sh | Automated pod issue diagnosis |
generate-manifest.sh | Generate production-ready K8s manifests |
onboard-team.sh | Team onboarding automation |
template-app.sh | Application scaffolding from templates |
Run any script:
bash scripts/<script-name>.sh [arguments]
CRITICAL: This section ensures agents work effectively across multiple context windows.
Every session MUST begin by reading the progress file:
# 1. Get your bearings
pwd
ls -la
# 2. Read progress file for current agent
cat working/WORKING.md
# 3. Read global logs for context
cat logs/LOGS.md | head -100
# 4. Check for any incidents since last session
cat incidents/INCIDENTS.md | head -50
Before ending ANY session, you MUST:
# 1. Update WORKING.md with current status
# - What you completed
# - What remains
# - Any blockers
# 2. Commit changes to git
git add -A
git commit -m "agent:developer-experience: $(date -u +%Y%m%d%S) --%H%M {summary}"
# 3. Update LOGS.md
# Log what you did, result, and next action
The WORKING.md file is your single source of truth:
## Agent: developer-experience (Desk)
### Current Session
- Started: {ISO timestamp}
- Task: {what you're working on}
### Completed This Session
- {item 1}
- {item 2}
### Remaining Tasks
- {item 1}
- {item 2}
### Blockers
- {blocker if any}
### Next Action
{what the next session should do}
| Rule | Why |
|---|---|
| Work on ONE task at a time | Prevents context overflow |
| Commit after each subtask | Enables recovery from context loss |
| Update WORKING.md frequently | Next agent knows state |
| NEVER skip session end protocol | Loses all progress |
| Keep summaries concise | Fits in context |
If you see these, RESTART the session:
If context is getting full:
# List ACR instances
az acr list -g ${RG} -o table
# Get login server
az acr show -n ${ACR_NAME} --query loginServer
# Build and push image
az acr build -t ${ACR_NAME}.azurecr.io/${APP}:${TAG} -f Dockerfile .
# Create repository
az acr repository create --name ${ACR_NAME} --image ${APP}:${TAG}
# List images
az acr repository list -n ${ACR_NAME} -o table
# Get pull credentials
az acr credential show -n ${ACR_NAME}
# Create PostgreSQL server
az postgres flexible-server create \
--name ${DB_NAME} \
--resource-group ${RG} \
--sku-name Standard_B1ms \
--tier Burstable
# Get connection string
az postgres flexible-server show-connection-string \
--name ${DB_NAME} \
--admin-user ${ADMIN_USER}
# Configure firewall
az postgres flexible-server firewall-rule create \
--name ${DB_NAME} \
--rule-name allow-access \
--start-ip-address 0.0.0.0 \
--end-ip-address 255.255.255.255
# Create key vault
az keyvault create --name ${KV_NAME} --resource-group ${RG}
# Set secret
az keyvault secret set --vault-name ${KV_NAME} --name "api-key" --value "xxx"
# Get secret
az keyvault secret show --vault-name ${KV_NAME} --name "api-key" --query value
# Create access policy
az keyvault set-policy \
--name ${KV_NAME} \
--upn ${USER_EMAIL} \
--secret-permissions get list
# Create storage account
az storage account create \
--name ${STORAGE_NAME} \
--resource-group ${RG} \
--sku Standard_LRS
# Get connection string
az storage account show-connection-string \
--name ${STORAGE_NAME} \
--query connectionString
# Create blob container
az storage container create --name ${CONTAINER} --connection-string ${CONN_STR}
# Create ECR repository
aws ecr create-repository --repository-name ${APP} --image-tag-mutability MUTABLE
# Get login password
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ${ACCOUNT}.dkr.ecr.us-east-1.amazonaws.com
# Push image
docker tag ${APP}:${TAG} ${ACCOUNT}.dkr.ecr.us-east-1.amazonaws.com/${APP}:${TAG}
docker push ${ACCOUNT}.dkr.ecr.us-east-1.amazonaws.com/${APP}:${TAG}
# List images
aws ecr list-images --repository-name ${APP}
# Scan image for vulnerabilities
aws ecr start-image-scan --repository-name ${APP} --image-tag ${TAG}
# Get scan findings
aws ecr describe-image-scan-findings --repository-name ${APP} --image-tag ${TAG}
# Create PostgreSQL instance
aws rds create-db-instance \
--db-instance-identifier ${DB_NAME} \
--db-instance-class db.t3.micro \
--engine postgres \
--engine-version 15.3 \
--allocated-storage 20 \
--master-username ${ADMIN_USER} \
--master-user-password ${PASSWORD}
# Get connection endpoint
aws rds describe-db-instances \
--db-instance-identifier ${DB_NAME} \
--query 'DBInstances[0].Endpoint.Address'
# Create subnet group
aws rds create-db-subnet-group \
--db-subnet-group-name ${DB_NAME}-subnet \
--subnet-ids ${SUBNET_IDS} \
--description "Subnet group for ${DB_NAME}"
# Create secret
aws secretsmanager create-secret \
--name "dev/${APP}/api-keys" \
--secret-string '{"api_key":"xxx","api_secret":"yyy"}'
# Get secret
aws secretsmanager get-secret-value --secret-id "dev/${APP}/api-keys"
# Update secret
aws secretsmanager update-secret \
--secret-id "dev/${APP}/api-keys" \
--secret-string '{"api_key":"new_key","api_secret":"new_secret"}'
# Create bucket
aws s3 mb s3://${BUCKET_NAME}
# Upload file
aws s3 cp file.txt s3://${BUCKET_NAME}/
# List objects
aws s3 ls s3://${BUCKET_NAME}/
# Generate presigned URL
aws s3 presign s3://${BUCKET_NAME}/file.txt --expires-in 3600
# Enable versioning
aws s3api put-bucket-versioning \
--bucket ${BUCKET_NAME} \
--versioning-configuration Status=Enabled
Keep humans in the loop. Use Slack/Teams for async communication. Use PagerDuty for urgent escalation.
| Channel | Use For | Response Time |
|---|---|---|
| Slack | Namespace requests, onboarding | < 1 hour |
| MS Teams | Namespace requests, onboarding | < 1 hour |
| PagerDuty | Production namespace issues | Immediate |
{
"text": "🎯 *Agent Action Required - DevEx*",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Approval Request from Desk (Developer Experience)*"
}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": "*Type:*\n{request_type}"},
{"type": "mrkdwn", "text": "*Target:*\n{namespace/team}"},
{"type": "mrkdwn", "text": "*Risk:*\n{risk_level}"},
{"type": "mrkdwn", "text": "*Deadline:*\n{response_deadline}"}
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Request Details:*\n```{request_details}```"
}
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {"type": "plain_text", "text": "✅ Approve"},
"style": "primary",
"action_id": "approve_{request_id}"
},
{
"type": "button",
"text": {"type": "plain_text", "text": "❌ Reject"},
"style": "danger",
"action_id": "reject_{request_id}"
}
]
}
]
}
{
"text": "✅ *Desk - Onboarding Complete*",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Team {team_name} has been onboarded*"
}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": "*Namespace:*\n{namespace}"},
{"type": "mrkdwn", "text": "*Resources Created:*\n{resources}"}
]
}
]
}
curl -X POST 'https://events.pagerduty.com/v2/enqueue' \
-H 'Content-Type: application/json' \
-d '{
"routing_key": "$PAGERDUTY_ROUTING_KEY",
"event_action": "trigger",
"payload": {
"summary": "[Desk] {issue_summary}",
"severity": "{critical|error|warning}",
"source": "desk-developer-experience",
"custom_details": {
"agent": "Desk",
"namespace": "{namespace}",
"issue": "{issue_details}"
}
},
"client": "cluster-agent-swarm"
}'
| Priority | Slack/Teams Wait | PagerDuty Escalation After |
|---|---|---|
| CRITICAL | 5 minutes | 10 minutes total |
| HIGH | 15 minutes | 30 minutes total |
| MEDIUM | 30 minutes | No escalation |
Converted and distributed by TomeVault — claim your Tome and manage your conversions.