| name | k8s-operations |
| description | Kubernetes and OpenShift cluster operations, maintenance, and lifecycle management. Use this skill when:
(1) Performing cluster upgrades (K8s, OCP, EKS, GKE, AKS)
(2) Backup and disaster recovery (etcd, Velero, cluster state)
(3) Node management: drain, cordon, scaling, replacement
(4) Capacity planning and cluster scaling
(5) Certificate rotation and management
(6) etcd maintenance and health checks
(7) Resource quota and limit range management
(8) Namespace lifecycle management
(9) Cluster migration and workload portability
(10) Monitoring and alerting configuration
(11) Log aggregation setup
(12) Cost optimization and resource rightsizing
|
Kubernetes / OpenShift Cluster Operations
Current Versions & Documentation (January 2026)
Key Tools & Versions
| Tool | Version | Install | Purpose |
|---|
| kubeadm | 1.31.x | Package manager | Cluster bootstrap |
| Velero | 1.15.x | Helm/CLI | Backup & restore |
| kube-prometheus-stack | v67.x | Helm | Monitoring |
| VPA | 1.3.x | kubectl apply | Vertical scaling |
| Cluster Autoscaler | 1.31.x | Helm | Node autoscaling |
| Karpenter | 1.1.x | Helm | AWS node provisioning |
Command Usage Convention
IMPORTANT: This skill uses kubectl as the primary command in all examples. When working with:
- OpenShift/ARO clusters: Replace all
kubectl commands with oc
- Standard Kubernetes clusters (AKS, EKS, GKE, etc.): Use
kubectl as shown
The agent will automatically detect the cluster type and use the appropriate command.
Day-2 operations, maintenance, and lifecycle management for production clusters.
Node Operations
Node Lifecycle
kubectl get nodes -o wide
kubectl describe node ${NODE_NAME}
kubectl top nodes
kubectl get nodes --show-labels
kubectl describe node ${NODE} | grep -A 5 Taints
Drain and Cordon
kubectl cordon ${NODE_NAME}
kubectl drain ${NODE_NAME} \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60 \
--timeout=300s
kubectl drain ${NODE_NAME} \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=30
kubectl uncordon ${NODE_NAME}
Node Maintenance Script
#!/bin/bash
NODE=$1
echo "Starting maintenance for node: $NODE"
kubectl cordon $NODE
echo "Draining node..."
kubectl drain $NODE \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=120 \
--timeout=600s
echo "Remaining pods on node:"
kubectl get pods -A --field-selector spec.nodeName=$NODE
echo "Node ready for maintenance"
echo "Run 'kubectl uncordon $NODE' when complete"
Node Scaling
Manual Node Addition (kubeadm)
kubeadm token create --print-join-command
kubeadm join ${CONTROL_PLANE}:6443 --token ${TOKEN} \
--discovery-token-ca-cert-hash sha256:${HASH}
Cluster Autoscaler Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
command:
- ./cluster-autoscaler
- --v=4
- --cloud-provider=${CLOUD_PROVIDER}
- --nodes=${MIN}:${MAX}:${NODE_GROUP}
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=true
- --balance-similar-node-groups=true
Backup and Recovery
etcd Backup
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table
etcd Restore
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--name=${ETCD_NAME} \
--initial-cluster=${ETCD_NAME}=https://${ETCD_IP}:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://${ETCD_IP}:2380 \
--data-dir=/var/lib/etcd-restored
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
Velero Backup
Current Version: Velero 1.15.x (January 2026)
brew install velero
curl -LO https://github.com/vmware-tanzu/velero/releases/latest/download/velero-v1.15.0-linux-amd64.tar.gz
velero install \
--provider aws \
--bucket ${BUCKET_NAME} \
--secret-file ./credentials-velero \
--backup-location-config region=${REGION} \
--snapshot-location-config region=${REGION} \
--plugins velero/velero-plugin-for-aws:v1.10.0 \
--use-node-agent
velero install \
--provider azure \
--bucket ${CONTAINER_NAME} \
--secret-file ./credentials-velero \
--backup-location-config resourceGroup=${RG},storageAccount=${STORAGE_ACCOUNT} \
--snapshot-location-config resourceGroup=${RG} \
--plugins velero/velero-plugin-for-microsoft-azure:v1.10.0 \
--use-node-agent
velero install \
--provider gcp \
--bucket ${BUCKET_NAME} \
--secret-file ./credentials-velero \
--plugins velero/velero-plugin-for-gcp:v1.10.0 \
--use-node-agent
velero backup create ${BACKUP_NAME} \
--include-namespaces ${NAMESPACES} \
--ttl 720h \
--default-volumes-to-fs-backup
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--include-namespaces ${NAMESPACES} \
--ttl 168h \
--default-volumes-to-fs-backup
velero restore create --from-backup ${BACKUP_NAME}
velero backup describe ${BACKUP_NAME} --details
velero backup logs ${BACKUP_NAME}
velero backup get
velero schedule get
Velero Backup Manifest
apiVersion: velero.io/v1
kind: Backup
metadata:
name: ${BACKUP_NAME}
namespace: velero
spec:
includedNamespaces:
- ${NAMESPACE_1}
- ${NAMESPACE_2}
excludedResources:
- events
- events.events.k8s.io
storageLocation: default
volumeSnapshotLocations:
- default
ttl: 720h0m0s
snapshotVolumes: true
defaultVolumesToFsBackup: false
hooks:
resources:
- name: backup-hook
includedNamespaces:
- ${NAMESPACE}
labelSelector:
matchLabels:
app: database
pre:
- exec:
container: database
command:
- /bin/sh
- -c
- "pg_dump -U postgres > /backup/pre-backup.sql"
onError: Fail
timeout: 120s
OpenShift Backup
oc debug node/${CONTROL_PLANE_NODE} -- chroot /host \
/usr/local/bin/cluster-backup.sh /home/core/backup
oc get all -A -o yaml > cluster-resources.yaml
oc get pv -o yaml > persistent-volumes.yaml
oc get sc -o yaml > storage-classes.yaml
Cluster Upgrades
Pre-Upgrade Checklist
#!/bin/bash
echo "=== Cluster Version ==="
kubectl version --short
echo -e "\n=== Node Status ==="
kubectl get nodes
echo -e "\n=== Pods Not Running ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
echo -e "\n=== PDBs That May Block Drain ==="
kubectl get pdb -A
echo -e "\n=== Pending PVCs ==="
kubectl get pvc -A --field-selector=status.phase=Pending
echo -e "\n=== Deprecated APIs in Use ==="
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
echo -e "\n=== etcd Health ==="
kubectl get pods -n kube-system -l component=etcd
echo -e "\n=== Backup Status ==="
ls -la /backup/etcd-*.db 2>/dev/null || echo "No local etcd backups found"
Kubernetes Upgrade (kubeadm)
apt update && apt-cache madison kubeadm
apt-mark unhold kubeadm && apt-get update && apt-get install -y kubeadm=${VERSION} && apt-mark hold kubeadm
kubeadm upgrade plan
kubeadm upgrade apply v${VERSION}
apt-mark unhold kubelet kubectl && apt-get update && apt-get install -y kubelet=${VERSION} kubectl=${VERSION} && apt-mark hold kubelet kubectl
systemctl daemon-reload && systemctl restart kubelet
kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data
apt-mark unhold kubeadm && apt-get update && apt-get install -y kubeadm=${VERSION} && apt-mark hold kubeadm
kubeadm upgrade node
apt-mark unhold kubelet kubectl && apt-get update && apt-get install -y kubelet=${VERSION} kubectl=${VERSION} && apt-mark hold kubelet kubectl
systemctl daemon-reload && systemctl restart kubelet
kubectl uncordon ${NODE}
AKS Upgrade (Azure)
Documentation: https://learn.microsoft.com/azure/aks/upgrade-cluster
az aks get-versions --location ${LOCATION} -o table
az aks show --resource-group ${RG} --name ${CLUSTER} --query kubernetesVersion
az aks get-upgrades --resource-group ${RG} --name ${CLUSTER} -o table
az aks upgrade --resource-group ${RG} --name ${CLUSTER} \
--kubernetes-version 1.31.0
az aks upgrade --resource-group ${RG} --name ${CLUSTER} \
--kubernetes-version 1.31.0 \
--control-plane-only
az aks nodepool upgrade --resource-group ${RG} --cluster-name ${CLUSTER} \
--name ${NODEPOOL} --kubernetes-version 1.31.0
az aks nodepool upgrade --resource-group ${RG} --cluster-name ${CLUSTER} \
--name ${NODEPOOL} --kubernetes-version 1.31.0 \
--max-surge 33%
az aks update --resource-group ${RG} --name ${CLUSTER} \
--auto-upgrade-channel stable
az aks show --resource-group ${RG} --name ${CLUSTER} \
--query 'provisioningState'
GKE Upgrade (Google Cloud)
Documentation: https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster
gcloud container get-server-config --region ${REGION}
gcloud container clusters list --format="table(name,currentMasterVersion,currentNodeVersion)"
gcloud container clusters describe ${CLUSTER} --region ${REGION} \
--format="get(currentMasterVersion)"
gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \
--master --cluster-version 1.31
gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \
--node-pool ${POOL} \
--cluster-version 1.31
gcloud container node-pools update ${POOL} --cluster ${CLUSTER} \
--region ${REGION} \
--enable-blue-green-upgrade \
--node-pool-soak-duration 3600s
gcloud container clusters update ${CLUSTER} --region ${REGION} \
--release-channel regular
gcloud container clusters update ${CLUSTER} --region ${REGION} \
--maintenance-window-start 02:00 \
--maintenance-window-end 06:00
gcloud container operations list --filter="targetLink~${CLUSTER}"
OpenShift Upgrade
Documentation: https://docs.openshift.com/container-platform/4.17/updating/index.html
oc adm upgrade
oc get clusterversion
oc get clusterversion version -o jsonpath='{.spec.channel}'
oc adm upgrade channel stable-4.17
oc adm upgrade channel eus-4.16
oc adm upgrade --to-latest
oc adm upgrade --to=4.17.5
oc adm upgrade --to-multi-arch --to=4.16.0
watch -n 10 'oc get clusterversion && echo && oc get clusteroperators | grep -v "True.*False.*False"'
oc get clusterversion version -o jsonpath='{.status.conditions}' | jq
oc get clusteroperators
oc get nodes
oc get mcp
oc adm upgrade --include=current
oc describe clusterversion version
az aro update --resource-group ${RG} --name ${CLUSTER}
rosa upgrade cluster --cluster ${CLUSTER} --version 4.17.5
rosa describe upgrade --cluster ${CLUSTER}
EKS Upgrade
Documentation: https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html
aws eks describe-cluster --name ${CLUSTER_NAME} --query 'cluster.version'
aws eks describe-addon-versions --kubernetes-version 1.31 | jq '.addons[].addonName'
brew upgrade awscli eksctl
aws eks update-cluster-version \
--name ${CLUSTER_NAME} \
--kubernetes-version 1.31
aws eks wait cluster-active --name ${CLUSTER_NAME}
aws eks describe-update --name ${CLUSTER_NAME} --update-id ${UPDATE_ID}
for addon in vpc-cni coredns kube-proxy eks-pod-identity-agent; do
aws eks update-addon --cluster-name ${CLUSTER_NAME} \
--addon-name $addon \
--resolve-conflicts PRESERVE
done
aws eks update-nodegroup-version \
--cluster-name ${CLUSTER_NAME} \
--nodegroup-name ${NODEGROUP_NAME}
helm upgrade karpenter oci://public.ecr.aws/karpenter/karpenter \
--namespace karpenter \
--set settings.clusterName=${CLUSTER_NAME} \
--set controller.image.tag=1.1.0
kubectl get nodeclaims
kubectl get nodes
kubectl version
Resource Management
Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: ${NAMESPACE}
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
persistentvolumeclaims: "10"
requests.storage: 100Gi
count/deployments.apps: "20"
count/services: "20"
count/secrets: "50"
count/configmaps: "50"
Limit Ranges
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: ${NAMESPACE}
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "4"
memory: 8Gi
min:
cpu: 50m
memory: 64Mi
- type: PersistentVolumeClaim
max:
storage: 50Gi
min:
storage: 1Gi
Check Resource Usage
kubectl describe quota -n ${NAMESPACE}
kubectl top pods -n ${NAMESPACE} --sort-by=memory
kubectl top pods -n ${NAMESPACE} --sort-by=cpu
kubectl describe nodes | grep -A 5 "Allocated resources"
kubectl get pods -n ${NAMESPACE} -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
CPU_LIM:.spec.containers[*].resources.limits.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory,\
MEM_LIM:.spec.containers[*].resources.limits.memory
Certificate Management
Check Certificate Expiry
kubeadm certs check-expiration
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
for cert in /etc/kubernetes/pki/*.crt; do
echo "=== $cert ==="
openssl x509 -in $cert -noout -dates
done
Rotate Certificates (kubeadm)
kubeadm certs renew all
kubeadm certs renew apiserver
crictl pods --name kube-apiserver -q | xargs crictl stopp
crictl pods --name kube-controller-manager -q | xargs crictl stopp
crictl pods --name kube-scheduler -q | xargs crictl stopp
OpenShift Certificate Management
oc get certificates -A
oc get certificatesigningrequests
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
Monitoring Setup
Prometheus Stack (kube-prometheus-stack)
Current Version: Chart v67.x, Prometheus v2.55.x, Grafana v11.x (January 2026)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--version 67.x \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.retentionSize=50GB \
--set prometheus.prometheusSpec.replicas=2 \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3 \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
--set prometheus.prometheusSpec.resources.requests.memory=2Gi \
--set prometheus.prometheusSpec.resources.limits.memory=4Gi \
--set alertmanager.alertmanagerSpec.replicas=3 \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.storageClassName=gp3 \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=10Gi \
--set grafana.persistence.storageClassName=gp3 \
--set grafana.adminPassword="${GRAFANA_PASSWORD}" \
--set grafana.sidecar.dashboards.enabled=true \
--set grafana.sidecar.dashboards.searchNamespace=ALL
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
kubectl get servicemonitor -A
kubectl get prometheusrules -A
Custom ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ${APP_NAME}
namespace: monitoring
labels:
release: prometheus
spec:
namespaceSelector:
matchNames:
- ${NAMESPACE}
selector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
PrometheusRule (Alerts)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ${APP_NAME}-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: ${APP_NAME}.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{job="${APP_NAME}",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="${APP_NAME}"}[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total{namespace="${NAMESPACE}"}[15m]) * 60 * 15 > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod crash looping"
description: "Pod {{ $labels.pod }} is crash looping"
Logging Setup
Fluent Bit DaemonSet
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser cri
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On
[OUTPUT]
Name es
Match *
Host ${ELASTICSEARCH_HOST}
Port 9200
Logstash_Format On
Retry_Limit False
parsers.conf: |
[PARSER]
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
Cost Optimization
Resource Rightsizing Script
#!/bin/bash
echo "=== Pods with CPU Usage < 10% of Requests ==="
kubectl top pods -A --no-headers | while read ns pod cpu mem; do
cpu_milli=$(echo $cpu | sed 's/m//')
requests=$(kubectl get pod $pod -n $ns -o jsonpath='{.spec.containers[*].resources.requests.cpu}' 2>/dev/null | sed 's/m//')
if [ -n "$requests" ] && [ "$requests" -gt 0 ]; then
usage_pct=$((cpu_milli * 100 / requests))
if [ "$usage_pct" -lt 10 ]; then
echo "$ns/$pod: ${usage_pct}% CPU utilization (${cpu_milli}m / ${requests}m requested)"
fi
fi
done
echo -e "\n=== Pods with Memory Usage < 20% of Requests ==="
kubectl top pods -A --no-headers | while read ns pod cpu mem; do
mem_mi=$(echo $mem | sed 's/Mi//')
requests=$(kubectl get pod $pod -n $ns -o jsonpath='{.spec.containers[*].resources.requests.memory}' 2>/dev/null)
requests_mi=$(echo $requests | sed 's/Mi//;s/Gi/*1024/' | bc 2>/dev/null)
if [ -n "$requests_mi" ] && [ "$requests_mi" -gt 0 ]; then
usage_pct=$((mem_mi * 100 / requests_mi))
if [ "$usage_pct" -lt 20 ]; then
echo "$ns/$pod: ${usage_pct}% Memory utilization (${mem_mi}Mi / ${requests})"
fi
fi
done
VerticalPodAutoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: ${APP_NAME}-vpa
namespace: ${NAMESPACE}
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: ${APP_NAME}
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
Namespace Lifecycle
Namespace Template
apiVersion: v1
kind: Namespace
metadata:
name: ${NAMESPACE}
labels:
app.kubernetes.io/managed-by: cluster-code
environment: ${ENVIRONMENT}
team: ${TEAM}
cost-center: ${COST_CENTER}
annotations:
owner: ${OWNER_EMAIL}
description: "${DESCRIPTION}"
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: default-quota
namespace: ${NAMESPACE}
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
---
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: ${NAMESPACE}
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: ${NAMESPACE}
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Safe Namespace Deletion
#!/bin/bash
NS=$1
echo "=== Resources in namespace $NS ==="
kubectl get all -n $NS
echo -e "\n=== PVCs (data will be lost!) ==="
kubectl get pvc -n $NS
echo -e "\n=== Secrets ==="
kubectl get secrets -n $NS
read -p "Delete namespace $NS? (yes/no): " confirm
if [ "$confirm" = "yes" ]; then
kubectl get all -n $NS -o name | xargs -I {} kubectl patch {} -n $NS -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null
kubectl delete namespace $NS --timeout=120s
if kubectl get namespace $NS &>/dev/null; then
echo "Namespace stuck, removing finalizers..."
kubectl get namespace $NS -o json | jq '.spec.finalizers = []' | kubectl replace --raw "/api/v1/namespaces/$NS/finalize" -f -
fi
fi
Disaster Recovery Runbook
Full Cluster Recovery Checklist
-
Restore etcd
-
Verify Control Plane
kubectl get nodes
kubectl get pods -n kube-system
kubectl cluster-info
-
Restore Workloads (if using Velero)
velero restore create --from-backup ${BACKUP_NAME}
velero restore describe ${RESTORE_NAME}
-
Verify Application Health
kubectl get pods -A
kubectl get svc -A
kubectl get ingress -A
-
Restore Secrets (if external)
kubectl annotate externalsecret -A force-sync=$(date +%s) --overwrite
-
Verify DNS and Networking
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup kubernetes
-
Validate Data Integrity
-
Update DNS/Load Balancers