name

kubernetes-skill

description

Prevent Kubernetes hallucinations by diagnosing and fixing failure modes: insecure workload defaults, resource starvation, network exposure, privilege sprawl, fragile rollouts, and API drift. Use when generating, reviewing, refactoring, or migrating manifests, Helm charts, Kustomize overlays, cluster policies, and platform-specific Kubernetes work for EKS, GKE, AKS, OpenShift, GitOps controllers, or observability stacks.

KubeShark: Failure-Mode Workflow for Kubernetes

Run this workflow top to bottom.

1) Capture execution context

Record before writing manifests:

cluster version (e.g. 1.30, 1.31) and distribution (EKS, GKE, AKS, k3s, vanilla)
target namespace and environment criticality (dev/staging/prod)
workload type (Deployment, StatefulSet, Job, CronJob, DaemonSet)
deployment method (raw YAML, Helm, Kustomize, operator-managed)
policy enforcement (Pod Security Admission level, Kyverno, OPA/Gatekeeper)
cloud provider and CNI (affects networking, storage classes, load balancers)
platform controllers/add-ons (GitOps, observability, ingress, service mesh, autoscaling)

If unknown, state assumptions explicitly.

2) Diagnose likely failure mode(s)

Select one or more based on user intent and risk:

insecure workload defaults: missing security contexts, PSS violations, host access
resource starvation: missing requests/limits, no PDB, scheduling chaos
network exposure: flat networking, missing policies, wrong Service types, DNS issues
privilege sprawl: overly permissive RBAC, leaked secrets, excess ServiceAccount rights
fragile rollouts: misconfigured probes, mutable tags, unsafe update strategies
API drift: wrong apiVersion, deprecated APIs, schema violations, tool-specific errors

3) Load only the relevant reference file(s)

Primary failure-mode references:

references/insecure-workload-defaults.md
references/resource-starvation.md
references/network-exposure.md
references/privilege-sprawl.md
references/fragile-rollouts.md
references/api-drift.md

Supplemental references (only when needed):

references/deployment-patterns.md
references/stateful-patterns.md
references/job-patterns.md
references/daemonset-operator-patterns.md
references/security-hardening.md
references/observability.md
references/multi-tenancy.md
references/storage-and-state.md
references/helm-patterns.md
references/kustomize-patterns.md
references/validation-and-policy.md
references/examples-good.md
references/examples-bad.md
references/do-dont-patterns.md

Conditional Reference Retrieval (CRR) references (load only when the signal is detected):

references/conditional/eks-patterns.md for EKS, AWS, IRSA, EKS Pod Identity, AWS Load Balancer Controller, EBS/EFS CSI, Karpenter
references/conditional/gke-patterns.md for GKE, Autopilot, Workload Identity Federation for GKE, Dataplane V2, GCE Ingress, Config Sync
references/conditional/aks-patterns.md for AKS, Microsoft Entra Workload ID, Azure CNI, AGIC, Azure Disk/File/Blob CSI
references/conditional/openshift-patterns.md for OpenShift, OKD, ROSA, ARO, Routes, SCCs, OLM, oc
references/conditional/gitops-controllers.md for Argo CD, ApplicationSet, Flux, GitOps reconciliation, sync waves
references/conditional/observability-stacks.md for Prometheus Operator, ServiceMonitor, PodMonitor, OpenTelemetry, Loki, Grafana

Do not load multiple CRR files unless the task spans multiple detected platforms/tools.

4) Propose fix path with explicit risk controls

For each fix, include:

why this addresses the failure mode
what could still go wrong at deploy time or runtime
guardrails (validation commands, policy checks, rollback path)

5) Generate implementation artifacts

When applicable, output:

Kubernetes manifests (YAML with security contexts, resource limits, labels)
Helm values/templates or Kustomize overlays
NetworkPolicies, RBAC resources, PodDisruptionBudgets
Policy rules (Kyverno/OPA) and admission controls

6) Validate before finalize

Always provide validation steps tailored to deployment method and risk tier:

kubectl apply --dry-run=server or kubectl diff
kubeconform for schema validation against target cluster version
cross-resource consistency check (label/selector/port alignment)
policy scan (PSS profile check, Kyverno/OPA audit) Never recommend direct production apply without reviewed diff and approval.

7) Output contract

Return:

assumptions and cluster version floor
selected failure mode(s)
chosen remediation and tradeoffs
validation/test plan
rollback/recovery notes (rollout undo, revision history, data safety)