| name | gpu-node-orphans |
| description | Use when GPU pods or PVCs linger with no owner reference and accelerator spend keeps climbing, when a finished training run left volumes behind, or when someone asks to "reclaim idle GPUs" or "clean up orphaned GPU nodes." Finds true orphans (no owner ref AND no recent activity), posts them to Slack for a soak period, and only after explicit human confirmation performs a guarded, cascading cleanup. Pushes back on deleting anything merely idle. |
GPU Node Orphan Reclamation
Overview
GPU capacity is the most expensive resource in the fleet, and the easiest to leak.
A training job crashes, its pod loses its owning Job, the PVC keeps its disk, and
the accelerator sits reserved but unused — quietly billing. This skill finds those
genuinely orphaned GPU pods and PersistentVolumeClaims, broadcasts the list for
human review, waits out a soak period, and then performs a guarded cascading
cleanup that cannot fire without an explicit confirmation token.
The hard rule: idle is not orphaned. A resource qualifies only when it has no
owner reference AND has shown no activity within the soak window. Both checks, every
time.
When to Use
Use this skill when you see symptoms like:
- GPU utilization dashboards show reserved-but-0%-busy accelerators for hours/days.
kubectl get pods shows GPU pods with no ownerReferences and no recent restarts.
- Accelerator spend is climbing with no matching increase in active training runs.
- Old PVCs remain after their jobs completed, holding disks (and cost).
- Someone asks to "reclaim idle GPUs," "find orphaned GPU nodes," or "clean up after that failed run."
Do NOT use this skill when:
- A long-running job is merely idle between steps — it may be owned and wanted.
- You cannot confirm a soak period has elapsed (do not shortcut the wait).
- The user has not yet seen and approved the candidate list (never cascade blind).
- The work is general cost analysis, not GPU resource cleanup — use
cost-investigation.
The soak-then-confirm flow
This skill is deliberately slow. Speed is how you delete someone's three-day run.
-
Detect. Run scripts/find_orphans.py to list candidate orphaned GPU pods and
PVCs. The script flags a resource only when it has no owner reference AND has
been idle past the soak window (default 48h). The output is a JSON report and
nothing is deleted.
python scripts/find_orphans.py --all-namespaces --idle-hours 48 --json /tmp/orphans.json
-
Broadcast. Post the candidate list to the team's Slack channel (e.g.
#ml-infra) for visibility. Anyone who recognizes a resource as still-wanted can
veto it. Keep the report as the record of what was proposed.
-
Soak. Wait out the soak period (default 48h) after broadcasting. The point is
to give owners time to object and to let a "quiet" job prove it really is dead. Do
not collapse this wait because a resource "looks obviously dead."
-
Confirm. The operator reviews the (possibly trimmed) list and explicitly
approves the exact resources to delete. This is a human decision, not a model one.
-
Cleanup, guarded. Only after confirmation, perform the cascading delete. The
guard hook (below) blocks every destructive command until the operator sets the
confirmation token for this session:
export GPU_ORPHAN_CONFIRM=1
kubectl delete pod <name> -n <ns>
kubectl delete pvc <name> -n <ns>
On-demand guard hook
While this skill is active, hooks/guard.sh is registered as a PreToolUse hook
for this session only — an on-demand guardrail, not a global one. It is removed
when the session ends.
The hook reads each Bash command and denies anything destructive
(kubectl delete, kubectl ... --force, rm -rf, force-push) unless
GPU_ORPHAN_CONFIRM=1 is set in the environment. That token is the same human gate
from step 4: the model cannot reach it on its own, so it cannot cascade-delete on a
whim. Set the token only after the soak has elapsed and the list is approved; unset
it (or end the session) the moment cleanup is done.
The guard emits a permissionDecision: "deny" and also exits non-zero, so a blocked
command is stopped both ways.
Gotchas
- ALWAYS check owner refs AND last activity — idle alone is a false positive. An
idle GPU pod can be a paused-but-owned training run. Deleting on idleness alone is
how you destroy live work. Require both signals.
- ALWAYS enforce the soak period before deleting. The broadcast-then-wait window
exists so owners can object. Skipping it turns a cleanup into an incident. Do not
shorten it because something "looks dead."
- ALWAYS account for the PV reclaim policy. A PVC bound to a PV with
reclaimPolicy: Retain leaves the disk (and its cost) behind after the claim is
deleted — "cleanup" then needs a separate PV/disk step. Delete reclaims it
automatically. The detection script reports the policy; read it before you act.
- ALWAYS have the user confirm the exact list before any cascade. Never
cascade-delete from the model's own judgment. The confirmation token gates the
guard hook precisely so a human signs off on the specific resources first.
- ALWAYS verify
kubectl config current-context. Running the cleanup against the
wrong cluster is unrecoverable. Confirm the context before setting the token.
Files
SKILL.md — this file; the soak-then-confirm flow and guardrails.
scripts/find_orphans.py — lists candidate orphaned GPU pods/PVCs (no owner ref
AND idle past the soak window); read-only, emits a JSON report. Never deletes.
hooks/guard.sh — on-demand PreToolUse hook; blocks destructive cleanup commands
unless GPU_ORPHAN_CONFIRM=1 is set. Session-scoped.