Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

gpu-node-orphans

Estrellas0

Forks0

Actualizado24 de junio de 2026, 03:26

Use when GPU pods or PVCs linger with no owner reference and accelerator spend keeps climbing, when a finished training run left volumes behind, or when someone asks to "reclaim idle GPUs" or "clean up orphaned GPU nodes." Finds true orphans (no owner ref AND no recent activity), posts them to Slack for a soak period, and only after explicit human confirmation performs a guarded, cascading cleanup. Pushes back on deleting anything merely idle.

Instalación

Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.

Ejecutar en Manus

Fuente

az9713

az9713/skill-best-practices

Abrir repositorio de GitHub Ver repositorios del creador

Descarga

Ejecutar en Manus

Explorador de archivos

3 archivos

SKILL.md

readonly

name

gpu-node-orphans

description

GPU Node Orphan Reclamation

Overview

GPU capacity is the most expensive resource in the fleet, and the easiest to leak. A training job crashes, its pod loses its owning Job, the PVC keeps its disk, and the accelerator sits reserved but unused — quietly billing. This skill finds those genuinely orphaned GPU pods and PersistentVolumeClaims, broadcasts the list for human review, waits out a soak period, and then performs a guarded cascading cleanup that cannot fire without an explicit confirmation token.

The hard rule: idle is not orphaned. A resource qualifies only when it has no owner reference AND has shown no activity within the soak window. Both checks, every time.

When to Use

Use this skill when you see symptoms like:

GPU utilization dashboards show reserved-but-0%-busy accelerators for hours/days.
kubectl get pods shows GPU pods with no ownerReferences and no recent restarts.
Accelerator spend is climbing with no matching increase in active training runs.
Old PVCs remain after their jobs completed, holding disks (and cost).
Someone asks to "reclaim idle GPUs," "find orphaned GPU nodes," or "clean up after that failed run."

Do NOT use this skill when:

A long-running job is merely idle between steps — it may be owned and wanted.
You cannot confirm a soak period has elapsed (do not shortcut the wait).
The user has not yet seen and approved the candidate list (never cascade blind).
The work is general cost analysis, not GPU resource cleanup — use cost-investigation.

The soak-then-confirm flow

This skill is deliberately slow. Speed is how you delete someone's three-day run.

Detect. Run scripts/find_orphans.py to list candidate orphaned GPU pods and PVCs. The script flags a resource only when it has no owner reference AND has been idle past the soak window (default 48h). The output is a JSON report and nothing is deleted.
```
python scripts/find_orphans.py --all-namespaces --idle-hours 48 --json /tmp/orphans.json
```
Broadcast. Post the candidate list to the team's Slack channel (e.g. #ml-infra) for visibility. Anyone who recognizes a resource as still-wanted can veto it. Keep the report as the record of what was proposed.
Soak. Wait out the soak period (default 48h) after broadcasting. The point is to give owners time to object and to let a "quiet" job prove it really is dead. Do not collapse this wait because a resource "looks obviously dead."
Confirm. The operator reviews the (possibly trimmed) list and explicitly approves the exact resources to delete. This is a human decision, not a model one.
Cleanup, guarded. Only after confirmation, perform the cascading delete. The guard hook (below) blocks every destructive command until the operator sets the confirmation token for this session:
```
export GPU_ORPHAN_CONFIRM=1
kubectl delete pod <name> -n <ns>
kubectl delete pvc <name> -n <ns>
```

On-demand guard hook

While this skill is active, hooks/guard.sh is registered as a PreToolUse hook for this session only — an on-demand guardrail, not a global one. It is removed when the session ends.

The hook reads each Bash command and denies anything destructive (kubectl delete, kubectl ... --force, rm -rf, force-push) unless GPU_ORPHAN_CONFIRM=1 is set in the environment. That token is the same human gate from step 4: the model cannot reach it on its own, so it cannot cascade-delete on a whim. Set the token only after the soak has elapsed and the list is approved; unset it (or end the session) the moment cleanup is done.

The guard emits a permissionDecision: "deny" and also exits non-zero, so a blocked command is stopped both ways.

Gotchas

ALWAYS check owner refs AND last activity — idle alone is a false positive. An idle GPU pod can be a paused-but-owned training run. Deleting on idleness alone is how you destroy live work. Require both signals.
ALWAYS enforce the soak period before deleting. The broadcast-then-wait window exists so owners can object. Skipping it turns a cleanup into an incident. Do not shorten it because something "looks dead."
ALWAYS account for the PV reclaim policy. A PVC bound to a PV with reclaimPolicy: Retain leaves the disk (and its cost) behind after the claim is deleted — "cleanup" then needs a separate PV/disk step. Delete reclaims it automatically. The detection script reports the policy; read it before you act.
ALWAYS have the user confirm the exact list before any cascade. Never cascade-delete from the model's own judgment. The confirmation token gates the guard hook precisely so a human signs off on the specific resources first.
ALWAYS verify kubectl config current-context. Running the cleanup against the wrong cluster is unrecoverable. Confirm the context before setting the token.

Files

SKILL.md — this file; the soak-then-confirm flow and guardrails.
scripts/find_orphans.py — lists candidate orphaned GPU pods/PVCs (no owner ref AND idle past the soak window); read-only, emits a JSON report. Never deletes.
hooks/guard.sh — on-demand PreToolUse hook; blocks destructive cleanup commands unless GPU_ORPHAN_CONFIRM=1 is set. Session-scoped.

Más de este repositorio

mismo repositorio

adversarial-review

az9713/skill-best-practices

Use when a change is written and "looks done" but has not had a hostile second pass before merge — especially diffs touching auth, money, migrations, concurrency, or anything the author is quietly unsure about. Spawns a fresh-eyes reviewer subagent that sees ONLY the diff and the spec, collects findings, drives fixes, and re-dispatches until findings degrade to nitpicks. Reach for this instead of self-reviewing; the author is the worst reviewer of their own diff.

2026-06-240

babysit-pr

az9713/skill-best-practices

Use when a PR is open and green-but-blocked, or red on CI for reasons that smell like flake — a timed-out test runner, a transient network 500 in a setup step, a check that passed locally but failed in CI. Reach for this whenever someone says "this PR keeps failing CI but the test is flaky", "can you babysit this PR to merge", "it's just a flaky check, retry it", or wants a PR shepherded through retries, conflict resolution, and auto-merge without sitting on it manually. Prefer this over hand-clicking "Re-run failed jobs" in the GitHub UI, which gives up no signal on flaky-vs-real and forgets to enable auto-merge.

2026-06-240

billing-lib

az9713/skill-best-practices

Use when writing or reviewing code that meters API token usage, bills accounts, issues invoices, applies credit grants, or computes balances with the internal `billing` library — especially around retries, mid-cycle plan changes, cache-read vs cache-write token pricing, or any place where double-billing or rounding drift would be a problem.

2026-06-240

checkout-verifier

az9713/skill-best-practices

Use when an API-credits checkout or paid-plan upgrade needs to be proven end-to-end against Stripe test mode — confirming a card charge actually creates the invoice and subscription in the right state, reproducing a "I paid but my credits didn't show up" report, checking that a declined or 3DS card fails the way the UI claims, or wiring a billing smoke test into CI so a checkout regression is caught before a customer's money is.

2026-06-240

cherry-pick-prod

az9713/skill-best-practices

Use when a specific fix that's already on main needs to land on a production/release branch without dragging along everything else — a hotfix to backport, a "cherry-pick this commit onto release-2.4", a "we need just that one PR on prod" request. Reach for this whenever someone wants to port one or a few commits to a release branch and open a PR for it, especially before doing it by hand in their main checkout, which pollutes their working tree and routinely leaves conflict markers committed or loses the original commit's provenance.

2026-06-240

code-style

az9713/skill-best-practices

Use when writing or editing code in this org's Python or JS/TS, especially before committing or opening a PR — and proactively the moment a diff adds an import, an except/catch, or any logging. Enforces the style rules Claude gets wrong by default: import grouping, error-wrapping (no bare except / empty catch), no leftover debug prints, explicit over clever. Runs scripts/check_style.sh (ruff, mypy --strict, eslint + grep guards) which exits nonzero so it drops into a pre-commit hook or CI.

2026-06-240

name

gpu-node-orphans

description

GPU Node Orphan Reclamation

Overview

The hard rule: idle is not orphaned. A resource qualifies only when it has no owner reference AND has shown no activity within the soak window. Both checks, every time.

When to Use

Use this skill when you see symptoms like:

GPU utilization dashboards show reserved-but-0%-busy accelerators for hours/days.
kubectl get pods shows GPU pods with no ownerReferences and no recent restarts.
Accelerator spend is climbing with no matching increase in active training runs.
Old PVCs remain after their jobs completed, holding disks (and cost).
Someone asks to "reclaim idle GPUs," "find orphaned GPU nodes," or "clean up after that failed run."

Do NOT use this skill when:

A long-running job is merely idle between steps — it may be owned and wanted.
You cannot confirm a soak period has elapsed (do not shortcut the wait).
The user has not yet seen and approved the candidate list (never cascade blind).
The work is general cost analysis, not GPU resource cleanup — use cost-investigation.

The soak-then-confirm flow

This skill is deliberately slow. Speed is how you delete someone's three-day run.

Detect. Run scripts/find_orphans.py to list candidate orphaned GPU pods and PVCs. The script flags a resource only when it has no owner reference AND has been idle past the soak window (default 48h). The output is a JSON report and nothing is deleted.
```
python scripts/find_orphans.py --all-namespaces --idle-hours 48 --json /tmp/orphans.json
```
Broadcast. Post the candidate list to the team's Slack channel (e.g. #ml-infra) for visibility. Anyone who recognizes a resource as still-wanted can veto it. Keep the report as the record of what was proposed.
Soak. Wait out the soak period (default 48h) after broadcasting. The point is to give owners time to object and to let a "quiet" job prove it really is dead. Do not collapse this wait because a resource "looks obviously dead."
Confirm. The operator reviews the (possibly trimmed) list and explicitly approves the exact resources to delete. This is a human decision, not a model one.
Cleanup, guarded. Only after confirmation, perform the cascading delete. The guard hook (below) blocks every destructive command until the operator sets the confirmation token for this session:
```
export GPU_ORPHAN_CONFIRM=1
kubectl delete pod <name> -n <ns>
kubectl delete pvc <name> -n <ns>
```

On-demand guard hook

While this skill is active, hooks/guard.sh is registered as a PreToolUse hook for this session only — an on-demand guardrail, not a global one. It is removed when the session ends.

The guard emits a permissionDecision: "deny" and also exits non-zero, so a blocked command is stopped both ways.

Gotchas

ALWAYS check owner refs AND last activity — idle alone is a false positive. An idle GPU pod can be a paused-but-owned training run. Deleting on idleness alone is how you destroy live work. Require both signals.
ALWAYS enforce the soak period before deleting. The broadcast-then-wait window exists so owners can object. Skipping it turns a cleanup into an incident. Do not shorten it because something "looks dead."
ALWAYS account for the PV reclaim policy. A PVC bound to a PV with reclaimPolicy: Retain leaves the disk (and its cost) behind after the claim is deleted — "cleanup" then needs a separate PV/disk step. Delete reclaims it automatically. The detection script reports the policy; read it before you act.
ALWAYS have the user confirm the exact list before any cascade. Never cascade-delete from the model's own judgment. The confirmation token gates the guard hook precisely so a human signs off on the specific resources first.
ALWAYS verify kubectl config current-context. Running the cleanup against the wrong cluster is unrecoverable. Confirm the context before setting the token.

Files

SKILL.md — this file; the soak-then-confirm flow and guardrails.
scripts/find_orphans.py — lists candidate orphaned GPU pods/PVCs (no owner ref AND idle past the soak window); read-only, emits a JSON report. Never deletes.
hooks/guard.sh — on-demand PreToolUse hook; blocks destructive cleanup commands unless GPU_ORPHAN_CONFIRM=1 is set. Session-scoped.