| name | cluster-ops |
| description | Operate the BGU CIS (formerly ISE-CS-DT) SLURM cluster for the PDDL copilot sweep — queue + pending-reason, submit/cancel, sync results, post-mortem completed jobs (right-size --mem from sacct/MaxRSS), prioritize pending cells via per-task Nice. Operations only; aggregation, plotting, tables, and drift detection live in the sibling `analyzer` skill. |
| argument-hint | ["status | preflight | sync | postmortem | prioritize"] |
User asked for: $ARGUMENTS — pick the matching recipe below.
Why this skill exists
Triggers (so the skill auto-matches): "cluster status", "what's running", "why is it pending", "submit sweep", "cancel jobs", "sync results", "check ollama", "postmortem", "memory headroom", "prioritize", "deprioritize", "nice value", "let cell X finish first".
Every session we re-derive the same SSH queue queries, .out-file grep patterns, rsync invocations, and sacct memory-headroom recipes. The cluster state is persistent but Claude's working set isn't. This skill pins the conventions in one place and exposes 4 short helper scripts. Read it before running SSH/rsync commands ad-hoc.
Read this skill alongside analyzer — cluster-ops gets results onto disk via sync.sh and tells you what's running via status.sh; analyzer (under .claude/skills/analyzer/) turns the synced results into Markdown tables, paper plots, the master pivot, and drift-vs-baseline checks. The two skills compose via the recipes below.
Cluster & repo conventions that matter here:
- Login node:
omereliy@slurm.bgu.ac.il — SSH is pre-authed for the user.
- Remote repo root:
~/pddl-copilot-experiments on the login node.
- Job submission — single backend:
cluster-experimenting/submit_with_rtx.sh <model> [<model>...] is the only submit path. GPU sbatch, self-deploys Ollama via Apptainer on a single dedicated GPU (default rtx_pro_6000:1 96 GB; --gpu-type rtx_6000 is the opt-in 48 GB escape hatch). Loops MODELS × THINK_MODES × CONDITIONS in-process so weights stay resident. --all packs the 4 active models (Qwen3.5:0.8B, qwen3.6:27b, qwen3.6:35b, gemma4:31b — post 2026-04-30 roster trim; nemotron-3-nano:30b dropped after Hermes XML parse failures proved content-dependent) into one job under MAX_LOADED_MODELS=1. The --no-tools flag pins the run to the discriminative no-tools matrix (CONDITIONS=no-tools, THINK_MODES=off, TASKS=solve+validate_*, --time=4h × N).
- The cis-ollama path (
submit_all.sh waves, submit_120b_cis.sh, run_condition.sbatch) was retired 2026-04-27. gpt-oss:120b is no longer in the active sweep; the large-model band is held by qwen3.6:35b (A3B MoE) as of the 2026-04-29 refresh (previously Qwen3.5:35b since 2026-04-27).
- Log file:
cluster-experimenting/logs/pddl_rtx_<model>-<jobid>.out. Legacy formats from earlier sweeps: pddl_<model>_<think>-<jobid>.out (cis path, retired) and pddl_<model>_<cond>-<jobid>.out (pre-2026-04-21).
- Results dir:
results/slurm_<model>_<think>_<cond>_<jobid>/ (schema unchanged — distinguish runs by the job's meta.host in summary.json).
- Ollama server:
http://localhost:11434 inside the allocated compute node (Apptainer-served, no TLS, unique port per job set by the sbatch).
- Routing rules (from
CLAUDE.md): MCP-tool bugs → ../pddl-copilot/plugins/<name>/server/. Scoring/prompt/GT → here. This skill is read-only over experiment state.
Safety
- Destructive ops require explicit user consent:
scancel -u omereliy (kills all jobs), rm on logs or results. Confirm with the user before each.
- Never mutate
run_experiment.py, run_condition_rtx.sbatch, or submit_with_rtx.sh from this skill.
- Preflight before submit: run
scripts/preflight.sh first — it pulls both repos, refreshes the plugin venvs, and surfaces GPU pool capacity in one shot. Submitting with a stale venv or against a saturated pool wastes time.
Operations scripts (under scripts/)
All paths are relative to the repo root /Users/omereliyahu/personal/pddl-copilot-experiments. This skill retains five operations scripts: status.sh, sync.sh, preflight.sh, postmortem.sh, prioritize.sh. The five analysis scripts (aggregate.py, plot.py, plot_focused.py, table.py, drift_check.py) moved to .claude/skills/analyzer/scripts/ on 2026-05-01 — see the analyzer skill for those.
scripts/status.sh — cluster status snapshot
One SSH call (squeue + per-cell wc -l trials.jsonl). Local Python diffs against ~/.cache/cluster-ops-status.json (overridable via STATE_FILE env) and renders five sections, in this order:
- Header —
## Status — ~Xh since last check (or first run when the cache file is absent).
- What changed — bullets for cells that flipped to ✓ this window and cells that newly started accumulating trials. Omitted if nothing changed.
- Per-cell progress matrix — 4 active models × 5 think×cond cells. Each cell shows
N/D (P%) plus an icon: ✓ done · ▶ growing · ⏸ has trials but no growth · PD↻ pending rerun (count > 0) · PD pending fresh · _-_ empty/no match.
- Δ since last status — only cells whose count grew this window. Columns:
Cell | Prev → Now | Δ | pace (s/trial) | ETA. Pace is window-averaged, so a cell that started mid-window will appear slower than reality.
- Roll-up — Done X/20 · Trial coverage % · Running N cells (job IDs) · Watch list (cells where
elapsed + ETA > 0.9 × 72h --time).
- Queue (compact) — Running job IDs, Pending grouped by REASON. See the REASON cheat-sheet below.
The cache file is local-only and pure scratch — rm ~/.cache/cluster-ops-status.json to reset (next run will be a "first run" with no Δ).
Pending array tasks whose per-cell name hasn't materialised yet (still showing the parent template like pddl_rtx_qwen3_6_27b) are matched to all 5 cells of that model — so PD icons appear before the array fans out.
Output mode auto-selects from stdout TTY-detect: ANSI-coloured aligned text in a real terminal, GitHub-flavoured markdown when piped or run via the Bash tool. Override with flags:
bash .claude/skills/cluster-ops/scripts/status.sh
bash .claude/skills/cluster-ops/scripts/status.sh --md
bash .claude/skills/cluster-ops/scripts/status.sh --terminal
bash .claude/skills/cluster-ops/scripts/status.sh --no-color
The two modes share data computation; they differ only in rendering, so the metrics, Δ window, and watch-list logic are identical.
scripts/sync.sh — pull results locally
rsync -av --update from the cluster results/slurm_* into a local subdir under results/.
bash .claude/skills/cluster-ops/scripts/sync.sh
bash .claude/skills/cluster-ops/scripts/sync.sh results/my-custom-run
Never deletes anything. To clear cancelled-job .out files on the remote side, tell the user explicitly what IDs you intend to delete and wait for confirmation before ssh … rm.
scripts/preflight.sh — pre-submit cluster refresh + capacity
Run this before every submit_with_rtx.sh. Does, in one SSH call:
git pull both repos (this one + ../pddl-copilot).
pip install --upgrade -r requirements.txt in each plugin's .venv — setup_env.sh deliberately skips existing venvs, so a pinned dependency bump in ../pddl-copilot/plugins/<plugin>/requirements.txt is silently stale until something explicitly upgrades.
- GPU pool capacity —
sinfo -p rtx6000 -t idle,mix and same for rtx_pro_6000. The free-node count tells you whether submit_with_rtx.sh will queue immediately or sit in PENDING(Resources). If rtx_pro_6000 is 0/6 and you can't wait, --gpu-type rtx_6000 is the opt-in escape hatch.
sres snapshot (Mar-26 guide §"Resources Usage") — one-glance cluster utilization view. sres's "6000" column conflates rtx_6000 and rtx_pro_6000, so trust step 3 for routing decisions.
bash .claude/skills/cluster-ops/scripts/preflight.sh
scripts/prioritize.sh — bias which pending cells run next
Per-array-task scontrol update Nice=N driven by the manifest written at submit time (cluster-experimenting/logs/<jobid>.cells.tsv, idxmodelthinkcond). Listed models keep Nice=0; every other pending cell gets Nice=500 so the listed cells grab the next free GPU slot. Already-running tasks are skipped (Nice has no effect once dispatched).
Direction is one-way: negative Nice (raise priority above default) is admin-only on this cluster — verified by probe 2026-05-08 (nice=100 accepted, nice=-1000 denied). The only lever is deprioritizing the rest.
submit_with_rtx.sh already auto-applies this on a fresh --all submission (deprioritizes everything outside PDDL_SLOW_MODELS={gemma4:31b, qwen3.6:35b}). The skill script is the manual lever for: (a) --continue-partial / single-model resubmits where the auto-gate intentionally doesn't fire, and (b) mid-sweep when you want one specific high-progress cell to finish next so partial results are ready for analyst handoff.
bash .claude/skills/cluster-ops/scripts/prioritize.sh <jobid>
bash .claude/skills/cluster-ops/scripts/prioritize.sh <jobid> gemma4:31b
bash .claude/skills/cluster-ops/scripts/prioritize.sh <jobid> --reset
bash .claude/skills/cluster-ops/scripts/prioritize.sh <jobid> --dry-run gemma4:31b
Idempotent — safe to re-run with a different keep-list. If the manifest is missing (job submitted before the prioritize feature landed), the script exits 2 and tells you so; in that case fall back to manual scontrol update JobId=<master>_<idx> Nice=500 per cluster-experimenting/README.md:280-287.
If you want progress-aware ordering (rank pending cells by current trials.jsonl count and apply a Nice ladder so the closest-to-done cell wins ties), do it manually for now: pull progress with status.sh, then scontrol update JobId=<master>_<idx> Nice=N per task with N rising as progress falls (e.g. 0 / 100 / 200 / … / 700 — Nice values up to 700 are accepted unprivileged). Codifying it into the script is on the table when there's a second concrete need.
scripts/postmortem.sh — completed-job introspection (sacct)
Closes the loop on the Mar-26 guide's "use minimum possible RAM" rule (§Allocating Resources). Pulls sacct for completed pddl_* jobs, merges parent + .batch step rows so MaxRSS lands in the same row as State/Elapsed/ExitCode, then computes a memory-headroom recommendation across the window.
Use it after a sweep finishes to: spot OOMs (Comment = OOM-Kill), find jobs that approached --time (Elapsed close to 3-00:00:00), and right-size --mem for the next sweep without manual sacct per job.
bash .claude/skills/cluster-ops/scripts/postmortem.sh
bash .claude/skills/cluster-ops/scripts/postmortem.sh --since 2026-04-22
bash .claude/skills/cluster-ops/scripts/postmortem.sh --jobs 17130166,17130167
Recipes
"What's the cluster status?"
bash .claude/skills/cluster-ops/scripts/status.sh — table of all running jobs.
- If any job has been stuck at the same progress for >30 min → tail the
.out file to see the last line and surface to the user:
ssh omereliy@slurm.bgu.ac.il 'tail -50 pddl-copilot-experiments/cluster-experimenting/logs/*-<jobid>.out'
"Sync and plot the results"
This recipe spans both skills — sync + sacct here, aggregate + plot + table in analyzer.
bash .claude/skills/cluster-ops/scripts/sync.sh — rsync into results/cluster-<today>/.
- Hand off to the
analyzer skill's "Sync, aggregate, plot, table" recipe for steps 3–5 (aggregate.py, plot.py, table.py against the synced dir).
bash .claude/skills/cluster-ops/scripts/postmortem.sh — sacct table + memory-headroom recommendation. Surface any OOM rows or jobs that approached --time to the user.
- Report to user with the plot paths (from analyzer) and 3–5 key numbers.
"Is the in-flight sweep drifting?"
After a sweep is submitted, periodically verify it's not regressing vs a baseline before letting it eat more GPU-hours. The drift logic lives in analyzer/scripts/drift_check.py; the status + sync glue is here.
bash .claude/skills/cluster-ops/scripts/status.sh — confirm jobs are running (not stuck, not OOM-killed).
bash .claude/skills/cluster-ops/scripts/sync.sh results/cluster-<today> — pull whatever cells have started writing. Cells without a summary_*.json yet still ship trials.jsonl, which drift_check.py aggregates as a fallback.
- Hand off to the
analyzer skill's "Is the in-flight sweep consistent with the last one?" recipe — it picks a baseline and runs drift_check.py.
"Submit the sweep"
The path validated 2026-04-25: bulk jobs queued in 8 seconds, full 4-model sweep packed in ONE rtx_pro_6000 job under MAX_LOADED_MODELS=1.
bash .claude/skills/cluster-ops/scripts/preflight.sh — pulls both repos, refreshes plugin venvs, surfaces GPU pool capacity for rtx6000 and rtx_pro_6000. Halts on any failure.
- Dry-run, then submit.
--all packs the 4 active models into one job; per-model invocations submit independent jobs that queue in parallel:
ssh omereliy@slurm.bgu.ac.il "cd ~/pddl-copilot-experiments && \
bash cluster-experimenting/submit_with_rtx.sh --all --dry-run"
- If approved, same command without
--dry-run.
--no-tools shorthand: for the baseline-only run, bash submit_with_rtx.sh --all --no-tools pins CONDITIONS=no-tools, THINK_MODES=off, TASKS="solve validate_domain validate_problem validate_plan", and scales --time linearly with model count (4h base + 4h per extra model).
GPU class: default rtx_pro_6000:1 (96 GB, --mem=80G). The sweep is hard-pinned to this class for consistency; --gpu-type rtx_6000 is the opt-in 48 GB escape hatch (use only if rtx_pro_6000 is saturated). Think modes auto-select to on off (both run sequentially in one job so weights stay resident); override with --think-modes "default" for a model that lacks the think kwarg.
VRAM safety: the sbatch pins OLLAMA_NUM_PARALLEL=4, MAX_LOADED_MODELS=1, CONTEXT_LENGTH=16384 (raised from 8192 on 2026-04-29). After warmup, a runtime guard aborts the offending model if VRAM usage > 85% (loop continues with the next model). Never raise NUM_PARALLEL without re-measuring KV-cache allocation.
"Prioritize a cell during a contended sweep"
The auto-prioritize logic in submit_with_rtx.sh covers the common case (fresh --all → heavy models grab slots first). For everything else:
bash .claude/skills/cluster-ops/scripts/status.sh — identify which pending cell you want to win the next free slot. Note its model.
- Decide the keep-list. Examples:
- "let qwen3.6:35b finish for the Friday meeting" →
prioritize.sh <jid> qwen3.6:35b
- "this resubmit, just keep the slow set up front" →
prioritize.sh <jid> (default = PDDL_SLOW_MODELS)
- "I want everything back to default after the contention clears" →
prioritize.sh <jid> --reset
- Run
--dry-run first if you want to inspect the plan; then re-run without --dry-run to apply.
Caveats:
- Nice ordering only matters while tasks are PENDING. RUNNING tasks are past the scheduling decision; the script skips them and reports the count.
- If pending tasks are stuck on a node-specific reservation (REASON=
ReqNodeNotAvail/Reservation), Nice ordering changes who-goes-first but doesn't dislodge the reservation. Check status.sh's queue table and the Pending REASON cheat-sheet below.
"Cancel jobs"
Specific IDs first; pipe squeue → awk → scancel is the safer middle ground when the user wants the whole sweep gone:
ssh omereliy@slurm.bgu.ac.il 'scancel <id> <id> …'
ssh omereliy@slurm.bgu.ac.il "squeue --me -h -o '%i %j' | awk '\$2 ~ /^pddl_/ {print \$1}' \
| xargs --no-run-if-empty scancel"
Do NOT use scancel --name=pddl_* — verified 2026-04-25 on SLURM 25.11.4: --name is exact-string match (comma-separated list of literal names), not a glob/regex, so pddl_* silently matches zero jobs and the cancel is a no-op with no error. Use the squeue→awk→xargs pipe above to filter by name prefix.
scancel -u omereliy (nuke all, no name filter) needs an explicit user request — it will terminate jobs that have been running for hours and may not be sweep-related. Confirm first.
Pending REASON cheat sheet (Mar-26 guide §FAQ)
When status.sh's Pending table shows a non-trivial REASON, here's what to do:
| REASON | What it means | Action |
|---|
Resources | The requested partition pool is full. | Wait, or fall back to --gpu-type rtx_6000 if rtx_pro_6000 is saturated and the model set fits 48 GB. |
Priority | Preempted by a Golden-Ticket QoS job (Mar-26 guide §"High Priority Jobs"). | Wait — usually clears in minutes. |
QOSMaxJobsPerUserLimit | Per-user concurrent-job cap reached. | Wait for one of your other jobs to finish, or scancel a low-priority one. |
MaxGRESPerAccount | Per-account GPU cap (relevant for high-priority QoS). | Wait. Not applicable on plain --partition main. |
PartitionTimeLimit | --time exceeds partition's max (main ≤ 7 days). | Edit the #SBATCH --time line in the sbatch and resubmit. |
"Debug a FAIL (exception) cluster"
Real MCP/chat failure, often FD-stdout pollution on tool use (ISS-016, fixed 2026-04-21 in pddl-copilot as bb23ad0).
The stderr lines added in commit cea5ae0 (run_experiment.py:951–971) print the exception type + message live in the .out. For older jobs, the message only exists in single_task_*.json.
Things this skill does NOT do
- Edit experiment code, plugin code, or sbatch scripts (those have their own routing rules).
- Launch an agent; everything here is direct tool calls.
- Resolve an
ISS-###; it just references them in diagnostics.