Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

rl-isaaclab-cluster-ops

Étoiles1

Forks0

Mis à jour24 mai 2026 à 09:01

Submit, monitor, sync, and ledger Moleworks IsaacLab RL runs in `moleworks_ext`. Use when preparing a smoke test, launching Euler jobs, debugging failed Slurm runs, syncing experiments, comparing run configs, or updating `docs/EXPERIMENTS_ONGOING.md` and `docs/EXPERIMENTS_RUN.md`.

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

Idate96

Idate96/codex_skills

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

Métiers associésSOC

Basé sur la classification professionnelle SOC

Développeurs de logicielsProfessions informatiques et mathématiques·SOC 15-1252

SKILL.md

readonly

name	rl-isaaclab-cluster-ops
description	Submit, monitor, sync, and ledger Moleworks IsaacLab RL runs in `moleworks_ext`. Use when preparing a smoke test, launching Euler jobs, debugging failed Slurm runs, syncing experiments, comparing run configs, or updating `docs/EXPERIMENTS_ONGOING.md` and `docs/EXPERIMENTS_RUN.md`.

IsaacLab Cluster Ops

Use this skill for cluster-side RL operations in moleworks_ext.

Source Of Truth

Read the repo workflow/docs before improvising:

docs/AI_RESEARCHER_WORKFLOW.md
docs/EXPERIMENTS_ONGOING.md
docs/EXPERIMENTS_RUN.md
docs/MULTI_GPU_TRAINING.md
docker/cluster/cluster_interface.sh
docker/cluster/sync_experiments.sh
scripts/utils/cluster_run_report.sh
scripts/utils/compare_run_configs.py

Hard Rules

Local smoke first, with W&B disabled.
Real training runs use W&B.
Real Euler training runs should usually request JOB_TIME=24h. Use 30m/4h only for smoke gates, startup validation, queue probes, or intentionally bounded debugging.
If a 24h run has not saturated by its last evaluated checkpoint, continue from the best/latest checkpoint with another long run instead of treating the partial curve as converged. On Euler, chained 24h continuations are preferred over one fragile oversized job.
Do not talk about "pulling a policy" unless the job actually trained one.
Keep experiment docs current in the same work session as submit/sync/closeout.
Prefer targeted live diagnostics and targeted sync before broad full-log sync.

Local Smoke Gate

Prefer preparing or running a tiny bounded smoke command before cluster launch:

export WANDB_MODE=disabled
/workspace/isaaclab/isaaclab.sh -p scripts/rsl_rl/train.py \
  --task <TASK> \
  --num_envs 4 \
  --max_iterations 3 \
  --headless
unset WANDB_MODE

Submit

Default cluster entrypoint:

JOB_TIME=24h NUM_GPUS=2 GPU_TYPE=rtx_3090 \
./docker/cluster/cluster_interface.sh job \
  --task <TASK> \
  --num_envs 64000 \
  --max_iterations 10000

Notes:

In multi-GPU mode, --num_envs is the total across GPUs.
With resume, --max_iterations is additional learning iterations beyond the loaded checkpoint.
If using a git worktree, launch from that worktree root and prove the staged code copy matches the worktree.

Monitor And Debug

Use narrow checks first:

ssh euler 'squeue -u $USER'
ssh euler 'sacct -j <jobid> --format=JobID,State,Elapsed,Timelimit,Partition%12,ExitCode -P'
ssh euler 'tail -100 /cluster/scratch/<user>/moleworks_ext_<timestamp>/slurm-<jobid>.out'
ssh euler 'tail -50 /cluster/scratch/<user>/moleworks_ext_<timestamp>/slurm-<jobid>.err'

Preferred live diagnostic helper:

scripts/utils/cluster_run_report.sh
scripts/utils/cluster_run_report.sh --job-ids <jobA> <jobB> --sync-params

Use compare_run_configs.py before manual YAML archaeology:

python3 scripts/utils/compare_run_configs.py --run-a <runA> --run-b <runB>

Sync

Full sync when needed:

./docker/cluster/sync_experiments.sh
./docker/cluster/sync_experiments.sh --remove

Targeted-first workflow:

cluster_run_report.sh --job-ids ... --sync-params
compare configs if needed
full sync only when you need checkpoints or broader artifacts

Experiment Docs

Treat these as mandatory:

docs/EXPERIMENTS_ONGOING.md for live runs
docs/EXPERIMENTS_RUN.md for benchmarked/completed/stopped runs

For every new real run, record:

name
run_name
wandb_run
wandb_url
environment
date in UTC
short intention
job id and key artifact path when known

Before each monitoring report:

reconcile EXPERIMENTS_ONGOING.md against live squeue
if a listed job is no longer live, treat it as finished by default
benchmark/archive it, then move it to EXPERIMENTS_RUN.md

Reporting Format

Use one-line snapshots:

<job_id> | <run_name> | <task> | wandb_run=<...> | wandb_url=<...> | timeout=<...> | full=<...> | close=<...>

Always include run_name, task, and W&B identity, not just job ids.

Plus depuis ce dépôt

même dépôt

terra-trench

Idate96/codex_skills

Current Moleworks Terra trenching runbook for full autonomous Beam6-style trench execution in simulation or on the robot. Use when investigating or running the two-stage flange/bottom trench flow, generate_trench_sequence_plans.py, beam6_sequence_stage.launch.py, BASE_CONTROL target registration, mesh_to_excavation_grid_map.py, workspace planner trench-axis metadata, Terra behavior-tree activation, Newton or Isaac/Terra simulation bringup, robot bringup, and 400 mm tool handoff.

2026-05-261

chat-replies

Idate96/codex_skills

Read recent Google Chat context, draft or send a reply in the correct DM or space, download collaborator attachments such as timesheets or PDFs, and handle simple meeting coordination by creating or updating a Google Calendar invite and posting the Meet link back in Chat. Use when Lorenzo asks to read a collaborator's recent messages, understand chat context before replying, send a Google Chat reply through the Chat API, pull a PDF or timesheet out of Chat, or create a meeting from a chat exchange.

2026-05-241

dig-bag-replay

Idate96/codex_skills

Replay split DIG bags in the `moleworks_ros` container with bag TF, live self-filter, live elevation mapping, live excavation mapping, and Foxglove. Use when reviewing DIG episodes from `sensors/`, `state/`, `commands/`, `lidar/`, and optional `elevation_map/` bags.

2026-05-241

grading-student

Idate96/codex_skills

Finalize RSL student grading and offboarding. Use when Lorenzo asks to find a student's grading sheet, extract or submit a grade, update the RSL student-project tracker like the onboarding workflow, request eDoz grade entry from admin staff, mark offboarding fields such as completed/report/grading/source/access-revoked only with evidence, or send a short Google Chat status reply after the handoff.

2026-05-241

newton-nav-stack-test

Idate96/codex_skills

Validate the Newton + ROS Nav2 driving stack in a clean tmux session after bringup. Use when the user wants a repeatable navigation check in Newton sim, including health checks for the bridge/model/drive path and the lateral-shift golden test.

2026-05-241

newton-sim-ros-startup

Idate96/codex_skills

Start or restart the Moleworks ROS2 stack using the Newton simulator in the default moleworks_ros runtime shell, assuming the current shell is already inside the target container unless the user says otherwise. Use when you need a clean tmux layout for Newton bridge, robot/TF/RViz, perception (elevation + excavation mapping), optional Foxglove bridge, an isolated bridge-only validation stack on a specific ROS domain, or Terra failure capture and resume from saved checkpoints in Newton simulation, all with use_sim_time:=true.

2026-05-241

name	rl-isaaclab-cluster-ops
description	Submit, monitor, sync, and ledger Moleworks IsaacLab RL runs in `moleworks_ext`. Use when preparing a smoke test, launching Euler jobs, debugging failed Slurm runs, syncing experiments, comparing run configs, or updating `docs/EXPERIMENTS_ONGOING.md` and `docs/EXPERIMENTS_RUN.md`.

IsaacLab Cluster Ops

Use this skill for cluster-side RL operations in moleworks_ext.

Source Of Truth

Read the repo workflow/docs before improvising:

docs/AI_RESEARCHER_WORKFLOW.md
docs/EXPERIMENTS_ONGOING.md
docs/EXPERIMENTS_RUN.md
docs/MULTI_GPU_TRAINING.md
docker/cluster/cluster_interface.sh
docker/cluster/sync_experiments.sh
scripts/utils/cluster_run_report.sh
scripts/utils/compare_run_configs.py

Hard Rules

Local smoke first, with W&B disabled.
Real training runs use W&B.
Real Euler training runs should usually request JOB_TIME=24h. Use 30m/4h only for smoke gates, startup validation, queue probes, or intentionally bounded debugging.
If a 24h run has not saturated by its last evaluated checkpoint, continue from the best/latest checkpoint with another long run instead of treating the partial curve as converged. On Euler, chained 24h continuations are preferred over one fragile oversized job.
Do not talk about "pulling a policy" unless the job actually trained one.
Keep experiment docs current in the same work session as submit/sync/closeout.
Prefer targeted live diagnostics and targeted sync before broad full-log sync.

Local Smoke Gate

Prefer preparing or running a tiny bounded smoke command before cluster launch:

export WANDB_MODE=disabled
/workspace/isaaclab/isaaclab.sh -p scripts/rsl_rl/train.py \
  --task <TASK> \
  --num_envs 4 \
  --max_iterations 3 \
  --headless
unset WANDB_MODE

Submit

Default cluster entrypoint:

JOB_TIME=24h NUM_GPUS=2 GPU_TYPE=rtx_3090 \
./docker/cluster/cluster_interface.sh job \
  --task <TASK> \
  --num_envs 64000 \
  --max_iterations 10000

Notes:

In multi-GPU mode, --num_envs is the total across GPUs.
With resume, --max_iterations is additional learning iterations beyond the loaded checkpoint.
If using a git worktree, launch from that worktree root and prove the staged code copy matches the worktree.

Monitor And Debug

Use narrow checks first:

ssh euler 'squeue -u $USER'
ssh euler 'sacct -j <jobid> --format=JobID,State,Elapsed,Timelimit,Partition%12,ExitCode -P'
ssh euler 'tail -100 /cluster/scratch/<user>/moleworks_ext_<timestamp>/slurm-<jobid>.out'
ssh euler 'tail -50 /cluster/scratch/<user>/moleworks_ext_<timestamp>/slurm-<jobid>.err'

Preferred live diagnostic helper:

scripts/utils/cluster_run_report.sh
scripts/utils/cluster_run_report.sh --job-ids <jobA> <jobB> --sync-params

Use compare_run_configs.py before manual YAML archaeology:

python3 scripts/utils/compare_run_configs.py --run-a <runA> --run-b <runB>

Sync

Full sync when needed:

./docker/cluster/sync_experiments.sh
./docker/cluster/sync_experiments.sh --remove

Targeted-first workflow:

cluster_run_report.sh --job-ids ... --sync-params
compare configs if needed
full sync only when you need checkpoints or broader artifacts

Experiment Docs

Treat these as mandatory:

docs/EXPERIMENTS_ONGOING.md for live runs
docs/EXPERIMENTS_RUN.md for benchmarked/completed/stopped runs

For every new real run, record:

name
run_name
wandb_run
wandb_url
environment
date in UTC
short intention
job id and key artifact path when known

Before each monitoring report:

reconcile EXPERIMENTS_ONGOING.md against live squeue
if a listed job is no longer live, treat it as finished by default
benchmark/archive it, then move it to EXPERIMENTS_RUN.md

Reporting Format

Use one-line snapshots:

<job_id> | <run_name> | <task> | wandb_run=<...> | wandb_url=<...> | timeout=<...> | full=<...> | close=<...>

Always include run_name, task, and W&B identity, not just job ids.