Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

monitor-run

Name: Monitor Run
Author: PrimeIntellect-ai

// Monitor an ongoing prime-rl training run — find the output directory, tail logs, check key metrics, inspect SLURM jobs, and restart safely. Use when asked to check on a run, debug training, or investigate performance.

Exécuter dans Manus

$ git log --oneline --stat

stars:1 413

forks:302

updated:20 mai 2026 à 21:27

SKILL.md

readonly

name	monitor-run
description	Monitor an ongoing prime-rl training run — find the output directory, tail logs, check key metrics, inspect SLURM jobs, and restart safely. Use when asked to check on a run, debug training, or investigate performance.

Monitor a run

Runbook

On launch

Find the output dir and read the resolved configs at {output_dir}/configs/ (start with rl.toml).
Confirm all processes are alive and the run is making progress.
Write the initial summary into {output_dir}/STATUS.md.

Recurring check-ins

Default cadence: 1 hour (researcher can override). At each check-in:

Confirm processes are alive.
Grep logs for errors/warnings; note current step and key metrics.
Append an entry to {output_dir}/STATUS.md (never overwrite):

## YYYY-MM-DD HH:MM UTC

**Step**: {current_step} / {max_steps}
**Health**: {Healthy | Degraded | Down}

**Progress**: reward/mean, seq_len, truncation, eval scores, env-specific metrics.
**Stability**: entropy, mismatch_kl, grad_norm — flag spikes.
**Performance**: trainer vs orchestrator step time, env lag, inference pressure.

**Notes**: anything unusual (errors, restarts, hangs). Omit if nothing notable.

Restarting a run

Never restart unless the researcher explicitly asked. Confirm the exact restart command and the conditions that warrant one.

Never run kill or launch commands from your own shell. Dispatch them to the tmux Launcher window so the researcher sees what was executed:

SESSION=$(tmux display-message -p '#S')
tmux send-keys -t "$SESSION:Launcher" 'your command here' Enter

After a restart, verify all processes are back up and progress resumed before the next check-in.

Reference

Where to find things

scripts/tmux.sh launches the run with a Launcher window in the named tmux session. The Claude window receives the output dir and session name in its appended prompt — if either is missing, ask rather than guess.
{output_dir}/configs/ — resolved TOMLs (rl.toml has the full picture).
{output_dir}/logs/ — see below.
{output_dir}/rollouts/step_N/ — saved rollouts.

Logs

{output_dir}/logs/
├── trainer.log                # rank 0 stdout
├── orchestrator.log           # orchestrator stdout
├── inference.log              # vLLM stdout
├── trainer/
│   ├── node_*.log             # per-node (multi-node only)
│   └── torchrun/              # per-rank stdout/stderr
├── inference/
│   ├── node_*.log             # per-node (multi-node only)
│   └── router_0.log           # vllm-router per replica (multi-node only)
└── envs/{train,eval}/{env_name}/
    ├── env_server.log
    └── env_worker_*.log

Usually tailing trainer.log, orchestrator.log, and inference.log is enough. Drop into per-node or per-rank logs only when debugging. All logs are loguru with HH:mm:ss LEVEL message; levels: DEBUG, INFO, SUCCESS, WARNING, ERROR.

Scan for problems:

grep -E "WARNING|ERROR" {output_dir}/logs/{trainer,orchestrator,inference}.log
grep -E "WARNING|ERROR" {output_dir}/logs/envs/train/*/env_{server,worker_*}.log

Metrics

All metrics print to the console log (and W&B when configured).

Progress — orchestrator log:

Metric	Description
`reward/{all,env}/mean`	mean training reward
`seq_len/{all,env}/mean`	avg sequence length (tokens)
`num_turns/{all,env}/mean`	avg turns per rollout (multi-turn only)
`is_truncated/{all,env}/mean`	fraction truncated
`empty_rollouts/{all,env}`, `errored_rollouts/{all,env}`	fraction empty/errored
`metrics/{env}/{metric}`	env-specific (e.g. pass rate)
`eval/{env}/{avg@k,pass@k}`	eval scores when configured

Stability — trainer log:

Metric	Description
`mismatch_kl/{all,env}/{mean,std,max}`	KL between trainer and (old) inference policy over trainable tokens
`entropy/{all,env}/{mean,std,max}`	policy entropy over trainable tokens
`masked_advantage_{positive,negative}/mean`	fraction of DPPO-masked tokens with +/- advantage
`optim/grad_norm`	spikes may precede divergence

Performance — trainer and orchestrator step independently, so comparing step times shows who's waiting on whom.

Source	Metric	Description
trainer	`time/step`	total trainer step
trainer	`time/wait_for_batch`	high → orchestrator is bottleneck
trainer	`time/forward_backward`, `time/broadcast_weights`, `time/save_ckpt`	phase timings
trainer	`perf/throughput`, `perf/mfu`	tokens/s and MFU %
orchestrator	`time/step`, `time/generate_completions`, `time/update_weights`	phase timings
orchestrator	`time/wait_for_ckpt`	high → trainer is bottleneck
orchestrator	`scheduler/async_level`, `scheduler/inflight_rollouts`	scheduler state
env server	event loop lag (min/mean/p90/p99/max), active task distribution	periodic

For live vLLM stats, query Prometheus directly:

curl -s http://localhost:8000/metrics | grep -E "num_requests|gpu_cache_usage"
# vllm:num_requests_running, vllm:num_requests_waiting, vllm:gpu_cache_usage_perc (→1.0 = KV cache saturated)

Rollouts

{output_dir}/rollouts/step_N/
├── train_rollouts.jsonl   # all train rollouts (vf.RolloutOutput, trajectory excluded)
├── eval_rollouts.jsonl    # only present when eval ran
└── train_rollouts.bin     # binary batch consumed by the trainer

wc -l {output_dir}/rollouts/step_42/train_rollouts.jsonl
head -1 {output_dir}/rollouts/step_42/train_rollouts.jsonl | python -m json.tool
jq '.reward' {output_dir}/rollouts/step_42/train_rollouts.jsonl

Common failure modes

A few warnings are normal. Escalate when errors are persistent, growing, or hit a large fraction of rollouts.

Env workers: exceptions in env code, timeouts, sandbox errors, OOM kills (most common source — runs user code).
Orchestrator: empty/errored rollout spikes, weight-broadcast failures, checkpoint errors.
Trainer: NCCL/CUDA errors, OOM, NaN loss or gradients.
Inference: NCCL/CUDA errors, OOM, request timeouts.

Process tree

All processes use setproctitle so they're visible in ps/htop/pstree:

PRIME-RL::Launcher
├── PRIME-RL::Inference          (vLLM server, GPU 0)
├── PRIME-RL::Orchestrator       (CPU-only)
│   └── Verifiers::EnvServer     (ZMQ env server per environment)
│       └── Verifiers::EnvWorker0..N
├── torchrun
│   └── PRIME-RL::Trainer        (GPU 1+)
└── tail trainer.log

For multi-node runs, trainer and inference processes are on separate nodes — use srun or ssh to inspect them.

related-skills.json

même dépôt

configs.md

from "PrimeIntellect-ai/prime-rl"

How the prime-rl config system works — TOML files, CLI overrides, composition, and special patterns. Use when creating configs, debugging config errors, or overriding values via CLI.

2026-05-221.4k

release.md

from "PrimeIntellect-ai/prime-rl"

How to prepare and publish GitHub releases for prime-rl. Use when drafting release notes, tagging versions, or publishing releases.

2026-05-201.4k

install.md

from "PrimeIntellect-ai/prime-rl"

How to install prime-rl and its optional dependencies. Use when setting up the project, installing extras like deep-gemm for FP8 models, or troubleshooting dependency issues.

2026-05-201.4k

training.md

from "PrimeIntellect-ai/prime-rl"

Launch and monitor prime-rl training runs. Use when starting, supervising, or debugging an RL/SFT run. Routes to `start-run` (entrypoints + how to launch) and `monitor-run` (logs, metrics, check-ins).

2026-05-201.4k

start-run.md

from "PrimeIntellect-ai/prime-rl"

How to launch prime-rl training runs — the `rl`, `sft`, and `inference` entrypoints, their config classes, and single-node/SLURM/dry-run modes. Use when starting a run or picking the right entrypoint.

2026-05-201.4k

package.json

"author": "PrimeIntellect-ai"

"repository": "PrimeIntellect-ai/prime-rl"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Scientifiques des donnéesProfessions informatiques et mathématiques15-2051L4

name	monitor-run
description	Monitor an ongoing prime-rl training run — find the output directory, tail logs, check key metrics, inspect SLURM jobs, and restart safely. Use when asked to check on a run, debug training, or investigate performance.

Monitor a run

Runbook

On launch

Find the output dir and read the resolved configs at {output_dir}/configs/ (start with rl.toml).
Confirm all processes are alive and the run is making progress.
Write the initial summary into {output_dir}/STATUS.md.

Recurring check-ins

Default cadence: 1 hour (researcher can override). At each check-in:

Confirm processes are alive.
Grep logs for errors/warnings; note current step and key metrics.
Append an entry to {output_dir}/STATUS.md (never overwrite):

## YYYY-MM-DD HH:MM UTC

**Step**: {current_step} / {max_steps}
**Health**: {Healthy | Degraded | Down}

**Progress**: reward/mean, seq_len, truncation, eval scores, env-specific metrics.
**Stability**: entropy, mismatch_kl, grad_norm — flag spikes.
**Performance**: trainer vs orchestrator step time, env lag, inference pressure.

**Notes**: anything unusual (errors, restarts, hangs). Omit if nothing notable.

Restarting a run

Never restart unless the researcher explicitly asked. Confirm the exact restart command and the conditions that warrant one.

Never run kill or launch commands from your own shell. Dispatch them to the tmux Launcher window so the researcher sees what was executed:

SESSION=$(tmux display-message -p '#S')
tmux send-keys -t "$SESSION:Launcher" 'your command here' Enter

After a restart, verify all processes are back up and progress resumed before the next check-in.

Reference

Where to find things

scripts/tmux.sh launches the run with a Launcher window in the named tmux session. The Claude window receives the output dir and session name in its appended prompt — if either is missing, ask rather than guess.
{output_dir}/configs/ — resolved TOMLs (rl.toml has the full picture).
{output_dir}/logs/ — see below.
{output_dir}/rollouts/step_N/ — saved rollouts.

Logs

{output_dir}/logs/
├── trainer.log                # rank 0 stdout
├── orchestrator.log           # orchestrator stdout
├── inference.log              # vLLM stdout
├── trainer/
│   ├── node_*.log             # per-node (multi-node only)
│   └── torchrun/              # per-rank stdout/stderr
├── inference/
│   ├── node_*.log             # per-node (multi-node only)
│   └── router_0.log           # vllm-router per replica (multi-node only)
└── envs/{train,eval}/{env_name}/
    ├── env_server.log
    └── env_worker_*.log

Scan for problems:

grep -E "WARNING|ERROR" {output_dir}/logs/{trainer,orchestrator,inference}.log
grep -E "WARNING|ERROR" {output_dir}/logs/envs/train/*/env_{server,worker_*}.log

Metrics

All metrics print to the console log (and W&B when configured).

Progress — orchestrator log:

Metric	Description
`reward/{all,env}/mean`	mean training reward
`seq_len/{all,env}/mean`	avg sequence length (tokens)
`num_turns/{all,env}/mean`	avg turns per rollout (multi-turn only)
`is_truncated/{all,env}/mean`	fraction truncated
`empty_rollouts/{all,env}`, `errored_rollouts/{all,env}`	fraction empty/errored
`metrics/{env}/{metric}`	env-specific (e.g. pass rate)
`eval/{env}/{avg@k,pass@k}`	eval scores when configured

Stability — trainer log:

Metric	Description
`mismatch_kl/{all,env}/{mean,std,max}`	KL between trainer and (old) inference policy over trainable tokens
`entropy/{all,env}/{mean,std,max}`	policy entropy over trainable tokens
`masked_advantage_{positive,negative}/mean`	fraction of DPPO-masked tokens with +/- advantage
`optim/grad_norm`	spikes may precede divergence

Performance — trainer and orchestrator step independently, so comparing step times shows who's waiting on whom.

Source	Metric	Description
trainer	`time/step`	total trainer step
trainer	`time/wait_for_batch`	high → orchestrator is bottleneck
trainer	`time/forward_backward`, `time/broadcast_weights`, `time/save_ckpt`	phase timings
trainer	`perf/throughput`, `perf/mfu`	tokens/s and MFU %
orchestrator	`time/step`, `time/generate_completions`, `time/update_weights`	phase timings
orchestrator	`time/wait_for_ckpt`	high → trainer is bottleneck
orchestrator	`scheduler/async_level`, `scheduler/inflight_rollouts`	scheduler state
env server	event loop lag (min/mean/p90/p99/max), active task distribution	periodic

For live vLLM stats, query Prometheus directly:

curl -s http://localhost:8000/metrics | grep -E "num_requests|gpu_cache_usage"
# vllm:num_requests_running, vllm:num_requests_waiting, vllm:gpu_cache_usage_perc (→1.0 = KV cache saturated)

Rollouts

{output_dir}/rollouts/step_N/
├── train_rollouts.jsonl   # all train rollouts (vf.RolloutOutput, trajectory excluded)
├── eval_rollouts.jsonl    # only present when eval ran
└── train_rollouts.bin     # binary batch consumed by the trainer

wc -l {output_dir}/rollouts/step_42/train_rollouts.jsonl
head -1 {output_dir}/rollouts/step_42/train_rollouts.jsonl | python -m json.tool
jq '.reward' {output_dir}/rollouts/step_42/train_rollouts.jsonl

Common failure modes

A few warnings are normal. Escalate when errors are persistent, growing, or hit a large fraction of rollouts.

Env workers: exceptions in env code, timeouts, sandbox errors, OOM kills (most common source — runs user code).
Orchestrator: empty/errored rollout spikes, weight-broadcast failures, checkpoint errors.
Trainer: NCCL/CUDA errors, OOM, NaN loss or gradients.
Inference: NCCL/CUDA errors, OOM, request timeouts.

Process tree

All processes use setproctitle so they're visible in ps/htop/pstree:

PRIME-RL::Launcher
├── PRIME-RL::Inference          (vLLM server, GPU 0)
├── PRIME-RL::Orchestrator       (CPU-only)
│   └── Verifiers::EnvServer     (ZMQ env server per environment)
│       └── Verifiers::EnvWorker0..N
├── torchrun
│   └── PRIME-RL::Trainer        (GPU 1+)
└── tail trainer.log

For multi-node runs, trainer and inference processes are on separate nodes — use srun or ssh to inspect them.

monitor-run

Monitor a run

Runbook

On launch

Recurring check-ins

Restarting a run

Reference

Where to find things

Logs

Metrics

Rollouts

Common failure modes

Process tree

Plus depuis ce dépôt

Monitor a run

Runbook

On launch

Recurring check-ins

Restarting a run

Reference

Where to find things

Logs

Metrics

Rollouts

Common failure modes

Process tree

Plus depuis ce dépôt