بنقرة واحدة
babysit-job
Monitor an Iris job and recover it on failure. Use when asked to babysit or watch a job or run.
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
القائمة
Monitor an Iris job and recover it on failure. Use when asked to babysit or watch a job or run.
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
استنادا إلى تصنيف SOC المهني
Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.
Modify or upstream a Grug/Grugformer experiment variant.
Run a perf gate on a PR that touches lib/zephyr internals.
Curate the experiment report index at docs/reports/index.md.
Triage a failed canary ferry run (CI-invoked).
Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.
| name | babysit-job |
| description | Monitor an Iris job and recover it on failure. Use when asked to babysit or watch a job or run. |
Monitor a job continuously and recover on failure. For Zephyr pipelines, delegate to babysit-zephyr instead. Otherwise, follow this skill — Iris is the execution backend.
job_id — Iris job ID in canonical format /<user>/<job> (e.g., /dlwh/iris-run-train_tiny_model_tpu-20260302-185630)config — Iris config path (e.g., lib/iris/config/marin.yaml). When the user
refers to a cluster by shorthand name (e.g., "marin_dev", "marin-dev", "marin",
"coreweave"), resolve it to the matching config file under lib/iris/config/.
Common mappings:
marin / marin_prod -> lib/iris/config/marin.yamlmarin_dev / marin-dev -> lib/iris/config/marin-dev.yamlcoreweave -> lib/iris/config/coreweave.yamlresubmit_command — exact Iris submit command for resubmission; must include --no-wait--extra marin-core:tpu (not --extra marin-core:cpu)--tpu <variant>.
--reserve <variant> only holds capacity; it does not attach TPU devices to the task container.Example resubmit command:
uv run iris --config lib/iris/config/marin.yaml job run --no-wait --extra marin-core:tpu --tpu v5litepod-16 -- python experiments/tutorials/train_tiny_model_tpu.py
If any required field is missing, ask for it before proceeding.
job_id, latest
error/signal, W&B link(s), and resubmission metadata.120 once, check for immediate failure; if still
alive, switch to the normal 570 cadence.monitor stale
separately from run unhealthy.When using marin-mcp-babysitter, keep the MCP server resident and verify the
job through MCP tools, not only Iris CLI commands.
screen, tmux, or one long-running exec session). Record session names,
ports, and log paths in the state file.uv run --package marin-core marin-mcp-babysitter --controller-url <URL> --cluster <CLUSTER> --transport streamable-http --host 127.0.0.1 --port <PORT>iris_job_summary and iris_tail_logs. For heartbeat monitoring,
report: job state, latest progress/tick/log line, timestamp, error signal.scratch/.Write to scratch/<create_timestamp>_monitoring_state.json (create scratch/
if needed); <create_timestamp> has format YYYYMMDD-HHMM. Track
restart_count to detect flapping. Add MCP fields when a resident MCP server is
part of the setup. The state file allows resume after context reset.
{
"ts": <timestamp_ms>,
"job_id": "<JOB_ID>",
"config": "<IRIS_CONFIG_PATH>",
"mcp_url": "http://127.0.0.1:<PORT>/mcp",
"tunnel_session": "<SESSION_NAME>",
"server_session": "<SESSION_NAME>",
"tunnel_log": "scratch/<TUNNEL_LOG>",
"server_log": "scratch/<SERVER_LOG>",
"resubmit_command": "<IRIS_JOB_RUN_COMMAND_WITH_NO_WAIT>",
"restart_count": 0
}
1. SLEEP
- if just submitted/restarted: sleep 120 once
- otherwise: sleep 570
2. CHECK LOGS
uv run iris --config <CONFIG> job logs --since-seconds 900 <JOB_ID> | rg -i -e "loss|error|traceback|exception|resource_exhausted|oom|compiler_base\.cc:2587|program hbm requirement|largest program allocations|ownerdiederror|dead node|node death|autoscaler unsatisfied resources|no accelerator found|failed_precondition|device or resource busy"
`iris job logs <JOB_ID>` includes child-job task logs by default.
3. CHECK STATUS
uv run iris --config <CONFIG> job list --json --prefix <JOB_ID>
Terminal success: JOB_STATE_SUCCEEDED
Terminal non-success: JOB_STATE_FAILED, JOB_STATE_KILLED, JOB_STATE_WORKER_FAILED, JOB_STATE_UNSCHEDULABLE
Non-terminal: JOB_STATE_PENDING, JOB_STATE_BUILDING, JOB_STATE_RUNNING
If `pending_reason` indicates worker scale-up/capacity wait, treat as scheduler
capacity wait — do not run cluster update/recreate/restart actions. Continue
waiting on cadence, or stop+resubmit only if user explicitly asks.
Treat RUNNING as controller-level signal only; confirm allocation via expected
W&B run when possible.
3a. ON TERMINAL STATE / OOM-LIKE SIGNAL — get a structured per-task summary
(final state, exit, duration, peak memory) instead of grepping logs:
uv run iris --config <CONFIG> job summary --json <JOB_ID>
Fast postmortem: e.g. "13/14 shards peaked near the container memory limit
and failed with exit 137" → cgroup OOM, raise `--memory` on resubmit.
4. PRINT W&B RUN IDS/LINKS (once per training run)
- For normal runs, record the active W&B run id/display name/link when W&B is
available; many runs use autoassigned ids.
- When the launch workflow provides an intended W&B identity, validate the
active run id/display name, state, `_timestamp`, `global_step`, and key
losses against it. Do not rely only on a stored URL.
- During resume catch-up, W&B and checkpoint progress may be stale. Live
training-progress log lines with advancing timestamps are sufficient
liveness until W&B appears; once W&B is active, require W&B
timestamps/steps to keep moving.
5. REPORT PROGRESS (format: ~<current>/<exact_max>)
- Resolve `<exact_max>` from the launched config/code, not from progress-bar display text.
6. EVALUATE (terminal? error? stalled? -> recover or continue)
7. RECOVER (STOP -> RESUBMIT)
- If current job is still non-terminal, stop it first:
uv run iris --config <CONFIG> job stop <JOB_ID>
- Then resubmit:
<RESUBMIT_COMMAND>
- Capture `job_id` from output (line like `Job submitted: /<user>/<job>`).
- Iris nuance:
- if `resubmit_command` omits `--job-name`, Iris auto-generates a fresh id each resubmission.
- if `resubmit_command` uses a fixed `--job-name`, Iris may reuse the same id
after terminal completion by replacing the finished job.
- Update state file: `job_id=<NEW_JOB_ID>`, `restart_count += 1`.
- Go to step 1.
When EVALUATE detects an error, before recovery:
Traceback, Error, Exception. Identify file and line.NameError, ImportError, SyntaxError, obvious KeyError):
fix it, then RECOVER.Program hbm requirement ...Largest program allocations in hbmOwnerDiedError, dead node,
or unsatisfied resources -> mark degraded and notify user.Before declaring the job complete:
metadata.json when the run is expected to
write a checkpoint.iris task exec -> debugjob list --prefix requires canonical job names (/<user>/<job>), not short names.