| name | babysit-job |
| description | Monitor an Iris job and recover it on failure. Use when asked to babysit or watch a job or run. |
Skill: Babysit Job
Monitor a job continuously and recover on failure. For Zephyr pipelines,
delegate to babysit-zephyr instead. Otherwise, follow this skill — Iris is
the execution backend.
Required Info
job_id — Iris job ID in canonical format /<user>/<job> (e.g., /dlwh/iris-run-train_tiny_model_tpu-20260302-185630)
config — Iris config path (e.g., lib/iris/config/marin.yaml). When the user
refers to a cluster by shorthand name (e.g., "marin_dev", "marin-dev", "marin",
"coreweave"), resolve it to the matching config file under lib/iris/config/.
Common mappings:
marin / marin_prod -> lib/iris/config/marin.yaml
marin_dev / marin-dev -> lib/iris/config/marin-dev.yaml
coreweave -> lib/iris/config/coreweave.yaml
resubmit_command — exact Iris submit command for resubmission; must include --no-wait
- For Marin TPU training jobs, use
--extra marin-core:tpu (not --extra marin-core:cpu)
- For TPU jobs, the resubmit command must request TPU resources with
--tpu <variant>.
--reserve <variant> only holds capacity; it does not attach TPU devices to the task container.
Example resubmit command:
uv run iris --config lib/iris/config/marin.yaml job run --no-wait --extra marin-core:tpu --tpu v5litepod-16 -- python experiments/tutorials/train_tiny_model_tpu.py
If any required field is missing, ask for it before proceeding.
Scope
- Recovery is stop then resubmit at the job level.
- Cluster-level actions are out of scope. Do not restart, recreate, or otherwise
mutate the cluster unless the user gives explicit consent in the current thread.
- For TPU bad-node errors, escalate to debug.
Monitoring Ownership and Duration
- Assign a single monitoring owner when the loop starts.
- Keep the loop running until: the job reaches a terminal state and the user has
acknowledged next action; a user-specified stopping point is reached; or an
unrecoverable error is found and reported.
- Do not stop early after first loss lines, first eval, or first W&B link.
- Ferry-scale runs commonly take 4-5 hours.
- Do not end the turn for a status update while continuous monitoring is active;
continue until terminal state, a stopping point, or an unrecoverable error.
- For handoff, transfer ownership explicitly with: current
job_id, latest
error/signal, W&B link(s), and resubmission metadata.
Cadence and Tooling Notes
- After submit/resubmit: sleep
120 once, check for immediate failure; if still
alive, switch to the normal 570 cadence.
- Tool-runtime workaround: keep one long-running monitor session; poll it in
~30s chunks as tool limits require — repeated no-output polls are expected
while waiting for the next 570s check.
- Run only one active monitor loop per job (duplicate loops cause SSH tunnel and
port-binding conflicts).
- Sleep must be foreground (max ~10 min due to tool timeout). Loop control is at
agent level, not bash.
- Screen/process alive is not enough. Check state-file freshness plus
stdout/event-log mtime when a monitor writes them; if no monitor state or
event update occurs for more than 2 cadences, report
monitor stale
separately from run unhealthy.
- If an Iris/orchestrator query is blocked or inconclusive, do not assume job
failure. Cross-check W&B freshness, live logs, checkpoint movement,
worker/TPU health, and latest monitor state.
MCP-Assisted Monitoring
When using marin-mcp-babysitter, keep the MCP server resident and verify the
job through MCP tools, not only Iris CLI commands.
- Keep the controller tunnel and MCP server in named, restartable sessions
(
screen, tmux, or one long-running exec session). Record session names,
ports, and log paths in the state file.
- Start MCP with a stable local controller URL and streamable HTTP transport:
uv run --package marin-core marin-mcp-babysitter --controller-url <URL> --cluster <CLUSTER> --transport streamable-http --host 127.0.0.1 --port <PORT>
- Verify with
iris_job_summary and iris_tail_logs. For heartbeat monitoring,
report: job state, latest progress/tick/log line, timestamp, error signal.
- If the MCP server is reachable but tool calls fail with connection refused to
the controller URL, restart only the smoke-test tunnel/session — do not mutate
the Iris cluster.
- If a sandbox blocks localhost TCP probes, run the probe inside an existing
long-lived session and write a small JSON result under
scratch/.
- For bounded smoke tests, create a thread heartbeat only after the job is
submitted, MCP is reachable, and one expected log/progress line has appeared.
Delete the heartbeat and stop smoke-test sessions when the job reaches the
expected terminal state.
State File
Write to scratch/<create_timestamp>_monitoring_state.json (create scratch/
if needed); <create_timestamp> has format YYYYMMDD-HHMM. Track
restart_count to detect flapping. Add MCP fields when a resident MCP server is
part of the setup. The state file allows resume after context reset.
{
"ts": <timestamp_ms>,
"job_id": "<JOB_ID>",
"config": "<IRIS_CONFIG_PATH>",
"mcp_url": "http://127.0.0.1:<PORT>/mcp",
"tunnel_session": "<SESSION_NAME>",
"server_session": "<SESSION_NAME>",
"tunnel_log": "scratch/<TUNNEL_LOG>",
"server_log": "scratch/<SERVER_LOG>",
"resubmit_command": "<IRIS_JOB_RUN_COMMAND_WITH_NO_WAIT>",
"restart_count": 0
}
Loop
1. SLEEP
- if just submitted/restarted: sleep 120 once
- otherwise: sleep 570
2. CHECK LOGS
uv run iris --config <CONFIG> job logs --since-seconds 900 <JOB_ID> | rg -i -e "loss|error|traceback|exception|resource_exhausted|oom|compiler_base\.cc:2587|program hbm requirement|largest program allocations|ownerdiederror|dead node|node death|autoscaler unsatisfied resources|no accelerator found|failed_precondition|device or resource busy"
`iris job logs <JOB_ID>` includes child-job task logs by default.
3. CHECK STATUS
uv run iris --config <CONFIG> job list --json --prefix <JOB_ID>
Terminal success: JOB_STATE_SUCCEEDED
Terminal non-success: JOB_STATE_FAILED, JOB_STATE_KILLED, JOB_STATE_WORKER_FAILED, JOB_STATE_UNSCHEDULABLE
Non-terminal: JOB_STATE_PENDING, JOB_STATE_BUILDING, JOB_STATE_RUNNING
If `pending_reason` indicates worker scale-up/capacity wait, treat as scheduler
capacity wait — do not run cluster update/recreate/restart actions. Continue
waiting on cadence, or stop+resubmit only if user explicitly asks.
Treat RUNNING as controller-level signal only; confirm allocation via expected
W&B run when possible.
3a. ON TERMINAL STATE / OOM-LIKE SIGNAL — get a structured per-task summary
(final state, exit, duration, peak memory) instead of grepping logs:
uv run iris --config <CONFIG> job summary --json <JOB_ID>
Fast postmortem: e.g. "13/14 shards peaked near the container memory limit
and failed with exit 137" → cgroup OOM, raise `--memory` on resubmit.
4. PRINT W&B RUN IDS/LINKS (once per training run)
- For normal runs, record the active W&B run id/display name/link when W&B is
available; many runs use autoassigned ids.
- When the launch workflow provides an intended W&B identity, validate the
active run id/display name, state, `_timestamp`, `global_step`, and key
losses against it. Do not rely only on a stored URL.
- During resume catch-up, W&B and checkpoint progress may be stale. Live
training-progress log lines with advancing timestamps are sufficient
liveness until W&B appears; once W&B is active, require W&B
timestamps/steps to keep moving.
5. REPORT PROGRESS (format: ~<current>/<exact_max>)
- Resolve `<exact_max>` from the launched config/code, not from progress-bar display text.
6. EVALUATE (terminal? error? stalled? -> recover or continue)
7. RECOVER (STOP -> RESUBMIT)
- If current job is still non-terminal, stop it first:
uv run iris --config <CONFIG> job stop <JOB_ID>
- Then resubmit:
<RESUBMIT_COMMAND>
- Capture `job_id` from output (line like `Job submitted: /<user>/<job>`).
- Iris nuance:
- if `resubmit_command` omits `--job-name`, Iris auto-generates a fresh id each resubmission.
- if `resubmit_command` uses a fixed `--job-name`, Iris may reuse the same id
after terminal completion by replacing the finished job.
- Update state file: `job_id=<NEW_JOB_ID>`, `restart_count += 1`.
- Go to step 1.
Fixing Small Bugs
When EVALUATE detects an error, before recovery:
- Analyze logs for
Traceback, Error, Exception. Identify file and line.
- Small fix (
NameError, ImportError, SyntaxError, obvious KeyError):
fix it, then RECOVER.
- Complex (OOM, TPU/XLA HBM exhaustion, distributed-training failures, data
loading, unclear multi-file stack traces): report to user, exit loop.
Error Patterns
- Treat TPU/XLA HBM reports as failure even without literal OOM:
Program hbm requirement ...
Largest program allocations in hbm
- If progress stalls across multiple intervals with
OwnerDiedError, dead node,
or unsatisfied resources -> mark degraded and notify user.
- If same error repeats after one fix attempt, do not retry blindly; report to user.
- Noisy shutdown traces are not decisive by themselves. Terminal Iris/orchestrator
status, driver/process exit code, final checkpoint state, and W&B state
determine whether a run succeeded.
Completion
Before declaring the job complete:
- Verify terminal state is successful.
- Verify W&B is finished or has the expected final state and metrics when W&B is
part of the run.
- Verify the final checkpoint has
metadata.json when the run is expected to
write a checkpoint.
- Capture final metrics, final step, W&B run id/display name, output root, final
checkpoint path, and caveats in the monitoring state or handoff note.
- Stop/delete monitor heartbeats and resident monitoring sessions that are no
longer needed.
When to Escalate
- Zephyr pipeline issues, TPU bad-node errors, or debugging running tasks with
iris task exec -> debug
Notes
- Iris
job list --prefix requires canonical job names (/<user>/<job>), not short names.
- Iris monitoring is job-level; cluster updates are not part of normal recovery.