con un clic
reserve-tpu
Reserve an Iris-backed TPU worker for fast debugging with dev_tpu.py.
Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.
Menú
Reserve an Iris-backed TPU worker for fast debugging with dev_tpu.py.
Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.
Basado en la clasificación ocupacional SOC
Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.
Modify or upstream a Grug/Grugformer experiment variant.
Run a perf gate on a PR that touches lib/zephyr internals.
Curate the experiment report index at docs/reports/index.md.
Triage a failed canary ferry run (CI-invoked).
Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.
| name | reserve-tpu |
| description | Reserve an Iris-backed TPU worker for fast debugging with dev_tpu.py. |
Use this skill for the standard fast TPU debugging loop without wiring a full training job each time.
scripts/iris/dev_tpu.py reserves a TPU-backed worker through Iris, waits for the worker VM to come up, and lets you SSH into it or run commands directly against it. It uses gcloud SSH and SCP against the worker Iris assigned to the holder job. There is no persistent ~/.ssh/config alias.
Run at most one TPU job at a time on a given dev TPU VM. Do not launch concurrent TPU commands on the same worker from multiple shells, tmux panes, or background jobs.
allocate: submit a TPU holder job, resolve the assigned worker VM(s), optionally sync the repo, block until releasestatus: show the active local session metadataconnect: open an interactive SSH session to one reserved workersetup_env: sync the repo by default, then install/refresh the remote uv environment on one or all reserved workersexecute: sync local files to ~/marin on the reserved worker(s), then run one commandwatch: sync all reserved workers and rerun a command on the selected worker when local files changerelease: terminate the holder job and remove the local session filegcloud auth login
gcloud config set project hai-gcp-models
gcloud auth application-default login
make dev_setup
Ensure the Iris controller is running for your cluster. On shared Marin clusters this is usually already true; only start it yourself for a fresh or local cluster.
Use a cluster config that can actually provision the TPU type you want.
All invocations share this shape; only the subcommand and its flags change:
uv run scripts/iris/dev_tpu.py \
--config lib/iris/config/marin.yaml \
--tpu-name "$USER-v5p8" \
<subcommand> [flags]
Subcommands and distinctive flags:
allocate --tpu-type v5p-8 — reserves a single-host TPU and holds it until Ctrl-C. --tpu-type is required (the config may expose many variants; the script does not guess). Add --zone us-east5-b to pin the holder job to a zone (the config must expose that TPU family there). Add --no-setup-env to skip remote env setup.status — show the active session.connect — interactive SSH to the reserved worker. After connecting: cd ~/marin && source ~/.local/bin/env.setup_env — install/refresh the remote env. Run before the first execute/watch if you allocated with --no-setup-env.execute -- <cmd> — sync local files, then run <cmd>. Add --no-sync for a no-sync inner loop. Example: execute -- uv run --package levanter --group test pytest lib/levanter/tests/kernels/test_pallas_fused_cross_entropy_loss.py.watch -- <cmd> — rerun <cmd> on local file changes.release — terminate the holder job and clear the session file.Multi-host TPU types reserve more than one worker VM. Use --worker <index> with connect, execute, or watch to target a specific worker (execute and watch default to worker 0):
uv run scripts/iris/dev_tpu.py --config lib/iris/config/marin.yaml \
--tpu-name "$USER-v5p16" connect --worker 1
Use normal Iris tooling to inspect the backing cluster and holder job:
uv run iris --config=lib/iris/config/marin.yaml cluster dashboard
uv run iris --config=lib/iris/config/marin.yaml cluster vm status
uv run iris --config=lib/iris/config/marin.yaml job list --prefix /$USER/dev-tpu
uv run iris --config=lib/iris/config/marin.yaml job logs /$USER/dev-tpu-<name>
If worker bootstrap fails:
uv run iris --config=lib/iris/config/marin.yaml cluster vm logs <worker-id>
~/.cache/marin/dev_tpu_iris/.allocate terminal dies unexpectedly, use release to terminate the holder job and clear the stale state file.execute and watch already wrap the remote command in bash -lc; do not pass your own bash -c.Always pass --tpu-name to avoid collisions with other agents:
export TPU_NAME="${USER}-$(git rev-parse --abbrev-ref HEAD | tr '/' '-')"
uv run scripts/iris/dev_tpu.py --config lib/iris/config/marin.yaml --tpu-name "$TPU_NAME" allocate --tpu-type v5p-8
Normal cleanup is Ctrl-C in the allocate terminal. To clean up from another shell, run the release subcommand.