一键导入
debug
Debug code bugs or Iris/Zephyr/TPU infrastructure faults with a structured debug log.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Debug code bugs or Iris/Zephyr/TPU infrastructure faults with a structured debug log.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
| name | debug |
| description | Debug code bugs or Iris/Zephyr/TPU infrastructure faults with a structured debug log. |
Systematic debugging for code-level bugs and Marin infrastructure faults.
For infrastructure symptoms, route to the right OPS.md section first; for
code bugs, keep a structured debug log.
Read lib/iris/AGENTS.md or lib/zephyr/AGENTS.md for context, then follow
the matching OPS.md section:
| Symptom | Read |
|---|---|
| Stuck job, scheduling failure, resource leak, controller stalled | lib/iris/OPS.md → SQL Queries, Process Inspection & Profiling, Known Bugs, Troubleshooting |
| Iris task misbehaving, container inspection, profiling a running task | lib/iris/OPS.md → Task Operations, Process Inspection & Profiling |
| Zephyr pipeline slow / stragglers / data skew / worker failures | lib/zephyr/OPS.md → Diagnostic Patterns, Observability |
TPU bad node (No accelerator found, FAILED_PRECONDITION, Device or resource busy) | lib/iris/OPS.md → TPU Bad-Node Recovery |
Operational guardrails (never modify the controller DB, prefer
iris process profile over SSH, never run a full iris cluster restart
without approval) live next to the relevant commands in OPS.md — read those
sections. After a TPU recovery or zephyr fix, return to the active babysit
loop (babysit-job or babysit-zephyr).
For code-level bugs that are not infrastructure faults, maintain a debug log
at docs/debug-log-<task-name>.md:
# Debugging log for <task>
<goal>
## Initial status
<initial status, as reported or observed>
## <Hypothesis N>
The suspected source of the bug, or a change needed to isolate it.
## Changes to make
Which files you are altering and how.
## Results
Test results and any new hypotheses. Repeat the Hypothesis/Results cycle as needed.
## Future work
- [ ] Cleanups observed along the way
Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.
Modify or upstream a Grug/Grugformer experiment variant.
Run a perf gate on a PR that touches lib/zephyr internals.
Curate the experiment report index at docs/reports/index.md.
Triage a failed canary ferry run (CI-invoked).
Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.