بنقرة واحدة
debug
Debug code bugs or Iris/Zephyr/TPU infrastructure faults with a structured debug log.
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
القائمة
Debug code bugs or Iris/Zephyr/TPU infrastructure faults with a structured debug log.
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
استنادا إلى تصنيف SOC المهني
Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.
Modify or upstream a Grug/Grugformer experiment variant.
Run a perf gate on a PR that touches lib/zephyr internals.
Curate the experiment report index at docs/reports/index.md.
Triage a failed canary ferry run (CI-invoked).
Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.
| name | debug |
| description | Debug code bugs or Iris/Zephyr/TPU infrastructure faults with a structured debug log. |
Systematic debugging for code-level bugs and Marin infrastructure faults.
For infrastructure symptoms, route to the right OPS.md section first; for
code bugs, keep a structured debug log.
Read lib/iris/AGENTS.md or lib/zephyr/AGENTS.md for context, then follow
the matching OPS.md section:
| Symptom | Read |
|---|---|
| Stuck job, scheduling failure, resource leak, controller stalled | lib/iris/OPS.md → SQL Queries, Process Inspection & Profiling, Known Bugs, Troubleshooting |
| Iris task misbehaving, container inspection, profiling a running task | lib/iris/OPS.md → Task Operations, Process Inspection & Profiling |
| Zephyr pipeline slow / stragglers / data skew / worker failures | lib/zephyr/OPS.md → Diagnostic Patterns, Observability |
TPU bad node (No accelerator found, FAILED_PRECONDITION, Device or resource busy) | lib/iris/OPS.md → TPU Bad-Node Recovery |
Operational guardrails (never modify the controller DB, prefer
iris process profile over SSH, never run a full iris cluster restart
without approval) live next to the relevant commands in OPS.md — read those
sections. After a TPU recovery or zephyr fix, return to the active babysit
loop (babysit-job or babysit-zephyr).
For code-level bugs that are not infrastructure faults, maintain a debug log
at docs/debug-log-<task-name>.md:
# Debugging log for <task>
<goal>
## Initial status
<initial status, as reported or observed>
## <Hypothesis N>
The suspected source of the bug, or a change needed to isolate it.
## Changes to make
Which files you are altering and how.
## Results
Test results and any new hypotheses. Repeat the Hypothesis/Results cycle as needed.
## Future work
- [ ] Cleanups observed along the way