一键导入
write-ops-log
Write a postmortem ops log to .agents/ops/. Use after an infrastructure-debugging session.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Write a postmortem ops log to .agents/ops/. Use after an infrastructure-debugging session.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
| name | write-ops-log |
| description | Write a postmortem ops log to .agents/ops/. Use after an infrastructure-debugging session. |
Summarize a debugging / incident-response conversation as a standalone
postmortem entry under .agents/ops/. The audience is a future sysops
engineer (probably another Claude session) with no memory of this conversation
who must reconstruct: what broke, what was tried, what the user steered, what
fixed it, and how OPS.md guidance could have shortened the investigation.
.agents/ops/YYYY-MM-DD-<slug>.md
The .agents/ops/ directory is checked into git; .agents/ops/logs/ is
gitignored (matched by the global logs/ pattern), so do not nest logs
under a logs/ subdirectory.
<slug> is 3-6 words, kebab-case, naming the system + symptom. Examples:
iris-scheduler-freeze, coreweave-nodepool-stuck-delete,
zephyr-coordinator-oom.Write a single markdown file with these sections in order. Don't invent extra sections; omit any section that genuinely has nothing to say.
---
date: YYYY-MM-DD
system: iris | zephyr | fray | coreweave | gcp | <component>
severity: outage | degraded | near-miss | diagnostic-only
resolution: fixed | mitigated | wontfix | investigating
pr: <url or "none">
issue: <url or "none">
---
One screen. A future engineer should know from this alone whether the log is relevant. Include: user-visible symptom, real root cause, fix applied, any lingering caveat.
Quote or paraphrase the user's opening message — what they observed, not the real bug. This is the pattern-match hook: preserve the exact error string / dashboard text / command the next engineer will grep for.
Narrative, not a log dump. What was checked, in order, and why. Include dead ends — they teach what not to spend time on. Cite file:line for code read, commit/log timestamps for live data.
Format as a numbered list of short paragraphs. Five to twelve steps; longer means you're narrating tool calls instead of decisions.
Explicit list of points where the user redirected the investigation. Each entry: what the model was about to do, what the user said instead, and why it was right. The most load-bearing section — it captures judgment the model lacked.
Tight technical description, one or two paragraphs. Cite the specific file:line of the bug. If there's a class of bug (e.g. "invariant violation between tables X and Y"), name it.
What changed, with file paths and a short diff-style excerpt if subtle. Include the migration / data repair step separately from the code change. If the fix is deferred, say so and link the tracking issue.
Concrete, actionable suggestions for lib/<component>/OPS.md edits. Each
suggestion: the section to edit, the sentence or command to add, and the signal
it would have unblocked.
Generic patterns only. OPS.md is for recurring diagnostic workflows across many incidents — not this one bug. Before writing a suggestion, ask: would this help an engineer debugging a completely different incident in the same subsystem? If not, drop it.
Good OPS.md addition: a recurring smell mapped to a class of cause — e.g. "same pending-reason text on many jobs → diagnostic cache has stopped updating" (applies to any future cache-update bug, not just this one). Also good: a new tool/workflow the investigation relied on, or a default that burned time.
Bad OPS.md addition: a troubleshooting row keyed on the exact literal string this incident produced — Known-Bugs material at best, noise after the fix ships. Also bad: SQL queries that only detect this specific invariant violation, or "watch out for bug X" entries that duplicate the git log.
If a lesson is genuinely incident-specific, put it in "Root cause"/"Fix" here; don't propose it for OPS.md. Avoid vague "improve documentation" — name the exact section and text to add. This section pays forward; take it seriously.
Links or paths to supporting evidence, in order of usefulness:
/tmp paths already gone.lib/iris/src/iris/cluster/controller/transitions.py:2167).Do not add an index file or update MEMORY.md. Logs are discoverable by
ls .agents/ops/ and full-text search. If the directory gets unwieldy (>30
entries), flag it to the user.
Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.
Modify or upstream a Grug/Grugformer experiment variant.
Run a perf gate on a PR that touches lib/zephyr internals.
Curate the experiment report index at docs/reports/index.md.
Triage a failed canary ferry run (CI-invoked).
Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.