Run any Skill in Manus with one click

Get Started

write-ops-log

Stars1,129

Forks133

UpdatedMay 19, 2026 at 23:30

Write a postmortem ops log to .agents/ops/. Use after an infrastructure-debugging session.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

marin-community

marin-community/marin

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Network and Computer Systems AdministratorsComputer and Mathematical Occupations·SOC 15-1244

SKILL.md

readonly

name	write-ops-log
description	Write a postmortem ops log to .agents/ops/. Use after an infrastructure-debugging session.

Skill: Ops Log

Summarize a debugging / incident-response conversation as a standalone postmortem entry under .agents/ops/. The audience is a future sysops engineer (probably another Claude session) with no memory of this conversation who must reconstruct: what broke, what was tried, what the user steered, what fixed it, and how OPS.md guidance could have shortened the investigation.

Filename

.agents/ops/YYYY-MM-DD-<slug>.md

The .agents/ops/ directory is checked into git; .agents/ops/logs/ is gitignored (matched by the global logs/ pattern), so do not nest logs under a logs/ subdirectory.

Use the date the issue was investigated, not today's arbitrary date.
<slug> is 3-6 words, kebab-case, naming the system + symptom. Examples: iris-scheduler-freeze, coreweave-nodepool-stuck-delete, zephyr-coordinator-oom.
If a log already exists for this incident, extend it with a new section rather than creating a parallel file.

Structure

Write a single markdown file with these sections in order. Don't invent extra sections; omit any section that genuinely has nothing to say.

Frontmatter

---
date: YYYY-MM-DD
system: iris | zephyr | fray | coreweave | gcp | <component>
severity: outage | degraded | near-miss | diagnostic-only
resolution: fixed | mitigated | wontfix | investigating
pr: <url or "none">
issue: <url or "none">
---

TL;DR (3-6 bullets)

One screen. A future engineer should know from this alone whether the log is relevant. Include: user-visible symptom, real root cause, fix applied, any lingering caveat.

Original problem report

Quote or paraphrase the user's opening message — what they observed, not the real bug. This is the pattern-match hook: preserve the exact error string / dashboard text / command the next engineer will grep for.

Investigation path

Narrative, not a log dump. What was checked, in order, and why. Include dead ends — they teach what not to spend time on. Cite file:line for code read, commit/log timestamps for live data.

Format as a numbered list of short paragraphs. Five to twelve steps; longer means you're narrating tool calls instead of decisions.

User course corrections

Explicit list of points where the user redirected the investigation. Each entry: what the model was about to do, what the user said instead, and why it was right. The most load-bearing section — it captures judgment the model lacked.

Root cause

Tight technical description, one or two paragraphs. Cite the specific file:line of the bug. If there's a class of bug (e.g. "invariant violation between tables X and Y"), name it.

Fix

What changed, with file paths and a short diff-style excerpt if subtle. Include the migration / data repair step separately from the code change. If the fix is deferred, say so and link the tracking issue.

How OPS.md could have shortened this

Concrete, actionable suggestions for lib/<component>/OPS.md edits. Each suggestion: the section to edit, the sentence or command to add, and the signal it would have unblocked.

Generic patterns only. OPS.md is for recurring diagnostic workflows across many incidents — not this one bug. Before writing a suggestion, ask: would this help an engineer debugging a completely different incident in the same subsystem? If not, drop it.

Good OPS.md addition: a recurring smell mapped to a class of cause — e.g. "same pending-reason text on many jobs → diagnostic cache has stopped updating" (applies to any future cache-update bug, not just this one). Also good: a new tool/workflow the investigation relied on, or a default that burned time.

Bad OPS.md addition: a troubleshooting row keyed on the exact literal string this incident produced — Known-Bugs material at best, noise after the fix ships. Also bad: SQL queries that only detect this specific invariant violation, or "watch out for bug X" entries that duplicate the git log.

If a lesson is genuinely incident-specific, put it in "Root cause"/"Fix" here; don't propose it for OPS.md. Avoid vague "improve documentation" — name the exact section and text to add. This section pays forward; take it seriously.

Artifacts

Links or paths to supporting evidence, in order of usefulness:

Local files staged during the investigation — only if still present; don't link to /tmp paths already gone.
GCS/S3 paths to parquet / sqlite / log bundles.
Grafana / dashboard URLs.
Parent PR and follow-up issues.

Writing style

Past tense, third person; you're writing for someone who wasn't there.
Short, dense sentences. The reader is busy.
No praise, no "we" language, no apology.
No emojis.
Absolute file paths relative to the repo root (e.g. lib/iris/src/iris/cluster/controller/transitions.py:2167).
Command-line snippets the next engineer can paste verbatim.
If you cite a log message, preserve its exact text — that's the grep string.

What to skip

Minute-by-minute narration of tool calls.
Code already obvious from the fix diff.
Generic reminders ("remember to check logs first").
Hedged conclusions. If the root cause is known, state it; if not, say so under "Investigating".

After writing

Do not add an index file or update MEMORY.md. Logs are discoverable by ls .agents/ops/ and full-text search. If the directory gets unwieldy (>30 entries), flag it to the user.

Skill: Ops Log

Filename

.agents/ops/YYYY-MM-DD-<slug>.md

The .agents/ops/ directory is checked into git; .agents/ops/logs/ is gitignored (matched by the global logs/ pattern), so do not nest logs under a logs/ subdirectory.

Use the date the issue was investigated, not today's arbitrary date.
<slug> is 3-6 words, kebab-case, naming the system + symptom. Examples: iris-scheduler-freeze, coreweave-nodepool-stuck-delete, zephyr-coordinator-oom.
If a log already exists for this incident, extend it with a new section rather than creating a parallel file.

Structure

Write a single markdown file with these sections in order. Don't invent extra sections; omit any section that genuinely has nothing to say.

Frontmatter

---
date: YYYY-MM-DD
system: iris | zephyr | fray | coreweave | gcp | <component>
severity: outage | degraded | near-miss | diagnostic-only
resolution: fixed | mitigated | wontfix | investigating
pr: <url or "none">
issue: <url or "none">
---

TL;DR (3-6 bullets)

One screen. A future engineer should know from this alone whether the log is relevant. Include: user-visible symptom, real root cause, fix applied, any lingering caveat.

Original problem report

Investigation path

Narrative, not a log dump. What was checked, in order, and why. Include dead ends — they teach what not to spend time on. Cite file:line for code read, commit/log timestamps for live data.

Format as a numbered list of short paragraphs. Five to twelve steps; longer means you're narrating tool calls instead of decisions.

User course corrections

Root cause

Tight technical description, one or two paragraphs. Cite the specific file:line of the bug. If there's a class of bug (e.g. "invariant violation between tables X and Y"), name it.

Fix

How OPS.md could have shortened this

Concrete, actionable suggestions for lib/<component>/OPS.md edits. Each suggestion: the section to edit, the sentence or command to add, and the signal it would have unblocked.

Artifacts

Links or paths to supporting evidence, in order of usefulness:

Local files staged during the investigation — only if still present; don't link to /tmp paths already gone.
GCS/S3 paths to parquet / sqlite / log bundles.
Grafana / dashboard URLs.
Parent PR and follow-up issues.

Writing style

Past tense, third person; you're writing for someone who wasn't there.
Short, dense sentences. The reader is busy.
No praise, no "we" language, no apology.
No emojis.
Absolute file paths relative to the repo root (e.g. lib/iris/src/iris/cluster/controller/transitions.py:2167).
Command-line snippets the next engineer can paste verbatim.
If you cite a log message, preserve its exact text — that's the grep string.

What to skip

Minute-by-minute narration of tool calls.
Code already obvious from the fix diff.
Generic reminders ("remember to check logs first").
Hedged conclusions. If the root cause is known, state it; if not, say so under "Investigating".

After writing

Do not add an index file or update MEMORY.md. Logs are discoverable by ls .agents/ops/ and full-text search. If the directory gets unwieldy (>30 entries), flag it to the user.

write-ops-log

Skill: Ops Log

Filename

Structure

Frontmatter

TL;DR (3-6 bullets)

Original problem report

Investigation path

User course corrections

Root cause

Fix

How OPS.md could have shortened this

Artifacts

Writing style

What to skip

After writing

More from this repository

More from this repository

Skill: Ops Log

Filename

Structure

Frontmatter

TL;DR (3-6 bullets)

Original problem report

Investigation path

User course corrections

Root cause

Fix

How OPS.md could have shortened this

Artifacts

Writing style

What to skip

After writing