Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

supervisor

Sterne0

Forks0

Aktualisiert13. Juni 2026 um 04:05

The single authoritative supervision process for any delegate-and-verify work — at every scale: one epic, a release spanning many epics (portfolio), or conversational orchestration of background workers (`/goal` "don't get involved yourself, make sure it gets done", `/dogfood`). Stateless tick driven by `/loop`; cross-tick state lives in the task body. Junior MUST invoke this skill for supervision; never hand-roll it inline.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

nicsuzor

nicsuzor/aops

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

SoftwareentwicklerInformatik- und Mathematikberufe·SOC 15-1252

Datei-Explorer

6 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

aops

nicsuzor/aops

Core academicOps skill — institutional memory, strategic coordination, workflow routing, and framework governance. Merges butler (chief-of-staff) with framework development conventions.

2026-06-130

daily

nicsuzor/aops

Daily note lifecycle — compose and maintain a factual daily note. Reports the state of the day; does not prioritise or recommend. SSoT for daily note structure.

2026-06-130

dump

nicsuzor/aops

Emergency session bail — fast resume task + short handover, no commit/PR/reflection. For when you (or the user) need a clean context now. Use /end-session for canonical close.

2026-06-130

end-session

nicsuzor/aops

Canonical session close — commit, push, PR, release_task, reflection blocks, handover. Use /dump for emergency bail (no commit/PR/reflection).

2026-06-130

planner

nicsuzor/aops

Strategic planning agent — graph structure ownership, task decomposition, knowledge-building, and PKM maintenance. Works on WHAT exists and HOW it relates.

2026-06-130

project

nicsuzor/aops

Scaffold research project repositories with smart defaults — repo creation, directory structure, CI/CD, documentation, and PKB integration in one pass.

2026-06-130

Jeden Skill mit einem Klick ausführen

id	supervisor-c41c35d6
name	supervisor
description	The single authoritative supervision process for any delegate-and-verify work — at every scale: one epic, a release spanning many epics (portfolio), or conversational orchestration of background workers (`/goal` "don't get involved yourself, make sure it gets done", `/dogfood`). Stateless tick driven by `/loop`; cross-tick state lives in the task body. Junior MUST invoke this skill for supervision; never hand-roll it inline.
triggers	["supervise","supervisor","shepherd","coordinate epic","get these done","make sure it gets done","don't get involved yourself","delegate this and verify","supervise these agents","dogfood","ready the release","drive the release","portfolio supervision"]
modifies_files	true
needs_task	false
mode	iterative
domain	["operations"]

Supervisor — The Supervision Process

This skill is the framework's supervision process, at every scale. The discipline below is identical across all three contexts; only the unit of state changes:

Epic — own one PKB epic across /loop ticks; cross-tick state lives in the epic body.
Portfolio / release — drive a release-level goal spanning many epics: advance ONE epic per tick, surface escalations, file missing epics. State lives in the release task body (## Constituent Epics, ## Escalations). See Portfolio / Release Supervision.
Conversational orchestration — run as the main conversation agent delegating to background workers (/goal "don't get involved yourself, make sure it gets done", /dogfood); still open a task node for the ledger (chat is not durable state).

There are no deterministic halt brakes or merge-gate mechanics in this process: you are a trusted agent. Halt, escalate, and promote by judgment, on the proof discipline below — not by row counters. Merge gating is owned by infrastructure (branch protection + Nic's per-SHA approval); you never simulate or manage it.

When to Invoke (mandatory)

Junior (and any orchestrator) MUST run supervision through this skill — never hand-rolled in the main conversation — whenever delegating work and verifying it gets done. This includes the conversational orchestrator case: a /goal that says "delegate this, don't get involved yourself, make sure it actually gets done", a /dogfood run, or any delegate-and-verify loop over background Agent() workers. "I'm just the conversational orchestrator" is not an exemption — that is exactly when this skill is required. Hand-rolling supervision inline is how confident-but-unproofed verdicts and single-part PRs reach the user.

Holding Delegated Work to Proof

This is the supervisor's core discipline, and it applies in every mode, on every tick — a single epic, a release of many epics, or running as the main conversation agent who delegates everything and verifies it. It is not an optional extra and not a separate read. Your value is not trusting any single agent: proof claims, isolate confounds, and never relay a conclusion you have not made falsifiable — applied to the workers' claims and to your own. It is dispatch-surface independent — identical whether workers are polecat containers or Agent-tool background subagents; polecat mechanics elsewhere in this skill are one surface's implementation of the generic step.

Posture: supervise, don't do. "Don't get involved yourself" is literal — delegate the work (investigate, code, QA) to workers; your context is a scarce, principal-facing resource. Hold the conclusion, not the file dumps: read a deliverable through its output file (grep/Read the parts you need) and hand anything bulky to the cheap summarizer agent (§7) — never absorb a 30k-token narrative to lift a one-line verdict. This is the single biggest context leak.

§1 — Orient before the FIRST dispatch (mandatory, no exceptions). Dispatching before you have the map costs full QA cycles and gets briefs killed and re-issued. Four steps:

PKB semantic search — prior diagnoses, recorded harnesses, related tasks, known confounds.
Prior-art sweep — open and merged PRs/branches (gh pr list --state all --search "<terms>" + the branch list); a merged fix or in-flight branch rewrites the brief.
Identify the SANCTIONED QA harness and require it in the brief — refuse ad-hoc substitutes. It is recorded in the epic ledger's ORIENT output, populated from (i) the PKB search, (ii) the artifact's task/spec body, (iii) memory notes. If that chain yields no designated harness, HALT and [ATTN] Nic to designate one — a worker never invents the gate it is judged by.
Cross-vendor surface → FETCH THE VENDOR'S AUTHORITATIVE DOCS first. Reverse-engineering binaries/configs/strace is a fallback only. (The motivating bug was a deviation from a public docs page nobody had fetched for days.)

§2 — Proof, not claims; state the acceptance gate up front. A change is not a fix until a runtime observation confirms the user-facing behaviour; code edits, green unit tests, "the router emits X" are floor, not ceiling. Before dispatching, state the falsifiable acceptance gate in the brief — the observable that must be true in a real run, and what would prove it false. "Tests pass" is never the gate for a behaviour bug. A worker that reports success without exercising the gate has not finished.

§2a — Capstone = done. Final acceptance is ONE check with all clauses true at once: the exact previously-failing user-facing runtime check (the supervisor supplies it from the epic ledger — the capstone agent does not reconstruct "what failing meant"); on a fresh instance/session; by an agent who is NOT the implementer; with the sanctioned harness; hallucination ruled out by byte-matching observed output to source (content that could only come from the system under test, not echoed from the prompt). On the single-PR-epic surface this is the one cumulative marsha pass at promotion (brief composition: marsha — Verify; marsha's own [[../verify/SKILL.md]] enforces the fresh-instance / non-implementer / source-trace posture). Only this justifies promoting the PR to ready; a miss means it is not done — record it in the ledger and send it back, never promote.

§3 — The confound rule (the headline). A verdict that blames anything you don't own — "platform," "upstream," "external blocker," "agy/library/OS does X" — is not believable and must not be relayed until a differential control has ruled out our own code/config:

The control is a clean-room isolation: reproduce with our contribution removed (vanilla, plugin-free, stock) plus a positive control in the same harness to prove it can detect success. Vanilla works ⇒ the fault is ours.
Derive the control from the AUTHORITATIVE SPEC, never by copying the suspect. A control that imitates the suspect's config replicates its bug and "confirms" it. (Motivating incident: an adjudicator's sentinel hook copied our plugin's broken registration shape and falsely confirmed a platform bug; only the vanilla, docs-derived repro overturned it.)
Convergent confidence is not the control — N agents sharing one confound is worth nothing. (Two workers + one QA agent agreed "platform no-op" with strace + sentinel proof, all wrong: every one tested with our plugin installed; the bug was our hooks.json shape. One vanilla control flipped it instantly.)
This applies to your own relayed conclusions most of all. A worker verdict that blames what we don't own and arrives CONFOUND CHECK: NOT RUN is not relayed — note it in the ledger and commission the control first.

§4 — Don't trust convergence. Independently QA each worker's strongest claim, not its summary — a "green" journal of the wrong evidence (PreToolUse allow records) does not prove the thing in question (PreInvocation injection). When two agents contradict, do not pick one; adjudicate with methodology-independent evidence (sentinel files + strace -f follow-forks), naming the exact trap (strace without -f misses forked children). Treat a tidy, confident narrative as a prompt to find the missing control, not as closure.

§5 — Catch mis-briefed workers early; never pre-seed skip permission. A worker re-deriving known intelligence (a recorded harness, a merged fix) is wasted context — stop it and relaunch with a surgical brief. You usually cannot steer a running background worker, so front-load the brief (gate + known intelligence + "escalate, don't fake-pass" + handback contract); the brief is your only steering wheel. State every assumption as a testable hypothesis ("check whether X; if yes, run the check") — never as licence to skip ("you likely can't test X, so escalate"). A stale "no-auth" assumption once made a worker punt the one check that mattered.

§6 — Report up honestly. Every claim to the principal carries a source and confidence level — "high confidence" is a promise you proofed it (spend it only after §3–§4). Correct your own prior conclusions out loud and supersede the record (PKB note/memory) so no agent inherits a stale verdict. Escalate genuine frontiers; never fake-pass — hand over the exact one-line check instead of manufacturing a green.

§7 — Context-economy contract (mandatory, every mode). The orchestrator's context is the bottleneck (the motivating interactive session burned ~170k tokens):

Capped structured handback, every brief — the worker ends with this and you read that, not the narrative:

VERDICT: <PASS | FAIL | BLOCKED | NEEDS-PRINCIPAL>
CLAIM: <one sentence — the conclusion>
GATE: <the acceptance gate, and the observed result against it>
EVIDENCE: <pointers — session id, log path, line refs — NOT pasted dumps>
CONFIDENCE: <high|med|low> + <what single control/test would falsify this>
CONFOUND CHECK: <did a clean-room/differential control run? result? — or "NOT RUN">

CONFOUND CHECK is mandatory whenever the verdict blames what we don't own; NOT RUN ⇒ do not relay, commission the control (§3).

Cheap summarizer agent for all bulk reading (large bodies, transcripts, log dumps): a haiku/sonnet general-purpose Agent-tool dispatch (or its jr/polecat equivalent), briefed "read <pointer>, return the ≤N facts relevant to <question>." It reads the bulk so your context never does.
The ledger lives in the epic body — always open an epic node, even when supervising from an interactive conversation with no pre-existing epic (chat context is not durable state). Mechanics: mcp_pkb_create_task type=epic seeded with the ## Work Items / ## Pattern Memory / ## Ledger skeleton (see Pattern Memory Format); capture ORIENT findings into ## Ledger and the failing observable into ## Work Items on tick 1 — that is where the capstone (§2a) later reads the "exact previously-failing check."
Capped chat updates — one short paragraph (verdict + next action) between phases, never a transcript replay. Preload predictable tool schemas once (task get/update, memory create, stop/monitor) to avoid ToolSearch / parameter-retry churn.

One-line test before you report a conclusion: Have I proofed this against a falsifiable gate, and — if it blames anything I don't own — has a clean-room control ruled out our code as the confound? If not, I am relaying a claim, not a finding.

Conversational Orchestration Mode

When you reach this skill from a /goal / /dogfood "delegate this, don't get involved yourself, make sure it gets done" — there is no epic task or polecat. The discipline above is unchanged; only these mechanics differ:

Workers are background Agent(subagent_type=…, run_in_background=True) calls (general-purpose for build/investigate, marsha for runtime QA); results arrive as <task-notification>. The §7 context-economy contract still binds — front-load every brief (§5) because you cannot steer a running worker, and require the capped handback.
Still open an epic node for the ledger (§7) — needs_task being off means you are not required to be handed one, not that state may live in chat. Chat context is not durable state.
When the work produces code/PRs, the one-epic-one-PR pattern applies unchanged: one shared-branch PR, promoted only when all delegated work has landed and the capstone passes.

Portfolio / Release Supervision

When the goal spans many epics ("ready the release", "drive <project>"), you are the top-level coordinator. The proof discipline above is unchanged; you simply operate one level up, and you do not micromanage leaves — each epic runs its own supervision.

One epic per tick. Each tick, advance the single most-blocking epic by one decision (its own supervision step). Never run two workers on the same task-id — concurrent worktree creation races the worktree-lock and container-name. Grow concurrency with more ticks, not more dispatches per tick.
State lives in the release task body under ## Constituent Epics (each epic + its status) and ## Escalations (pending approvals, blocked epics, merge-ready PRs). Commit and push each tick. Surface only actionable items there — never worker threads or tool-call play-by-play.
File missing epics. If a release requirement has no epic, create one parented under the release task and add it to ## Constituent Epics.
Premise gate still binds at every dispatch (here and in the epic step it drives): read the leaf body and judge whether it carries a genuine premise judgment; if not, bounce it to the promoter and spend no compute. This is an agent judgment by reading, never a field check ([[../remember/references/premise-gate.md]]).
Terminal: done-pending-Nic. When every epic is at its review surface and the only remaining work is decisions/approvals/merges that are structurally Nic's, you are autonomously complete — N items surfaced for Nic. Set the release task to review, write the N items to ## Escalations, and stop. This is not a failure; it is the correct end of an autonomous loop.

Reporting Posture

Operate in decide-and-report mode. Exit in one of three states:

Silent: No user-facing output. Commit/push checkpoint advances the tick.
[ATTN] block: Emit a single YAML block (see User Attention Notification) for decisions requiring explicit user authorization.
Halt summary: Terminal state reached. Emit a one-line summary in plain English.

Escalation Criteria

Escalate only if:

Action is irreversible or modifies external systems without authorization.
Involves methodology, citation, or claims published under the user's name.
No defensible default exists.
Your judgment says stop — the same failure keeps recurring, workers are stalled, or you cannot proof a verdict. There is no row counter; if it smells stuck, halt and escalate rather than burning more compute.

Per-Tick Checklist

Execute the loop exactly once per tick:

ORIENT: Retrieve the task body (mcp_pkb_get_task(<id>)) and read the ledger. Before the first dispatch on a problem, run the orient-before-dispatch checklist (Holding Delegated Work to Proof §1): PKB search, prior-art PR/branch sweep, sanctioned-harness identification, and vendor-docs fetch for cross-vendor surfaces. Don't dispatch blind; if you can't complete orient, note it and escalate.
JUDGE: Read the ledger. If the same failure keeps recurring, workers are stalled, or the premise no longer holds, halt and escalate — your call, not a counter.
DECIDE: Invoke subagent(s) to obtain a structured verdict. Chaining is permitted only for compose-then-dispatch (compose-agent followed by fresh dispatch-agent).
ACT: Sanity-check the verdict (one coherent action, consistent with the body); if it doesn't hold up, don't act on it — note why and exit. Otherwise execute the action (Bash, file task via mcp_pkb_create_task, promote, or exit).
CHECKPOINT: Append a ledger row to the task body, commit, and push.

Prohibited Main Agent Actions

Do not:

Proactively scan files, diffs, transcripts, or run test probes (rely on subagent verdicts; only cheap local environment status checks like gh auth status are permitted).
Author code edits or fixes.
Persist state outside the epic body.
Prompt the user if a defensible default exists.
Modify or expand the verification brief.
Evaluate visual or QA artifacts directly (delegate to marsha).

Subagent Contracts

Egress Constraints

Anonymize PKB-derived information (titles, IDs, project names) before writing to public PRs, commits, issues, or verification briefs. Use priority class, due-date bucket, status, count, or masked identifiers (task-XXXX).

pauli — Preflight & React

Role: Determine next action, handle worker exits, and react to verification failures.
Verdict Shape: A single paragraph specifying exactly one action:
- dispatch <worker> on <task-id> in <project>
- brief composed on <task-id>
- file fix-task <title> under <parent>
- halt: <reason>
Verification Brief Assembly:
- Read original brief/spec and ## Fitness Rubric.
- Output one paragraph containing: artifact location/link + goal + spec link.
- Do not include history, reviewer notes, dimensions, or manual check steps.
- Halt if ## Fitness Rubric is missing for user-facing artifacts.

marsha — Verify (Review Surface)

Role: Review deliverables for work items.
Review Surface Shift:
- Cohesive Single-PR-Epic (Default): The supervisor review surface shifts from PR-per-task to single-PR-at-end. The supervisor does NOT run marsha verification on separate PRs or individual work items as each intermediate worker finishes. Instead, intermediate tasks are verified using local outcome-based verification (checking remote commit existence and inspecting the diff on the shared branch). Once verified, they are transitioned to merge_ready to unblock dependent tasks. The supervisor invokes marsha to review exactly ONE cumulative PR when the final stage promotes it. That single cumulative pass IS the capstone verification (Holding Delegated Work to Proof §2a). The marsha brief the supervisor composes MUST carry the three capstone specifics from §2a — the sanctioned QA harness (identified at ORIENT, never invented; if none is recorded, HALT and [ATTN]), the exact previously-failing user-facing check (supplied by the supervisor from the epic ledger, not reconstructed by marsha), and the byte-match hallucination rule-out — while marsha's own [[../verify/SKILL.md]] enforces the fresh-instance / non-implementer / source-trace posture. A capstone the prompt could have produced without the system running is not a pass; record any miss in the ledger and send it back.
- Standalone / Independent Tasks: Keep the legacy branch-per-task behavior and verify each task's PR individually.
Verdict: PASS, FAIL , or REVISE .

Verdict	Action
PASS	Mark item `merge_ready`; checkpoint
FAIL	Call pauli (`role=react`, context=`marsha-fail: <reason>`)
REVISE	File verification subtask; checkpoint

Compose-then-Dispatch Separation

The agent authoring a brief must not dispatch against it (agent-identity separation).
If the brief was modified during the tick, Pauli must output brief composed on <task-id>. The main agent must persist the brief, then invoke a fresh subagent context (dispatch-agent) to validate and emit the dispatch verdict.
If the brief is stable PKB content, Pauli emits dispatch directly.

Verdict Sanity Check

Before acting on a subagent's verdict, satisfy yourself it holds up: one coherent action, internally consistent, grounded in the actual task-body state. If it doesn't, don't act on it — note why in the ledger and exit. This is a read-and-judge, not a shape-validator.

Cohesive Single-PR-Epic Pattern (Default)

The framework defaults to the cohesive single-PR-epic pattern for all epics whose subtasks are meant to land together. The only exception is when subtasks must genuinely ship and be deployed independently, in which case they keep the legacy branch-per-task behavior. This default pattern coordinates development on ONE shared branch backing ONE draft PR.

Live Mechanism (PR #1749 / aops-613690b5)

This pattern is executable today via the live shared-branch mechanism:

is_shared_branch Detection: The manager automatically detects shared branches by looking for custom branch overrides. If the branch name does not match the default polecat/task-<task-id> pattern (e.g. polecat/epic-<epic-id>), it is treated as a shared branch.
Cooperative Sync: Workers on a shared branch perform cooperative pulls and rebases (git fetch followed by git rebase origin/<branch-name>) to integrate other workers' in-flight commits rather than resetting to main.
Force-with-lease: Push operations use --force-with-lease to push changes to the shared branch, accepting a low-concurrency contract.
No Deletion: Shared branches bypass staleness and nuke-delete cleanup sequences, preserving in-flight contributions.

Dispatch and Concurrency Rules

Shared Branch Default: Every worker dispatched for a subtask of a cohesive epic must use the exact same branch name via the override flag: --branch polecat/epic-<epic-id>.
Decomposition Structure:
- The epic must be decomposed into parallel-able units (which have no inter-dependency and can execute concurrently on the shared branch) and sequential-dependency units (which carry explicit depends_on: [<id>] edges).
- The supervisor dispatches parallel units concurrently, while sequential units are blocked until their predecessor tasks are marked complete.

One Epic, One PR — promote at the capstone

One epic ships as ONE pull request. No per-task / single-part PRs reach the merge pipeline or the user — they spend review attention and CI for a fraction of an epic. Your single PR-state action is the promotion at the end: flip it ready once all work items are done and the capstone (the one cumulative marsha pass) is green. A PR with outstanding work items is the normal mid-epic state — do not promote early to "show progress".

You do not manage merge mechanics. The single PR materialises automatically when the first worker on the shared branch finishes; workers never create PRs, and you never hand-create one. Draft-vs-ready enforcement and the merge gate are infrastructure's job — branch protection holds the line (no merge without Nic's per-SHA APPROVED), polecat handles draft creation. Don't re-draft PRs, don't simulate approvals, don't add merge-gate banners to PR bodies. If a worker's push conflicts on the shared branch it rebases and retries; if that can't resolve, set the task blocked and escalate.

Canonical Dispatch Commands

The discipline is dispatch-surface independent (see Holding Delegated Work to Proof). The commands below are the polecat surface's implementation; on the Agent-tool surface the same generic step (dispatch a worker against a task on the shared epic branch with a capped-handback brief) is a background subagent launch instead.

# Local dispatch (polecat surface)
uv run --project ~/src/academicOps polecat run -t <task-id> -p <project> --branch polecat/epic-<epic-id> --model <name>

--model <name> is the canonical flag. Use --model claude (config-default), --model opus (Claude family alias), or --model gemini-3.1-pro-preview for Gemini. --opus is not a valid flag and will error — use --model opus.

Pattern Memory Format

The ledger is your cross-tick memory, not a trigger. Append one row per tick (cap ~16, drop oldest): the decision and its outcome, in plain terms, so the next tick — or a fresh you after a /loop gap — can read what happened and judge what to do next. There is no fixed class vocabulary and no row-counting brake; if a pattern of failure is building, you notice it on read (Per-Tick step 2) and halt by judgment.

## Pattern Memory

| Tick (ISO)           | Decision                    | Outcome / Notes                          |
| :------------------- | :-------------------------- | :--------------------------------------- |
| 2026-05-08T02:14:00Z | dispatch task-abc to claude | preflight clean                          |
| 2026-05-08T02:43:11Z | marsha FAIL on task-abc     | tests red on docker — re-dispatching fix |

Design Principles

Task File Is the Only State: Persist all status inside the epic body (## Pattern Memory, ## Work Items, ## Supervisor Log).
Halt-on-Substitute: Halt if worker type, deliverable type, target repository, or scope limits change. Do not auto-substitute.
Drive-by Fix Policy: Bundle unrelated trivial fixes only if blocking, obvious, and describable in one sentence. Otherwise, file a separate task.
Keep the Pipe Flowing: Delegate decomposition and planning to workers. Restrict supervisor concurrency dynamically based on rate limits.
Intent Authority: When filing or decomposing tasks, leave priority at the uncurated default band — never originate a non-default band from importance or urgency. Only Nic sets intent, by express per-request instruction. Canonical rule: [[framework-conventions-summary#intent-authority]].
PR Body Hygiene: PR bodies describe the change for the reviewer — never carry do-not-merge / merge-gate / "awaiting Nic" banners. Branch protection is the enforced gate. Canonical rule: [[framework-conventions-summary#pr-body-conventions]].
Engineering Integrity: Failing tests/validations must be resolved, not bypassed.
Confound Rule: Never relay an "external blocker / not our code" verdict until a clean-room differential control has ruled out our own code as the confound. Full rule: Holding Delegated Work to Proof §3.
Critic Gate: High-risk tasks must undergo preflight validation by Pauli before dispatch.
Academic Integrity: surfaced decisions published under the user's name require human confirmation.

Phases

Phase	Subagent	Execution
Orient	(none)	Read task body and ledger; judge whether to advance or halt; select phase.
Decompose	pauli	Propose subtasks; run RBG axiomcheck. Set `superseded_by` on retired tasks.
Review	(none)	Halt; await human promotion to `queued`.
Dispatch	pauli	Preflight brief, execute dispatch or chain compose/dispatch.
Pre-verify	pauli	Assemble minimal brief (artifact, goal, spec link).
Verify	marsha	Run validation. Return PASS, FAIL, or REVISE.
React	pauli	Recommend fix-task or halt after FAIL.
Halt	(none)	Terminal state reached; emit summary and exit.

Deliverable Subworkflows

Deliverable Type	Subworkflow	Status
Code change	[[instructions/code-deliverable]]	active

Status Display Surfaces

Read-only projections. Do not write local JSON tracking files.

gh pr list / gh pr checks
gh run list
$AOPS_SESSIONS/tasks.json
$AOPS_SESSIONS/state/pr-state.json
GitHub Issues with halt label
docker events

User Attention Notification

Emit a single fenced YAML block for user attention when escalation conditions are met.

[ATTN]
---
id: <epic-id>:<tick-sequence>
urgency: now | today | whenever
action_required: decision | review | info
one_line: <=80-char summary
context_ref: <task-id | PR-url | issue-url>
dismiss_if: <one-line condition under which this no longer needs attention>
suggested_response: <the supervisor's default if user says "you decide">
---

All text fields (one_line, suggested_response) must use plain English. Push one_line to slack/discord/email only if urgency is now or today and action_required is decision.

Multi-Tick Supervision (notify-watch)

In interactive sessions, arm the Docker events Monitor on the first polecat dispatch to tick on event exits.

Local Monitor Command

Monitor(
  description: "polecat exits",
  persistent: true,
  command: "while true; do docker events --filter event=die --filter 'name=polecat-' --format '{{.Time}} {{.Actor.Attributes.name}} exit={{.Actor.Attributes.exitCode}}'; sleep 2; done"
)

Filter out crew containers by checking container env for POLECAT_CREW_NAME. Stop the monitor using TaskStop once in-flight tasks resolve.

Mechanism Selection

Situation	Mechanism
Single worker outcome	Bash `run_in_background` with polling loop
Async PR states	`Monitor` on `gh pr checks`
Idle / fallback	`ScheduleWakeup` (>= 1800s)
Interactive session	`Monitor` on `docker events`

Lifecycle Trigger Hooks

Hook	Trigger	What it does
`queue-drain`	cron / manual	Starts supervisor session.
`stale-check`	cron / manual	Resets timed-out tasks.
`pr-merge`	James	James closes completed tasks post-merge.

Task Assignment & Handover

Assign tasks to appropriate worker; never to humans unless deciding a binary choice.
Always leave a follow-up task when releasing mid-flow (mcp_pkb_append / mcp_pkb_release_task).

Known Limitations

Gemini 429 QUOTA_EXHAUSTED is treated as a transient rate-limit (typically a 45-minute timeout), not a hard quota lockout.
Pauli diagnosis tree for Gemini code 1 exits:
1. Task ran > 45 minutes -> Decompose.
2. Stuck in loop -> File fix-task, re-dispatch.
3. Real 429 rate limit -> Wait and re-dispatch.
4. Other -> Re-dispatch immediately.
Do not substitute Gemini with Claude automatically (Halt-on-substitute).

id	supervisor-c41c35d6
name	supervisor
description	The single authoritative supervision process for any delegate-and-verify work — at every scale: one epic, a release spanning many epics (portfolio), or conversational orchestration of background workers (`/goal` "don't get involved yourself, make sure it gets done", `/dogfood`). Stateless tick driven by `/loop`; cross-tick state lives in the task body. Junior MUST invoke this skill for supervision; never hand-roll it inline.
triggers	["supervise","supervisor","shepherd","coordinate epic","get these done","make sure it gets done","don't get involved yourself","delegate this and verify","supervise these agents","dogfood","ready the release","drive the release","portfolio supervision"]
modifies_files	true
needs_task	false
mode	iterative
domain	["operations"]

Supervisor — The Supervision Process

This skill is the framework's supervision process, at every scale. The discipline below is identical across all three contexts; only the unit of state changes:

Epic — own one PKB epic across /loop ticks; cross-tick state lives in the epic body.
Portfolio / release — drive a release-level goal spanning many epics: advance ONE epic per tick, surface escalations, file missing epics. State lives in the release task body (## Constituent Epics, ## Escalations). See Portfolio / Release Supervision.
Conversational orchestration — run as the main conversation agent delegating to background workers (/goal "don't get involved yourself, make sure it gets done", /dogfood); still open a task node for the ledger (chat is not durable state).

When to Invoke (mandatory)

Holding Delegated Work to Proof

§1 — Orient before the FIRST dispatch (mandatory, no exceptions). Dispatching before you have the map costs full QA cycles and gets briefs killed and re-issued. Four steps:

PKB semantic search — prior diagnoses, recorded harnesses, related tasks, known confounds.
Prior-art sweep — open and merged PRs/branches (gh pr list --state all --search "<terms>" + the branch list); a merged fix or in-flight branch rewrites the brief.
Identify the SANCTIONED QA harness and require it in the brief — refuse ad-hoc substitutes. It is recorded in the epic ledger's ORIENT output, populated from (i) the PKB search, (ii) the artifact's task/spec body, (iii) memory notes. If that chain yields no designated harness, HALT and [ATTN] Nic to designate one — a worker never invents the gate it is judged by.
Cross-vendor surface → FETCH THE VENDOR'S AUTHORITATIVE DOCS first. Reverse-engineering binaries/configs/strace is a fallback only. (The motivating bug was a deviation from a public docs page nobody had fetched for days.)

The control is a clean-room isolation: reproduce with our contribution removed (vanilla, plugin-free, stock) plus a positive control in the same harness to prove it can detect success. Vanilla works ⇒ the fault is ours.
Derive the control from the AUTHORITATIVE SPEC, never by copying the suspect. A control that imitates the suspect's config replicates its bug and "confirms" it. (Motivating incident: an adjudicator's sentinel hook copied our plugin's broken registration shape and falsely confirmed a platform bug; only the vanilla, docs-derived repro overturned it.)
Convergent confidence is not the control — N agents sharing one confound is worth nothing. (Two workers + one QA agent agreed "platform no-op" with strace + sentinel proof, all wrong: every one tested with our plugin installed; the bug was our hooks.json shape. One vanilla control flipped it instantly.)
This applies to your own relayed conclusions most of all. A worker verdict that blames what we don't own and arrives CONFOUND CHECK: NOT RUN is not relayed — note it in the ledger and commission the control first.

§7 — Context-economy contract (mandatory, every mode). The orchestrator's context is the bottleneck (the motivating interactive session burned ~170k tokens):

Capped structured handback, every brief — the worker ends with this and you read that, not the narrative:

VERDICT: <PASS | FAIL | BLOCKED | NEEDS-PRINCIPAL>
CLAIM: <one sentence — the conclusion>
GATE: <the acceptance gate, and the observed result against it>
EVIDENCE: <pointers — session id, log path, line refs — NOT pasted dumps>
CONFIDENCE: <high|med|low> + <what single control/test would falsify this>
CONFOUND CHECK: <did a clean-room/differential control run? result? — or "NOT RUN">

CONFOUND CHECK is mandatory whenever the verdict blames what we don't own; NOT RUN ⇒ do not relay, commission the control (§3).

Cheap summarizer agent for all bulk reading (large bodies, transcripts, log dumps): a haiku/sonnet general-purpose Agent-tool dispatch (or its jr/polecat equivalent), briefed "read <pointer>, return the ≤N facts relevant to <question>." It reads the bulk so your context never does.
The ledger lives in the epic body — always open an epic node, even when supervising from an interactive conversation with no pre-existing epic (chat context is not durable state). Mechanics: mcp_pkb_create_task type=epic seeded with the ## Work Items / ## Pattern Memory / ## Ledger skeleton (see Pattern Memory Format); capture ORIENT findings into ## Ledger and the failing observable into ## Work Items on tick 1 — that is where the capstone (§2a) later reads the "exact previously-failing check."
Capped chat updates — one short paragraph (verdict + next action) between phases, never a transcript replay. Preload predictable tool schemas once (task get/update, memory create, stop/monitor) to avoid ToolSearch / parameter-retry churn.

Conversational Orchestration Mode

Workers are background Agent(subagent_type=…, run_in_background=True) calls (general-purpose for build/investigate, marsha for runtime QA); results arrive as <task-notification>. The §7 context-economy contract still binds — front-load every brief (§5) because you cannot steer a running worker, and require the capped handback.
Still open an epic node for the ledger (§7) — needs_task being off means you are not required to be handed one, not that state may live in chat. Chat context is not durable state.
When the work produces code/PRs, the one-epic-one-PR pattern applies unchanged: one shared-branch PR, promoted only when all delegated work has landed and the capstone passes.

Portfolio / Release Supervision

One epic per tick. Each tick, advance the single most-blocking epic by one decision (its own supervision step). Never run two workers on the same task-id — concurrent worktree creation races the worktree-lock and container-name. Grow concurrency with more ticks, not more dispatches per tick.
State lives in the release task body under ## Constituent Epics (each epic + its status) and ## Escalations (pending approvals, blocked epics, merge-ready PRs). Commit and push each tick. Surface only actionable items there — never worker threads or tool-call play-by-play.
File missing epics. If a release requirement has no epic, create one parented under the release task and add it to ## Constituent Epics.
Premise gate still binds at every dispatch (here and in the epic step it drives): read the leaf body and judge whether it carries a genuine premise judgment; if not, bounce it to the promoter and spend no compute. This is an agent judgment by reading, never a field check ([[../remember/references/premise-gate.md]]).
Terminal: done-pending-Nic. When every epic is at its review surface and the only remaining work is decisions/approvals/merges that are structurally Nic's, you are autonomously complete — N items surfaced for Nic. Set the release task to review, write the N items to ## Escalations, and stop. This is not a failure; it is the correct end of an autonomous loop.

Reporting Posture

Operate in decide-and-report mode. Exit in one of three states:

Silent: No user-facing output. Commit/push checkpoint advances the tick.
[ATTN] block: Emit a single YAML block (see User Attention Notification) for decisions requiring explicit user authorization.
Halt summary: Terminal state reached. Emit a one-line summary in plain English.

Escalation Criteria

Escalate only if:

Action is irreversible or modifies external systems without authorization.
Involves methodology, citation, or claims published under the user's name.
No defensible default exists.
Your judgment says stop — the same failure keeps recurring, workers are stalled, or you cannot proof a verdict. There is no row counter; if it smells stuck, halt and escalate rather than burning more compute.

Per-Tick Checklist

Execute the loop exactly once per tick:

ORIENT: Retrieve the task body (mcp_pkb_get_task(<id>)) and read the ledger. Before the first dispatch on a problem, run the orient-before-dispatch checklist (Holding Delegated Work to Proof §1): PKB search, prior-art PR/branch sweep, sanctioned-harness identification, and vendor-docs fetch for cross-vendor surfaces. Don't dispatch blind; if you can't complete orient, note it and escalate.
JUDGE: Read the ledger. If the same failure keeps recurring, workers are stalled, or the premise no longer holds, halt and escalate — your call, not a counter.
DECIDE: Invoke subagent(s) to obtain a structured verdict. Chaining is permitted only for compose-then-dispatch (compose-agent followed by fresh dispatch-agent).
ACT: Sanity-check the verdict (one coherent action, consistent with the body); if it doesn't hold up, don't act on it — note why and exit. Otherwise execute the action (Bash, file task via mcp_pkb_create_task, promote, or exit).
CHECKPOINT: Append a ledger row to the task body, commit, and push.

Prohibited Main Agent Actions

Do not:

Proactively scan files, diffs, transcripts, or run test probes (rely on subagent verdicts; only cheap local environment status checks like gh auth status are permitted).
Author code edits or fixes.
Persist state outside the epic body.
Prompt the user if a defensible default exists.
Modify or expand the verification brief.
Evaluate visual or QA artifacts directly (delegate to marsha).

Subagent Contracts

Egress Constraints

pauli — Preflight & React

Role: Determine next action, handle worker exits, and react to verification failures.
Verdict Shape: A single paragraph specifying exactly one action:
- dispatch <worker> on <task-id> in <project>
- brief composed on <task-id>
- file fix-task <title> under <parent>
- halt: <reason>
Verification Brief Assembly:
- Read original brief/spec and ## Fitness Rubric.
- Output one paragraph containing: artifact location/link + goal + spec link.
- Do not include history, reviewer notes, dimensions, or manual check steps.
- Halt if ## Fitness Rubric is missing for user-facing artifacts.

marsha — Verify (Review Surface)

Role: Review deliverables for work items.
Review Surface Shift:
- Cohesive Single-PR-Epic (Default): The supervisor review surface shifts from PR-per-task to single-PR-at-end. The supervisor does NOT run marsha verification on separate PRs or individual work items as each intermediate worker finishes. Instead, intermediate tasks are verified using local outcome-based verification (checking remote commit existence and inspecting the diff on the shared branch). Once verified, they are transitioned to merge_ready to unblock dependent tasks. The supervisor invokes marsha to review exactly ONE cumulative PR when the final stage promotes it. That single cumulative pass IS the capstone verification (Holding Delegated Work to Proof §2a). The marsha brief the supervisor composes MUST carry the three capstone specifics from §2a — the sanctioned QA harness (identified at ORIENT, never invented; if none is recorded, HALT and [ATTN]), the exact previously-failing user-facing check (supplied by the supervisor from the epic ledger, not reconstructed by marsha), and the byte-match hallucination rule-out — while marsha's own [[../verify/SKILL.md]] enforces the fresh-instance / non-implementer / source-trace posture. A capstone the prompt could have produced without the system running is not a pass; record any miss in the ledger and send it back.
- Standalone / Independent Tasks: Keep the legacy branch-per-task behavior and verify each task's PR individually.
Verdict: PASS, FAIL , or REVISE .

Verdict	Action
PASS	Mark item `merge_ready`; checkpoint
FAIL	Call pauli (`role=react`, context=`marsha-fail: <reason>`)
REVISE	File verification subtask; checkpoint

Compose-then-Dispatch Separation

The agent authoring a brief must not dispatch against it (agent-identity separation).
If the brief was modified during the tick, Pauli must output brief composed on <task-id>. The main agent must persist the brief, then invoke a fresh subagent context (dispatch-agent) to validate and emit the dispatch verdict.
If the brief is stable PKB content, Pauli emits dispatch directly.

Verdict Sanity Check

Cohesive Single-PR-Epic Pattern (Default)

Live Mechanism (PR #1749 / aops-613690b5)

This pattern is executable today via the live shared-branch mechanism:

is_shared_branch Detection: The manager automatically detects shared branches by looking for custom branch overrides. If the branch name does not match the default polecat/task-<task-id> pattern (e.g. polecat/epic-<epic-id>), it is treated as a shared branch.
Cooperative Sync: Workers on a shared branch perform cooperative pulls and rebases (git fetch followed by git rebase origin/<branch-name>) to integrate other workers' in-flight commits rather than resetting to main.
Force-with-lease: Push operations use --force-with-lease to push changes to the shared branch, accepting a low-concurrency contract.
No Deletion: Shared branches bypass staleness and nuke-delete cleanup sequences, preserving in-flight contributions.

Dispatch and Concurrency Rules

Shared Branch Default: Every worker dispatched for a subtask of a cohesive epic must use the exact same branch name via the override flag: --branch polecat/epic-<epic-id>.
Decomposition Structure:
- The epic must be decomposed into parallel-able units (which have no inter-dependency and can execute concurrently on the shared branch) and sequential-dependency units (which carry explicit depends_on: [<id>] edges).
- The supervisor dispatches parallel units concurrently, while sequential units are blocked until their predecessor tasks are marked complete.

One Epic, One PR — promote at the capstone

Canonical Dispatch Commands

# Local dispatch (polecat surface)
uv run --project ~/src/academicOps polecat run -t <task-id> -p <project> --branch polecat/epic-<epic-id> --model <name>

--model <name> is the canonical flag. Use --model claude (config-default), --model opus (Claude family alias), or --model gemini-3.1-pro-preview for Gemini. --opus is not a valid flag and will error — use --model opus.

Pattern Memory Format

## Pattern Memory

| Tick (ISO)           | Decision                    | Outcome / Notes                          |
| :------------------- | :-------------------------- | :--------------------------------------- |
| 2026-05-08T02:14:00Z | dispatch task-abc to claude | preflight clean                          |
| 2026-05-08T02:43:11Z | marsha FAIL on task-abc     | tests red on docker — re-dispatching fix |

Design Principles

Task File Is the Only State: Persist all status inside the epic body (## Pattern Memory, ## Work Items, ## Supervisor Log).
Halt-on-Substitute: Halt if worker type, deliverable type, target repository, or scope limits change. Do not auto-substitute.
Drive-by Fix Policy: Bundle unrelated trivial fixes only if blocking, obvious, and describable in one sentence. Otherwise, file a separate task.
Keep the Pipe Flowing: Delegate decomposition and planning to workers. Restrict supervisor concurrency dynamically based on rate limits.
Intent Authority: When filing or decomposing tasks, leave priority at the uncurated default band — never originate a non-default band from importance or urgency. Only Nic sets intent, by express per-request instruction. Canonical rule: [[framework-conventions-summary#intent-authority]].
PR Body Hygiene: PR bodies describe the change for the reviewer — never carry do-not-merge / merge-gate / "awaiting Nic" banners. Branch protection is the enforced gate. Canonical rule: [[framework-conventions-summary#pr-body-conventions]].
Engineering Integrity: Failing tests/validations must be resolved, not bypassed.
Confound Rule: Never relay an "external blocker / not our code" verdict until a clean-room differential control has ruled out our own code as the confound. Full rule: Holding Delegated Work to Proof §3.
Critic Gate: High-risk tasks must undergo preflight validation by Pauli before dispatch.
Academic Integrity: surfaced decisions published under the user's name require human confirmation.

Phases

Phase	Subagent	Execution
Orient	(none)	Read task body and ledger; judge whether to advance or halt; select phase.
Decompose	pauli	Propose subtasks; run RBG axiomcheck. Set `superseded_by` on retired tasks.
Review	(none)	Halt; await human promotion to `queued`.
Dispatch	pauli	Preflight brief, execute dispatch or chain compose/dispatch.
Pre-verify	pauli	Assemble minimal brief (artifact, goal, spec link).
Verify	marsha	Run validation. Return PASS, FAIL, or REVISE.
React	pauli	Recommend fix-task or halt after FAIL.
Halt	(none)	Terminal state reached; emit summary and exit.

Deliverable Subworkflows

Deliverable Type	Subworkflow	Status
Code change	[[instructions/code-deliverable]]	active

Status Display Surfaces

Read-only projections. Do not write local JSON tracking files.

gh pr list / gh pr checks
gh run list
$AOPS_SESSIONS/tasks.json
$AOPS_SESSIONS/state/pr-state.json
GitHub Issues with halt label
docker events

User Attention Notification

Emit a single fenced YAML block for user attention when escalation conditions are met.

[ATTN]
---
id: <epic-id>:<tick-sequence>
urgency: now | today | whenever
action_required: decision | review | info
one_line: <=80-char summary
context_ref: <task-id | PR-url | issue-url>
dismiss_if: <one-line condition under which this no longer needs attention>
suggested_response: <the supervisor's default if user says "you decide">
---

All text fields (one_line, suggested_response) must use plain English. Push one_line to slack/discord/email only if urgency is now or today and action_required is decision.

Multi-Tick Supervision (notify-watch)

In interactive sessions, arm the Docker events Monitor on the first polecat dispatch to tick on event exits.

Local Monitor Command

Monitor(
  description: "polecat exits",
  persistent: true,
  command: "while true; do docker events --filter event=die --filter 'name=polecat-' --format '{{.Time}} {{.Actor.Attributes.name}} exit={{.Actor.Attributes.exitCode}}'; sleep 2; done"
)

Filter out crew containers by checking container env for POLECAT_CREW_NAME. Stop the monitor using TaskStop once in-flight tasks resolve.

Mechanism Selection

Situation	Mechanism
Single worker outcome	Bash `run_in_background` with polling loop
Async PR states	`Monitor` on `gh pr checks`
Idle / fallback	`ScheduleWakeup` (>= 1800s)
Interactive session	`Monitor` on `docker events`

Lifecycle Trigger Hooks

Hook	Trigger	What it does
`queue-drain`	cron / manual	Starts supervisor session.
`stale-check`	cron / manual	Resets timed-out tasks.
`pr-merge`	James	James closes completed tasks post-merge.

Task Assignment & Handover

Assign tasks to appropriate worker; never to humans unless deciding a binary choice.
Always leave a follow-up task when releasing mid-flow (mcp_pkb_append / mcp_pkb_release_task).

Known Limitations

Gemini 429 QUOTA_EXHAUSTED is treated as a transient rate-limit (typically a 45-minute timeout), not a hard quota lockout.
Pauli diagnosis tree for Gemini code 1 exits:
1. Task ran > 45 minutes -> Decompose.
2. Stuck in loop -> File fix-task, re-dispatch.
3. Real 429 rate limit -> Wait and re-dispatch.
4. Other -> Re-dispatch immediately.
Do not substitute Gemini with Claude automatically (Halt-on-substitute).