| name | promode-audit |
| description | Audit how well a repository's codebase and practices align with the promode methodology, then produce a prioritised, actionable improvement plan. Fans out parallel assessors (one per dimension) and synthesises their findings. Use when the user wants to assess promode alignment/fit, audit a repo against the methodology, or get a plan to bring a codebase in line with promode. Also flags stale per-project install leftovers — promode ships its own SessionStart hook, so nothing should be copied into a project. |
You (the main agent) run this audit: fan out parallel assessors, one per dimension, then synthesise their findings into a prioritised, actionable plan. The assessors gather evidence; the prioritisation and plan are yours — that judgement is not delegated.
This audits the repo's alignment with promode's working practices, plus a pre-flight setup check. promode ships its delivery via its own plugin SessionStart hook, so nothing is installed per-project — the only setup issue to catch is stale leftovers from the retired copy-install.
Each dimension is a reusable lens, not just a step in the full sweep — it mirrors what a working agent is already responsible for. The same standard runs at three cadences: every agent upholds its dimension continuously while working; a single owning agent can be asked to audit just its dimension for a targeted spot-check (e.g. promode:code-reviewer for tests or architecture); or you run this skill for the full parallel sweep. Auditing and doing are the same standard, zoomed differently.
1. **Frame** — Skim `CLAUDE.md` and `README` to understand the repo (language, stack, size, test setup). Pick which dimensions apply and scale the assessor count to the repo (a small library may merge dimensions; a large system warrants all of them, possibly split by area). Also do a quick **setup check**: the promode plugin delivers its brief via its own SessionStart hook, so a project needs nothing installed — flag any stale leftovers from the retired per-project copy-install (`.claude/PROMODE_MAIN_AGENT.md`, `.claude/hooks/promode-main-context.sh`, or a promode `SessionStart` entry in `.claude/settings.json`); they now double-inject the brief and should be removed.
2. **Fan out assessors in parallel** — Dispatch one background agent per dimension using the standard fire-and-forget pattern (all in one turn, then end the turn; their ``s wake you). Each is **read-only — instruct it to modify nothing**. Give each the assessor brief below. Wait for all to report.
3. **Synthesise** — Merge findings into one assessment and a single prioritised plan (format below). Resolve overlaps between dimensions; rank across all of them. For a *contested or high-stakes* rating, judge deliberately: score the dimension against an explicit rubric, compare options **pairwise** when "better" can't be stated in the abstract, and for a close call re-run the assessor (a different model/provider catches different failures) and take the **majority** rather than trusting one pass.
4. **Deliver & offer to capture** — Present the report. Offer to turn the plan into tracked work (e.g. `KANBAN_BOARD.md` / `IDEAS.md`) or a saved report file.
The promode-alignment axes. Each is one assessor's deliverable.
| Dimension | Assesses (promode principle) | Suggested assessor |
|---|
| Framing & traceability | Apply the framing & traceability check (below): does a top-down document hierarchy (goals/risks/priorities, realistic customer profiles / personas → marketing → feature definitions → feature tests) make the repo self-describing, each layer explaining WHY and linking up to a goal? Are the user needs/journeys/workflows features rest on cited or flagged as assumptions (a silent one is the highest-severity finding) vs asserted as fact? (feature knowledge-base, self-describing repo, evidence-based user needs) | general-purpose |
| Tests & feedback loops | Behaviour-focused tests on critical paths; tested through public interfaces vs coupled to implementation / over-mocked; speed & determinism (can an agent get a fast pass/fail signal?); a documented one-command way to run tests, lint, typecheck, and the app. Where a UI fronts real logic, apply the Discovery → Determinism layered-coverage check (below) — including whether the acceptance tier reads as domain-language user stories traceable to a cited/flagged need and whether a product-doc↔acceptance-test seam exists (tool-agnostic, not BDD-mandated). (TDD, tests-as-documentation, fast feedback, verifier-readiness, traceable acceptance) | promode:code-reviewer |
| Agent knowledge & orientation | Apply the CLAUDE.md health check (below), including the orientation hierarchy: root launchpad, optional subtree launchpads, and adjacent AGENTS.md symlinks. Beyond it, two questions: (1) capture — is durable reusable knowledge captured as linked nodes reachable from the root or relevant subtree CLAUDE.md, or left tribal — across the kinds the knowledge-graph model names: subsystem orientation, non-obvious build/run gotchas, decisions (ADR-style), and runbooks (repeatable operational procedures — deploy, migration, env bring-up, recovery, recurring incidents)? Flag any kind that's missing, unlinked, or lives only in someone's head. (2) discoverability — are crucial design constraints (invariants, prohibitions, required patterns, load-bearing decisions) reaching context by a guaranteed mechanism — inline in the governing orientation, or @import-transcluded from it — rather than only reachable by a plain link an agent may never follow? The latter is the dimension's highest-severity finding (see the health check). (3) skill quality — are SKILL.md skills tight landmines-not-docs (project gotchas the model wouldn't know, with exact conventions inlined at point of use), or bloated comprehensive manuals that dilute signal and can worsen behaviour? Flag over-stuffed skills (see agent-knowledge-wiki → Authoring skills). (CLAUDE.md-rooted knowledge graph; hierarchy; decision/runbook/operational-knowledge capture; critical-constraint discoverability; skill quality) | general-purpose |
| Architecture & navigability | Module depth vs shallowness; testability (dependency injection, seams, return-values over side-effects); oversized files that burn agent context; tangled coupling, dead code, misleading names. (small diffs, testability, context-frugality) | promode:code-reviewer |
| Observability & traceability | Apply the observability & traceability check (below): do runtime logs carry a correlation/tracer ID that threads one request client→backend (and across service hops), filterable on both sides, so an agent can pull a single request's whole trace instead of slurping unfiltered logs? (context is precious applied to runtime; cheap agent debugging) | promode:code-reviewer |
| Design system & visual feedback loop | Apply the design system & visual feedback-loop check (below): is a design source-of-truth established (two-layer DESIGN_SYSTEM.md, tokens + rationale, linked from CLAUDE.md, traceable to personas/goals), does a lookbook exist and cover the components/screens, and is there a live-refresh preview loop for design + marketing artifacts? (visual feedback loop = the design analogue of the operator-seam test loop; crystallise taste into determinism) | promode:product-design-expert |
| Change hygiene (optional) | Commit focus & size; messages explain why; do tests land with the code they cover? (small focused commits, explain-why, visible TDD) | general-purpose |
The repo should be **self-describing top-down** — a reader (human or agent) can start at the high-level goals and follow links down to the tests that implement them. The Framing & traceability assessor checks:
- The hierarchy exists — are there docs for high-level goals/risks/priorities, realistic customer profiles / personas (
docs/product/PERSONAS.md), product/marketing framing, feature definitions, and feature tests? Note which layers are missing or thin.
- Every layer explains WHY and links up — each artifact states why it exists and links to the layer above, ultimately to a goal. Flag docs that describe only WHAT/HOW with no WHY, and layers that don't connect upward.
- No orphans, no drift, no sprawl — can each significant feature be traced up to a goal? Flag orphaned features (no link to any goal → likely superfluous, or the goals doc is stale) and goals nothing implements. Also flag goal sprawl — an unfocused or ballooning goals/risks/priorities list, or goals that look invented post-hoc to justify a feature; too many goals is itself a finding (focus is the scarce resource). A broken chain is a diagnostic signal, not just a missing file — say which interpretation is likely and recommend the fix (cut the work, sharpen the goal, or update a genuinely stale goals doc).
- User-facing features trace to a realistic persona — can each user-facing feature name the documented customer profile / persona it serves? Flag features with no persona (who is this actually for?) and personas that look invented or flattered to justify a feature rather than grounded in real customer evidence — the same post-hoc-justification trap as a stretched goal. A persona with no evidence behind it is a finding, not a framing artifact.
- User needs / journeys / workflows are evidence-bearing CLAIMS, not givens — the user needs, journeys, and operational workflows features rest on should each be backed by a cited signal (research, support tickets, usage data) or explicitly flagged as an assumption with a validation path — not asserted silently as fact. A silent (uncited, unflagged) user-need assumption is the HIGHEST-SEVERITY finding in this dimension: rank it above orphaned goals and personas, because it propagates down into the domain model and architecture — the costliest layer to unwind once the wrong need is baked in. The evidence bar is graded — the finding is a silent or fabricated assumption, not "lacks a formal citation"; do not push toward cargo-cult citation demands or penalise a need that's honestly flagged as an assumption with a path to validate it.
`CLAUDE.md` is loaded project orientation, so it is the highest-leverage project-owned context. The Agent-knowledge assessor evaluates the root file and any subtree `CLAUDE.md` files against four tests, then recommends a concrete restructure (what to cut, what to link, what signposts to add, what subtree orientation or symlink to create) where the hierarchy falls short. Some checks are artifact checks you can run by reading the orientation files; the inline-criticality test (#2) is a **traversal** check — you cannot tell what's *missing* from loaded orientation by reading it alone, so walk the linked docs, ADRs, workflow docs, and load-bearing code/comments and ask of each crucial constraint you find: *is this surfaced inline in the orientation that governs the affected area?*
- Concise launchpads — loaded orientation enters agent context, so every extra line is a token tax and dilutes attention. Flag bloat: long prose, duplicated content, detail that belongs in a linked doc, or repo-root content that belongs in a subtree
CLAUDE.md. Each orientation file should be a launchpad, not a manual.
- Critical knowledge reaches context by a guaranteed mechanism, never a bare link. An agent acts only on what's already in its context, and it won't follow links it doesn't know exist — so anything an agent would fail or do harm without must be guaranteed-loaded into the governing orientation, by one of two mechanisms (the rule is criticality, not topic): (a) inline in the root
CLAUDE.md (repo-wide) or nearest subtree CLAUDE.md (local), or (b) @import-transcluded from that CLAUDE.md when the content is too long to inline or shared across areas (the harness pulls its full contents into context; a plain [link] does not — it's a pointer the agent may never open). This bites hardest for crucial design constraints and critical workflow/build/test rules that habitually hide in ADRs, runbooks, workflow docs, or code comments. A critical rule reachable only by a plain link (or buried with no guaranteed-load path) is the highest-severity finding in this dimension — rank it Now and recommend the reinforce-design-constraints skill, the write action that either hoists the rule inline or converts the plain link to an @import. (Note the locality trade-off: a root @import taxes every session, so a rule critical only in a subtree belongs in that subtree's CLAUDE.md — auto-loaded on-interaction — not a root import.)
- Orientation hierarchy fits the repo — the root
CLAUDE.md should link to major subtree orientation files when present, and major subsystems with materially different local commands, concepts, or landmines should have their own concise CLAUDE.md instead of bloating the root. Flag both missing subtree orientation and root bloat that should split downward.
- Entry point and compatibility — does the root and subtree orientation link out to key docs so an agent can reach any major area in a hop or two? Flag orphaned docs (reachable from nothing = invisible), missing links, and missing adjacent
AGENTS.md -> CLAUDE.md symlinks beside orientation files. If symlinks are impossible, flag absence of a minimal lockstep fallback.
Recommended write action: use reinforce-design-constraints for buried critical constraints. For broad shape problems, create or split concise CLAUDE.md orientation files, link them from the parent/root, and add adjacent AGENTS.md symlinks where supported. Do not overwrite project orientation or move promode orchestration methodology into it.
When a UI sits over real logic, the test suite should be **layered, not flat** — the bulk of coverage runs fast and headless below the UI; the UI itself is exercised only for what *only* surfaces there. Agent discoveries should harden into deterministic artifacts instead of manual lore. The Tests assessor checks:
- A below-UI operator seam exists — is there an interaction/operator layer beneath the UI that a test (or an agent) can drive end-to-end against real logic and persistence with no GUI? Flag suites that can only reach business logic through the UI: that forces slow, flaky tests to carry coverage a fast headless test could own. The same seam that makes the system headless-testable is also what would make it agent-operable — but that second payoff is an unproven prediction (n=1), so the test-speed gap is the finding here; note any agent-operability upside only as a secondary, speculative observation, never as a required-capability gap.
- Coverage is layered, not duplicated — the headless tier should hold the bulk of acceptance coverage; any UI-level tests should be surgical, reserved for behaviour that manifests only through the running GUI (navigation gating, view/provider/persistence wiring, render defects). Flag slow UI tests re-checking logic the headless tier already covers — that merge is the central anti-pattern. (UI-tier tests are verification-only and surgical by design — they do not invert promode's "fast feedback = unit/integration; system tests are for verification, not debugging"; the headless seam is what keeps the bulk of feedback fast.)
- Discoveries are crystallised — when an agent explores an unknown surface, is the finding hardened into deterministic, self-checking code (a map, recognizer, fixture, or test) rather than left as one-off manual knowledge? Flag exploration whose result wasn't captured as a re-runnable check, and UI checks pinned to coordinates/screenshots rather than stable selectors (they drift). Determinism that breaks imprecisely is a finding too: a failure that can't say which state/edge broke can't tell an agent where to re-discover.
- The acceptance tier reads as evidence-based user stories, traceable to a need — does the top acceptance layer read as user-visible behaviour in the high-level language of the domain (Given-When-Then-style scenarios), each traceable up to a cited (or explicitly-flagged) user need — rather than implementation-coupled assertions orphaned from any product need? Check there's a seam from the product docs (user stories) to the executable acceptance suite — the scenario is the bottom-layer test; a missing product-doc↔acceptance-test bridge is the finding. This is tool-agnostic: the finding is absence of domain-language, traceable acceptance coverage — NOT absence of Gherkin/Cucumber. Do not penalise a repo for skipping a BDD framework, nor reward one for using Cucumber if its scenarios are still implementation-coupled; Gherkin is one option and a plain high-level test whose name/steps read as the user story serves equally. See
discovery-to-determinism's <scenario-vs-seam> for the mechanics (scenario = the what, traceable to the need; operator seam = the how/where, headless).
Runtime traceability is **"context is precious" applied to debugging**: an agent that can filter a whole request by one correlation/tracer ID across client and backend never has to slurp megabytes of unfiltered logs into context — cheaper tokens, faster root-causing. The Observability & traceability assessor checks:
- Correlation IDs exist and propagate — does a request carry a correlation/tracer ID that threads from the client through to the backend (and across service hops), rather than each tier logging in isolation? Flag boundaries where the ID is dropped or regenerated, so a single request can't be followed end-to-end.
- Logs are filterable by that ID on both sides — are client and backend logs emitted in a form (a structured field or a consistent greppable tag) that lets an agent isolate one request's complete trace with a single filter? Flag unstructured or untagged logging that forces a full-log read to reconstruct one request.
- The discipline is built in, not bolted on — is tracing present in the code paths that matter (boundary crossings, error paths) and asserted where it's load-bearing, rather than absent until someone needs to debug? Flag boundary-crossing code with no traceability as a debugging feedback-loop gap, not a cosmetic one — it's the runtime analogue of a missing test seam.
The **visual feedback loop** is the design analogue of the operator-seam test loop: a design source-of-truth (≈ feature tests), a lookbook that renders it (≈ headless client), and a live-refresh preview server (≈ fast test runner) together **crystallise taste into determinism** — capture an aesthetic decision once, replay it deterministically across every prompt and session. The Design system & visual feedback-loop assessor checks:
- Design source-of-truth exists and is traceable — a two-layer
docs/product/DESIGN_SYSTEM.md (YAML token front-matter for the normative what + ## rationale sections for the why), linked from CLAUDE.md as a graph node (not a project-root DESIGN.md, not inlined into CLAUDE.md), tracing up to the documented personas/goals. Flag fabricated or ad-hoc tokens (a hex here, a spacing value there) with no source-of-truth behind them.
- A lookbook exists and renders the system — a lookbook at
docs/product/lookbook/ that renders the tokens: living (imports the real components, so it can't drift) where a component system exists, or reference (curated, tagged screenshots) before one does. It should cover the key components and screen states and trace up to the source-of-truth. Flag a stale or absent lookbook.
- A live-refresh feedback loop exists — design AND marketing artifacts (the lookbook, landing-page proposals, decks, one-pagers) have an edit→see-instantly loop — the project's existing HMR where it has one, else the reference static server — rather than a manual rebuild-and-reload tax. Flag visual artifacts with no live-preview loop as a visual feedback-loop gap — the design analogue of a missing test seam.
Applicability: like the layered-coverage check, this dimension is conditional — it applies when the project has a user-facing visual surface or produces marketing artifacts. Skip it for a pure backend or library with no visual output.
Give each assessor:
- **Scope** — its one dimension, and which part of the repo to read (whole repo, or an area for large codebases).
- **Read-only** — "Assess and report only. Do not modify, create, or commit any files."
- **Rubric** — the dimension's checklist (from the table), framed as "rate how well the repo does this, with evidence."
- **Required output** (so you can synthesise cleanly):
- **Rating** — `Strong` / `Partial` / `Weak` for the dimension.
- **Findings** — concrete, each citing file/path evidence (not "tests are weak" but "`src/checkout.ts` has no tests; `auth.test.ts` mocks the internal `UserStore`, coupling to implementation").
- **Recommendations** — specific changes, each with rough effort (S/M/L) and the promode principle it serves.
Synthesise into:
# Promode Methodology Audit — <repo>
## Overall alignment
<2–4 sentences. Per-dimension rating: Framing <R> · Tests <R> · Knowledge <R> · Architecture <R> · Observability <R> · Design <R> · Hygiene <R>>
## Findings by dimension
### <Dimension> — <rating>
- <finding with file evidence>
## Prioritised action plan
Ranked across all dimensions by impact × effort. Each item:
**[Now/Next/Later] <change>** — why it matters for promode · effort S/M/L · suggested executor (`promode:<agent>`)
- Now = high impact, low/medium effort (unblocks agents working effectively — e.g. "no fast test loop", "no loaded CLAUDE.md orientation", "a critical workflow rule reachable only by a link an agent may never follow"). Next = high impact, higher effort. Later = lower impact / nice-to-have.
- Lead with the items that most improve an agent's ability to work the codebase (a fast feedback loop and a usable knowledge root usually rank highest — they compound).
- Keep every item actionable and concrete; tie each to the promode agent that would execute it.