Run any Skill in Manus with one click

framework-health

Evaluate Mycelium's own process effectiveness. Measures cycle velocity, discard trends, confidence calibration, gate effectiveness, regression rate. Run quarterly or every 20 cycles.

Run Skill in Manus

Overview

Evaluate Mycelium's own process effectiveness. Measures cycle velocity, discard trends, confidence calibration, gate effectiveness, regression rate. Run quarterly or every 20 cycles.

Install command

npx skills add https://github.com/haabe/mycelium --skill framework-health

Copy and paste this command into Claude Code to install the skill

Source

haabe/mycelium

Stars30

Forks3

UpdatedMay 30, 2026 at 13:12

SKILL.md

readonly

Framework Health Check

Mycelium evaluates its own process. This is triple-loop learning — the framework assessing whether it is getting better at producing good outcomes.

When to Use

Quarterly review (scheduled)
After 20 completed leaf cycles (triggered by cycle-history.yml count)
When process friction is suspected
Before major framework changes (baseline measurement)

Workflow

1. Load Cycle Data

Read .claude/canvas/cycle-history.yml.

Framework-self-host detection (per engine/cycle-learning.md#framework-on-framework-exemption): if the project root contains plugins/mycelium/plugin.json AND CLAUDE.md begins with # Mycelium:, this is the framework dogfooding itself. Skip the cycle-count gate and route to a corrections-graduation summary:

Count entries in .claude/memory/corrections.md (total, and ×graduated-to-mechanism in the last 90 days).
Read .claude/memory/cluster-instances.md and list clusters at-or-above their graduation criterion that are not yet graduated (this is the framework analogue of "actual outcome vs predicted ICE").
Skip cycle-derived dimensions (velocity, discard rate, confidence calibration, regression rate) — they do not apply. Still run Steps 2b, 4b, 4c, 4d.

Otherwise (product project, not framework-self-host): if fewer than 5 cycles recorded, report: "Insufficient cycle data for framework health assessment. [N] cycles recorded; minimum 5 needed. Continue recording outcomes."

2. Measure Five Dimensions

For each dimension, compute the metric and compare against trend (if prior assessments exist):

Cycle Velocity:

Average days from diamond creation to completion, grouped by scale
Trend: improving / stable / degrading
If degrading: flag for investigation

Discard Rate:

Count of discards per lifecycle phase
Average discard phase (1-10 scale)
Trend: shifting earlier (good) / shifting later (bad) / stable
If >50% of discards at Phase 7+: flag "late discard pattern"

Confidence Calibration:

For all cycles with predicted confidence and actual outcome:
- Compute: actual success rate per confidence band (0.3-0.5, 0.5-0.7, 0.7-0.9)
- Compare with expected rate (confidence 0.7 should succeed ~70%)
- Report calibration factor: actual/expected
If calibration factor < 0.8 or > 1.2: flag miscalibration

Gate Effectiveness:

For each theory gate, count: times checked, times passed, times failed
Compute hit rate: failures / total checks
Flag rubber stamps (0% failure rate) and hard blocks (>80% failure rate)
Theory X/Y audit (per ${CLAUDE_PLUGIN_ROOT}/harness/theory-tensions.md Tension 7): for any hard-block gate, check it is scaffolding (surfaces its why, an escape hatch exists, leaves the user more capable), not coercion (compliance for its own sake, no surfaced reason, no escape). A high-block gate that fails this audit is a Theory-X drift to remediate, not just a strict gate.

Regression Rate:

Count diamonds that regressed at least once / total diamonds
Trend: decreasing (good) / increasing (bad) / stable

2b. Re-run Deferred Design-Verification Eval Scenarios

Re-run any eval scenario tagged regression AND router-discipline from .claude/evals/scenarios/integration/. These are deferred design-time decisions that need periodic re-verification (the AGENTS.md router design is the canonical case — see agents-md-router-discipline.yml).

For each scenario:

Run via /mycelium:eval-runner against the scenario file
Compare result against the scenario's baseline_reference field
Report:
- Same outcome → design holding; no action
- Improved → either the design got better OR the model improved; investigate which (a model improvement that hides a design regression is a Goodhart trap)
- Regressed → design drifted; flag for remediation in this assessment

If a scenario fails its success_criteria for the first time, log to corrections.md as a new generalizable correction with the scenario name as evidence. Do not auto-remediate — surface the regression for human review.

3. Run Threshold Calibration

If cycle count ≥ minimum_n for any threshold in .claude/canvas/thresholds.yml:

Apply calibration rules from ${CLAUDE_PLUGIN_ROOT}/engine/adaptive-thresholds.md
Update calibrated values
Log changes in .claude/harness/decision-log.md

4. Check Goodhart Counter-Metrics

For each dimension, verify the counter-metric is not degrading:

Velocity improving BUT outcome quality declining? Flag.
Earlier discards BUT false positive rate rising? Flag.
Better calibration BUT decision speed dropping? Flag.

4b. Cluster Graduation-Readiness (added 2026-05-08)

Read .claude/memory/cluster-instances.md. For each cluster:

Compare instance count to graduation criterion. If a cluster has reached or exceeded its stated criterion without being graduated to the corresponding mechanism (e.g., 6+ instances with spec-only status when promotion bar requires implemented detection rules), surface as a graduation-readiness flag.
For spec-status clusters with linked spec docs (e.g., ${CLAUDE_PLUGIN_ROOT}/engine/consistency-check-spec.md): check whether the spec's promotion-bar conditions have been met. Concretely: count detection rules drafted vs. required, FP-rate measurements available vs. needed.
Recursive check: if a cluster's stated graduation criterion has been met for >30 days without graduation action, that's itself an instance of the documented-rule-diverges-from-enforcement cluster — log it.
Output: include cluster status in the dashboard under a new "Cluster Graduation Status" section.

This step closes the recursion the cluster log was created to address: graduation criteria become mechanically auditable rather than promises stored in commit messages.

4c. Receipts Highlights Rotation Cadence (added 2026-05-08)

The README's "How Mycelium got smarter" section shows 5 case headers; the full list lives in docs/receipts/cases/. Stale README highlights are a Goodhart signal: if the receipts surface freezes, the framework's "we get smarter with each cycle" claim degrades to "we got smarter once".

For each case currently on the README:

Check git-log staleness: when did the case header last change? If >90 days, flag as a rotation candidate.
Check for newer cases: are there cases under docs/receipts/cases/ newer than the rotation candidate that better demonstrate the framework's recent behavior?
Recommend rotation: surface specific rotate-out / rotate-in pairs in the dashboard. Rotation is a maintainer decision, not automatic — but the flag forces the decision rather than letting it drift.
Highlight gap signal: if no case has been added to docs/receipts/cases/ in >60 days, flag as a possible-low-friction signal — either the framework genuinely caused no recent friction (rare), or the dogfood loop has weakened (usually).

Per docs/contributing/style.md#highlights-rotation. Cases stay in docs/receipts/cases/ even when rotated off README; only the README mention rotates.

4d. Docs Health Cross-Surface (added 2026-05-08)

Run a lightweight version of /mycelium:canvas-health step 9b on docs/:

Stub freshness (any forthcoming-doc Last updated >60 days)
Length budget compliance (hard caps)
Marketing-voice scan
Information-scent scan on links

Surface in the dashboard. Full details delegate to /mycelium:canvas-health.

4e. Chat-UX Axiom Audit of Skill Output Templates (added 2026-05-30)

The chat-UX nudges in ${CLAUDE_PLUGIN_ROOT}/harness/design-principles.md ("the chat is a UI") shape live output, which has no stored corpus to audit retroactively. What is auditable is the static surface that pre-shapes live output: the ## Output/## Output Format blocks in ${CLAUDE_PLUGIN_ROOT}/skills/*/SKILL.md. Scan each for two axiom violations:

Hick's Law — an output template that instructs the agent to present a list of options/recommendations with no "recommend one" / "priority" / "top-N" cue. A template that emits N equally-weighted choices manufactures decision-tax on every invocation. Flag templates with option-lists lacking a recommendation cue.
Von Restorff (isolation) — an output template that renders a blocker, gate, error, or STOP condition as undifferentiated prose rather than a visually distinct marker (ON HOLD, Gated by:, a leading verdict line). Flag blocker-bearing templates whose blocker does not visually pop.

This is the buildable form of the self-audit; the live-output version is unenforceable (no corpus). Surface counts + offending skills in the dashboard. Do not auto-edit skills — flag for maintainer review (a template's flat option-list may be deliberate). Graduation path: if the same skill is flagged across two assessments, promote to a mechanical tests/bash check (then it inherits G-V12 / Check 37).

5. Generate Dashboard

Output

## Framework Health Dashboard

Assessment date: [date]
Cycles analyzed: [N]
Period: [date range]

### Dimensions

| Dimension | Current | Trend | Status | Counter-Metric |
|-----------|---------|-------|--------|----------------|
| Cycle velocity | [X days avg] | [improving/stable/degrading] | [healthy/warning/critical] | Outcome quality: [OK/degrading] |
| Discard rate | [avg phase X] | [earlier/stable/later] | [healthy/warning/critical] | False positive rate: [OK/rising] |
| Confidence calibration | [factor X.XX] | [improving/stable/diverging] | [healthy/warning/critical] | Decision speed: [OK/slowing] |
| Gate effectiveness | [see detail] | — | [healthy/warning/critical] | Flow speed: [OK/slowing] |
| Regression rate | [X%] | [decreasing/stable/increasing] | [healthy/warning/critical] | Innovation rate: [OK/declining] |

### Threshold Calibration

| Threshold | Default | Calibrated | Based On | Change |
|-----------|---------|-----------|----------|--------|
| ICE advance | 100 | [value or "insufficient data"] | N cycles | [+/-] |
| Confidence factor | 1.0 | [value or "insufficient data"] | N cycles | [+/-] |
| Bakeoff delta | 20% | [value or "insufficient data"] | N bakeoffs | [+/-] |

### Pattern Signals Active

[List any active pattern detector signals from ${CLAUDE_PLUGIN_ROOT}/engine/pattern-detector.md]

### Recommendations

[Specific actions based on findings — not generic advice]

Rules

Never modify thresholds without sufficient data (respect minimum_n)
Always check counter-metrics before celebrating improvement
Log all threshold changes in .claude/harness/decision-log.md
If all dimensions are healthy, say so and suggest next review date

Theory Citations

Argyris: Triple-loop learning (learning how to learn)
Forsgren: Accelerate (measuring capabilities, not just outputs)
Goodhart: Counter-metrics for every metric
Deming: Statistical process control (data-driven threshold adjustment)

name	framework-health
description	Evaluate Mycelium's own process effectiveness. Measures cycle velocity, discard trends, confidence calibration, gate effectiveness, regression rate. Run quarterly or every 20 cycles.
metadata	{"instruction_budget":"50","framework_dependency":"mycelium","framework_dependency_note":"This skill is designed to run within the Mycelium framework (https://github.com/haabe/mycelium). Standalone use will skip the canvas state, theory gates, and harness behavior the skill assumes. Install: /plugin install mycelium@haabe/mycelium."}

framework-health

More from this repository

More from this repository

Framework Health Check

When to Use

Workflow

1. Load Cycle Data

2. Measure Five Dimensions

2b. Re-run Deferred Design-Verification Eval Scenarios

3. Run Threshold Calibration

4. Check Goodhart Counter-Metrics

4b. Cluster Graduation-Readiness (added 2026-05-08)

4c. Receipts Highlights Rotation Cadence (added 2026-05-08)

4d. Docs Health Cross-Surface (added 2026-05-08)

4e. Chat-UX Axiom Audit of Skill Output Templates (added 2026-05-30)

5. Generate Dashboard

Output

Rules

Theory Citations

Framework Health Check

When to Use

Workflow

1. Load Cycle Data

2. Measure Five Dimensions

2b. Re-run Deferred Design-Verification Eval Scenarios

3. Run Threshold Calibration

4. Check Goodhart Counter-Metrics

4b. Cluster Graduation-Readiness (added 2026-05-08)

4c. Receipts Highlights Rotation Cadence (added 2026-05-08)

4d. Docs Health Cross-Surface (added 2026-05-08)

4e. Chat-UX Axiom Audit of Skill Output Templates (added 2026-05-30)

5. Generate Dashboard

Output

Rules

Theory Citations