Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

autonomous-improve

Scores against a rubric, attacks the highest-leverage axis, verifies no regressions, documents the learning, and loops. Re-scores from scratch each iteration.

Exécuter dans Manus

Aperçu

Scores against a rubric, attacks the highest-leverage axis, verifies no regressions, documents the learning, and loops. Re-scores from scratch each iteration.

Commande d'installation

npx skills add https://github.com/oldmangrizzz/HUGH-Sovereign-Guardian --skill autonomous-improve

Copiez et collez cette commande dans Claude Code pour installer le skill

Source

oldmangrizzz/HUGH-Sovereign-Guardian

Étoiles0

Forks0

Mis à jour11 avril 2026 à 20:25

SKILL.md

readonly

name	autonomous-improve
description	Scores against a rubric, attacks the highest-leverage axis, verifies no regressions, documents the learning, and loops. Re-scores from scratch each iteration.

Autonomous Quality Loop — Score, Attack, Verify, Repeat

Goal

Quality doesn't improve through sprints. It improves through iteration — scoring what exists, picking the worst thing, fixing it, and doing it again. The problem is that iteration requires sustained attention. Sessions end. Context resets. The loop stalls.

This kit automates the iteration. It runs a tight cycle: score everything against a rubric, select the single highest-leverage axis, attack it with a real fix, verify the fix worked and nothing regressed, document what was learned, then loop. Each iteration re-scores from scratch — because what you fixed this loop changes the priority order for the next.

The rubric is the key design choice. It's not a checklist. Each axis has three anchors (what a 0 looks like, what a 5 looks like, what a 10 looks like), a weight, and programmatic verification specs where possible. This makes scores arguable, not vibes-based. Three independent evaluators score each axis and the minimum score wins — because a low score from any evaluator is an unresolved problem, and averaging would hide it.

The loop stops when you tell it to, when a quality ceiling is reached (all axes >= 8.0), or when the rubric has been extracted to its ceiling and needs to be re-anchored at a higher standard.

When to Use

A product, module, or codebase that needs systematic quality improvement over time
After shipping a fast MVP that you know has rough edges but aren't sure which to address first
When you want to improve a specific dimension (documentation, test coverage, security posture) but aren't sure where the gaps actually are
As a continuous improvement process across multiple sessions or overnight runs (pairs with the Daemon kit for unattended operation)

Not for: quick bug fixes (use Systematic Debugging), one-off code reviews (use the Review kit), or simple refactors. The loop has overhead. It earns its cost on targets where the improvement space is wide and the priority isn't obvious.

Good first run: use --score-only to get your baseline before committing to improvement loops. You'll see where you actually stand, which makes the case for how many loops are warranted.

Setup

Create a rubric before running your first loop.

The rubric is a markdown file at .planning/rubrics/{target}.md. If it doesn't exist, the kit's Phase 0 drafts one and asks for your approval.

A good rubric has:

8-14 axes organized into 3-5 categories (don't go narrow, don't go sprawling)
Each axis with a weight (0.0-1.0), a category, and three anchors (0/5/10)
Programmatic verification specs for at least half the axes (things that can be checked with a script, not just judged by eye)

Why approval is required: the rubric defines what "better" means. If the axes are wrong, the loop optimizes the wrong things. Phase 0 drafts; you decide.

Directory structure: the kit uses .planning/ for state. Create it at your project root if it doesn't exist, or let Phase 0 create it on first run.

Steps

Phase 0: Rubric Bootstrap (first run only, requires your approval)

Runs only when .planning/rubrics/{target}.md doesn't exist.

If research on comparable products exists in .planning/research/, reads it
If not, runs a multi-scout research pass to understand the quality landscape
Drafts 8-14 axes with weights, categories, anchors, and verification specs
Presents the draft with rationale for each axis
Stops. Waits for your approval. You edit the rubric, confirm it, and the loop begins.

This is the only mandatory human gate. Everything after this is autonomous.

Phase 1: Score (every loop, no exceptions, no caching)

Three independent evaluator agents score every axis in the rubric simultaneously. Each evaluator receives:

The full rubric with axis definitions and anchors
Read access to the target
Their assigned persona (Experienced User, Newcomer, or Critical Reviewer)
Instructions to score independently — they don't see each other's scores

For each axis, the final score is the minimum of the three evaluators. Not the average. A score of 6, 8, 7 produces a final score of 6. A low score from any evaluator means there's a real unresolved problem somewhere. The minimum surfaces it.

Programmatic checks run in parallel with evaluator scoring. A failed programmatic check caps the axis at 5, regardless of evaluator scores.

The scorecard:

Axis                    | A  | B  | C  | Prog | Final | Delta
------------------------|----|----|----|----- |-------|------
security_posture        | 7  | 8  | 6  | PASS |  6.0  |
test_coverage           | 4  | 3  | 5  | FAIL |  3.0  | cap
documentation_accuracy  | 6  | 6  | 7  | PASS |  6.0  |

Phase 2: Select

Choose the single axis to attack this loop using a selection formula:

score(axis) = (10 - current_score) × weight × effort_multiplier × recency_penalty

effort_multiplier: low-effort changes = 1.0, medium = 0.7, high-effort = 0.4 (lower-effort improvements get priority — more ground covered per loop)
recency_penalty: 0.5 if this axis was attacked in either of the last two loops (prevents fixation on one axis while others deteriorate)

The selection and its rationale are shown before any attack begins. You can override with --axis {name} to force a specific axis regardless of scoring.

Phase 3: Attack

Execute the improvement. Strategy depends on the axis category:

Technical axes (test coverage, reliability, API consistency): run targeted experiments, measure before/after, commit only if improvement is verified
Documentation axes: read current docs, identify specific gaps or inaccuracies, rewrite the specific sections — not wholesale rewrites unless the score is below 3
Experience axes (onboarding, error recovery, discoverability): structural fixes plus documentation updates plus verification of the actual user path
Security axes: read the specific code involved, make targeted changes, run programmatic checks before committing

Each approach is tried in isolation before committing. If multiple approaches are evaluated, the decision record is written to the loop log — future loops can read why the winner won.

Phase 4: Verify

After the attack, re-score only the targeted axis (not a full re-score — that's Phase 1 of the next loop):

Programmatic: run the axis's specific checks — do they now pass?
Structural: verify structural requirements are met
Perceptual: spawn a single Newcomer evaluator (the hardest-to-satisfy persona) to score the targeted axis independently
Behavioral simulation (for onboarding/UX axes): follow the actual user path in a clean environment with no prior knowledge — does it work?

Regression check: re-run programmatic checks on every axis that shares files with the changes. If anything that was passing now fails: revert the changes. Don't commit improvements that break something else.

A behavioral FAIL overrides a passing perceptual score. If the user path is broken, the loop doesn't commit regardless of what the evaluator said.

Phase 5: Document

Every loop gets a log at .planning/improvement-logs/{target}/loop-{n}.md. Even loops that end in no-change or revert. Especially those.

The log records:

The full scorecard with deltas from the previous loop
What was changed and why that approach was chosen
Verification results across all four tiers
What was learned that future loops should know

This institutional memory is what makes later loops smarter than earlier ones.

Phase 6: Loop or Exit

The loop exits when:

A loop count (--loops N) is specified and N loops have completed
All axes score >= 8.0 (quality ceiling reached)
No axis improved > 0.5 in either of the last two loops and at least 3 loops have completed (plateau — Level-Up Protocol is triggered)
You say stop

On plateau (Level-Up Protocol): the current rubric has been extracted to its ceiling. The kit freezes the current state, proposes re-anchored axes where the current 10 becomes the new 5, and waits for your approval before continuing. This is how the quality bar gets raised, not just approached.

Constraints

Phase 0 requires your approval. The rubric is the only thing that requires human judgment. Once it's set, the loop runs autonomously.
The loop never writes to the live rubric. Proposed axis additions or re-anchorings go to .planning/rubrics/{target}-proposals.md. You approve before they take effect.
Level-Up Protocol requires your approval. The loop halts and waits. It doesn't self-upgrade the rubric.
Regressions block commits. If the fix breaks something that was passing, the change is reverted. The loop continues without a commit for that iteration.
Security axis failures are blocking. A failed security programmatic check halts the entire loop — it doesn't get treated as "just a low score."

Safety Notes

Use --score-only first to see your baseline before spending tokens on attack loops. Knowing your scores in advance helps you calibrate how many loops are warranted.
Each loop commits its changes separately. If a loop produces an unwanted result, git revert <commit> undoes that loop's changes without touching others.
For overnight / unattended runs, pair this kit with the Daemon kit, which handles session continuity and budget enforcement automatically.
Set a loop count (--loops 3) for your first run rather than running until ceiling. It's easier to add loops than to undo them.

Plus depuis ce dépôt

même dépôt

code-review-5pass

oldmangrizzz/HUGH-Sovereign-Guardian

5 passes: correctness, security, performance, readability, consistency. Every finding: file, line, severity, fix. Ends with PASS/CONDITIONAL/FAIL verdict.

2026-04-110

codebase-map

oldmangrizzz/HUGH-Sovereign-Guardian

Structural codebase index: files, exports, imports, dependency graph, roles. Keyword search. Inject compact slices into agents to cut token usage 60-80%.

2026-04-110

second-brain

oldmangrizzz/HUGH-Sovereign-Guardian

Persistent memory engine for AI agents with graph-first retrieval, contradiction tracking, temporal decay, and 3D brain visualization.

2026-04-110

structured-research

oldmangrizzz/HUGH-Sovereign-Guardian

Turns questions into findings with confidence levels and citations. 2-4 queries, 3-6 sources, persistent doc. Scales to parallel multi-scout investigation.

2026-04-110

systematic-debugging

oldmangrizzz/HUGH-Sovereign-Guardian

4-phase debug protocol: observe → hypothesize → verify → fix. No code changes without a confirmed root cause. Emergency stop after two failed attempts.

2026-04-110

client-health-monitor

oldmangrizzz/HUGH-Sovereign-Guardian

Daily automated relationship health check for small business operators running OpenClaw.

2026-04-110

Source

oldmangrizzz

oldmangrizzz/HUGH-Sovereign-Guardian

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Commande d'installation

Téléchargement

Exécuter dans Manus

Utile pourSOC

Développeurs de logicielsProfessions informatiques et mathématiques15-1252L4

name	autonomous-improve
description	Scores against a rubric, attacks the highest-leverage axis, verifies no regressions, documents the learning, and loops. Re-scores from scratch each iteration.

Autonomous Quality Loop — Score, Attack, Verify, Repeat

Goal

The loop stops when you tell it to, when a quality ceiling is reached (all axes >= 8.0), or when the rubric has been extracted to its ceiling and needs to be re-anchored at a higher standard.

When to Use

A product, module, or codebase that needs systematic quality improvement over time
After shipping a fast MVP that you know has rough edges but aren't sure which to address first
When you want to improve a specific dimension (documentation, test coverage, security posture) but aren't sure where the gaps actually are
As a continuous improvement process across multiple sessions or overnight runs (pairs with the Daemon kit for unattended operation)

Good first run: use --score-only to get your baseline before committing to improvement loops. You'll see where you actually stand, which makes the case for how many loops are warranted.

Setup

Create a rubric before running your first loop.

The rubric is a markdown file at .planning/rubrics/{target}.md. If it doesn't exist, the kit's Phase 0 drafts one and asks for your approval.

A good rubric has:

8-14 axes organized into 3-5 categories (don't go narrow, don't go sprawling)
Each axis with a weight (0.0-1.0), a category, and three anchors (0/5/10)
Programmatic verification specs for at least half the axes (things that can be checked with a script, not just judged by eye)

Why approval is required: the rubric defines what "better" means. If the axes are wrong, the loop optimizes the wrong things. Phase 0 drafts; you decide.

Directory structure: the kit uses .planning/ for state. Create it at your project root if it doesn't exist, or let Phase 0 create it on first run.

Steps

Phase 0: Rubric Bootstrap (first run only, requires your approval)

Runs only when .planning/rubrics/{target}.md doesn't exist.

If research on comparable products exists in .planning/research/, reads it
If not, runs a multi-scout research pass to understand the quality landscape
Drafts 8-14 axes with weights, categories, anchors, and verification specs
Presents the draft with rationale for each axis
Stops. Waits for your approval. You edit the rubric, confirm it, and the loop begins.

This is the only mandatory human gate. Everything after this is autonomous.

Phase 1: Score (every loop, no exceptions, no caching)

Three independent evaluator agents score every axis in the rubric simultaneously. Each evaluator receives:

The full rubric with axis definitions and anchors
Read access to the target
Their assigned persona (Experienced User, Newcomer, or Critical Reviewer)
Instructions to score independently — they don't see each other's scores

Programmatic checks run in parallel with evaluator scoring. A failed programmatic check caps the axis at 5, regardless of evaluator scores.

The scorecard:

Axis                    | A  | B  | C  | Prog | Final | Delta
------------------------|----|----|----|----- |-------|------
security_posture        | 7  | 8  | 6  | PASS |  6.0  |
test_coverage           | 4  | 3  | 5  | FAIL |  3.0  | cap
documentation_accuracy  | 6  | 6  | 7  | PASS |  6.0  |

Phase 2: Select

Choose the single axis to attack this loop using a selection formula:

score(axis) = (10 - current_score) × weight × effort_multiplier × recency_penalty

effort_multiplier: low-effort changes = 1.0, medium = 0.7, high-effort = 0.4 (lower-effort improvements get priority — more ground covered per loop)
recency_penalty: 0.5 if this axis was attacked in either of the last two loops (prevents fixation on one axis while others deteriorate)

The selection and its rationale are shown before any attack begins. You can override with --axis {name} to force a specific axis regardless of scoring.

Phase 3: Attack

Execute the improvement. Strategy depends on the axis category:

Technical axes (test coverage, reliability, API consistency): run targeted experiments, measure before/after, commit only if improvement is verified
Documentation axes: read current docs, identify specific gaps or inaccuracies, rewrite the specific sections — not wholesale rewrites unless the score is below 3
Experience axes (onboarding, error recovery, discoverability): structural fixes plus documentation updates plus verification of the actual user path
Security axes: read the specific code involved, make targeted changes, run programmatic checks before committing

Each approach is tried in isolation before committing. If multiple approaches are evaluated, the decision record is written to the loop log — future loops can read why the winner won.

Phase 4: Verify

After the attack, re-score only the targeted axis (not a full re-score — that's Phase 1 of the next loop):

Programmatic: run the axis's specific checks — do they now pass?
Structural: verify structural requirements are met
Perceptual: spawn a single Newcomer evaluator (the hardest-to-satisfy persona) to score the targeted axis independently
Behavioral simulation (for onboarding/UX axes): follow the actual user path in a clean environment with no prior knowledge — does it work?

A behavioral FAIL overrides a passing perceptual score. If the user path is broken, the loop doesn't commit regardless of what the evaluator said.

Phase 5: Document

Every loop gets a log at .planning/improvement-logs/{target}/loop-{n}.md. Even loops that end in no-change or revert. Especially those.

The log records:

The full scorecard with deltas from the previous loop
What was changed and why that approach was chosen
Verification results across all four tiers
What was learned that future loops should know

This institutional memory is what makes later loops smarter than earlier ones.

Phase 6: Loop or Exit

The loop exits when:

A loop count (--loops N) is specified and N loops have completed
All axes score >= 8.0 (quality ceiling reached)
No axis improved > 0.5 in either of the last two loops and at least 3 loops have completed (plateau — Level-Up Protocol is triggered)
You say stop

Constraints

Phase 0 requires your approval. The rubric is the only thing that requires human judgment. Once it's set, the loop runs autonomously.
The loop never writes to the live rubric. Proposed axis additions or re-anchorings go to .planning/rubrics/{target}-proposals.md. You approve before they take effect.
Level-Up Protocol requires your approval. The loop halts and waits. It doesn't self-upgrade the rubric.
Regressions block commits. If the fix breaks something that was passing, the change is reverted. The loop continues without a commit for that iteration.
Security axis failures are blocking. A failed security programmatic check halts the entire loop — it doesn't get treated as "just a low score."

Safety Notes

Use --score-only first to see your baseline before spending tokens on attack loops. Knowing your scores in advance helps you calibrate how many loops are warranted.
Each loop commits its changes separately. If a loop produces an unwanted result, git revert <commit> undoes that loop's changes without touching others.
For overnight / unattended runs, pair this kit with the Daemon kit, which handles session continuity and budget enforcement automatically.
Set a loop count (--loops 3) for your first run rather than running until ceiling. It's easier to add loops than to undo them.