name

aeon-skill-evals

description

Validate the output of any installed skill against an assertion manifest — word counts, required patterns, forbidden phrases, required sections, source citation. Detects regressions by diffing vs prior runs (NEW_FAIL / NEW_PASS / CHRONIC / STABLE_FAIL). Bootstrap mode generates a starter manifest from a skill's recent successful runs so manifests aren't written speculatively. Triggers: "evaluate this skill's output", "check skill X for regressions", "bootstrap evals for Y", "did this skill output pass quality gates".

aeon-skill-evals

Quality net for installed skills. Each skill can declare an assertion manifest; outputs are checked against it; failing assertions surface regressions and route concrete fixes.

Manifest format

token-movers:
  min_words: 200
  required_patterns: ["Top movers", "24h"]
  forbidden_patterns: ["I cannot", "as an AI"]
  must_cite_source: true
  min_distinct_items: 5

narrative-tracker:
  min_words: 400
  required_sections: ["TRANSITIONS", "POSITIONS", "MAP"]
  forbidden_patterns: ["exciting", "consider"]
  must_have_position_call: true

Supported assertions: min_words / max_words, required_patterns / forbidden_patterns, required_sections, must_cite_source, min_distinct_items, output_pattern (regex), and per-skill-family custom binary checks.

Operations

eval — run every manifest-defined skill against its latest output.
eval --skill=NAME — one skill.
bootstrap --skill=NAME — generate a starter manifest from recent successful runs.

Regression states

State	Action
`NEW_FAIL`	Passing last run, failing now. Severity scales with pass streak.
`NEW_PASS`	Failing last run, passing now. Log the win.
`CHRONIC`	Failing > 3 consecutive runs. Recommend operator review.
`STABLE_FAIL`	Always failing. Manifest assertion mismatch — flag for review.

State in local evals-state.json.

Bootstrap mode

Samples last 5 successful runs of a skill. Computes:

min_words at p25 of historical runs.
Required patterns from common section headers.
Forbidden patterns from default list (refusals, hedging filler).

Emits the proposed manifest for review. Never auto-commits — assertions need a human signoff.

Rules

Assertions are observations, not specifications. Bootstrap before writing speculatively.
Forbidden patterns catch hallucination markers and refusals. Keep the list tight; don't lint stylistic choices.
Chronic failures get a recommendation, not a re-file.
Manifest changes are reviewed; never auto-edited by this skill.

aeon-skill-evals

Quality net for installed skills. Each skill can declare an assertion manifest; outputs are checked against it; failing assertions surface regressions and route concrete fixes.

Manifest format

token-movers: min_words: 200 required_patterns: ["Top movers", "24h"] forbidden_patterns: ["I cannot", "as an AI"] must_cite_source: true min_distinct_items: 5 narrative-tracker: min_words: 400 required_sections: ["TRANSITIONS", "POSITIONS", "MAP"] forbidden_patterns: ["exciting", "consider"] must_have_position_call: true

Operations

eval — run every manifest-defined skill against its latest output.

eval --skill=NAME — one skill.

bootstrap --skill=NAME — generate a starter manifest from recent successful runs.

Regression states

State

Action

NEW_FAIL

Passing last run, failing now. Severity scales with pass streak.

NEW_PASS

Failing last run, passing now. Log the win.

CHRONIC

Failing > 3 consecutive runs. Recommend operator review.

STABLE_FAIL

Always failing. Manifest assertion mismatch — flag for review.

State in local evals-state.json.

Bootstrap mode

Samples last 5 successful runs of a skill. Computes:

min_words at p25 of historical runs.

Required patterns from common section headers.

Forbidden patterns from default list (refusals, hedging filler).

Emits the proposed manifest for review. Never auto-commits — assertions need a human signoff.

Rules

Assertions are observations, not specifications. Bootstrap before writing speculatively.

Forbidden patterns catch hallucination markers and refusals. Keep the list tight; don't lint stylistic choices.

Chronic failures get a recommendation, not a re-file.

Manifest changes are reviewed; never auto-edited by this skill.

aeon-skill-evals

aeon-skill-evals

Manifest format

Operations

Regression states

Bootstrap mode

Rules

同仓库更多 Skills

同仓库更多 Skills

aeon-skill-evals

Manifest format

Operations

Regression states

Bootstrap mode

Rules