Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

earos-calibrate

Sterne0

Forks0

Aktualisiert22. März 2026 um 15:34

Run EAROS calibration exercises to validate rubric reliability before production use. Use this skill whenever someone wants to calibrate a rubric, validate inter-rater reliability, compare scores against gold-standard artifacts, measure scoring consistency, or says "calibrate this rubric", "run calibration", "check if the rubric is reliable", "compare my scores to the gold set", "test this profile against examples", "is this rubric ready for production", "what is our kappa", "measure agreement between reviewers", "validate a new profile", or "how well does the rubric score consistently". Calibration is required before any new profile can move from draft to candidate status.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

ThomasRohde

ThomasRohde/EAROS

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

Sonstige GesundheitsunterstützungskräfteMedizinische Hilfsberufe·SOC 31-9099

Datei-Explorer

3 Dateien

SKILL.md

readonly

name

earos-calibrate

description

EAROS Calibrate Skill

You are running an EAROS calibration exercise. Calibration validates that a rubric produces consistent, reliable scores across reviewers and artifacts before it enters a governance process.

Why calibration matters: A rubric that produces inconsistent scores is not a quality gate — it is noise. Without calibration, two reviewers applying the same rubric will score the same artifact differently, governance decisions will be arbitrary, and the framework loses credibility. Calibration makes the rubric trustworthy by measuring and improving its reproducibility.

Target reliability metrics:

Binary agreement (exact match): > 95%
Ordinal Cohen's κ: > 0.70 for well-defined criteria; > 0.50 for subjective criteria
Spearman ρ (overall score correlation across artifacts): > 0.80

Critical: Do NOT look at gold-set benchmark scores until after completing your independent assessment. True calibration requires independent scoring first.

Step 0 — Load Calibration Inputs

Read these files:

core/core-meta-rubric.yaml
The profile or overlay being calibrated (ask if not specified; scan profiles/ and overlays/)
calibration/gold-set/ — scan for existing reference artifacts and their benchmark scores
calibration/results/ — scan for prior calibration runs (to understand trends)

Ask the user:

Which rubric/profile is being calibrated? (if not specified)
Are there artifacts to calibrate against, or should I use the gold-set?
Solo calibration (agent vs. gold-set) or multi-evaluator reconciliation?

Step 1 — Artifact Inventory

List available calibration artifacts. For each:

Artifact ID, title, type
Expected quality category: strong (≥3.2), adequate (2.4–3.19), weak (<2.4), borderline
Known benchmark scores (if prior calibration exists)

If no gold-set artifacts exist, stop and tell the user:

"Calibration requires at least 3 artifacts: 1 strong (should score ≥3.2), 1 weak (should score <2.4), and 1 ambiguous (borderline case). The spread across quality levels is important — calibration against only strong artifacts doesn't test whether the rubric correctly identifies weaknesses. Please provide these artifacts or their paths."

Step 2 — Independent Scoring

For each calibration artifact, run a full EAROS assessment using the earos-assess skill protocol:

Follow the full 8-step DAG
Score every criterion independently before looking at gold-set benchmark scores
Record evidence anchors, evidence classes, confidence, and rationale for every score

This step cannot be skipped or abbreviated. Independent scoring is the entire point of calibration. If you score after seeing the benchmark, you measure nothing.

For the full assessment protocol, see .agents/skills/earos-assess/SKILL.md.

Step 3 — Score Comparison

After completing independent scoring for all artifacts, compare against the gold-set:

artifact_id: [ID]
criterion_id: [ID]
gold_score: [benchmark]
agent_score: [your score]
delta: [gold - agent]  # positive = agent under-scored; negative = agent over-scored
delta_abs: [abs(delta)]
agreement: exact | within_1 | disagreement  # disagreement = delta_abs >= 2
evidence_quality_match: yes | partial | no

Read references/calibration-protocol.md for the full comparison procedure and how to handle cases where you believe the gold-set benchmark may itself be wrong.

Step 4 — Agreement Metric Computation

Read references/agreement-metrics.md before this step. It contains the formulas, computation steps, and interpretation guidance.

Key metrics to compute:

Binary agreement: (exact matches) / (total scored criteria)

Per-criterion reliability flag:

reliable: max_delta ≤ 1 across all artifacts
moderate: max_delta = 2 in isolated cases
unreliable: max_delta ≥ 2 systematically

Overall Spearman ρ: rank correlation of overall scores across artifacts

Verdict: pass_for_production / borderline / not_ready

Step 5 — Root Cause Analysis

For each disagreement (delta ≥ 2) or unreliable criterion, investigate:

Ambiguous level descriptor? — Does scoring_guide clearly distinguish adjacent levels?
Missing decision tree? — Does the criterion have a decision_tree?
Evidence classification issue? — Observed vs. inferred disagreement?
Anti-pattern match? — Did the artifact exhibit an anti-pattern scored differently by the gold-set?
Context sensitivity? — Is this criterion's meaning different for different artifact sub-types?

For each root cause, recommend a specific rubric improvement (which field to change and how).

Step 6 — Calibration Report

Save the report to calibration/results/[rubric-id]-calibration-[YYYY-MM-DD].yaml

Read references/calibration-protocol.md#report-format for the full YAML and markdown report templates.

Step 7 — Recalibration Triggers

After saving results, check whether recalibration is needed sooner than the standard 6-month cycle. Recalibrate when:

Profile criteria change materially (version bump)
New overlay introduced
Agreement drops below targets on any criterion
New artifact formats appear (new diagramming tools, document formats)
Agent model changes materially
Governance expectations change

Tell the user: "Schedule next calibration check: [6 months from today or at next profile revision]."

Non-Negotiable Rules

Score independently first. Never look at gold-set before producing your own assessment.
Don't calibrate to pass. If you systematically disagree with the gold-set, flag it — the gold-set may need review too.
Unreliable criteria must be fixed before production. κ < 0.50 = not usable in governance.
Calibration is ongoing. A profile that passes today must be recalibrated after any material change.

When to Read Which Reference File

When	Read
Before Step 3 (always)	`references/calibration-protocol.md`
Computing agreement metrics (Step 4)	`references/agreement-metrics.md`
Interpreting κ values	`references/agreement-metrics.md#interpretation`
Writing the calibration report	`references/calibration-protocol.md#report-format`
Investigating disagreements (Step 5)	`references/calibration-protocol.md#root-cause-analysis`
Unsure if gold-set benchmark is correct	`references/calibration-protocol.md#gold-set-disagreement`

Mehr aus diesem Repository

gleiches Repository

earos-artifact-gen

ThomasRohde/EAROS

Create architecture documents through guided interview. Triggers on "create an architecture document", "generate a reference architecture", "help me write a solution architecture", "document my architecture", "new architecture document", or any request to create/write/generate architecture artifacts.

2026-04-110

earos-assess

ThomasRohde/EAROS

Run a full EAROS evaluation on an architecture artifact. Triggers when the user wants to assess, evaluate, score, or review an architecture document using the EAROS framework. Also triggers for "score this architecture", "evaluate this ADR", "run EAROS on this", "assess this capability map", "review this solution design", "is this architecture any good", "quality check this design", "grade this document", "what score would this get", or any request to evaluate, rate, or assess the quality of an architecture artifact.

2026-03-230

earos-report

ThomasRohde/EAROS

Generate executive reports from EAROS evaluation records. Triggers when the user wants to generate a report, create a summary, produce an executive view, aggregate multiple evaluations, show trends, or says "generate a report", "create an executive summary", "summarize these evaluations", "show me the portfolio status", "create a dashboard view", "what is the overall quality of our architecture portfolio", or "produce an EAROS report".

2026-03-230

earos-create

ThomasRohde/EAROS

Create a new EAROS rubric — core rubric, artifact profile, or cross-cutting overlay. Use this skill when someone wants to "create a rubric", "new profile", "new overlay", "define criteria for", "make an assessment rubric for", "I need a rubric for", "how do I assess [artifact type]", "create evaluation criteria", "build a scoring framework", "new EAROS rubric", "add a rubric for [type]", "we don't have a rubric for", "extend EAROS for", "create evaluation standards for", or any request to create, define, or build evaluation criteria for architecture artifacts. Also triggers on "I need something to score [artifact type]", "how do I make EAROS work for [artifact type]", "we need criteria for [artifact type]", or "I want to add [artifact type] to EAROS". This skill supersedes earos-profile-author for creating new rubrics from scratch.

2026-03-220

earos-profile-author

ThomasRohde/EAROS

Technical YAML authoring guide for EAROS profiles and overlays. Use this skill when someone has already completed rubric design (criteria defined, design method chosen) and needs help with the YAML structure, v2 field requirements, or schema compliance. Also triggers when someone asks "what are the 5 design methods", "how do I write a criterion", "what fields does a v2 criterion need", or "how do I structure overlay YAML". NOTE: For creating new rubrics from scratch — where the criteria are not yet defined — use earos-create instead. This skill focuses on the technical details of profile YAML authoring after rubric design is complete.

2026-03-220

earos-remediate

ThomasRohde/EAROS

Generate a prioritized improvement plan from an EAROS evaluation. Triggers on "how do I fix this", "improve this artifact", "remediation plan", "how to pass EAROS", "fix the assessment", "improvement plan", "what's wrong with my architecture", "how to get a better score", or any request to improve an artifact based on evaluation results.

2026-03-200

name

earos-calibrate

description

EAROS Calibrate Skill

You are running an EAROS calibration exercise. Calibration validates that a rubric produces consistent, reliable scores across reviewers and artifacts before it enters a governance process.

Target reliability metrics:

Binary agreement (exact match): > 95%
Ordinal Cohen's κ: > 0.70 for well-defined criteria; > 0.50 for subjective criteria
Spearman ρ (overall score correlation across artifacts): > 0.80

Critical: Do NOT look at gold-set benchmark scores until after completing your independent assessment. True calibration requires independent scoring first.

Step 0 — Load Calibration Inputs

Read these files:

core/core-meta-rubric.yaml
The profile or overlay being calibrated (ask if not specified; scan profiles/ and overlays/)
calibration/gold-set/ — scan for existing reference artifacts and their benchmark scores
calibration/results/ — scan for prior calibration runs (to understand trends)

Ask the user:

Which rubric/profile is being calibrated? (if not specified)
Are there artifacts to calibrate against, or should I use the gold-set?
Solo calibration (agent vs. gold-set) or multi-evaluator reconciliation?

Step 1 — Artifact Inventory

List available calibration artifacts. For each:

Artifact ID, title, type
Expected quality category: strong (≥3.2), adequate (2.4–3.19), weak (<2.4), borderline
Known benchmark scores (if prior calibration exists)

If no gold-set artifacts exist, stop and tell the user:

"Calibration requires at least 3 artifacts: 1 strong (should score ≥3.2), 1 weak (should score <2.4), and 1 ambiguous (borderline case). The spread across quality levels is important — calibration against only strong artifacts doesn't test whether the rubric correctly identifies weaknesses. Please provide these artifacts or their paths."

Step 2 — Independent Scoring

For each calibration artifact, run a full EAROS assessment using the earos-assess skill protocol:

Follow the full 8-step DAG
Score every criterion independently before looking at gold-set benchmark scores
Record evidence anchors, evidence classes, confidence, and rationale for every score

This step cannot be skipped or abbreviated. Independent scoring is the entire point of calibration. If you score after seeing the benchmark, you measure nothing.

For the full assessment protocol, see .agents/skills/earos-assess/SKILL.md.

Step 3 — Score Comparison

After completing independent scoring for all artifacts, compare against the gold-set:

artifact_id: [ID]
criterion_id: [ID]
gold_score: [benchmark]
agent_score: [your score]
delta: [gold - agent]  # positive = agent under-scored; negative = agent over-scored
delta_abs: [abs(delta)]
agreement: exact | within_1 | disagreement  # disagreement = delta_abs >= 2
evidence_quality_match: yes | partial | no

Read references/calibration-protocol.md for the full comparison procedure and how to handle cases where you believe the gold-set benchmark may itself be wrong.

Step 4 — Agreement Metric Computation

Read references/agreement-metrics.md before this step. It contains the formulas, computation steps, and interpretation guidance.

Key metrics to compute:

Binary agreement: (exact matches) / (total scored criteria)

Per-criterion reliability flag:

reliable: max_delta ≤ 1 across all artifacts
moderate: max_delta = 2 in isolated cases
unreliable: max_delta ≥ 2 systematically

Overall Spearman ρ: rank correlation of overall scores across artifacts

Verdict: pass_for_production / borderline / not_ready

Step 5 — Root Cause Analysis

For each disagreement (delta ≥ 2) or unreliable criterion, investigate:

Ambiguous level descriptor? — Does scoring_guide clearly distinguish adjacent levels?
Missing decision tree? — Does the criterion have a decision_tree?
Evidence classification issue? — Observed vs. inferred disagreement?
Anti-pattern match? — Did the artifact exhibit an anti-pattern scored differently by the gold-set?
Context sensitivity? — Is this criterion's meaning different for different artifact sub-types?

For each root cause, recommend a specific rubric improvement (which field to change and how).

Step 6 — Calibration Report

Save the report to calibration/results/[rubric-id]-calibration-[YYYY-MM-DD].yaml

Read references/calibration-protocol.md#report-format for the full YAML and markdown report templates.

Step 7 — Recalibration Triggers

After saving results, check whether recalibration is needed sooner than the standard 6-month cycle. Recalibrate when:

Profile criteria change materially (version bump)
New overlay introduced
Agreement drops below targets on any criterion
New artifact formats appear (new diagramming tools, document formats)
Agent model changes materially
Governance expectations change

Tell the user: "Schedule next calibration check: [6 months from today or at next profile revision]."

Non-Negotiable Rules

Score independently first. Never look at gold-set before producing your own assessment.
Don't calibrate to pass. If you systematically disagree with the gold-set, flag it — the gold-set may need review too.
Unreliable criteria must be fixed before production. κ < 0.50 = not usable in governance.
Calibration is ongoing. A profile that passes today must be recalibrated after any material change.

When to Read Which Reference File

When	Read
Before Step 3 (always)	`references/calibration-protocol.md`
Computing agreement metrics (Step 4)	`references/agreement-metrics.md`
Interpreting κ values	`references/agreement-metrics.md#interpretation`
Writing the calibration report	`references/calibration-protocol.md#report-format`
Investigating disagreements (Step 5)	`references/calibration-protocol.md#root-cause-analysis`
Unsure if gold-set benchmark is correct	`references/calibration-protocol.md#gold-set-disagreement`