Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

earos-review

Sterne0

Forks0

Aktualisiert20. März 2026 um 09:49

Challenge and peer-review an existing EAROS evaluation record. Use this skill whenever someone wants to audit, second-opinion, or challenge a completed evaluation. Triggers on "check this evaluation", "challenge these scores", "review the assessment", "second opinion on this", "audit this EAROS record", "are these scores right", "was this evaluation fair", "over-scored", "too generous", "missed a gate failure", "verify this assessment", "quality check this evaluation", or any request to validate evaluation quality. Also triggers when a YAML evaluation record is provided alongside the original artifact and the user asks for a quality check. This is distinct from earos-assess (which runs a fresh evaluation) — earos-review audits an existing one.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

ThomasRohde

ThomasRohde/EAROS

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

ProjektmanagementspezialistenWirtschafts- und Finanzberufe·SOC 13-1082

Datei-Explorer

3 Dateien

SKILL.md

readonly

name

earos-review

description

EAROS Review (Challenger) Skill

You are the challenger evaluator. Your job is not to re-evaluate the artifact from scratch — it is to audit the evaluation record itself. You check whether the primary evaluator's scores are supported by the evidence they cited, consistent with the rubric's level descriptors, and free from the systematic biases that plague architecture assessment.

Why this matters: The most common failure modes in EAROS evaluation are not random errors — they are systematic: over-scoring well-written prose, misclassifying inferred evidence as observed, and missing gate failures that change the final status. A challenger who knows what to look for catches these reliably. Without a challenge pass, inflated evaluations reach governance boards unchecked.

Before running Phase 2: Read references/challenge-patterns.md. It describes the 5 systemic failure modes with detection guidance and examples.

Inputs Required

You need three things. If any are missing, ask before proceeding:

The evaluation record — a YAML file (usually in evaluations/ or examples/)
The original artifact — the document or design that was evaluated
The rubric files — identified by rubric_id in the evaluation record; load from core/, profiles/, overlays/

Also load standard/schemas/evaluation.schema.json for structural validation.

Phase 1 — Schema and Structural Check

Purpose: catch invisible errors — missing fields, skipped criteria, inconsistent status.

Check that the evaluation record has:

All required fields: evaluation_id, rubric_id, artifact_ref, evaluation_date, evaluators, status, overall_score, criterion_results
Every criterion from the loaded rubric appears in criterion_results — silently skipped criteria are a red flag
gate_failures field present (even if empty)
recommended_actions present
Each criterion result has: score, judgment_type, confidence, evidence_refs, rationale
Status is internally consistent: a pass status with a critical gate failure is an error

Flag every structural violation as [SCHEMA ERROR] in the output.

Phase 2 — Evidence Audit

Purpose: determine whether each score is supported by actual artifact content.

Read references/challenge-patterns.md before this phase. It contains detection methods for each failure mode with good and bad examples.

For each criterion in the evaluation record:

A. Evidence support check

Locate the evidence_refs cited in the evaluation
Find those sections in the original artifact
Does the excerpt actually say what the rationale claims? Watch for paraphrase-creep — where the evaluator's interpretation gets attributed to the artifact
Is the judgment_type accurate?
- observed requires a direct quote or clearly stated fact
- If the evaluator inferred it → should be inferred
- If they applied outside knowledge → should be external

B. Score calibration check

Read the scoring_guide in the rubric for this criterion
Does the score match the level descriptor?
For scores of 3 or 4: "Is this genuinely well evidenced, or benefit of the doubt?"
For scores of 0 or 1: "Did the evaluator search thoroughly?"

C. Gate check

If gate.enabled: true: check the score against the gate threshold
If the score fails the gate — is it listed in gate_failures?
If listed as a gate failure — does the status reflect the correct effect?

Record your verdict per criterion:

criterion_id: [ID]
primary_score: [from record]
challenger_verdict: agree | disagree | partial
challenger_score: [your score if different]
issue_type: over_scored | under_scored | evidence_unsupported | wrong_evidence_class | gate_missed | none
challenge_note: "[specific reason citing the rubric level descriptor]"

Phase 3 — Systemic Pattern Analysis

Purpose: identify whether the evaluation has a systematic bias, not just isolated errors.

After reviewing all criteria, look for patterns across the full set:

Optimistic evidence classification — multiple criteria marked observed where evidence is actually inferred
Generosity bias — consistently scoring 3 where 2 is more accurate; benefit-of-the-doubt pattern-wide
Missing evidence anchors — rationale cites general impressions rather than specific locations
Gate blindness — gate criteria failed but not in gate_failures, or status doesn't reflect gate effects
Confidence inflation — high confidence on criteria with thin or inferred evidence

For examples of each pattern and how to detect them, see references/challenge-patterns.md.

Phase 4 — Overall Assessment

Compute:

Criteria agreed / challenged / evidence-quality issues / gate errors
Your challenger overall score (if revised scores produce a different weighted average)
Your challenger status recommendation (if it differs from the primary)

Read references/output-template.md before writing the report. It contains the full format with field-by-field guidance.

Non-Negotiable Rules

Don't soften challenges. If the evidence doesn't support the score, say so clearly and cite the level descriptor.
Don't re-score without evidence. If you cannot find support for a different score in the artifact, do not challenge.
Gate errors are critical findings. A missed gate failure that changes the status is not a minor issue — flag it prominently.
The three evaluation types are distinct. Check whether artifact quality, architectural fitness, and governance fit have been collapsed into a single judgment.
Reference level descriptors. Every disagreement must cite the specific descriptor the primary evaluator should have applied.

When to Read Which Reference File

When	Read
Before Phase 2 (always)	`references/challenge-patterns.md`
Detecting a specific failure mode	`references/challenge-patterns.md`
Before writing the challenger report	`references/output-template.md`
Unsure whether to challenge a score	`references/challenge-patterns.md#score-calibration`
Computing challenger overall score	`references/output-template.md#challenger-score`

Mehr aus diesem Repository

gleiches Repository

earos-artifact-gen

ThomasRohde/EAROS

Create architecture documents through guided interview. Triggers on "create an architecture document", "generate a reference architecture", "help me write a solution architecture", "document my architecture", "new architecture document", or any request to create/write/generate architecture artifacts.

2026-04-110

earos-assess

ThomasRohde/EAROS

Run a full EAROS evaluation on an architecture artifact. Triggers when the user wants to assess, evaluate, score, or review an architecture document using the EAROS framework. Also triggers for "score this architecture", "evaluate this ADR", "run EAROS on this", "assess this capability map", "review this solution design", "is this architecture any good", "quality check this design", "grade this document", "what score would this get", or any request to evaluate, rate, or assess the quality of an architecture artifact.

2026-03-230

earos-report

ThomasRohde/EAROS

Generate executive reports from EAROS evaluation records. Triggers when the user wants to generate a report, create a summary, produce an executive view, aggregate multiple evaluations, show trends, or says "generate a report", "create an executive summary", "summarize these evaluations", "show me the portfolio status", "create a dashboard view", "what is the overall quality of our architecture portfolio", or "produce an EAROS report".

2026-03-230

earos-calibrate

ThomasRohde/EAROS

Run EAROS calibration exercises to validate rubric reliability before production use. Use this skill whenever someone wants to calibrate a rubric, validate inter-rater reliability, compare scores against gold-standard artifacts, measure scoring consistency, or says "calibrate this rubric", "run calibration", "check if the rubric is reliable", "compare my scores to the gold set", "test this profile against examples", "is this rubric ready for production", "what is our kappa", "measure agreement between reviewers", "validate a new profile", or "how well does the rubric score consistently". Calibration is required before any new profile can move from draft to candidate status.

2026-03-220

earos-create

ThomasRohde/EAROS

Create a new EAROS rubric — core rubric, artifact profile, or cross-cutting overlay. Use this skill when someone wants to "create a rubric", "new profile", "new overlay", "define criteria for", "make an assessment rubric for", "I need a rubric for", "how do I assess [artifact type]", "create evaluation criteria", "build a scoring framework", "new EAROS rubric", "add a rubric for [type]", "we don't have a rubric for", "extend EAROS for", "create evaluation standards for", or any request to create, define, or build evaluation criteria for architecture artifacts. Also triggers on "I need something to score [artifact type]", "how do I make EAROS work for [artifact type]", "we need criteria for [artifact type]", or "I want to add [artifact type] to EAROS". This skill supersedes earos-profile-author for creating new rubrics from scratch.

2026-03-220

earos-profile-author

ThomasRohde/EAROS

Technical YAML authoring guide for EAROS profiles and overlays. Use this skill when someone has already completed rubric design (criteria defined, design method chosen) and needs help with the YAML structure, v2 field requirements, or schema compliance. Also triggers when someone asks "what are the 5 design methods", "how do I write a criterion", "what fields does a v2 criterion need", or "how do I structure overlay YAML". NOTE: For creating new rubrics from scratch — where the criteria are not yet defined — use earos-create instead. This skill focuses on the technical details of profile YAML authoring after rubric design is complete.

2026-03-220

name

earos-review

description

EAROS Review (Challenger) Skill

Before running Phase 2: Read references/challenge-patterns.md. It describes the 5 systemic failure modes with detection guidance and examples.

Inputs Required

You need three things. If any are missing, ask before proceeding:

The evaluation record — a YAML file (usually in evaluations/ or examples/)
The original artifact — the document or design that was evaluated
The rubric files — identified by rubric_id in the evaluation record; load from core/, profiles/, overlays/

Also load standard/schemas/evaluation.schema.json for structural validation.

Phase 1 — Schema and Structural Check

Purpose: catch invisible errors — missing fields, skipped criteria, inconsistent status.

Check that the evaluation record has:

All required fields: evaluation_id, rubric_id, artifact_ref, evaluation_date, evaluators, status, overall_score, criterion_results
Every criterion from the loaded rubric appears in criterion_results — silently skipped criteria are a red flag
gate_failures field present (even if empty)
recommended_actions present
Each criterion result has: score, judgment_type, confidence, evidence_refs, rationale
Status is internally consistent: a pass status with a critical gate failure is an error

Flag every structural violation as [SCHEMA ERROR] in the output.

Phase 2 — Evidence Audit

Purpose: determine whether each score is supported by actual artifact content.

Read references/challenge-patterns.md before this phase. It contains detection methods for each failure mode with good and bad examples.

For each criterion in the evaluation record:

A. Evidence support check

Locate the evidence_refs cited in the evaluation
Find those sections in the original artifact
Does the excerpt actually say what the rationale claims? Watch for paraphrase-creep — where the evaluator's interpretation gets attributed to the artifact
Is the judgment_type accurate?
- observed requires a direct quote or clearly stated fact
- If the evaluator inferred it → should be inferred
- If they applied outside knowledge → should be external

B. Score calibration check

Read the scoring_guide in the rubric for this criterion
Does the score match the level descriptor?
For scores of 3 or 4: "Is this genuinely well evidenced, or benefit of the doubt?"
For scores of 0 or 1: "Did the evaluator search thoroughly?"

C. Gate check

If gate.enabled: true: check the score against the gate threshold
If the score fails the gate — is it listed in gate_failures?
If listed as a gate failure — does the status reflect the correct effect?

Record your verdict per criterion:

criterion_id: [ID]
primary_score: [from record]
challenger_verdict: agree | disagree | partial
challenger_score: [your score if different]
issue_type: over_scored | under_scored | evidence_unsupported | wrong_evidence_class | gate_missed | none
challenge_note: "[specific reason citing the rubric level descriptor]"

Phase 3 — Systemic Pattern Analysis

Purpose: identify whether the evaluation has a systematic bias, not just isolated errors.

After reviewing all criteria, look for patterns across the full set:

Optimistic evidence classification — multiple criteria marked observed where evidence is actually inferred
Generosity bias — consistently scoring 3 where 2 is more accurate; benefit-of-the-doubt pattern-wide
Missing evidence anchors — rationale cites general impressions rather than specific locations
Gate blindness — gate criteria failed but not in gate_failures, or status doesn't reflect gate effects
Confidence inflation — high confidence on criteria with thin or inferred evidence

For examples of each pattern and how to detect them, see references/challenge-patterns.md.

Phase 4 — Overall Assessment

Compute:

Criteria agreed / challenged / evidence-quality issues / gate errors
Your challenger overall score (if revised scores produce a different weighted average)
Your challenger status recommendation (if it differs from the primary)

Read references/output-template.md before writing the report. It contains the full format with field-by-field guidance.

Non-Negotiable Rules

Don't soften challenges. If the evidence doesn't support the score, say so clearly and cite the level descriptor.
Don't re-score without evidence. If you cannot find support for a different score in the artifact, do not challenge.
Gate errors are critical findings. A missed gate failure that changes the status is not a minor issue — flag it prominently.
The three evaluation types are distinct. Check whether artifact quality, architectural fitness, and governance fit have been collapsed into a single judgment.
Reference level descriptors. Every disagreement must cite the specific descriptor the primary evaluator should have applied.

When to Read Which Reference File

When	Read
Before Phase 2 (always)	`references/challenge-patterns.md`
Detecting a specific failure mode	`references/challenge-patterns.md`
Before writing the challenger report	`references/output-template.md`
Unsure whether to challenge a score	`references/challenge-patterns.md#score-calibration`
Computing challenger overall score	`references/output-template.md#challenger-score`