Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

judge-evaluate

Sterne0

Forks0

Aktualisiert16. Februar 2026 um 18:34

Evaluate a proposed implementation patch and handoff artifact against task scope and verification checks, then produce a machine-readable `verdict.json`. Use when a doer has delivered `patch.diff` and `handoff.json` and a separate judging role must accept, reject, or escalate without editing source files.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

OilProducts

OilProducts/agent-skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

Softwarequalitätssicherungsanalysten und -testerInformatik- und Mathematikberufe·SOC 15-1253

Datei-Explorer

7 Dateien

SKILL.md

readonly

name	judge-evaluate
description	Evaluate a proposed implementation patch and handoff artifact against task scope and verification checks, then produce a machine-readable `verdict.json`. Use when a doer has delivered `patch.diff` and `handoff.json` and a separate judging role must accept, reject, or escalate without editing source files.

Judge Evaluate

Assess doer output and produce a structured verdict artifact.

Role Boundaries

Evaluate; do not implement fixes.
Base conclusions on concrete evidence (commands, logs, artifacts).
Write verdict output only.
Do not claim merge/release readiness unless explicitly asked as a separate gate.

Required Inputs

task_id and task statement
patch.diff
handoff.json
output path for verdict.json

Optional:

explicit eval command list
scoring rubric overrides

Workflow

1) Verify handoff integrity

Validate that required doer artifacts exist.
Read handoff.json for touched files, assumptions, and smoke checks.
Flag malformed or missing handoff fields as needs-human or reject depending on severity.

2) Evaluate implementation evidence

Run deterministic checks when commands are provided:

python3 <path-to-skill>/scripts/run_eval.py \
  --repo <repo-root> \
  --output <artifact-path>/eval-results.json \
  --command "pytest -q" \
  --command "ruff check ."

Keep outputs in machine-readable form.
Treat failed eval commands as evidence for rejection unless out of scope.

3) Build verdict

Run:

python3 <path-to-skill>/scripts/write_verdict.py \
  --task-id <task-id> \
  --output <artifact-path>/verdict.json \
  --eval-results <artifact-path>/eval-results.json \
  --verdict reject \
  --reason "tests::2 failures in auth reset flow" \
  --requirement-checked "RQ-0102" \
  --requirement-missing "NFR-0001" \
  --required-change "Handle invalid token branch" \
  --suggested-test "pytest -q tests/test_reset.py::test_invalid_token"

Use one of:

pass
reject
needs-human

4) Validate verdict artifact

Run:

python3 <path-to-skill>/scripts/validate_verdict.py \
  --input <artifact-path>/verdict.json

Evaluation Rules

Prefer reject when evidence shows unmet behavior or failing checks.
Use needs-human for ambiguous requirements, missing context, or conflicting constraints.
Keep required_changes implementation-neutral and actionable.

Output Rules

Always produce verdict.json.
Include concrete reasons with check and details.
Keep suggested_tests runnable.
Keep requirements_checked and requirements_missing in stable ID form (RQ-####, NFR-####, ASMP-####, ADR-####).

Resources

scripts/

scripts/run_eval.py: execute eval commands and persist structured results.
scripts/write_verdict.py: synthesize verdict.json from evidence and explicit findings.
scripts/validate_verdict.py: enforce required verdict shape and enums.

references/

references/artifact-contract.md: canonical verdict schema and examples.
references/scoring-rubric.md: lightweight scoring framework for consistent decisions.

Mehr aus diesem Repository

gleiches Repository

childrens-book-creator

OilProducts/agent-skills

End-to-end workflow for creating original children's books (picture books, early readers, chapter books) including brief, outline, page-by-page plan, manuscript drafting, revision checklists, and illustration/art direction prompts. Use when a user asks for help writing, outlining, revising, or planning a children's book; creating page breakdowns/spreads; generating illustration prompts; or scaffolding a book project folder with structured files.

2026-02-170

comfyui-image-gen

OilProducts/agent-skills

Unified ComfyUI skill for direct API workflow runs and RenderSpec-based multi-page pipeline phases (draft/refine/inpaint/upscale_print) with reproducibility artifacts. Use for both ad-hoc workflow execution and full storybook/page production loops.

2026-02-170

auditor-gate

OilProducts/agent-skills

Apply final governance and release-gate checks to a judged change set by reading `handoff.json`, `verdict.json`, optional eval evidence, and emitting machine-readable `audit.json` with `gate` status. Use when implementation already has a judge verdict and a separate auditor must decide landability (`pass`, `fail`, or `needs-human`) without modifying source files.

2026-02-160

doer-implement

OilProducts/agent-skills

Implement a scoped software task by editing code, running targeted smoke checks, and producing machine-readable handoff artifacts (`patch.diff` and `handoff.json`) for downstream evaluation. Use when a user or orchestrator asks to implement/build/fix a task card and hand work to a separate judge; do not use this skill to grade final correctness or make merge decisions.

2026-02-160

loop-orchestrator

OilProducts/agent-skills

Orchestrate separated doer-judge-auditor software development loops with git worktrees, codex exec runs, and machine-readable handoff/verdict/audit artifacts. Use when a user asks to implement and verify iteratively with strict role separation, including optional governance gating before final acceptance.

2026-02-160

ssh-ops

OilProducts/agent-skills

Use for SSH and remote shell tasks through the shell-only ssh-ops wrappers (`scripts/ssh_ops.sh`). Supports ad-hoc host/user/auth, command execution, guarded PTY workflows, scp copy, credentials, and transcript-aware troubleshooting.

2026-02-110

name	judge-evaluate
description	Evaluate a proposed implementation patch and handoff artifact against task scope and verification checks, then produce a machine-readable `verdict.json`. Use when a doer has delivered `patch.diff` and `handoff.json` and a separate judging role must accept, reject, or escalate without editing source files.

Judge Evaluate

Assess doer output and produce a structured verdict artifact.

Role Boundaries

Evaluate; do not implement fixes.
Base conclusions on concrete evidence (commands, logs, artifacts).
Write verdict output only.
Do not claim merge/release readiness unless explicitly asked as a separate gate.

Required Inputs

task_id and task statement
patch.diff
handoff.json
output path for verdict.json

Optional:

explicit eval command list
scoring rubric overrides

Workflow

1) Verify handoff integrity

Validate that required doer artifacts exist.
Read handoff.json for touched files, assumptions, and smoke checks.
Flag malformed or missing handoff fields as needs-human or reject depending on severity.

2) Evaluate implementation evidence

Run deterministic checks when commands are provided:

python3 <path-to-skill>/scripts/run_eval.py \
  --repo <repo-root> \
  --output <artifact-path>/eval-results.json \
  --command "pytest -q" \
  --command "ruff check ."

Keep outputs in machine-readable form.
Treat failed eval commands as evidence for rejection unless out of scope.

3) Build verdict

Run:

python3 <path-to-skill>/scripts/write_verdict.py \
  --task-id <task-id> \
  --output <artifact-path>/verdict.json \
  --eval-results <artifact-path>/eval-results.json \
  --verdict reject \
  --reason "tests::2 failures in auth reset flow" \
  --requirement-checked "RQ-0102" \
  --requirement-missing "NFR-0001" \
  --required-change "Handle invalid token branch" \
  --suggested-test "pytest -q tests/test_reset.py::test_invalid_token"

Use one of:

pass
reject
needs-human

4) Validate verdict artifact

Run:

python3 <path-to-skill>/scripts/validate_verdict.py \
  --input <artifact-path>/verdict.json

Evaluation Rules

Prefer reject when evidence shows unmet behavior or failing checks.
Use needs-human for ambiguous requirements, missing context, or conflicting constraints.
Keep required_changes implementation-neutral and actionable.

Output Rules

Always produce verdict.json.
Include concrete reasons with check and details.
Keep suggested_tests runnable.
Keep requirements_checked and requirements_missing in stable ID form (RQ-####, NFR-####, ASMP-####, ADR-####).

Resources

scripts/

scripts/run_eval.py: execute eval commands and persist structured results.
scripts/write_verdict.py: synthesize verdict.json from evidence and explicit findings.
scripts/validate_verdict.py: enforce required verdict shape and enums.

references/

references/artifact-contract.md: canonical verdict schema and examples.
references/scoring-rubric.md: lightweight scoring framework for consistent decisions.