Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

evaluation-anchor-checker

Sterne472

Forks36

Aktualisiert27. Mai 2026 um 16:37

Audit and rewrite evaluation/numeric claims to ensure they carry minimal protocol context (task + metric + constraint) and avoid underspecified model naming. **Trigger**: evaluation anchor checker, numeric claim hygiene, underspecified numbers, protocol context, 评测锚点检查, 数字断言, 指标上下文. **Use when**: before final merge/polish, or when reviewers would likely flag claims as underspecified (numbers without task/metric/budget), or `pipeline-auditor` warns about suspicious model naming. **Skip if**: evidence is too thin to justify numeric claims (route upstream to C3/C4), or you are pre-C2 (NO PROSE). **Network**: none. **Guardrail**: do not invent numbers; do not add/remove/move citation keys; if protocol context is missing, weaken/remove the numeric claim rather than guessing.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

WILLOSCAR

WILLOSCAR/research-units-pipeline-skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

Technische RedakteureKunst, Design, Unterhaltung, Sport und Medien·SOC 27-3042

Datei-Explorer

4 Dateien

SKILL.md

readonly

name

evaluation-anchor-checker

description

Evaluation Anchor Checker (make numbers reviewer-safe)

Purpose: fix a reviewer-magnet failure mode in agent surveys:

strong numeric/performance statements appear
but the minimal evaluation context is missing

This skill treats numeric claims as contracts:

if a number stays, the same sentence must contain enough protocol context to interpret it
if that context is not in evidence, the claim must be downgraded (no guessing)

Inputs

Preferred (pre-merge, keeps anchoring intact):

the affected sections/*.md files

Optional context (read-only; helps you avoid guessing):

outline/writer_context_packs.jsonl (look for evaluation_anchor_minimal, evaluation_protocol, anchor_facts)
outline/evidence_drafts.jsonl / outline/anchor_sheet.jsonl
citations/ref.bib

Outputs

Updated sections/*.md (or output/DRAFT.md if you are post-merge), with safer evaluation anchoring
output/EVAL_ANCHOR_REPORT.md (always; short report with files checked / changed / weakened sentences)
Optional completion marker: output/eval_anchors_checked.refined.ok

Recommended slot in the survey pipeline

Use this as the last section-level numeric hygiene sweep before merge:

after paragraph-curator + style-harmonizer + opener-variator
before transition-weaver / section-merger

Reason:

earlier section-level rewrite passes can legitimately rephrase or fuse numeric sentences
if you only wait for pipeline-auditor, numeric-context issues are discovered too late in the merged draft
section-scoped fixes are cheaper and preserve citation anchoring better than post-merge patching

Read Order

Always read:

references/numeric_hygiene.md

Machine-readable asset:

assets/numeric_hygiene.json

The asset defines the keyword families and qualitative fallback templates. Keep the script deterministic and let the policy live in the asset/reference pair.

Role prompt: Reviewer-minded Editor (evaluation hygiene)

You are a reviewer-minded editor for evaluation claims in a technical survey.

Goal:
- make every numeric/performance claim interpretable and reviewer-safe

Hard constraints:
- do not invent numbers
- do not add/remove/move citation keys
- if protocol context is missing, weaken or remove the numeric claim

Minimum context to include when keeping a number:
- task / setting (what kind of task)
- metric (what is being measured)
- constraint (budget/cost/tool access/horizon/seed/logging) when relevant

Avoid:
- ambiguous model naming that looks hallucinated (e.g., “GPT-5”) unless the cited paper uses it verbatim

Workflow (explicit inputs)

Use outline/writer_context_packs.jsonl to locate the subsection's allowed citations and any extracted evaluation_protocol/anchor_facts.
Cross-check outline/evidence_drafts.jsonl and outline/anchor_sheet.jsonl for task/metric/constraint context before touching numbers.
Validate every cited key against citations/ref.bib (do not introduce new keys).
Write output/EVAL_ANCHOR_REPORT.md so the pipeline has an auditable completion artifact for this sweep.

What to enforce (the “minimum protocol trio”)

When a sentence contains digits (%, x, or numbers):

Keep the number only if you can attach at least 2 of the following in the same sentence without guessing:
- task family / benchmark name
- metric definition
- constraint (budget, tool access, cost model, retries, horizon)

If you cannot, downgrade:

remove the number and rewrite as qualitative (“often”, “can”, “may”) with the same citation
or move the specificity into a verification target (“evaluations need to report …”) without adding new facts

Mini examples (paraphrase; do not copy)

Bad (underspecified):

Model X achieves 75% exact performance [@SomeBench].

Better (minimal context):

On <task/benchmark>, Model X reaches ~75% <metric>, under <constraint/budget/tool access> [@SomeBench].

Better (downgrade when context is missing):

Reported gains vary, but comparisons remain fragile when budgets and retry policies are not reported [@SomeBench].

Done checklist

output/EVAL_ANCHOR_REPORT.md exists and reports a non-zero file count.
No numeric claim remains without minimal protocol context.
No ambiguous model naming remains unless explicitly supported by citations.
Citation keys are unchanged.
If you removed/downgraded numbers, the paragraph still makes a defensible, evidence-bounded point.

Script

Quick Start

python .codex/skills/evaluation-anchor-checker/scripts/run.py --workspace workspaces/<ws>

All Options

--workspace <dir>: workspace containing sections/*.md or merged draft artifacts
--unit-id <id>: optional harness metadata
--inputs <semicolon-separated>: optional override from UNITS.csv
--outputs <semicolon-separated>: optional output override; default includes output/EVAL_ANCHOR_REPORT.md
--checkpoint <C*>: optional harness metadata

Examples

Run the numeric hygiene sweep before merge:
- python .codex/skills/evaluation-anchor-checker/scripts/run.py --workspace workspaces/<ws> --inputs 'sections/*.md;outline/writer_context_packs.jsonl;citations/ref.bib' --outputs 'sections/*.md;output/EVAL_ANCHOR_REPORT.md;output/eval_anchors_checked.refined.ok'

Mehr aus diesem Repository

gleiches Repository

agent-survey-corpus

WILLOSCAR/research-units-pipeline-skills

Download a small corpus of open-access arXiv survey/review PDFs about agentic systems and extract text for style learning. **Trigger**: agent survey corpus, ref corpus, download surveys, 学习综述写法, 下载 survey. **Use when**: you want to study how real agent surveys structure sections (6–8 H2), size subsections, and write evidence-backed comparisons. **Skip if**: you cannot download PDFs (no network) or you don't want local PDF files. **Network**: required. **Guardrail**: only download arXiv PDFs; store under `ref/` and keep large files out of git.

2026-05-30472

global-reviewer

WILLOSCAR/research-units-pipeline-skills

Global consistency review for survey drafts: terminology, cross-section coherence, and scope/citation hygiene. Writes `output/GLOBAL_REVIEW.md` and (optionally) applies safe edits to `output/DRAFT.md`. **Trigger**: global review, consistency check, coherence audit, 术语一致性, 全局回看, 章节呼应, 拷打 writer. **Use when**: Draft exists and you want a final evidence-first coherence pass before LaTeX/PDF. **Skip if**: You are still changing the outline/mapping/notes (do those first), or prose writing is not approved. **Network**: none. **Guardrail**: Do not invent facts or citations; do not add new citation keys; treat missing evidence as a failure signal.

2026-05-30472

literature-engineer

WILLOSCAR/research-units-pipeline-skills

Multi-route literature expansion + metadata normalization for evidence-first surveys. Produces a large candidate pool (`papers/papers_raw.jsonl`, target ≥1200) with stable IDs and provenance, ready for dedupe/rank + citation generation. **Trigger**: evidence collector, literature engineer, 文献扩充, 多路召回, snowballing, cited by, references, 元信息增强, provenance. **Use when**: 需要把候选文献扩充到 ≥1200 篇并补齐可追溯 meta（survey pipeline 的 Stage C1，写作前置 evidence）。 **Skip if**: 已经有高质量 `papers/papers_raw.jsonl`（≥1200 且每条都有稳定标识+来源记录）。 **Network**: 可离线（靠 imports）；雪崩/在线检索需要网络。 **Guardrail**: 不允许编造论文；每条记录必须带稳定标识（arXiv id / DOI / 可信 URL）和 provenance；不写 output/ prose。

2026-05-30472

pdf-text-extractor

WILLOSCAR/research-units-pipeline-skills

Download PDFs (when available) and extract plain text to support full-text evidence, writing `papers/fulltext_index.jsonl` and `papers/fulltext/*.txt`. **Trigger**: PDF download, fulltext, extract text, papers/pdfs, 全文抽取, 下载PDF. **Use when**: `queries.md` 设置 `evidence_mode: fulltext`（或你明确需要全文证据）并希望为 paper notes/claims 提供更强 evidence。 **Skip if**: `evidence_mode: abstract`（默认）；或你不希望进行下载/抽取（成本/权限/时间）。 **Network**: fulltext 下载通常需要网络（除非你手工提供 PDF 缓存在 `papers/pdfs/`）。 **Guardrail**: 缓存下载到 `papers/pdfs/`；默认不覆盖已有抽取文本（除非显式要求重抽）。

2026-05-30472

prose-writer

WILLOSCAR/research-units-pipeline-skills

Write `output/DRAFT.md` (or `output/SNAPSHOT.md`) from an approved outline and evidence packs, using only verified citation keys from `citations/ref.bib`. **Trigger**: write draft, prose writer, snapshot, survey writing, 写综述, 生成草稿, section-by-section drafting. **Use when**: structure is approved (`DECISIONS.md` has `Approve C2`) and evidence packs exist (`outline/subsection_briefs.jsonl`, `outline/evidence_drafts.jsonl`). **Skip if**: approvals are missing, or evidence packs are incomplete / scaffolded (missing-fields, TODO markers). **Network**: none. **Guardrail**: do not invent facts or citations; only cite keys present in `citations/ref.bib`; avoid pipeline-jargon leakage in final prose.

2026-05-30472

schema-normalizer

WILLOSCAR/research-units-pipeline-skills

Normalize cross-skill JSONL interfaces (ids + titles + citation key formats) so downstream skills do not rely on best-effort joins. **Trigger**: schema normalize, jsonl contract, interface drift, join drift, 字段不一致, schema 规范化. **Use when**: you have generated C2-C4 JSONL artifacts (outline/briefs/bindings/packs/anchors) and want deterministic, stable fields before self-loops/writing. **Skip if**: you are not using the survey pipelines, or the workspace already has a fresh PASS `output/SCHEMA_NORMALIZATION_REPORT.md` for the current artifacts. **Network**: none. **Guardrail**: NO PROSE; deterministic transforms only; do not invent evidence/claims; only fill missing ids/titles from `outline/outline.yml`.

2026-05-30472

name

evaluation-anchor-checker

description

Evaluation Anchor Checker (make numbers reviewer-safe)

Purpose: fix a reviewer-magnet failure mode in agent surveys:

strong numeric/performance statements appear
but the minimal evaluation context is missing

This skill treats numeric claims as contracts:

if a number stays, the same sentence must contain enough protocol context to interpret it
if that context is not in evidence, the claim must be downgraded (no guessing)

Inputs

Preferred (pre-merge, keeps anchoring intact):

the affected sections/*.md files

Optional context (read-only; helps you avoid guessing):

outline/writer_context_packs.jsonl (look for evaluation_anchor_minimal, evaluation_protocol, anchor_facts)
outline/evidence_drafts.jsonl / outline/anchor_sheet.jsonl
citations/ref.bib

Outputs

Updated sections/*.md (or output/DRAFT.md if you are post-merge), with safer evaluation anchoring
output/EVAL_ANCHOR_REPORT.md (always; short report with files checked / changed / weakened sentences)
Optional completion marker: output/eval_anchors_checked.refined.ok

Recommended slot in the survey pipeline

Use this as the last section-level numeric hygiene sweep before merge:

after paragraph-curator + style-harmonizer + opener-variator
before transition-weaver / section-merger

Reason:

earlier section-level rewrite passes can legitimately rephrase or fuse numeric sentences
if you only wait for pipeline-auditor, numeric-context issues are discovered too late in the merged draft
section-scoped fixes are cheaper and preserve citation anchoring better than post-merge patching

Read Order

Always read:

references/numeric_hygiene.md

Machine-readable asset:

assets/numeric_hygiene.json

The asset defines the keyword families and qualitative fallback templates. Keep the script deterministic and let the policy live in the asset/reference pair.

Role prompt: Reviewer-minded Editor (evaluation hygiene)

You are a reviewer-minded editor for evaluation claims in a technical survey.

Goal:
- make every numeric/performance claim interpretable and reviewer-safe

Hard constraints:
- do not invent numbers
- do not add/remove/move citation keys
- if protocol context is missing, weaken or remove the numeric claim

Minimum context to include when keeping a number:
- task / setting (what kind of task)
- metric (what is being measured)
- constraint (budget/cost/tool access/horizon/seed/logging) when relevant

Avoid:
- ambiguous model naming that looks hallucinated (e.g., “GPT-5”) unless the cited paper uses it verbatim

Workflow (explicit inputs)

Use outline/writer_context_packs.jsonl to locate the subsection's allowed citations and any extracted evaluation_protocol/anchor_facts.
Cross-check outline/evidence_drafts.jsonl and outline/anchor_sheet.jsonl for task/metric/constraint context before touching numbers.
Validate every cited key against citations/ref.bib (do not introduce new keys).
Write output/EVAL_ANCHOR_REPORT.md so the pipeline has an auditable completion artifact for this sweep.

What to enforce (the “minimum protocol trio”)

When a sentence contains digits (%, x, or numbers):

Keep the number only if you can attach at least 2 of the following in the same sentence without guessing:
- task family / benchmark name
- metric definition
- constraint (budget, tool access, cost model, retries, horizon)

If you cannot, downgrade:

remove the number and rewrite as qualitative (“often”, “can”, “may”) with the same citation
or move the specificity into a verification target (“evaluations need to report …”) without adding new facts

Mini examples (paraphrase; do not copy)

Bad (underspecified):

Model X achieves 75% exact performance [@SomeBench].

Better (minimal context):

On <task/benchmark>, Model X reaches ~75% <metric>, under <constraint/budget/tool access> [@SomeBench].

Better (downgrade when context is missing):

Reported gains vary, but comparisons remain fragile when budgets and retry policies are not reported [@SomeBench].

Done checklist

output/EVAL_ANCHOR_REPORT.md exists and reports a non-zero file count.
No numeric claim remains without minimal protocol context.
No ambiguous model naming remains unless explicitly supported by citations.
Citation keys are unchanged.
If you removed/downgraded numbers, the paragraph still makes a defensible, evidence-bounded point.

Script

Quick Start

python .codex/skills/evaluation-anchor-checker/scripts/run.py --workspace workspaces/<ws>

All Options

--workspace <dir>: workspace containing sections/*.md or merged draft artifacts
--unit-id <id>: optional harness metadata
--inputs <semicolon-separated>: optional override from UNITS.csv
--outputs <semicolon-separated>: optional output override; default includes output/EVAL_ANCHOR_REPORT.md
--checkpoint <C*>: optional harness metadata

Examples

Run the numeric hygiene sweep before merge:
- python .codex/skills/evaluation-anchor-checker/scripts/run.py --workspace workspaces/<ws> --inputs 'sections/*.md;outline/writer_context_packs.jsonl;citations/ref.bib' --outputs 'sections/*.md;output/EVAL_ANCHOR_REPORT.md;output/eval_anchors_checked.refined.ok'