一键在 Manus 中运行任何 Skill

tooluniverse-self-review

星标1,503

分支231

更新时间2026年6月20日 00:02

Generate the success criteria for a task or question, then review work against them. Given a task, goal, or open-ended question, decompose it into scenarios, evaluation perspectives, and fine-grained weighted YES/NO criteria using the Recursive Expansion Tree (RET) method; if work is supplied, score it criterion-by-criterion and surface what is missing or could be better. Use when asked to self-review or check your own work, judge whether a task is done well or completely, build a definition-of-done or completeness checklist, create an evaluation rubric or grading criteria, score or grade answers to a question, set up an LLM-as-judge rubric, or when the user mentions self-review, completeness check, success criteria, evaluation criteria, scoring rubric, Qworld, or the RET algorithm.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

mims-harvard

mims-harvard/ToolUniverse

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

Self-Review: Task Success Criteria + Work Review

Derive the criteria that define a good result for a specific task, then review work against them. Each task defines its own "evaluation world" -- a set of scenarios, perspectives, and criteria that capture what matters for judging results for that specific task. Built on the Qworld Recursive Expansion Tree (RET).

When to Use

An agent (or person) is about to do, is doing, or has finished a task and wants to know what a good result must cover, or what is missing or weak.
You need a definition-of-done or completeness checklist for a task or goal.
You need an evaluation rubric / grading criteria for answers to an open-ended question.
You want to score or grade responses: LLM-as-judge, exam answers, candidate responses.

Inputs

task (required): the task, goal, or question to evaluate. If it is a multi-turn conversation, treat the whole exchange as context and the last user message as the primary intent. If it includes an image or retrieved web context, factor that in.
work (optional): a planned approach or a completed result to review. If no work is supplied, run in checklist mode -- produce the criteria only, no scoring.

Core Principles

Task-specific: Every scenario, perspective, and criterion must be derived from the task's content, intent, and context. Never reuse a fixed set of dimensions across tasks.
Binary and verifiable: Each final criterion must be answerable YES or NO by a reviewer.
Non-redundant: At every level, check for overlap before adding new items.
Balanced: Include both positive criteria (what a good result must do) and negative criteria (what constitutes harmful or misleading content). Positive total points must outweigh negative.
Coverage-driven: At each level, repeatedly ask "what's missing?" to ensure the evaluation space implied by the task is fully explored.

Algorithm Overview

The RET builds a 3-level tree from the task:

Task / goal / question
  -> Level 1: Scenarios (contextual framings that change what "good" means)
       -> Level 2: Perspectives (evaluation dimensions per scenario)
            -> Level 3: Criteria (concrete, binary rubric items with scores)

At each level, two operators apply:

Hierarchical decomposition: break a node into finer-grained children at the next level.
Horizontal expansion: ask "what else is missing?" and add sibling nodes for coverage.

After the tree is built, Phase B reviews supplied work against the leaf criteria.

Step-by-Step Workflow

Copy this checklist and track progress as you work:

Task Progress:
- [ ] Step 1: Scenario grounding
- [ ] Step 2: Scenario expansion (3 rounds)
- [ ] Step 3: Perspective generation
- [ ] Step 4: Perspective expansion (4 rounds)
- [ ] Step 5: Perspective review and consolidation
- [ ] Step 6: Criteria generation
- [ ] Step 7: Criteria expansion (3 rounds)
- [ ] Step 8: Criteria review and consolidation
- [ ] Step 9: Polarity check
- [ ] Step 10: Score calibration
- [ ] Phase B: Review work against criteria (or emit checklist if no work supplied)

Step 1: Scenario Grounding

Goal: Identify the distinct real-world contexts in which this task could arise, where each context would materially change what constitutes a good result.

Method:

Read the task carefully. Infer its domain, intent, audience, stakes, and implicit constraints.
Ask: "In what different situations could someone pursue this task, and how would the situation change what a good result looks like?"
Produce a minimal, non-redundant set of scenarios. Merge any that would lead to the same evaluation criteria.

Output format for each scenario:

scenario_name: short descriptive label
scenario_description: 3-5 sentences explaining what is unique about this context and why it changes what "good" means

Step 2: Scenario Expansion (3 rounds)

Goal: Ensure comprehensive coverage of the evaluation space at the scenario level.

Method (repeat 3 times):

Review all existing scenarios.
Ask: "What contexts or framings are missing that would require different evaluation criteria?"
Identify gaps along context axes (audience, setting, stakes, constraints, domain variation, etc.).
Generate ONLY new scenarios that differ materially from all existing ones. Do not repeat or rephrase existing scenarios.

After each round, append new scenarios to the list.

Step 3: Perspective Generation

Goal: For each scenario, derive the evaluation dimensions that matter for judging a result in that context.

Method:

For EACH scenario, produce 4-7 non-overlapping evaluation perspectives.
Each perspective must be grounded in the scenario and the task. Derive perspectives entirely from the task's content and context -- do not apply a pre-set list of dimensions.
Aim for maximally diverse themes, including unconventional angles that are specific to this task.
Each perspective should be high-level enough to spawn multiple criteria, yet narrow enough to avoid overlap with other perspectives.

Output format for each perspective:

perspective_name: 2-5 word descriptive label
perspective_description: 3-5 sentences explaining what this perspective evaluates and listing 3-5 specific sub-aspects it covers

Step 4: Perspective Expansion (4 rounds)

Goal: Fill coverage gaps in the perspective set.

Method (repeat 4 times):

Review all perspectives across all scenarios.
Map coverage across quality dimensions implied by the task.
Ask: "What evaluation angles are missing that would yield materially different criteria?"
Generate ONLY new perspectives that provide distinct evaluation value. Do not repeat existing ones.

After each round, append new perspectives to the collection.

Step 5: Perspective Review and Consolidation

Goal: Produce a clean, non-redundant set of perspectives ready for criteria generation.

Method:

If two perspectives target the same evaluation angle but under different scenarios, combine scenario-specific details into each description but keep them as separate perspectives.
Remove perspectives that are off-topic, redundant with others, or too vague to guide concrete criteria generation.
Each kept perspective must contribute unique evaluation value for this task.
Assign each kept perspective a unique ID (p0, p1, p2, ...).

Step 6: Criteria Generation

Goal: For each reviewed perspective, generate concrete, binary evaluation criteria.

Method:

For EACH perspective, generate criteria that a reviewer can check YES or NO against a result.
Cover all sub-aspects listed in the perspective description.

Criteria rules:

Binary: Each criterion must be answerable YES/NO.
Scenario-specific: Explicitly reference features of the task and scenario. Make criteria as detailed and specific as possible.
Self-contained: Each criterion is a standalone statement in the form [Verb + Specific Requirement]. Start with one clear action verb, then state the exact required or forbidden content with qualifiers.
Balanced: Include positive criteria (required content/behavior) and negative criteria (harmful, misleading, or critically wrong content). Only add negative criteria when the issue represents harmful, dangerous, or significantly quality-reducing behavior -- not minor stylistic concerns.
Diverse: Cover all sub-aspects of the perspective.

Negative criteria phrasing: Describe the bad behavior directly. Instead of "Avoids doing X", write "Does X" (where X is the harmful behavior). The criterion text states the behavior; its negative score indicates that meeting it is bad.

Scoring standard:

Positive: 1-10 (10 = critical safety/core requirement; 8-9 = important completeness; 5-7 = quality enhancer; 1-4 = minor nice-to-have)
Negative: -1 to -10 (-10 = dangerous/harmful; -8 to -9 = major omission or error; -5 to -7 = quality issue; -1 to -4 = minor issue)

Output format for each criterion:

criterion: the criterion text
points: integer score (positive or negative)
reasoning: 2-3 sentences explaining why this criterion matters for this task and why it received this weight

Step 7: Criteria Expansion (3 rounds)

Goal: Fill coverage gaps in the criteria set.

Method (repeat 3 times):

Review all existing criteria.
Ask: "What concrete content or behavior, if present or absent in a result, would change whether it passes or fails -- and is not yet covered?"
Generate ONLY new criteria that are non-overlapping with existing ones.
Follow the same rules and scoring standard as Step 6.

After each round, append new criteria to the collection.

Step 8: Criteria Review and Consolidation

Goal: Produce a concise, non-redundant final rubric.

Method:

Deduplicate and merge: Combine criteria that assess the same aspect. Keep the most precise wording and include all distinct details from merged criteria.
Positive/negative overlap: If a positive and negative criterion cover the same aspect, keep only the positive.
Neutralize fixed facts: Replace hard-coded numbers, dates, versions, limits, or placeholders with a requirement that the result states the current/official/latest value or standard.
Balance check: Ensure total positive points outweigh total negative points. Retain all distinct, non-overlapping items.
Assign each final criterion a unique ID (c0, c1, c2, ...).

Step 9: Polarity Check

Goal: Verify that every criterion's score sign correctly reflects whether meeting it is good or bad.

Method:

For each criterion, determine: does meeting this criterion improve the result (positive) or indicate a problem (negative)?
A criterion that describes desirable behavior should have a positive score.
A criterion that describes harmful, wrong, or misleading behavior should have a negative score.
Phrasing like "Avoids doing X" describes desirable behavior (positive). Phrasing like "Does X" (where X is bad) describes undesirable behavior (negative).
Adjust the sign of points if misclassified. Do not change the criterion text.

Step 10: Score Calibration

Goal: Ensure score magnitudes accurately reflect importance.

Method:

Review each criterion's absolute score against the scoring standard from Step 6.
A more important or severe issue must always have a higher absolute score than a less important one.
Adjust magnitudes where needed. Do not change the sign (positive/negative direction) or the criterion text.
Verify the scoring standard is applied consistently:
- Positive: 10 = critical, 8-9 = important, 5-7 = quality, 1-4 = minor
- Negative: -10 = dangerous, -8 to -9 = major, -5 to -7 = moderate, -1 to -4 = minor

Phase B: Review Work Against Criteria

Goal: Apply the finalized criteria to the actual work, or emit a checklist if none was supplied.

If no work was supplied (checklist mode):

State explicitly: "No work supplied -- criteria only (definition-of-done checklist)."
Present the criteria as a checklist the user can run against any future result.
Stop here; do not invent or assume a result to score.

If work was supplied (review mode):

For EACH criterion, judge whether the work meets it: YES or NO, with a one-line cite of the specific evidence in the work (quote or location) that justifies the verdict.
Compute the score: sum the points of every criterion marked YES (positive criteria add, negative criteria subtract). Report earned points, maximum positive points, and the net total.
Produce a ranked gap list: the unmet positive criteria and any met negative criteria, ordered by absolute points (most important first). For each, give a one-line concrete fix.
Keep the review evidence-based. Do not credit a criterion the work does not actually satisfy, and do not penalize for criteria outside the stated task.

Final Output Format

Present results in this order:

Mode line: "review mode" or "checklist mode (no work supplied)".
Scenarios: list with IDs, names, descriptions.
Reviewed perspectives: list with IDs, names, descriptions.
Criteria table: each row has criterion_id, criterion, points, reasoning.
Summary statistics: number of scenarios, perspectives, raw criteria, final criteria; total positive and total negative points.
Review (review mode only): per-criterion YES/NO + evidence, earned/max/net score, and the ranked gap list with fixes.

Handling Special Inputs

Multi-turn conversations: If the task is a conversation (user-assistant exchange), treat the full conversation as context. The last user message is typically the primary intent; earlier messages provide context that may affect evaluation dimensions.

Tasks with images: If the task includes an image, factor image content into scenario analysis and criteria generation. Reference visual elements where relevant in criteria.

Tasks with retrieved web context: If web-retrieved context is provided alongside the task, use it to inform factual grounding of scenarios and criteria. Do not limit analysis to only the retrieved content -- also apply general reasoning.

Key Constraints

No fixed dimension lists: Never apply a pre-defined set of evaluation dimensions. Every scenario, perspective, and criterion must be freshly derived from the task.
Expansion counts matter: Complete all specified expansion rounds (3 for scenarios, 4 for perspectives, 3 for criteria). Skipping rounds reduces coverage.
Deduplication is mandatory: After each expansion phase and before finalization, check for and remove redundant items.
Criteria must be binary: Every criterion in the final set must be checkable as YES or NO. Do not produce criteria that require subjective degree judgments.
Evidence-based review: In review mode, every YES/NO must be backed by specific evidence from the supplied work.

Citation

This skill ports the Qworld method. If you use it in your work, please cite:

@misc{gao2026qworldquestionspecificevaluationcriteria,
      title={Qworld: Question-Specific Evaluation Criteria for LLMs},
      author={Shanghua Gao and Yuchang Su and Pengwei Sui and Curtis Ginder and Marinka Zitnik},
      year={2026},
      eprint={2603.23522},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.23522},
}

name	tooluniverse-self-review
description	Generate the success criteria for a task or question, then review work against them. Given a task, goal, or open-ended question, decompose it into scenarios, evaluation perspectives, and fine-grained weighted YES/NO criteria using the Recursive Expansion Tree (RET) method; if work is supplied, score it criterion-by-criterion and surface what is missing or could be better. Use when asked to self-review or check your own work, judge whether a task is done well or completely, build a definition-of-done or completeness checklist, create an evaluation rubric or grading criteria, score or grade answers to a question, set up an LLM-as-judge rubric, or when the user mentions self-review, completeness check, success criteria, evaluation criteria, scoring rubric, Qworld, or the RET algorithm.
disable-model-invocation	true

tooluniverse-self-review

同仓库更多 Skills

同仓库更多 Skills

Self-Review: Task Success Criteria + Work Review

When to Use

Inputs

Core Principles

Algorithm Overview

Step-by-Step Workflow

Step 1: Scenario Grounding

Step 2: Scenario Expansion (3 rounds)

Step 3: Perspective Generation

Step 4: Perspective Expansion (4 rounds)

Step 5: Perspective Review and Consolidation

Step 6: Criteria Generation

Step 7: Criteria Expansion (3 rounds)

Step 8: Criteria Review and Consolidation

Step 9: Polarity Check

Step 10: Score Calibration

Phase B: Review Work Against Criteria

Final Output Format

Handling Special Inputs

Key Constraints

Citation

Self-Review: Task Success Criteria + Work Review

When to Use

Inputs

Core Principles

Algorithm Overview

Step-by-Step Workflow

Step 1: Scenario Grounding

Step 2: Scenario Expansion (3 rounds)

Step 3: Perspective Generation

Step 4: Perspective Expansion (4 rounds)

Step 5: Perspective Review and Consolidation

Step 6: Criteria Generation

Step 7: Criteria Expansion (3 rounds)

Step 8: Criteria Review and Consolidation

Step 9: Polarity Check

Step 10: Score Calibration

Phase B: Review Work Against Criteria

Final Output Format

Handling Special Inputs

Key Constraints

Citation