Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

qworld-criteria-generation

Name: Qworld Criteria Generation
Author: mims-harvard

// Generate question-specific evaluation criteria using the Qworld Recursive Expansion Tree (RET) algorithm. Decomposes any open-ended question into scenarios, perspectives, and fine-grained binary criteria through structured horizontal and hierarchical expansion. Use when asked to create evaluation rubrics, generate grading criteria, build scoring rubrics for LLM responses, assess answer quality for open-ended questions, or when the user mentions "criteria generation", "evaluation rubrics", "evaluation criteria", "Qworld", or "RET algorithm".

Ejecutar en Manus

$ git log --oneline --stat

stars:27

forks:4

updated:26 de marzo de 2026, 01:27

SKILL.md

readonly

package.json

"author": "mims-harvard"

"repository": "mims-harvard/Qworld"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Científicos de datosOcupaciones informáticas y matemáticas15-2051L4

Ejecuta cualquier Skill con un clic

name

qworld-criteria-generation

description

Generate question-specific evaluation criteria using the Qworld Recursive Expansion Tree (RET) algorithm. Decomposes any open-ended question into scenarios, perspectives, and fine-grained binary criteria through structured horizontal and hierarchical expansion. Use when asked to create evaluation rubrics, generate grading criteria, build scoring rubrics for LLM responses, assess answer quality for open-ended questions, or when the user mentions "criteria generation", "evaluation rubrics", "evaluation criteria", "Qworld", or "RET algorithm".

Qworld: Question-Specific Criteria Generation

Generate evaluation criteria tailored to a specific question using the Recursive Expansion Tree (RET) algorithm. Each question defines its own "evaluation world" -- a set of scenarios, perspectives, and criteria that capture what matters for judging answers to that specific question.

When to Use

User asks to create evaluation criteria or rubrics for one or more questions
User wants to assess or score LLM responses to open-ended questions
User needs fine-grained, binary (YES/NO) evaluation dimensions for a question
User mentions Qworld, RET, or criteria generation

Core Principles

Question-specific: Every scenario, perspective, and criterion must be derived from the question's content, intent, and context. Never reuse a fixed set of dimensions across questions.
Binary and verifiable: Each final criterion must be answerable YES or NO by an evaluator.
Non-redundant: At every level, check for overlap before adding new items.
Balanced: Include both positive criteria (what a good answer must do) and negative criteria (what constitutes harmful or misleading content). Positive total points must outweigh negative.
Coverage-driven: At each level, repeatedly ask "what's missing?" to ensure the evaluation space implied by the question is fully explored.

Algorithm Overview

The RET builds a 3-level tree from the question:

Question
  -> Level 1: Scenarios (contextual framings that change what "good" means)
       -> Level 2: Perspectives (evaluation dimensions per scenario)
            -> Level 3: Criteria (concrete, binary rubric items with scores)

At each level, two operators apply:

Hierarchical decomposition: break a node into finer-grained children at the next level
Horizontal expansion: ask "what else is missing?" and add sibling nodes for coverage

Step-by-Step Workflow

Copy this checklist and track progress as you work:

Task Progress:
- [ ] Step 1: Scenario grounding
- [ ] Step 2: Scenario expansion (3 rounds)
- [ ] Step 3: Perspective generation
- [ ] Step 4: Perspective expansion (4 rounds)
- [ ] Step 5: Perspective review and consolidation
- [ ] Step 6: Criteria generation
- [ ] Step 7: Criteria expansion (3 rounds)
- [ ] Step 8: Criteria review and consolidation
- [ ] Step 9: Polarity check
- [ ] Step 10: Score calibration

Step 1: Scenario Grounding

Goal: Identify the distinct real-world contexts in which this question could arise, where each context would materially change what constitutes a good answer.

Method:

Read the question carefully. Infer its domain, intent, audience, stakes, and implicit constraints.
Ask: "In what different situations could someone ask this question, and how would the situation change what a good answer looks like?"
Produce a minimal, non-redundant set of scenarios. Merge any that would lead to the same evaluation criteria.

Output format for each scenario:

scenario_name: short descriptive label
scenario_description: 3-5 sentences explaining what is unique about this context and why it changes what "good" means

Step 2: Scenario Expansion (3 rounds)

Goal: Ensure comprehensive coverage of the evaluation space at the scenario level.

Method (repeat 3 times):

Review all existing scenarios.
Ask: "What contexts or framings are missing that would require different evaluation criteria?"
Identify gaps along context axes (audience, setting, stakes, constraints, domain variation, etc.).
Generate ONLY new scenarios that differ materially from all existing ones. Do not repeat or rephrase existing scenarios.

After each round, append new scenarios to the list.

Step 3: Perspective Generation

Goal: For each scenario, derive the evaluation dimensions that matter for judging an answer in that context.

Method:

For EACH scenario, produce 4-7 non-overlapping evaluation perspectives.
Each perspective must be grounded in the scenario and the question. Derive perspectives entirely from the question's content and context -- do not apply a pre-set list of dimensions.
Aim for maximally diverse themes, including unconventional angles that are specific to this question.
Each perspective should be high-level enough to spawn multiple criteria, yet narrow enough to avoid overlap with other perspectives.

Output format for each perspective:

perspective_name: 2-5 word descriptive label
perspective_description: 3-5 sentences explaining what this perspective evaluates and listing 3-5 specific sub-aspects it covers

Step 4: Perspective Expansion (4 rounds)

Goal: Fill coverage gaps in the perspective set.

Method (repeat 4 times):

Review all perspectives across all scenarios.
Map coverage across quality dimensions implied by the question.
Ask: "What evaluation angles are missing that would yield materially different criteria?"
Generate ONLY new perspectives that provide distinct evaluation value. Do not repeat existing ones.

After each round, append new perspectives to the collection.

Step 5: Perspective Review and Consolidation

Goal: Produce a clean, non-redundant set of perspectives ready for criteria generation.

Method:

If two perspectives target the same evaluation angle but under different scenarios, combine scenario-specific details into each description but keep them as separate perspectives.
Remove perspectives that are off-topic, redundant with others, or too vague to guide concrete criteria generation.
Each kept perspective must contribute unique evaluation value for this question.
Assign each kept perspective a unique ID (p0, p1, p2, ...).

Step 6: Criteria Generation

Goal: For each reviewed perspective, generate concrete, binary evaluation criteria.

Method:

For EACH perspective, generate criteria that an evaluator can check YES or NO against a response.
Cover all sub-aspects listed in the perspective description.

Criteria rules:

Binary: Each criterion must be answerable YES/NO.
Scenario-specific: Explicitly reference features of the question and scenario. Make criteria as detailed and specific as possible.
Self-contained: Each criterion is a standalone statement in the form [Verb + Specific Requirement]. Start with one clear action verb, then state the exact required or forbidden content with qualifiers.
Balanced: Include positive criteria (required content/behavior) and negative criteria (harmful, misleading, or critically wrong content). Only add negative criteria when the issue represents harmful, dangerous, or significantly quality-reducing behavior -- not minor stylistic concerns.
Diverse: Cover all sub-aspects of the perspective.

Negative criteria phrasing: Describe the bad behavior directly. Instead of "Avoids doing X", write "Does X" (where X is the harmful behavior). The criterion text states the behavior; its negative score indicates that meeting it is bad.

Scoring standard:

Positive: 1-10 (10 = critical safety/core requirement; 8-9 = important completeness; 5-7 = quality enhancer; 1-4 = minor nice-to-have)
Negative: -1 to -10 (-10 = dangerous/harmful; -8 to -9 = major omission or error; -5 to -7 = quality issue; -1 to -4 = minor issue)

Output format for each criterion:

criterion: the criterion text
points: integer score (positive or negative)
reasoning: 2-3 sentences explaining why this criterion matters for this question and why it received this weight

Step 7: Criteria Expansion (3 rounds)

Goal: Fill coverage gaps in the criteria set.

Method (repeat 3 times):

Review all existing criteria.
Ask: "What concrete content or behavior, if present or absent in a response, would change whether the response passes or fails -- and is not yet covered?"
Generate ONLY new criteria that are non-overlapping with existing ones.
Follow the same rules and scoring standard as Step 6.

After each round, append new criteria to the collection.

Step 8: Criteria Review and Consolidation

Goal: Produce a concise, non-redundant final rubric.

Method:

Deduplicate and merge: Combine criteria that assess the same aspect. Keep the most precise wording and include all distinct details from merged criteria.
Positive/negative overlap: If a positive and negative criterion cover the same aspect, keep only the positive.
Neutralize fixed facts: Replace hard-coded numbers, dates, versions, limits, or placeholders with a requirement that the response states the current/official/latest value or standard.
Balance check: Ensure total positive points outweigh total negative points. Retain all distinct, non-overlapping items.
Assign each final criterion a unique ID (c0, c1, c2, ...).

Step 9: Polarity Check

Goal: Verify that every criterion's score sign correctly reflects whether meeting it is good or bad.

Method:

For each criterion, determine: does meeting this criterion improve the response (positive) or indicate a problem (negative)?
A criterion that describes desirable behavior should have a positive score.
A criterion that describes harmful, wrong, or misleading behavior should have a negative score.
Phrasing like "Avoids doing X" describes desirable behavior (positive). Phrasing like "Does X" (where X is bad) describes undesirable behavior (negative).
Adjust the sign of points if misclassified. Do not change the criterion text.

Step 10: Score Calibration

Goal: Ensure score magnitudes accurately reflect importance.

Method:

Review each criterion's absolute score against the scoring standard from Step 6.
A more important or severe issue must always have a higher absolute score than a less important one.
Adjust magnitudes where needed. Do not change the sign (positive/negative direction) or the criterion text.
Verify the scoring standard is applied consistently:
- Positive: 10 = critical, 8-9 = important, 5-7 = quality, 1-4 = minor
- Negative: -10 = dangerous, -8 to -9 = major, -5 to -7 = moderate, -1 to -4 = minor

Final Output Format

Present the final criteria as a structured list. Each item contains:

criterion_id: unique identifier (c0, c1, ...)
criterion: the full criterion text
points: calibrated integer score
reasoning: why this criterion matters for this specific question and why this weight

Also include intermediate outputs for transparency:

List of scenarios (with IDs, names, descriptions)
List of reviewed perspectives (with IDs, names, descriptions)
Summary statistics: number of scenarios, perspectives, raw criteria, and final criteria

Handling Special Inputs

Multi-turn conversations: If the question is a conversation (user-assistant exchange), treat the full conversation as the question context. The last user message is typically the primary intent; earlier messages provide context that may affect evaluation dimensions.

Questions with images: If the question includes an image, factor image content into scenario analysis and criteria generation. Reference visual elements where relevant in criteria.

Questions with retrieved web context: If web-retrieved context is provided alongside the question, use it to inform factual grounding of scenarios and criteria. Do not limit analysis to only the retrieved content -- also apply general reasoning.

Key Constraints

No fixed dimension lists: Never apply a pre-defined set of evaluation dimensions. Every scenario, perspective, and criterion must be freshly derived from the question.
Expansion counts matter: Complete all specified expansion rounds (3 for scenarios, 4 for perspectives, 3 for criteria). Skipping rounds reduces coverage.
Deduplication is mandatory: After each expansion phase and before finalization, check for and remove redundant items.
Criteria must be binary: Every criterion in the final set must be checkable as YES or NO. Do not produce criteria that require subjective degree judgments.

Citation

If you use this skill or the Qworld method in your work, please cite our paper:

@misc{gao2026qworldquestionspecificevaluationcriteria,
      title={Qworld: Question-Specific Evaluation Criteria for LLMs},
      author={Shanghua Gao and Yuchang Su and Pengwei Sui and Curtis Ginder and Marinka Zitnik},
      year={2026},
      eprint={2603.23522},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.23522},
}

name

qworld-criteria-generation

description

Qworld: Question-Specific Criteria Generation

When to Use

User asks to create evaluation criteria or rubrics for one or more questions
User wants to assess or score LLM responses to open-ended questions
User needs fine-grained, binary (YES/NO) evaluation dimensions for a question
User mentions Qworld, RET, or criteria generation

Core Principles

Question-specific: Every scenario, perspective, and criterion must be derived from the question's content, intent, and context. Never reuse a fixed set of dimensions across questions.
Binary and verifiable: Each final criterion must be answerable YES or NO by an evaluator.
Non-redundant: At every level, check for overlap before adding new items.
Balanced: Include both positive criteria (what a good answer must do) and negative criteria (what constitutes harmful or misleading content). Positive total points must outweigh negative.
Coverage-driven: At each level, repeatedly ask "what's missing?" to ensure the evaluation space implied by the question is fully explored.

Algorithm Overview

The RET builds a 3-level tree from the question:

Question
  -> Level 1: Scenarios (contextual framings that change what "good" means)
       -> Level 2: Perspectives (evaluation dimensions per scenario)
            -> Level 3: Criteria (concrete, binary rubric items with scores)

At each level, two operators apply:

Hierarchical decomposition: break a node into finer-grained children at the next level
Horizontal expansion: ask "what else is missing?" and add sibling nodes for coverage

Step-by-Step Workflow

Copy this checklist and track progress as you work:

Task Progress:
- [ ] Step 1: Scenario grounding
- [ ] Step 2: Scenario expansion (3 rounds)
- [ ] Step 3: Perspective generation
- [ ] Step 4: Perspective expansion (4 rounds)
- [ ] Step 5: Perspective review and consolidation
- [ ] Step 6: Criteria generation
- [ ] Step 7: Criteria expansion (3 rounds)
- [ ] Step 8: Criteria review and consolidation
- [ ] Step 9: Polarity check
- [ ] Step 10: Score calibration

Step 1: Scenario Grounding

Goal: Identify the distinct real-world contexts in which this question could arise, where each context would materially change what constitutes a good answer.

Method:

Read the question carefully. Infer its domain, intent, audience, stakes, and implicit constraints.
Ask: "In what different situations could someone ask this question, and how would the situation change what a good answer looks like?"
Produce a minimal, non-redundant set of scenarios. Merge any that would lead to the same evaluation criteria.

Output format for each scenario:

scenario_name: short descriptive label
scenario_description: 3-5 sentences explaining what is unique about this context and why it changes what "good" means

Step 2: Scenario Expansion (3 rounds)

Goal: Ensure comprehensive coverage of the evaluation space at the scenario level.

Method (repeat 3 times):

Review all existing scenarios.
Ask: "What contexts or framings are missing that would require different evaluation criteria?"
Identify gaps along context axes (audience, setting, stakes, constraints, domain variation, etc.).
Generate ONLY new scenarios that differ materially from all existing ones. Do not repeat or rephrase existing scenarios.

After each round, append new scenarios to the list.

Step 3: Perspective Generation

Goal: For each scenario, derive the evaluation dimensions that matter for judging an answer in that context.

Method:

For EACH scenario, produce 4-7 non-overlapping evaluation perspectives.
Each perspective must be grounded in the scenario and the question. Derive perspectives entirely from the question's content and context -- do not apply a pre-set list of dimensions.
Aim for maximally diverse themes, including unconventional angles that are specific to this question.
Each perspective should be high-level enough to spawn multiple criteria, yet narrow enough to avoid overlap with other perspectives.

Output format for each perspective:

perspective_name: 2-5 word descriptive label
perspective_description: 3-5 sentences explaining what this perspective evaluates and listing 3-5 specific sub-aspects it covers

Step 4: Perspective Expansion (4 rounds)

Goal: Fill coverage gaps in the perspective set.

Method (repeat 4 times):

Review all perspectives across all scenarios.
Map coverage across quality dimensions implied by the question.
Ask: "What evaluation angles are missing that would yield materially different criteria?"
Generate ONLY new perspectives that provide distinct evaluation value. Do not repeat existing ones.

After each round, append new perspectives to the collection.

Step 5: Perspective Review and Consolidation

Goal: Produce a clean, non-redundant set of perspectives ready for criteria generation.

Method:

If two perspectives target the same evaluation angle but under different scenarios, combine scenario-specific details into each description but keep them as separate perspectives.
Remove perspectives that are off-topic, redundant with others, or too vague to guide concrete criteria generation.
Each kept perspective must contribute unique evaluation value for this question.
Assign each kept perspective a unique ID (p0, p1, p2, ...).

Step 6: Criteria Generation

Goal: For each reviewed perspective, generate concrete, binary evaluation criteria.

Method:

For EACH perspective, generate criteria that an evaluator can check YES or NO against a response.
Cover all sub-aspects listed in the perspective description.

Criteria rules:

Binary: Each criterion must be answerable YES/NO.
Scenario-specific: Explicitly reference features of the question and scenario. Make criteria as detailed and specific as possible.
Self-contained: Each criterion is a standalone statement in the form [Verb + Specific Requirement]. Start with one clear action verb, then state the exact required or forbidden content with qualifiers.
Balanced: Include positive criteria (required content/behavior) and negative criteria (harmful, misleading, or critically wrong content). Only add negative criteria when the issue represents harmful, dangerous, or significantly quality-reducing behavior -- not minor stylistic concerns.
Diverse: Cover all sub-aspects of the perspective.

Scoring standard:

Positive: 1-10 (10 = critical safety/core requirement; 8-9 = important completeness; 5-7 = quality enhancer; 1-4 = minor nice-to-have)
Negative: -1 to -10 (-10 = dangerous/harmful; -8 to -9 = major omission or error; -5 to -7 = quality issue; -1 to -4 = minor issue)

Output format for each criterion:

criterion: the criterion text
points: integer score (positive or negative)
reasoning: 2-3 sentences explaining why this criterion matters for this question and why it received this weight

Step 7: Criteria Expansion (3 rounds)

Goal: Fill coverage gaps in the criteria set.

Method (repeat 3 times):

Review all existing criteria.
Ask: "What concrete content or behavior, if present or absent in a response, would change whether the response passes or fails -- and is not yet covered?"
Generate ONLY new criteria that are non-overlapping with existing ones.
Follow the same rules and scoring standard as Step 6.

After each round, append new criteria to the collection.

Step 8: Criteria Review and Consolidation

Goal: Produce a concise, non-redundant final rubric.

Method:

Deduplicate and merge: Combine criteria that assess the same aspect. Keep the most precise wording and include all distinct details from merged criteria.
Positive/negative overlap: If a positive and negative criterion cover the same aspect, keep only the positive.
Neutralize fixed facts: Replace hard-coded numbers, dates, versions, limits, or placeholders with a requirement that the response states the current/official/latest value or standard.
Balance check: Ensure total positive points outweigh total negative points. Retain all distinct, non-overlapping items.
Assign each final criterion a unique ID (c0, c1, c2, ...).

Step 9: Polarity Check

Goal: Verify that every criterion's score sign correctly reflects whether meeting it is good or bad.

Method:

For each criterion, determine: does meeting this criterion improve the response (positive) or indicate a problem (negative)?
A criterion that describes desirable behavior should have a positive score.
A criterion that describes harmful, wrong, or misleading behavior should have a negative score.
Phrasing like "Avoids doing X" describes desirable behavior (positive). Phrasing like "Does X" (where X is bad) describes undesirable behavior (negative).
Adjust the sign of points if misclassified. Do not change the criterion text.

Step 10: Score Calibration

Goal: Ensure score magnitudes accurately reflect importance.

Method:

Review each criterion's absolute score against the scoring standard from Step 6.
A more important or severe issue must always have a higher absolute score than a less important one.
Adjust magnitudes where needed. Do not change the sign (positive/negative direction) or the criterion text.
Verify the scoring standard is applied consistently:
- Positive: 10 = critical, 8-9 = important, 5-7 = quality, 1-4 = minor
- Negative: -10 = dangerous, -8 to -9 = major, -5 to -7 = moderate, -1 to -4 = minor

Final Output Format

Present the final criteria as a structured list. Each item contains:

criterion_id: unique identifier (c0, c1, ...)
criterion: the full criterion text
points: calibrated integer score
reasoning: why this criterion matters for this specific question and why this weight

Also include intermediate outputs for transparency:

List of scenarios (with IDs, names, descriptions)
List of reviewed perspectives (with IDs, names, descriptions)
Summary statistics: number of scenarios, perspectives, raw criteria, and final criteria

Handling Special Inputs

Questions with images: If the question includes an image, factor image content into scenario analysis and criteria generation. Reference visual elements where relevant in criteria.

Key Constraints

No fixed dimension lists: Never apply a pre-defined set of evaluation dimensions. Every scenario, perspective, and criterion must be freshly derived from the question.
Expansion counts matter: Complete all specified expansion rounds (3 for scenarios, 4 for perspectives, 3 for criteria). Skipping rounds reduces coverage.
Deduplication is mandatory: After each expansion phase and before finalization, check for and remove redundant items.
Criteria must be binary: Every criterion in the final set must be checkable as YES or NO. Do not produce criteria that require subjective degree judgments.

Citation

If you use this skill or the Qworld method in your work, please cite our paper:

@misc{gao2026qworldquestionspecificevaluationcriteria,
      title={Qworld: Question-Specific Evaluation Criteria for LLMs},
      author={Shanghua Gao and Yuchang Su and Pengwei Sui and Curtis Ginder and Marinka Zitnik},
      year={2026},
      eprint={2603.23522},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.23522},
}