تشغيل أي مهارة في Manus بنقرة واحدة

scenario-methodology

النجوم١

التفرعات١

آخر تحديث٢١ فبراير ٢٠٢٦ في ١٧:٢٦

Framework for authoring, structuring, and managing scenarios — end-to-end user stories validated probabilistically by LLM-as-judge. Covers the holdout principle, scenario anatomy, versioning, composition, and anti-reward-hacking patterns.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

dlabs

dlabs/claude-marketplace

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات·SOC 15-1253

مستكشف الملفات

3 ملفات

SKILL.md

readonly

name	scenario-methodology
description	Framework for authoring, structuring, and managing scenarios — end-to-end user stories validated probabilistically by LLM-as-judge. Covers the holdout principle, scenario anatomy, versioning, composition, and anti-reward-hacking patterns.

Scenario Methodology

This skill provides the conceptual and practical framework for scenario-based validation. A scenario is not a test — it is a structured user story that describes what a user wants to accomplish and what outcomes would satisfy them.

When to Use

/scenario-testing:st:scenario — authoring new scenarios
/scenario-testing:st:review — reviewing and refining scenarios
/scenario-testing:st:catalog — managing the scenario catalog
When any agent needs to understand what a scenario is and how to write one

Key Distinction: Scenario vs. Test

Aspect	Traditional Test	Scenario
Stored	In the codebase	Outside the codebase (holdout)
Written in	Code (assertions)	YAML (natural language criteria)
Evaluator	Test runner (boolean)	LLM-as-judge (probabilistic)
Measures	Correctness	Satisfaction
Deterministic	Yes (same input → same result)	No (same scenario → distribution of trajectories)
Reward-hackable	Yes (code can be shaped to pass)	Resistant (holdout + LLM judgment)
Who understands it	Developers	Anyone (product, design, QA, developers)

The Holdout Principle

Scenarios are stored outside the codebase by default (.scenarios/ is gitignored). This mirrors the holdout set concept in machine learning:

Training data = your codebase, including unit and integration tests
Holdout data = your scenarios, stored separately
Evaluation = running scenarios against the code and measuring satisfaction

The model (code) is developed against training data (tests) but validated against holdout data (scenarios). This prevents overfitting — the code can't be shaped to trivially pass scenarios it doesn't see.

When to Use In-Repo Scenarios

Not all scenarios need to be holdout. Use in-repo storage when:

The team needs to collaboratively edit scenarios
Scenarios are tied to specific features and should version with the code
You trust the development process not to game the scenarios

Configure via .scenarios/config.json:

{
  "storage": "in-repo"  // or "holdout" (default)
}

Scenario Anatomy

Every scenario has 7 required sections and 2 optional sections:

Required

id — unique identifier, kebab-case (e.g., sso-login, export-to-sheets)
domain — category grouping (e.g., auth, onboarding, integrations)
version — integer version number, bumped on changes
persona — who is the user (role, expertise, goals)
context — starting state (data, permissions, environment, services)
intent — what the user wants to accomplish (1-2 sentences)
satisfaction_criteria — list of outcomes that would satisfy the user

Optional

anti_patterns — outcomes that would definitely NOT satisfy the user
chaos — failure conditions to inject during execution

Writing Satisfaction Criteria

Good criteria are:

Specific enough to judge — "Ticket has a descriptive title" is judgeable; "Ticket is good" is not
Flexible enough to allow valid variation — "Priority is set appropriately (High or Medium for regression bugs)" allows judgment; "Priority is exactly High" is too rigid
User-centered — describe what the user would notice, not what the code does internally
Independent — each criterion can be evaluated separately

Writing Anti-Patterns

Anti-patterns are the inverse of satisfaction criteria — they describe outcomes that are clearly wrong:

"Raw error message shown to user"
"Agent enters infinite retry loop"
"Data is written to wrong account"
"User is asked more than 3 clarifying questions before any action"

A trajectory that matches ANY anti-pattern is automatically judged "unsatisfactory", regardless of satisfaction criteria matches.

Scenario Lifecycle

Draft → Review → Active → Versioned
                   ↑          │
                   └──────────┘ (update + version bump)

Draft — authored via /st:scenario, may be incomplete
Review — reviewed via /st:review by the scenario-reviewer agent
Active — in the catalog, used for validation runs
Versioned — updated with changelog, satisfaction history preserved per version

Composition

Scenarios can be composed for complex workflows:

Sequential — scenario A's end state is scenario B's start state
Parallel — scenarios A and B run independently, overall satisfaction is the aggregate
Conditional — scenario B only runs if scenario A's satisfaction meets a threshold

References

references/scenario-template.md — YAML template with field documentation
references/criteria-patterns.md — Patterns for writing effective satisfaction criteria and anti-patterns

المزيد من هذا المستودع

نفس المستودع

ab-testing

dlabs/claude-marketplace

Production A/B testing lifecycle for design variants. Covers hypothesis formation, feature flags, variant comparison, analytics tracking, statistical significance analysis, experiment setup, and cleanup.

2026-03-101

agent-browser

dlabs/claude-marketplace

Browser automation using Vercel's agent-browser CLI. Use when you need to interact with web pages, fill forms, take screenshots, or scrape data. Uses Bash commands with ref-based element selection. Triggers on "browse website", "fill form", "click button", "take screenshot", "scrape page", "web automation".

2026-03-101

architecture-review

dlabs/claude-marketplace

Multi-agent architecture review combining core architecture design with parallel security, performance, and data integrity assessments. Produces ADRs in MADR format. Covers ADR, architecture decision, system design, scalability assessment. Not for code review or implementation — for architectural decisions only.

2026-03-101

batch-integration

dlabs/claude-marketplace

Reference for how the built-in /batch command integrates with blueprint-dev workflows — parallel codebase-wide changes using worktrees with project context.

2026-03-101

claude-md-learning

dlabs/claude-marketplace

Analyzes detected stack profiles and suggests targeted CLAUDE.md improvements. Covers CLAUDE.md improvement, project configuration, AI instructions. Never auto-writes to CLAUDE.md — stages suggestions for user review.

2026-03-101

compound-knowledge

dlabs/claude-marketplace

Problem documentation methodology for compounding team knowledge. Captures solved problems with structured metadata for searchability, pattern detection, and prevention. Covers postmortem, lessons learned, debugging documentation, solved problem capture. Not for general documentation — specifically for post-debugging problem capture.

2026-03-101

name	scenario-methodology
description	Framework for authoring, structuring, and managing scenarios — end-to-end user stories validated probabilistically by LLM-as-judge. Covers the holdout principle, scenario anatomy, versioning, composition, and anti-reward-hacking patterns.

Scenario Methodology

When to Use

/scenario-testing:st:scenario — authoring new scenarios
/scenario-testing:st:review — reviewing and refining scenarios
/scenario-testing:st:catalog — managing the scenario catalog
When any agent needs to understand what a scenario is and how to write one

Key Distinction: Scenario vs. Test

Aspect	Traditional Test	Scenario
Stored	In the codebase	Outside the codebase (holdout)
Written in	Code (assertions)	YAML (natural language criteria)
Evaluator	Test runner (boolean)	LLM-as-judge (probabilistic)
Measures	Correctness	Satisfaction
Deterministic	Yes (same input → same result)	No (same scenario → distribution of trajectories)
Reward-hackable	Yes (code can be shaped to pass)	Resistant (holdout + LLM judgment)
Who understands it	Developers	Anyone (product, design, QA, developers)

The Holdout Principle

Scenarios are stored outside the codebase by default (.scenarios/ is gitignored). This mirrors the holdout set concept in machine learning:

Training data = your codebase, including unit and integration tests
Holdout data = your scenarios, stored separately
Evaluation = running scenarios against the code and measuring satisfaction

When to Use In-Repo Scenarios

Not all scenarios need to be holdout. Use in-repo storage when:

The team needs to collaboratively edit scenarios
Scenarios are tied to specific features and should version with the code
You trust the development process not to game the scenarios

Configure via .scenarios/config.json:

{
  "storage": "in-repo"  // or "holdout" (default)
}

Scenario Anatomy

Every scenario has 7 required sections and 2 optional sections:

Required

id — unique identifier, kebab-case (e.g., sso-login, export-to-sheets)
domain — category grouping (e.g., auth, onboarding, integrations)
version — integer version number, bumped on changes
persona — who is the user (role, expertise, goals)
context — starting state (data, permissions, environment, services)
intent — what the user wants to accomplish (1-2 sentences)
satisfaction_criteria — list of outcomes that would satisfy the user

Optional

anti_patterns — outcomes that would definitely NOT satisfy the user
chaos — failure conditions to inject during execution

Writing Satisfaction Criteria

Good criteria are:

Specific enough to judge — "Ticket has a descriptive title" is judgeable; "Ticket is good" is not
Flexible enough to allow valid variation — "Priority is set appropriately (High or Medium for regression bugs)" allows judgment; "Priority is exactly High" is too rigid
User-centered — describe what the user would notice, not what the code does internally
Independent — each criterion can be evaluated separately

Writing Anti-Patterns

Anti-patterns are the inverse of satisfaction criteria — they describe outcomes that are clearly wrong:

"Raw error message shown to user"
"Agent enters infinite retry loop"
"Data is written to wrong account"
"User is asked more than 3 clarifying questions before any action"

A trajectory that matches ANY anti-pattern is automatically judged "unsatisfactory", regardless of satisfaction criteria matches.

Scenario Lifecycle

Draft → Review → Active → Versioned
                   ↑          │
                   └──────────┘ (update + version bump)

Draft — authored via /st:scenario, may be incomplete
Review — reviewed via /st:review by the scenario-reviewer agent
Active — in the catalog, used for validation runs
Versioned — updated with changelog, satisfaction history preserved per version

Composition

Scenarios can be composed for complex workflows:

Sequential — scenario A's end state is scenario B's start state
Parallel — scenarios A and B run independently, overall satisfaction is the aggregate
Conditional — scenario B only runs if scenario A's satisfaction meets a threshold

References

references/scenario-template.md — YAML template with field documentation
references/criteria-patterns.md — Patterns for writing effective satisfaction criteria and anti-patterns