Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

satisfaction-metrics

Étoiles1

Forks1

Mis à jour21 février 2026 à 17:26

Framework for measuring, aggregating, and trending satisfaction scores across scenarios. Covers LLM-as-judge methodology, trajectory evaluation, threshold configuration, comparison analysis, and reporting.

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

dlabs

dlabs/claude-marketplace

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

Métiers associésSOC

Basé sur la classification professionnelle SOC

Analystes en assurance qualité des logiciels et testeursProfessions informatiques et mathématiques·SOC 15-1253

Explorateur de fichiers

3 fichiers

SKILL.md

readonly

Plus depuis ce dépôt

même dépôt

ab-testing

dlabs/claude-marketplace

Production A/B testing lifecycle for design variants. Covers hypothesis formation, feature flags, variant comparison, analytics tracking, statistical significance analysis, experiment setup, and cleanup.

2026-03-101

agent-browser

dlabs/claude-marketplace

Browser automation using Vercel's agent-browser CLI. Use when you need to interact with web pages, fill forms, take screenshots, or scrape data. Uses Bash commands with ref-based element selection. Triggers on "browse website", "fill form", "click button", "take screenshot", "scrape page", "web automation".

2026-03-101

architecture-review

dlabs/claude-marketplace

Multi-agent architecture review combining core architecture design with parallel security, performance, and data integrity assessments. Produces ADRs in MADR format. Covers ADR, architecture decision, system design, scalability assessment. Not for code review or implementation — for architectural decisions only.

2026-03-101

batch-integration

dlabs/claude-marketplace

Reference for how the built-in /batch command integrates with blueprint-dev workflows — parallel codebase-wide changes using worktrees with project context.

2026-03-101

claude-md-learning

dlabs/claude-marketplace

Analyzes detected stack profiles and suggests targeted CLAUDE.md improvements. Covers CLAUDE.md improvement, project configuration, AI instructions. Never auto-writes to CLAUDE.md — stages suggestions for user review.

2026-03-101

compound-knowledge

dlabs/claude-marketplace

Problem documentation methodology for compounding team knowledge. Captures solved problems with structured metadata for searchability, pattern detection, and prevention. Covers postmortem, lessons learned, debugging documentation, solved problem capture. Not for general documentation — specifically for post-debugging problem capture.

2026-03-101

name	satisfaction-metrics
description	Framework for measuring, aggregating, and trending satisfaction scores across scenarios. Covers LLM-as-judge methodology, trajectory evaluation, threshold configuration, comparison analysis, and reporting.

Satisfaction Metrics

This skill provides the measurement framework for scenario validation. Satisfaction is a probabilistic metric — the fraction of observed trajectories through scenarios that an LLM judge deems "satisfactory" for the described user.

When to Use

/scenario-testing:st:satisfy — judging trajectories
/scenario-testing:st:report — generating satisfaction reports
/scenario-testing:st:validate — full pipeline validation
When any agent needs to understand how satisfaction is computed

Why Satisfaction, Not Correctness

Traditional tests measure correctness: does the output match the expected output?

Satisfaction measures something different: would the user described in the scenario consider this outcome acceptable?

This distinction matters because:

Agentic software has many valid execution paths — there is no single "correct" output
LLM-powered features produce non-deterministic outputs — running the same scenario twice yields different (but potentially equally good) results
User satisfaction is context-dependent — the persona's expertise, goals, and expectations shape what counts as "good enough"

The Satisfaction Computation

Per-Trajectory

Each trajectory is judged independently:

judge(trajectory, scenario) → {
  verdict: "satisfactory" | "unsatisfactory",
  reasoning: string,
  criteria_matched: string[],
  anti_patterns_matched: string[],
  confidence: float  // 0.0 to 1.0
}

A trajectory is "satisfactory" when ALL of:

At least ONE satisfaction criterion is matched
ZERO anti-patterns are matched
The judge believes the described persona would accept this outcome

Per-Scenario

satisfaction(scenario) = count(satisfactory) / count(total_trajectories)

The number of trajectories per scenario is configurable (default: 50). More trajectories give a tighter confidence interval on the satisfaction estimate.

Per-Domain

satisfaction(domain) = mean(satisfaction(scenario) for scenario in domain)

Equal weighting across scenarios in the domain. Override with explicit weights in config if some scenarios are more important.

Overall

satisfaction(overall) = weighted_mean(satisfaction(domain) for domain in all_domains)
weight(domain) = total_trajectory_count(domain)

Weighted by trajectory count to avoid small domains disproportionately affecting the overall score.

LLM-as-Judge Methodology

Default Judge Configuration

model: opus
temperature: 0.0  # deterministic judgment
max_tokens: 2000

system_prompt: |
  You are a satisfaction judge evaluating whether a software interaction trajectory
  would satisfy the user described in the scenario.

  You will receive:
  1. The scenario (persona, intent, satisfaction criteria, anti-patterns)
  2. The trajectory (actions taken, state transitions, final outcome)

  Evaluate:
  - Does the outcome match ANY of the satisfaction criteria?
  - Does the trajectory match ANY anti-patterns?
  - Would the described persona consider this outcome acceptable?

  Respond with a JSON object:
  {
    "verdict": "satisfactory" or "unsatisfactory",
    "reasoning": "2-3 sentence explanation",
    "criteria_matched": ["list of matched criteria"],
    "anti_patterns_matched": ["list of matched anti-patterns"],
    "confidence": 0.0 to 1.0
  }

Why Temperature 0?

The judge should be deterministic — the same trajectory judged twice should get the same verdict. Temperature 0 ensures consistent evaluation. If you want to measure judge variance, run multiple judgment passes and compare.

Custom Judges

Override the default judge per-scenario or per-domain:

# Per-scenario (in the scenario YAML)
judge_config:
  system_prompt: |
    You are a security-focused judge. Weight data protection
    and access control heavily in your evaluation.
  strict_mode: true  # any anti-pattern = unsatisfactory regardless

// Per-domain (in config.json)
{
  "judge_overrides": {
    "auth": {
      "system_prompt": "...",
      "strict_mode": true
    }
  }
}

Thresholds

Configuration

{
  "thresholds": {
    "global_minimum": 0.80,
    "domains": {
      "auth": 0.95,
      "payments": 0.98,
      "onboarding": 0.75,
      "integrations": 0.85
    },
    "scenarios": {
      "sso-login": 0.98,
      "password-reset": 0.90
    }
  }
}

Threshold Hierarchy

Scenario-specific > domain-specific > global minimum. If a scenario has its own threshold, that takes precedence.

Threshold Semantics

Above threshold — the feature is working well enough for users
Below threshold — investigation needed, satisfaction has degraded
Below global minimum — critical alert, feature may be broken

Comparison Analysis

Run-over-Run Comparison

/scenario-testing:st:report --compare-to 2026-02-15

Computes delta per-scenario and highlights:

Improvements (positive delta > 0.02)
Regressions (negative delta > 0.02)
Stable (delta within ±0.02)

Trend Analysis

/scenario-testing:st:report --trend --days 30

Shows satisfaction over time, helping identify:

Gradual degradation (satisfaction slowly declining)
Step-function changes (satisfaction dropped after a specific date/deploy)
Recovery patterns (satisfaction dropped and came back)

Statistical Considerations

Sample Size

The number of trajectories affects confidence in the satisfaction estimate:

Trajectories	95% Confidence Interval Width
10	±0.31
25	±0.20
50	±0.14
100	±0.10
200	±0.07
500	±0.04

For critical scenarios (auth, payments), use 100+ trajectories. For exploratory features, 25-50 is sufficient.

Judge Reliability

The LLM judge is itself non-deterministic (even at temperature 0, model updates can change behavior). Periodically:

Re-judge a sample of past trajectories and compare verdicts
Compute inter-judge agreement if using multiple judge configurations
Review the judge's reasoning for edge cases

Report Format

Terminal Report

═══════════════════════════════════════════════
  SATISFACTION REPORT — 2026-02-20
═══════════════════════════════════════════════

  Overall: 0.91 (1365/1500 trajectories)

  Domain          Satisfaction  Threshold  Status
  ─────────────   ────────────  ─────────  ──────
  auth            0.96          0.95       PASS
  onboarding      0.87          0.75       PASS
  integrations    0.88          0.85       PASS
  payments        0.93          0.98       FAIL ▼

  Scenarios Below Threshold:
    payments/checkout-flow  0.93 (target: 0.98)
      → 7/100 unsatisfactory: timeout during payment confirmation
        not communicated to user (no retry UI)

═══════════════════════════════════════════════

JSON Report (for CI)

{
  "date": "2026-02-20",
  "overall_satisfaction": 0.91,
  "total_trajectories": 1500,
  "satisfactory_trajectories": 1365,
  "pass": false,
  "domains": {
    "auth": { "satisfaction": 0.96, "threshold": 0.95, "pass": true },
    "payments": { "satisfaction": 0.93, "threshold": 0.98, "pass": false }
  },
  "failing_scenarios": [
    {
      "id": "checkout-flow",
      "domain": "payments",
      "satisfaction": 0.93,
      "threshold": 0.98,
      "failure_summary": "timeout handling not communicated to user"
    }
  ]
}

References

references/judge-prompt-template.md — Default and domain-specific judge system prompts
references/report-schemas.md — JSON schemas for reports, judgments, and run manifests