Run any Skill in Manus with one click

prompt-evaluator

Stars22

Forks3

UpdatedMay 15, 2026 at 04:37

Evaluate how you use Claude Code — analyze prompt patterns, feature utilization, and get improvement suggestions. Trigger: /prompts:evaluate, prompt analysis, usage evaluation, how am I using Claude

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

nguyenthienthanh

nguyenthienthanh/aura-frog

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Market Research Analysts and Marketing SpecialistsBusiness and Financial Operations Occupations·SOC 13-1161

SKILL.md

readonly

More from this repository

same repository

agent-detector

nguyenthienthanh/aura-frog

CRITICAL: MUST run for EVERY message. Detects agent, complexity, AND model automatically. Without this, tasks route to wrong agents and use wrong models, degrading quality and wasting tokens.

2026-06-1022

problem-solving

nguyenthienthanh/aura-frog

5 techniques for different problem types. Use when stuck or facing complex challenges.

2026-06-1022

sequential-thinking

nguyenthienthanh/aura-frog

Structured thinking process for complex analysis. Supports revision, branching, and dynamic adjustment.

2026-06-1022

angular-expert

nguyenthienthanh/aura-frog

Angular 17+ gotchas and decision criteria. Covers signals vs observables, standalone patterns, and common pitfalls Claude gets wrong.

2026-05-1522

api-designer

nguyenthienthanh/aura-frog

Designs RESTful APIs with endpoint naming, versioning strategies (URL path, header-based), pagination (offset and cursor), error response schemas, and OpenAPI conventions. Use when the user asks about REST API design, creating endpoints, URL structure, API versioning, status codes, Swagger, or OpenAPI specs.

2026-05-1522

bugfix-quick

nguyenthienthanh/aura-frog

Fast bug fixes with root cause investigation + TDD. Enforces 'no fix without root cause' discipline and verification protocol. Without this skill, fixes are applied at symptoms instead of sources, and bugs return.

2026-05-1522

name	prompt-evaluator
description	Evaluate how you use Claude Code — analyze prompt patterns, feature utilization, and get improvement suggestions. Trigger: /prompts:evaluate, prompt analysis, usage evaluation, how am I using Claude
autoInvoke	false
priority	30
triggers	["/prompts:evaluate","evaluate my prompts","how am I using Claude","usage analysis","prompt patterns","improve my workflow","evaluate this prompt","optimize this prompt"]
context	fork
allowed-tools	Read, Bash, Glob, Grep
user-invocable	false

AI-consumed reference. Optimized for Claude to read during execution. Human-readable explanation: see docs/architecture/HIERARCHICAL_PLANNING.md or docs/getting-started/ depending on topic.

Prompt Evaluator

Two modes: Usage Analytics (how you use Claude Code) and Prompt Quality (evaluate/optimize a specific prompt).

For task-intake validation (before executing a user request), use the 6-dimension benchmark in rules/core/prompt-validation.md — that's a different check (task completeness, not prompt craftsmanship) and should run at /run pre-execution per rules/core/no-assumption.md.

Mode 1: Usage Analytics

Trigger: /prompts:evaluate [--days N]

node "${CLAUDE_PLUGIN_ROOT}/scripts/metrics/evaluate-prompts.cjs" --days 7

Analyzes prompt logs from .claude/metrics/prompts/{date}.jsonl. Reports: intent distribution, feature utilization, complexity profile, suggestions, usage score (0-100).

If no data: prompt-logger hook collects automatically. Return after a few sessions.

Mode 2: Prompt Quality Evaluation

Trigger: "evaluate this prompt", "optimize this prompt", or user pastes a prompt for review.

Process

Step 1 — Classify. Infer intent, task type (coding/creative/RAG/agent/reasoning), constraints (format, tone, tools), target model.

Step 2 — Score. Rate 0-10 on 5 dimensions:

dimensions[5]{dimension,what_to_check}:
  Clarity,"Is intent unambiguous? Can the model misinterpret?"
  Instruction Quality,"Are steps specific? Is the task decomposed well?"
  Efficiency,"Token waste? Redundant phrasing? Could say the same in fewer words?"
  Robustness,"Edge cases handled? Hallucination controls? Fallback behavior?"
  Output Alignment,"Format specified? Easy to parse? Deterministic output?"

Calibration: 0-3 poor, 4-6 acceptable, 7-8 good, 9-10 excellent.

Step 3 — Detect Issues. List specific problems:

Ambiguities that could cause wrong output
Missing constraints (format, length, tone)
Redundant instructions (same thing said twice)
Hallucination risks (no grounding, no source constraint)
Weak structure (wall of text vs clear sections)
Token waste (filler phrases, over-explanation of obvious behavior)

Step 4 — Optimize. Rewrite in two versions:

Minimal Fix — preserve structure, fix issues only
Production Version — fully optimized for token efficiency + determinism

Goals: preserve intent, reduce tokens, improve structure, add missing constraints, ensure parseable output.

Output Format

{
  "task_type": "...",
  "score": 0-10,
  "breakdown": {
    "clarity": 0-10,
    "instruction": 0-10,
    "efficiency": 0-10,
    "robustness": 0-10,
    "output_alignment": 0-10
  },
  "issues": ["specific issue 1", "specific issue 2"],
  "suggestions": ["actionable suggestion 1"],
  "optimized_prompt": {
    "minimal": "...",
    "production": "..."
  }
}

Evaluation Principles

principles[6]{principle}:
  Principle > checklist — 3 clear rules beat 20 vague ones
  Show don't tell — one example > paragraph of explanation
  Structured output > prose — JSON/TOON/tables when parseable output needed
  Remove what the model already knows — don't teach coding basics to Claude
  Constraint what varies — only specify behavior the model wouldn't do by default
  Token budget awareness — every word costs money at scale

Mode 3: Output Variance Check

Trigger: user suspects a prompt is unstable, or before shipping a prompt to production.

Run the prompt N = 3 times (separate contexts, identical input). Compare outputs.

Variance Scoring

variance_level  =  percentage_of_non_matching_content_across_runs

- <10%   STABLE       — ship as-is
- 10-30% LOW VARIANCE — minor differences, usually acceptable
- 30-60% UNSTABLE     — prompt needs constraints to reduce ambiguity
- >60%   CHAOS        — rewrite the prompt before any use

What to Check

Dimension	What "same" means
Structure	Same sections, same formatting, same order
Key facts	Same numbers, names, file paths
Decision	Same conclusion/recommendation
Format	Same JSON keys, same table headers

Example Report

Prompt: "Review this PR and list issues"
Runs: 3
Variance: 42% (UNSTABLE)

Divergence:
- Run 1 listed 5 issues (security, perf, style)
- Run 2 listed 3 issues (missed perf and style findings)
- Run 3 listed 7 issues (added subjective style nitpicks)

Root cause: No constraint on issue categories or severity threshold.

Recommendation: Change prompt to: "List issues at severity >= warning. Categories: security, correctness, performance. Skip style/formatting."

When to Use

Before shipping a prompt that will run many times (automation, CI, customer-facing)
When user complains "sometimes it does X, sometimes Y"
Before locking a skill's SKILL.md content (stability of future invocations)

Cost: 3× single-run. Worth it for prompts that will run 50+ times.

Anti-Patterns to Flag

antipatterns[6]{pattern,fix}:
  "You are an expert...",Remove — model capability is fixed by model choice
  "Please note that...",Remove — filler phrase
  "It is important to always...",Rewrite as direct instruction
  Repeating the same rule in 3 sections,Deduplicate — state once
  Explaining what JSON is,Remove — model knows JSON
  Step-by-step for trivial tasks,Remove steps — just state the goal