Run any Skill in Manus with one click

sc-evaluate

LLM pipeline evaluation with oracle judge scoring. Runs prompts against gold standard datasets, evaluates output quality via LLM-as-judge, and generates scored reports with improvement recommendations.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/Tony363/SuperClaude --skill sc-evaluate

Copy and paste this command into Claude Code to install the skill

Source

Tony363/SuperClaude

Stars18

Forks2

UpdatedMarch 2, 2026 at 00:23

SKILL.md

readonly

More from this repository

same repository

sc-test

Tony363/SuperClaude

Execute tests with coverage analysis, gap identification, test generation, and automated quality reporting. Use when running tests, analyzing coverage, generating missing tests, or debugging test failures.

2026-03-1118

sc-code-review

Tony363/SuperClaude

Interactive multi-model consensus code review using PAL MCP. Reviews commits, staged changes, or branch diffs with user-directed scope and interactive decision points.

2026-03-1018

sc-explain

Tony363/SuperClaude

Provide clear explanations of code, concepts, and system behavior with educational clarity. Use when understanding code, learning concepts, or knowledge transfer.

2026-03-0818

sc-principles

Tony363/SuperClaude

Enforce KISS, Purity, SOLID, and Let It Crash principles through mandatory validation gates. Detects complexity violations, impure functions, design anti-patterns, and error handling issues.

2026-03-0818

sc-tdd

Tony363/SuperClaude

Strict Test-Driven Development enforcer with Red-Green-Refactor workflow automation. Auto-detects frameworks, validates semantic test failures, and blocks production code until tests fail properly. Use for feature development, bug fixes with test coverage, or refactoring with safety nets.

2026-03-0818

sc-e2e

Tony363/SuperClaude

E2E testing workflow supporting Playwright, Cypress, and Selenium. Run, debug, record, trace, generate test scaffolds, and view reports. Use when running browser tests, debugging E2E failures, or generating test scaffolds.

2026-03-0218

Source

Tony363

Tony363/SuperClaude

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	sc-evaluate
description	LLM pipeline evaluation with oracle judge scoring. Runs prompts against gold standard datasets, evaluates output quality via LLM-as-judge, and generates scored reports with improvement recommendations.

LLM Evaluation Skill

Run LLM pipeline evaluation against gold standard datasets using oracle LLM-as-judge scoring. Measures output quality across weighted dimensions, identifies weak steps, and suggests prompt improvements.

Quick Start

# Full evaluation (all test cases, all steps)
/sc:evaluate

# Quick spot check
/sc:evaluate --cases=case_1,case_2 --steps=1,2,3

# Re-evaluate existing results without re-running pipeline
/sc:evaluate --skip-pipeline

# Generate outputs only (no evaluation)
/sc:evaluate --skip-eval

# Specify judge model
/sc:evaluate --judge-model=gpt-4o

# Dry run to preview plan
/sc:evaluate --dry-run

Behavioral Flow

Discover - Find evaluation script, gold standards, and prompt files
Configure - Parse scope (cases, steps, model overrides)
Execute - Run pipeline on gold standard inputs
Evaluate - Score outputs against gold standards via LLM-as-judge
Analyze - Identify weak steps, dimension breakdowns, patterns
Recommend - Suggest specific prompt improvements for low-scoring steps
Report - Generate JSON + Markdown evaluation reports

Flags

Flag	Type	Default	Description
`--cases`	string	all	Comma-separated test case IDs to evaluate
`--steps`	string	all	Comma-separated step numbers to evaluate
`--model`	string	env default	Override pipeline model
`--judge-model`	string	env default	Override judge/oracle model
`--skip-pipeline`	bool	false	Skip pipeline execution, evaluate existing results
`--skip-eval`	bool	false	Run pipeline only, skip evaluation
`--dry-run`	bool	false	Preview execution plan without API calls
`--output`	string	`eval_runs/YYYYMMDD_HHMMSS/`	Output directory
`--concurrency`	int	5	Parallel judge calls
`--threshold`	int	70	Score threshold for "needs improvement"

Phase 1: Discover Project Structure

Locate evaluation components:

Component	Common Locations	Purpose
Evaluation script	`scripts/run_eval.py`, `eval/run.py`	Orchestrates pipeline + scoring
Gold standards	`gold_standards/`, `test_data/`, `fixtures/`	Expected outputs
Prompts	`prompts/`, `templates/`	Pipeline prompt templates
Rubrics	`eval/rubrics.py`, `config/rubrics.yaml`	Scoring dimensions and weights

If no standard structure found, ask the user to specify paths.

Phase 2: Configure Scope

Parse arguments to determine:

Which test cases to run (default: all discovered)
Which pipeline steps to evaluate (default: all)
Model overrides for pipeline and judge
Output directory (default: timestamped)

Create output directory:

OUTPUT_DIR="${output:-eval_runs/$(date +%Y%m%d_%H%M%S)}"
mkdir -p "$OUTPUT_DIR"

Phase 3: Execute Pipeline

Run the pipeline on gold standard inputs:

python <eval_script> \
  --output "$OUTPUT_DIR" \
  --verbose \
  [--cases CASES] \
  [--steps STEPS] \
  [--model MODEL] \
  [--skip-pipeline] \
  [--skip-eval]

API call estimation:

Pipeline: steps x cases API calls
Evaluation: scored_dimensions x cases judge calls

For quick validation, suggest running on 1-2 cases with 2-3 steps first.

Phase 4: Evaluate with LLM-as-Judge

For each step output, compare against gold standard using oracle LLM-as-judge:

Evaluation dimensions (customizable per project):

Dimension	What It Measures
Content Agreement	Do outputs cover the same key points?
Structure Match	Is the organization/format similar?
Detail Accuracy	Are specific claims and data correct?
Completeness	Are all expected elements present?

Each dimension has a weight (0.0-1.0) summing to 1.0 per step.

Phase 5: Analyze Results

Read and analyze evaluation report:

Overall similarity score across all cases and steps
Per-step scores — highlight any below threshold (default: 70/100)
Per-case scores — identify consistently weak test cases
Dimension breakdowns for weak steps

Score interpretation:

Score Range	Assessment	Action
85-100	Excellent	No changes needed
70-84	Good	Minor tuning possible
60-69	Needs improvement	Prompt revision recommended
Below 60	Poor	Prompt likely needs rewrite

Phase 6: Recommend Improvements

For each step scoring below threshold:

Read the current prompt template
Read the gold standard output (expected)
Read the pipeline output (actual)
Compare and identify gaps:
- Missing instructions that gold standard captures
- Overly broad instructions causing divergent output
- Format/structure differences
- Specificity gaps

Present actionable suggestions:

### Step N: <step_name> (Score: XX/100)

**Weakest Dimension**: <dimension> (XX/100)

**Gap Analysis**:
- Gold standard includes <X> but prompt doesn't instruct it
- Output format diverges: gold uses <format>, output uses <other>

**Suggested Prompt Changes**:
1. Add instruction: "<specific instruction>"
2. Clarify format: "<format guidance>"
3. Add example: "<example output snippet>"

Output Structure

eval_runs/YYYYMMDD_HHMMSS/
  results/                    # Pipeline outputs
    case_1/
      step_01_<name>.md
      step_02_<name>.md
      ...
    case_2/
      ...
  evaluation/                 # Judge scores
    evaluation_report.json
    evaluation_report.md
    per_step_scores.csv
    per_case_scores.csv

MCP Integration

PAL MCP (Optional)

Tool	When	Purpose
`mcp__pal__thinkdeep`	Low-scoring steps	Deep analysis of why outputs diverge
`mcp__pal__consensus`	Prompt revision	Multi-model validation of proposed changes
`mcp__pal__codereview`	Eval script	Review evaluation pipeline code

Rube MCP (Optional)

Tool	When	Purpose
`mcp__rube__RUBE_REMOTE_WORKBENCH`	Large eval runs	Process results in Python sandbox
`mcp__rube__RUBE_MULTI_EXECUTE_TOOL`	Notifications	Report results to Slack/email

Error Handling

Scenario	Action
No eval script found	Ask user for script path
No gold standards found	Ask user for gold standard directory
API rate limit	Reduce concurrency, add delays
Pipeline step fails	Log error, continue with remaining steps
Judge returns invalid score	Retry once, then flag for manual review
Output directory exists	Append timestamp suffix

Guardrails

Always pass --verbose for progress visibility
Warn about API call counts before full runs
Suggest quick validation on subset before full evaluation
Preserve all intermediate outputs for debugging
Never modify gold standard files

Tool Coordination

Bash - Run evaluation scripts
Read - Inspect prompts, gold standards, outputs, reports
Write - Generate reports
Grep - Search for patterns in outputs
PAL MCP - Deep analysis of score gaps