원클릭으로 Manus에서 모든 스킬 실행

$pwd:

agentic-eval-first-development

Name: Agentic Eval First Development
Author: vishalsachdev

// Architect, execute, and iterate on AI evaluations using the Data-Task-Score framework. Treats evals as the modern, quantifiable version of a PRD. Use when the user asks to "build an eval," "improve model quality," "test an agent workflow," "quantify product intuition," "move beyond vibe checks," "measure AI output," "score LLM responses," "benchmark a prompt," or "set up evaluation infrastructure." Also triggers on phrases like "how do I know if this is working," "is the model getting better," or "eval-driven development."

Manus에서 실행

$ git log --oneline --stat

stars:4

forks:1

updated:2026년 5월 2일 17:28

파일 탐색기

4 개 파일

SKILL.md

readonly

name

agentic-eval-first-development

description

Architect, execute, and iterate on AI evaluations using the Data-Task-Score framework. Treats evals as the modern, quantifiable version of a PRD. Use when the user asks to "build an eval," "improve model quality," "test an agent workflow," "quantify product intuition," "move beyond vibe checks," "measure AI output," "score LLM responses," "benchmark a prompt," or "set up evaluation infrastructure." Also triggers on phrases like "how do I know if this is working," "is the model getting better," or "eval-driven development."

Agentic Eval-First Development

Evals are infrastructure, not afterthoughts. Define success criteria before writing prompts or task logic. The eval becomes the spec.

Framework: Data → Task → Scores

Every eval has exactly three components:

Data — Golden dataset of inputs (the test cases)
Task — The operation being evaluated (LLM call, agent workflow, MCP pipeline)
Scores — Categorical rubric that maps outputs to normalized 0–1 values

Step 1: Define the PRD (Data & Scores)

Build the Golden Dataset

Collect or generate 10–20 representative inputs covering the full range of expected usage.

Use a high-reasoning model to autogenerate diverse test cases if manual examples are unavailable
Intentionally include inputs expected to fail — these map current model limitations
Store as JSON or JSONL for reproducibility. See references/golden-dataset-template.md for the format

Define the Scoring Rubric

Use categorical scoring (Options A/B/C) rather than asking for raw numbers. Raw numeric scores drift across evaluators and models.

Every score must include a written rationale explaining the grade
All scores normalize to 0–1 for cross-model comparison. See references/scoring-rubrics.md for rubric templates
Run scripts/normalize_scores.py to convert categorical results to normalized values

Example categorical scorer:

A (1.0) — Fully correct, well-structured, addresses all aspects
B (0.5) — Partially correct or missing key elements
C (0.0) — Incorrect, off-topic, or harmful

Step 2: Configure the Task (The Harness)

The task is the operation under evaluation.

Tool Pruning — If using MCP, limit available tools to only what's necessary. Models select incorrect tools when overwhelmed with options
System Prompt — Define initial instructions based on success criteria from Step 1 (e.g., "don't ask clarifying questions," "respond in JSON")
Isolation — Each eval run must be independent. No shared state between test cases

Step 3: Execute the Flywheel Loop

┌─────────────────────────────────────────┐
│  OFFLINE: Run golden dataset locally    │
│  → Identify gaps → Refine prompt/tools  │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│  ONLINE: Deploy scorers to production   │
│  → Monitor real user logs               │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│  CLOSE THE LOOP: Production failures    │
│  → Add back to golden dataset           │
└─────────────────────────────────────────┘

Offline iteration — Run experiments locally against the golden dataset. Iterate on prompts, tools, and model selection until scores stabilize
Online validation — Deploy scorers to production monitoring real user logs
Close the loop — When online score (e.g., 0.3) < offline score (e.g., 0.75), identify production failures and add them to the golden dataset

When to Stop Iterating

Offline scores plateau across 3+ consecutive runs
Online/offline gap is < 0.1
Remaining failures are edge cases outside the product's scope

Troubleshooting

Symptom	Likely Cause	Fix
All scores are 0	Scorer criteria too strict	Do a manual vibe check — if you disagree with the scorer, update the rubric
Scores are always 1.0	Scorer criteria too lenient or test cases too easy	Add adversarial inputs and tighten rubric
Online ≪ Offline	Golden dataset doesn't represent real usage	Add production failure cases to dataset
Scores vary wildly between runs	Non-deterministic task or scorer	Pin temperature=0, add more specific rubric criteria

Key Principle

The eval is the durable asset. Models change, prompts evolve, agent frameworks get replaced — but a well-built eval survives all of it. When switching models, re-run the eval; don't re-do the product thinking.

related-skills.json

같은 저장소

llm-council.md

from "vishalsachdev/claude-code-skills"

Convene a 3-model council (Claude + GPT via codex CLI + Gemini CLI) on a high-stakes decision. Forces cross-critique between members and surfaces where they actually disagree, breaking Claude's default agreeableness. Use when the user asks to "convene a council", "get a second opinion", "ask GPT and Gemini", "what would other models say", or has an architecture / strategy / hiring / pricing decision where being wrong is expensive. Skip for factual questions, code with one right answer, or anything premortem-shaped (route to premortem skill instead).

2026-05-024

codebase-singularity.md

from "vishalsachdev/claude-code-skills"

Apply the codebase singularity approach: reliable codebase understanding and change with a repeatable workflow, guardrails, and verification gates. Use for repo work (feature, bugfix, refactor, migration) when you need high trust, minimal diffs, and explicit validation and exit criteria.

2026-05-024

start-session.md

from "vishalsachdev/claude-code-skills"

Use when user says "let's get started", "where are we", or at beginning of a session. Reads project context from CLAUDE.md, checks git status and recent commits, and provides orientation for the session. Works across all repo types (code, research, mixed).

2026-05-024

tweet-series-extractor.md

from "vishalsachdev/claude-code-skills"

Extract tweet series from X/Twitter profiles with full content, links, and engagement metrics. Use when: (1) User provides a tweet URL and wants similar tweets by the same author, (2) User asks to "extract tweets", "collect tweet series", or "scrape tweets", (3) User wants to analyze a user's recurring tweet format (e.g., daily updates, weekly roundups), (4) User needs tweet content with embedded links preserved for analysis. Requires Chrome browser automation tools (mcp__claude-in-chrome__*).

2026-05-024

wrap-up-session.md

from "vishalsachdev/claude-code-skills"

Use when user says "let's wrap up", "close shop", "done for today", or wants to end a session. Handles session wrap-up including git operations, documentation updates, roadmap updates, and preparing for next session. Works across all repo types.

2026-05-024

premortem.md

from "vishalsachdev/claude-code-skills"

Run a premortem on a plan, launch, product, hire, strategy, or decision — assumes it failed 6 months from now and works backward to find every reason why, then produces a revised plan. Use when the user has a concrete plan or commitment with high cost-of-being-wrong and asks to "premortem", "stress test", "find blind spots", "poke holes", "what could kill this", or "what am I missing". Skip for vague ideas without a plan, simple feedback requests, factual questions, or already-irreversible decisions.

2026-05-024

package.json

"author": "vishalsachdev"

"repository": "vishalsachdev/claude-code-skills"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

소프트웨어 품질 보증 분석가·테스터컴퓨터 및 수학직15-1253L4

name

agentic-eval-first-development

description

Agentic Eval-First Development

Evals are infrastructure, not afterthoughts. Define success criteria before writing prompts or task logic. The eval becomes the spec.

Framework: Data → Task → Scores

Every eval has exactly three components:

Data — Golden dataset of inputs (the test cases)
Task — The operation being evaluated (LLM call, agent workflow, MCP pipeline)
Scores — Categorical rubric that maps outputs to normalized 0–1 values

Step 1: Define the PRD (Data & Scores)

Build the Golden Dataset

Collect or generate 10–20 representative inputs covering the full range of expected usage.

Use a high-reasoning model to autogenerate diverse test cases if manual examples are unavailable
Intentionally include inputs expected to fail — these map current model limitations
Store as JSON or JSONL for reproducibility. See references/golden-dataset-template.md for the format

Define the Scoring Rubric

Use categorical scoring (Options A/B/C) rather than asking for raw numbers. Raw numeric scores drift across evaluators and models.

Every score must include a written rationale explaining the grade
All scores normalize to 0–1 for cross-model comparison. See references/scoring-rubrics.md for rubric templates
Run scripts/normalize_scores.py to convert categorical results to normalized values

Example categorical scorer:

A (1.0) — Fully correct, well-structured, addresses all aspects
B (0.5) — Partially correct or missing key elements
C (0.0) — Incorrect, off-topic, or harmful

Step 2: Configure the Task (The Harness)

The task is the operation under evaluation.

Tool Pruning — If using MCP, limit available tools to only what's necessary. Models select incorrect tools when overwhelmed with options
System Prompt — Define initial instructions based on success criteria from Step 1 (e.g., "don't ask clarifying questions," "respond in JSON")
Isolation — Each eval run must be independent. No shared state between test cases

Step 3: Execute the Flywheel Loop

┌─────────────────────────────────────────┐
│  OFFLINE: Run golden dataset locally    │
│  → Identify gaps → Refine prompt/tools  │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│  ONLINE: Deploy scorers to production   │
│  → Monitor real user logs               │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│  CLOSE THE LOOP: Production failures    │
│  → Add back to golden dataset           │
└─────────────────────────────────────────┘

Offline iteration — Run experiments locally against the golden dataset. Iterate on prompts, tools, and model selection until scores stabilize
Online validation — Deploy scorers to production monitoring real user logs
Close the loop — When online score (e.g., 0.3) < offline score (e.g., 0.75), identify production failures and add them to the golden dataset

When to Stop Iterating

Offline scores plateau across 3+ consecutive runs
Online/offline gap is < 0.1
Remaining failures are edge cases outside the product's scope

Troubleshooting

Symptom	Likely Cause	Fix
All scores are 0	Scorer criteria too strict	Do a manual vibe check — if you disagree with the scorer, update the rubric
Scores are always 1.0	Scorer criteria too lenient or test cases too easy	Add adversarial inputs and tighten rubric
Online ≪ Offline	Golden dataset doesn't represent real usage	Add production failure cases to dataset
Scores vary wildly between runs	Non-deterministic task or scorer	Pin temperature=0, add more specific rubric criteria

agentic-eval-first-development

Agentic Eval-First Development

Framework: Data → Task → Scores

Step 1: Define the PRD (Data & Scores)

Build the Golden Dataset

Define the Scoring Rubric

Step 2: Configure the Task (The Harness)

Step 3: Execute the Flywheel Loop

When to Stop Iterating

Troubleshooting

Key Principle

이 저장소의 다른 Skills

이 저장소의 다른 Skills

Agentic Eval-First Development

Framework: Data → Task → Scores

Step 1: Define the PRD (Data & Scores)

Build the Golden Dataset

Define the Scoring Rubric

Step 2: Configure the Task (The Harness)

Step 3: Execute the Flywheel Loop

When to Stop Iterating

Troubleshooting

Key Principle