Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

agentic-eval-first-development

Name: Agentic Eval First Development
Author: vishalsachdev

// Architect, execute, and iterate on AI evaluations using the Data-Task-Score framework. Treats evals as the modern, quantifiable version of a PRD. Use when the user asks to "build an eval," "improve model quality," "test an agent workflow," "quantify product intuition," "move beyond vibe checks," "measure AI output," "score LLM responses," "benchmark a prompt," or "set up evaluation infrastructure." Also triggers on phrases like "how do I know if this is working," "is the model getting better," or "eval-driven development."

Exécuter dans Manus

$ git log --oneline --stat

stars:4

forks:1

updated:2 mai 2026 à 17:28

Explorateur de fichiers

4 fichiers

SKILL.md

readonly

name

agentic-eval-first-development

description

Architect, execute, and iterate on AI evaluations using the Data-Task-Score framework. Treats evals as the modern, quantifiable version of a PRD. Use when the user asks to "build an eval," "improve model quality," "test an agent workflow," "quantify product intuition," "move beyond vibe checks," "measure AI output," "score LLM responses," "benchmark a prompt," or "set up evaluation infrastructure." Also triggers on phrases like "how do I know if this is working," "is the model getting better," or "eval-driven development."

Agentic Eval-First Development

Evals are infrastructure, not afterthoughts. Define success criteria before writing prompts or task logic. The eval becomes the spec.

Framework: Data → Task → Scores

Every eval has exactly three components:

Data — Golden dataset of inputs (the test cases)
Task — The operation being evaluated (LLM call, agent workflow, MCP pipeline)
Scores — Categorical rubric that maps outputs to normalized 0–1 values

Step 1: Define the PRD (Data & Scores)

Build the Golden Dataset

Collect or generate 10–20 representative inputs covering the full range of expected usage.

Use a high-reasoning model to autogenerate diverse test cases if manual examples are unavailable
Intentionally include inputs expected to fail — these map current model limitations
Store as JSON or JSONL for reproducibility. See references/golden-dataset-template.md for the format

Define the Scoring Rubric

Use categorical scoring (Options A/B/C) rather than asking for raw numbers. Raw numeric scores drift across evaluators and models.

Every score must include a written rationale explaining the grade
All scores normalize to 0–1 for cross-model comparison. See references/scoring-rubrics.md for rubric templates
Run scripts/normalize_scores.py to convert categorical results to normalized values

Example categorical scorer:

A (1.0) — Fully correct, well-structured, addresses all aspects
B (0.5) — Partially correct or missing key elements
C (0.0) — Incorrect, off-topic, or harmful

Step 2: Configure the Task (The Harness)

The task is the operation under evaluation.

Tool Pruning — If using MCP, limit available tools to only what's necessary. Models select incorrect tools when overwhelmed with options
System Prompt — Define initial instructions based on success criteria from Step 1 (e.g., "don't ask clarifying questions," "respond in JSON")
Isolation — Each eval run must be independent. No shared state between test cases

Step 3: Execute the Flywheel Loop

┌─────────────────────────────────────────┐
│  OFFLINE: Run golden dataset locally    │
│  → Identify gaps → Refine prompt/tools  │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│  ONLINE: Deploy scorers to production   │
│  → Monitor real user logs               │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│  CLOSE THE LOOP: Production failures    │
│  → Add back to golden dataset           │
└─────────────────────────────────────────┘

Offline iteration — Run experiments locally against the golden dataset. Iterate on prompts, tools, and model selection until scores stabilize
Online validation — Deploy scorers to production monitoring real user logs
Close the loop — When online score (e.g., 0.3) < offline score (e.g., 0.75), identify production failures and add them to the golden dataset

When to Stop Iterating

Offline scores plateau across 3+ consecutive runs
Online/offline gap is < 0.1
Remaining failures are edge cases outside the product's scope

Troubleshooting

Symptom	Likely Cause	Fix
All scores are 0	Scorer criteria too strict	Do a manual vibe check — if you disagree with the scorer, update the rubric
Scores are always 1.0	Scorer criteria too lenient or test cases too easy	Add adversarial inputs and tighten rubric
Online ≪ Offline	Golden dataset doesn't represent real usage	Add production failure cases to dataset
Scores vary wildly between runs	Non-deterministic task or scorer	Pin temperature=0, add more specific rubric criteria

Key Principle

The eval is the durable asset. Models change, prompts evolve, agent frameworks get replaced — but a well-built eval survives all of it. When switching models, re-run the eval; don't re-do the product thinking.

related-skills.json

même dépôt

llm-council.md

from "vishalsachdev/claude-code-skills"

Convene a 3-model council (Claude + GPT via codex CLI + Gemini CLI) on a high-stakes decision. Forces cross-critique between members and surfaces where they actually disagree, breaking Claude's default agreeableness. Use when the user asks to "convene a council", "get a second opinion", "ask GPT and Gemini", "what would other models say", or has an architecture / strategy / hiring / pricing decision where being wrong is expensive. Skip for factual questions, code with one right answer, or anything premortem-shaped (route to premortem skill instead).

2026-05-024

codebase-singularity.md

from "vishalsachdev/claude-code-skills"

Apply the codebase singularity approach: reliable codebase understanding and change with a repeatable workflow, guardrails, and verification gates. Use for repo work (feature, bugfix, refactor, migration) when you need high trust, minimal diffs, and explicit validation and exit criteria.

2026-05-024

start-session.md

from "vishalsachdev/claude-code-skills"

Use when user says "let's get started", "where are we", or at beginning of a session. Reads project context from CLAUDE.md, checks git status and recent commits, and provides orientation for the session. Works across all repo types (code, research, mixed).

2026-05-024

tweet-series-extractor.md

from "vishalsachdev/claude-code-skills"

Extract tweet series from X/Twitter profiles with full content, links, and engagement metrics. Use when: (1) User provides a tweet URL and wants similar tweets by the same author, (2) User asks to "extract tweets", "collect tweet series", or "scrape tweets", (3) User wants to analyze a user's recurring tweet format (e.g., daily updates, weekly roundups), (4) User needs tweet content with embedded links preserved for analysis. Requires Chrome browser automation tools (mcp__claude-in-chrome__*).

2026-05-024

wrap-up-session.md

from "vishalsachdev/claude-code-skills"

Use when user says "let's wrap up", "close shop", "done for today", or wants to end a session. Handles session wrap-up including git operations, documentation updates, roadmap updates, and preparing for next session. Works across all repo types.

2026-05-024

premortem.md

from "vishalsachdev/claude-code-skills"

Run a premortem on a plan, launch, product, hire, strategy, or decision — assumes it failed 6 months from now and works backward to find every reason why, then produces a revised plan. Use when the user has a concrete plan or commitment with high cost-of-being-wrong and asks to "premortem", "stress test", "find blind spots", "poke holes", "what could kill this", or "what am I missing". Skip for vague ideas without a plan, simple feedback requests, factual questions, or already-irreversible decisions.

2026-05-024

package.json

"author": "vishalsachdev"

"repository": "vishalsachdev/claude-code-skills"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Analystes en assurance qualité des logiciels et testeursProfessions informatiques et mathématiques15-1253L4

name

agentic-eval-first-development

description

Agentic Eval-First Development

Evals are infrastructure, not afterthoughts. Define success criteria before writing prompts or task logic. The eval becomes the spec.

Framework: Data → Task → Scores

Every eval has exactly three components:

Data — Golden dataset of inputs (the test cases)
Task — The operation being evaluated (LLM call, agent workflow, MCP pipeline)
Scores — Categorical rubric that maps outputs to normalized 0–1 values

Step 1: Define the PRD (Data & Scores)

Build the Golden Dataset

Collect or generate 10–20 representative inputs covering the full range of expected usage.

Use a high-reasoning model to autogenerate diverse test cases if manual examples are unavailable
Intentionally include inputs expected to fail — these map current model limitations
Store as JSON or JSONL for reproducibility. See references/golden-dataset-template.md for the format

Define the Scoring Rubric

Use categorical scoring (Options A/B/C) rather than asking for raw numbers. Raw numeric scores drift across evaluators and models.

Every score must include a written rationale explaining the grade
All scores normalize to 0–1 for cross-model comparison. See references/scoring-rubrics.md for rubric templates
Run scripts/normalize_scores.py to convert categorical results to normalized values

Example categorical scorer:

A (1.0) — Fully correct, well-structured, addresses all aspects
B (0.5) — Partially correct or missing key elements
C (0.0) — Incorrect, off-topic, or harmful

Step 2: Configure the Task (The Harness)

The task is the operation under evaluation.

Tool Pruning — If using MCP, limit available tools to only what's necessary. Models select incorrect tools when overwhelmed with options
System Prompt — Define initial instructions based on success criteria from Step 1 (e.g., "don't ask clarifying questions," "respond in JSON")
Isolation — Each eval run must be independent. No shared state between test cases

Step 3: Execute the Flywheel Loop

┌─────────────────────────────────────────┐
│  OFFLINE: Run golden dataset locally    │
│  → Identify gaps → Refine prompt/tools  │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│  ONLINE: Deploy scorers to production   │
│  → Monitor real user logs               │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│  CLOSE THE LOOP: Production failures    │
│  → Add back to golden dataset           │
└─────────────────────────────────────────┘

Offline iteration — Run experiments locally against the golden dataset. Iterate on prompts, tools, and model selection until scores stabilize
Online validation — Deploy scorers to production monitoring real user logs
Close the loop — When online score (e.g., 0.3) < offline score (e.g., 0.75), identify production failures and add them to the golden dataset

When to Stop Iterating

Offline scores plateau across 3+ consecutive runs
Online/offline gap is < 0.1
Remaining failures are edge cases outside the product's scope

Troubleshooting

Symptom	Likely Cause	Fix
All scores are 0	Scorer criteria too strict	Do a manual vibe check — if you disagree with the scorer, update the rubric
Scores are always 1.0	Scorer criteria too lenient or test cases too easy	Add adversarial inputs and tighten rubric
Online ≪ Offline	Golden dataset doesn't represent real usage	Add production failure cases to dataset
Scores vary wildly between runs	Non-deterministic task or scorer	Pin temperature=0, add more specific rubric criteria

agentic-eval-first-development

Agentic Eval-First Development

Framework: Data → Task → Scores

Step 1: Define the PRD (Data & Scores)

Build the Golden Dataset

Define the Scoring Rubric

Step 2: Configure the Task (The Harness)

Step 3: Execute the Flywheel Loop

When to Stop Iterating

Troubleshooting

Key Principle

Plus depuis ce dépôt

Plus depuis ce dépôt

Agentic Eval-First Development

Framework: Data → Task → Scores

Step 1: Define the PRD (Data & Scores)

Build the Golden Dataset

Define the Scoring Rubric

Step 2: Configure the Task (The Harness)

Step 3: Execute the Flywheel Loop

When to Stop Iterating

Troubleshooting

Key Principle