Run any Skill in Manus with one click

agentic-eval

Stars2

Forks1

UpdatedMay 13, 2026 at 10:26

Adversarial evaluation patterns for any output or decision where you want a devil's advocate challenge before committing. Use when: you want to challenge a decision, get an adversarial second opinion, implement self-critique loops, apply rubric-based scoring, run evaluator-optimizer pipelines, judge-and-refine cycles, verify a downstream agent can consume an upstream artifact, or check inter-stage artifact consistency. Triggers on "devil's advocate", "challenge this decision", "挑戰這個決策", "扮演反對者", "adversarial review", "self-critique", "質疑這個方案". Do NOT use for standard code review (use code-security-review skill), general refactoring (use refactor skill), or security audits.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

forgivesam168

forgivesam168/ai-dev-workflow

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Data ScientistsComputer and Mathematical Occupations·SOC 15-2051

File Explorer

4 files

SKILL.md

readonly

More from this repository

same repository

brainstorming

forgivesam168/ai-dev-workflow

Start a work item: triage risk, run structured brainstorming to clarify requirements, compare solution options, and produce a decision log + change package skeleton.

2026-05-132

specification

forgivesam168/ai-dev-workflow

Generate comprehensive specification documents (PRD/Spec). Use when asked to "write spec", "create PRD", "document requirements", "user stories", "acceptance criteria", "產生規格", "寫需求文件", "specifications", or transforming brainstorm results into formal structured requirements for features.

2026-05-132

tdd-workflow

forgivesam168/ai-dev-workflow

Test-Driven Development workflow enforcement skill. Use when explicitly asked for TDD methodology, test-first development, red-green-refactor cycle, or TDD implementation. Triggers on keywords like "start TDD", "test-driven development", "write tests first", or in Chinese "開始 TDD", "測試先行", "TDD 實作". Provides comprehensive TDD patterns, coverage requirements, and Red-Green-Refactor workflow guidance.

2026-05-132

ci-cd-and-automation

forgivesam168/ai-dev-workflow

CI/CD pipeline design and automation quality gates. Use when designing or reviewing continuous integration pipelines, setting up quality gates, implementing Shift Left testing strategies, automating deployment workflows, or diagnosing slow/broken pipelines. Triggers on: "CI/CD", "pipeline design", "quality gate", "Shift Left", "automate deployment", "GitHub Actions", "pipeline slow", "build pipeline", "continuous integration", "continuous deployment". Do NOT use for production launch decisions (use shipping-and-launch), code review (use code-security-review), or infrastructure provisioning.

2026-05-132

shipping-and-launch

forgivesam168/ai-dev-workflow

Production deployment and launch management skill. Use when preparing to ship a feature to production: staged rollout planning, rollback plan design, production readiness checklists, launch go/no-go decisions, and post-launch monitoring setup. Triggers on: "deploy to production", "ship this feature", "launch plan", "rollout strategy", "rollback plan", "production checklist", "go-live", "staged rollout". Do NOT use for internal code finalization (use work-archiving), CI/CD pipeline design (use ci-cd-and-automation), or code review (use code-security-review).

2026-05-132

security-review

forgivesam168/ai-dev-workflow

Use this skill when adding authentication, handling user input, working with secrets, creating API endpoints, or implementing payment/sensitive features. Provides comprehensive security checklist and patterns.

2026-05-132

name	agentic-eval
description	Adversarial evaluation patterns for any output or decision where you want a devil's advocate challenge before committing. Use when: you want to challenge a decision, get an adversarial second opinion, implement self-critique loops, apply rubric-based scoring, run evaluator-optimizer pipelines, judge-and-refine cycles, verify a downstream agent can consume an upstream artifact, or check inter-stage artifact consistency. Triggers on "devil's advocate", "challenge this decision", "挑戰這個決策", "扮演反對者", "adversarial review", "self-critique", "質疑這個方案". Do NOT use for standard code review (use code-security-review skill), general refactoring (use refactor skill), or security audits.
allowed-tools	agent sql
compatibility	Tier 1 (self-critique/rubric) works in all environments (VS Code, CLI, cloud). Tier 2 (external critic via subagent) requires the agent tool: in CLI use task tool with agent_type; in VS Code use runSubagent / agent tool with explicit model selection for multi-model diversity. Rubber-duck (CLI) requires RUBBER_DUCK_AGENT experimental flag. Tier 3 (tracked iterations via sql) is CLI-only; uses per-session DB.

Agentic Evaluation Patterns

Iterative critique-and-refine loops for quality-critical agent outputs.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

Rubber Duck Spirit

用相反論點挑戰自己的輸出，直到找不到反駁為止。

"If you can't find a counter-argument, the output is ready. If you can, fix the weakness first."

核心行為模式：

生成輸出後，立即問：「什麼論點可以反駁這個決定？」
若找到反論 → 修正輸出，再問一次
若找不到有力反論 → 輸出已足夠穩健，繼續前進
不因「感覺對了」或「已經改很多次」就停止挑戰

agentic-eval 是將此精神結構化的工具集。任何決策、任何輸出均可使用，無須等特定階段。

Pre-Decision Mode（決策前懷疑模式）

Use before committing to a high-risk decision: architecture choices, irreversible operations (DB schema, API contracts, security design), or any decision flagged as High Risk by brainstorm-agent.

強制觸發條件: High Risk 決策 / 架構選擇 / 不可逆操作 → 必須執行 Pre-Decision Mode，不得直接實作。

Five-Step Protocol

Step	Action
CLAIM	State the decision in one sentence: "I will [X] because [Y]"
EXTRACT	List all assumptions the decision depends on (≥ 3)
DOUBT	Apply Sequential Specialist Lens — each perspective states ≥ 1 challenge or confirmation
RECONCILE	Self-score 0–10; describe what "10" looks like; edit until target score reached
STOP	Score ≥ 8 → proceed; Score < 8 after 1 re-score → escalate to user with DOUBT findings

Sequential Specialist Lens（DOUBT 步驟）

依序審查，每視角至少提出 1 個質疑或確認：

Security — 這個決策是否開了攻擊面？認證 / 授權 / 注入風險？
Performance — 是否引入 N+1 查詢、無界資料集、阻塞呼叫？
Architecture — 是否違反分層邊界、DDD 聚合規則、循環依賴？
Maintainability — 新成員 6 個月後能看懂嗎？測試可維護嗎？
Operability — 錯誤訊息有意義嗎？日誌可觀察嗎？部署可回滾嗎？

RECONCILE 自評格式

Score: X/10
What 10 looks like: [one sentence]
Gap to close: [specific action needed]

When to Use

Pre-Decision（高風險決策前）:

Architecture choices, DB schema changes, API contract design, security design
Any decision flagged as High Risk by brainstorm-agent
Irreversible operations where rollback is costly or impossible

Post-Output（輸出完成後）:

Output is high-stakes or externally visible (code shipped to prod, published reports)
An objective rubric or measurable criteria can be defined upfront
A second perspective is needed beyond linting or syntax checks
You want to verify that a downstream agent can consume an upstream artifact

When NOT to Use

Trivial or low-risk exploratory drafts
Standard code quality/security review → use code-review or code-security-review
Refactoring existing code → use refactor
Simple formatting or lint issues

Integration with 6-Stage AI Development Workflow

This skill can be used at any stage where adversarial challenge adds value. Evaluation strictness is risk-adaptive — tied to the brainstorm-agent's Low/Med/High classification. It complements the repo's shared execution-guardrails layer: guardrails shape behavior before and during work, while agentic-eval provides on-demand adversarial challenge.

The table below shows common usage patterns (advisory, not mandatory):

Stage Transition	Common Use	Tier	Risk Level
Before Spec → handoff	spec-agent self-challenge	1	All
Spec → Plan	plan-agent cross-validation	1	Med / High
After Plan complete	architect-agent external critique	2 (Optional)	Med / High
Before code-reviewer	coder-agent self-eval	1	All
Review completeness check	architect-agent meta-review	1 (Optional)	High only

Architect-Agent Trigger Conditions

When invoked by architect-agent for cross-stage quality arbitration:

Invocation Point	Rubric	Tier	Risk Threshold
After `spec-agent`	`#spec`	1	High risk only
After `plan-agent`	`#plan`	1; Tier 2 if ≥2 FAIL	Med / High
After `code-reviewer`	`#review` meta-rubric	1	High risk only

FAIL path: All PASS → REVIEW ACCEPTED. 1 FAIL → targeted re-review. ≥2 FAIL or Financial Precision FAIL → route to coder then full re-review. Max 2 iterations; unresolved → escalate to human.

Subagent Status Protocol

Status	Meaning
`DONE`	Completed; no blocking concerns
`DONE_WITH_CONCERNS`	Completed; 1+ concerns flagged in output
`NEEDS_CONTEXT`	Blocked; awaiting input artifact
`BLOCKED`	Hard blocker; requires human decision

Guardrail-aware scoring:

Use rubric dimensions such as Assumption Management, Simplicity / Overengineering Risk, Diff Scope Hygiene, and Verification Strength where they materially affect handoff quality.
Prefer stage-specific wording rather than a generic global checklist; the same guardrail should look different in brainstorm, plan, code, and review.

Context isolation rules (apply everywhere):

Pass summaries and key excerpts — NEVER full document blobs
For specs/plans: AC list + constraints, not full text (≤800 words to any critic)
For code: diff + test result summary, not full file content
Never include brainstorm conversation history in critic context

See stage-rubrics.md for per-stage rubric dimensions and adversarial prompt templates.

3-Tier Evaluation Framework

Choose the tier that matches your environment and quality target:

Tier	Requires	CLI	VS Code	Best For
1 — Self-Critique	Nothing	✅	✅	Quick rubric check
2 — External Critic	`agent` tool	✅ rubber-duck or general-purpose	✅ `runSubagent` + model selection	Adversarial second opinion
3 — Tracked Evaluation	`agent` + `sql`	✅	❌ (no sql tool)	Multi-iteration history

Iteration Ceilings (NFR-05)

Two distinct ceilings apply depending on context:

Stage-Transition Gating Loops (max 2 iterations)

Applies when agentic-eval is used as a stage gate — i.e., at any of:

Spec → Plan handoff (spec-agent self-eval)
Plan → Code handoff (plan-agent cross-eval)
Code → Review handoff (coder-agent self-eval)
Review completeness check (architect-agent meta-review)

Ceiling: max 2 iterations. After 2 iterations at a stage gate without all dimensions resolving to PASS:

Terminate the loop immediately
Surface all unresolved FAIL dimensions to the human using this structured format
Do NOT initiate a third iteration autonomously

Structured escalation message (required format):

## ⛔ Stage Gate Blocked — Human Decision Required
Unresolved dimensions after 2 iterations:
- [DIMENSION]: [one-sentence FAIL reason] → [specific line/excerpt as evidence]

Root cause type (pick one per dimension):
  UPSTREAM_GAP   — problem is in the input artifact (spec/brainstorm), not this artifact
  CONTENT_GAP    — this artifact is missing required content
  AMBIGUITY      — rubric cannot resolve without more context from user

Recommended actions:
  A. Fix upstream artifact (UPSTREAM_GAP) → re-run this agent after fix
  B. Add missing content to this artifact (CONTENT_GAP) → targeted edit
  C. Override this FAIL with explicit user approval and stated rationale (last resort)
  D. Stop this work package — revisit requirements

Rationale: Stage gates must not become infinite loops. 2 iterations provide one self-correction opportunity. If unresolved after 2, human judgment is required — but the human needs structured information to decide, not a raw list of failures.

General-Purpose Refinement Loops (max 3–5 iterations)

Applies when agentic-eval is used for non-gate iterative improvement — e.g.:

Draft document improvement
Report quality refinement
Iterative clarification outside stage transitions

Ceiling: max 3–5 iterations (Tier 1 self-critique default; adjust based on quality target).

Summary Table

Loop Type	Context	Max Iterations	Unresolved Action
Stage-transition gate	spec/plan/code/review handoff	2	Terminate; surface to human
General-purpose refinement	draft improvement, non-gate loops	3–5	Stop; output best available

Tier 1: Self-Critique / Rubric (Always Available)

Define a rubric, score the output, refine on FAIL dimensions. Max 3–5 iterations (general-purpose) or max 2 iterations (stage gate — see NFR-05 above).

Steps:

Define criteria and score threshold (e.g., 0.8 / 5-point scale)
Adopt adversarial persona before scoring (required for stage gates):
"You are a skeptical external auditor. Your job is to find problems, not confirm quality. For every dimension you score PASS, state the strongest counter-argument you can. If no counter-argument exists, write 'no counter-argument found'."
This combats Anchoring and RLHF sycophancy bias — LLMs systematically self-score high (7–8/10). The persona switch must precede scoring.
Score output against each dimension using structured JSON
If any dimension FAIL → refine with targeted feedback; evidence must cite a specific line, hunk, or excerpt — not a general statement
Stop when all PASS or max iterations reached

See Python implementation patterns for code examples.

Tier 2: External Critic via Subagent (Optional — Requires `agent` tool)

Delegate evaluation to a separate subagent using a different model perspective for adversarial critique.

Steps:

Generate output (code, report, design)
Extract a focused excerpt or summary — do NOT pass entire blobs; pass key sections, diff, or rubric context
Call critic subagent:
- CLI — rubber-duck available (RUBBER_DUCK_AGENT flag on): use task(agent_type: "rubber-duck")
- CLI — no rubber-duck: use task(agent_type: "general-purpose") with adversarial system prompt
- VS Code: use agent tool (runSubagent); explicitly request a different model in the prompt: "Use [GPT-4o / Claude Haiku] to critique this..." — model diversity is the key mechanism
Parse critique → identify weak points
Goodhart's Law mitigation: Supplement rubric scoring with at least one user-intent question:

"Ignore the rubric. In one paragraph: what problem does this artifact appear to be solving? Does this match what the user originally requested?"
This surfaces drift that structured rubric dimensions cannot catch (model optimizes rubric format at generation time).
Refine output targeting identified weaknesses
Repeat up to 3 iterations

⚠️ rubber-duck availability (CLI only): Requires RUBBER_DUCK_AGENT experimental flag. Enable via /experimental on or enabledFeatureFlags.RUBBER_DUCK_AGENT: true in ~/.copilot/config.json. In VS Code or without the flag, use general-purpose subagent with adversarial prompt — model diversity can be achieved by requesting a specific model different from the main conversation model.

Context efficiency rules (apply in all environments):

For code: pass file path + diff excerpt, not full file content
For reports: pass the relevant paragraph + evaluation rubric
For designs/plans: pass key decisions + constraints, not full spec

⚠️ Critic reliability: A Tier 2 critic is also an LLM and can hallucinate. Treat critic positive validations ("X is correct", "function Y exists") as non-authoritative. Only critic-identified gaps and failures require action. Correctness is confirmed by running code, querying APIs, or consulting authoritative sources — not by asking a critic.

See Evaluation Workflow for step-by-step guide (CLI + VS Code).

Tier 3: Tracked Evaluation (Optional — Requires `task` + `sql` tools)

Persist iteration scores to the per-session database for convergence analysis and audit trail. Use only when tracking history across 3+ iterations is meaningful.

⚠️ Use sql with database: "session" (the per-session DB). Do NOT use database: "session_store" — it is read-only.

Minimal schema:

CREATE TABLE IF NOT EXISTS eval_iterations (
    id        INTEGER PRIMARY KEY AUTOINCREMENT,
    iteration INTEGER NOT NULL,
    dimension TEXT    NOT NULL,
    score     REAL    NOT NULL,
    critique  TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Store only: iteration number, dimension, score, brief critique summary. Never store full output blobs.

See CLI evaluation workflow for full query patterns.

Evaluation Strategies

Strategy	When to Use
Rubric-Based	Clear weighted dimensions exist (accuracy, clarity, completeness)
Outcome-Based	Evaluate against expected end result
LLM-as-Judge	Compare two candidate outputs head-to-head
Adversarial	Find edge cases, failure modes, security/logic flaws
Test-Driven	Code: write tests first, iterate until all pass

Quick Start Checklist

## Evaluation Setup
- [ ] Choose tier (1 / 2 / 3) based on environment and stakes
- [ ] Define rubric dimensions and score threshold
- [ ] Set max iterations (default: 3, max: 5)

## Execution
- [ ] Generate initial output
- [ ] Score against rubric (Tier 1) or delegate to critic (Tier 2+)
- [ ] Refine targeting failed dimensions only
- [ ] Check convergence: stop if score not improving

## Safety
- [ ] Enforce iteration limit to prevent infinite loops
- [ ] Pass summaries/excerpts to critics — not full blobs
- [ ] Handle parse failures gracefully (fallback to full re-score)
- [ ] Log final score and iteration count

References

Python Application Patterns — Tier 1 code examples (self-critique, evaluator-optimizer, code reflection)
CLI Evaluation Workflow — Tier 2 & 3 step-by-step guide (task subagent, rubber-duck, SQL tracking)

agentic-eval

More from this repository

Agentic Evaluation Patterns

Rubber Duck Spirit

Pre-Decision Mode（決策前懷疑模式）

Five-Step Protocol

Sequential Specialist Lens（DOUBT 步驟）

RECONCILE 自評格式

When to Use

When NOT to Use

Integration with 6-Stage AI Development Workflow

Architect-Agent Trigger Conditions

Subagent Status Protocol

3-Tier Evaluation Framework

Iteration Ceilings (NFR-05)

Stage-Transition Gating Loops (max 2 iterations)

General-Purpose Refinement Loops (max 3–5 iterations)

Summary Table

Tier 1: Self-Critique / Rubric (Always Available)

Tier 2: External Critic via Subagent (Optional — Requires agent tool)

Tier 3: Tracked Evaluation (Optional — Requires task + sql tools)

Evaluation Strategies

Quick Start Checklist

References

Agentic Evaluation Patterns

Rubber Duck Spirit

Pre-Decision Mode（決策前懷疑模式）

Five-Step Protocol

Sequential Specialist Lens（DOUBT 步驟）

RECONCILE 自評格式

When to Use

When NOT to Use

Integration with 6-Stage AI Development Workflow

Architect-Agent Trigger Conditions

Subagent Status Protocol

3-Tier Evaluation Framework

Iteration Ceilings (NFR-05)

Stage-Transition Gating Loops (max 2 iterations)

General-Purpose Refinement Loops (max 3–5 iterations)

Summary Table

Tier 1: Self-Critique / Rubric (Always Available)

Tier 2: External Critic via Subagent (Optional — Requires agent tool)

Tier 3: Tracked Evaluation (Optional — Requires task + sql tools)

Evaluation Strategies

Quick Start Checklist

References

More from this repository

Tier 2: External Critic via Subagent (Optional — Requires `agent` tool)

Tier 3: Tracked Evaluation (Optional — Requires `task` + `sql` tools)

Tier 2: External Critic via Subagent (Optional — Requires `agent` tool)

Tier 3: Tracked Evaluation (Optional — Requires `task` + `sql` tools)