con un clic
skeptic-agent
Define and run skeptic exit criteria for non-trivial tasks — independent verification agent with inverted incentive to find gaps
Menú
Define and run skeptic exit criteria for non-trivial tasks — independent verification agent with inverted incentive to find gaps
Multi-PR triage and parallel work dispatcher. Prevents single-PR tunnel vision by enforcing a survey-before-deep-dive protocol.
Classifies GitHub issues/PRs into PR-type categories and recommends autor techniques. Used by packages/core/src/decomposer.ts.
Use this skill when working in repositories managed by Agent Orchestrator or when the user asks how to use `ao` properly. Covers the default AO workflow: bootstrap with `ao start`, dispatch work with `ao spawn`, inspect progress with `ao status` or `ao session ls`, steer sessions with `ao send`, and recover or clean up sessions safely.
TDD-driven evidence workflow for generating authoritative failure/fix proof in PRs.
Canonical 7-green PR merge criteria, PR status check pattern, PR freeze discipline, and admin merge protocol
How to record asciinema/tmux evidence videos that prove work was done correctly
| name | skeptic-agent |
| description | Define and run skeptic exit criteria for non-trivial tasks — independent verification agent with inverted incentive to find gaps |
Proactively activate when:
/skeptic or asks for skeptic verificationBefore starting the task, ask:
"This looks non-trivial. Want to define skeptic exit criteria? A separate agent will independently verify these when you think you're done. What does 'actually done' look like for this task?"
If the user declines, proceed without. If they define criteria, save them to specs/exit-criteria.md in the workspace.
# specs/exit-criteria.md
## Task: [task name]
### Criterion A: [name]
What to verify: [natural language description]
Command to run (if applicable): [exact command]
What PASS looks like: [expected output/state]
What FAIL looks like: [common proxy substitutions to watch for]
### Criterion B: [name]
...
The coding agent works normally. It does NOT see specs/exit-criteria.md.
It signals readiness by stating "I believe the task is complete" or similar.
When the coder signals completion, spawn or switch to a skeptic session. The skeptic's system instructions:
You are a QA Skeptic. Your job is to FIND GAPS in the implementation.
INVERTED INCENTIVE: You are rewarded for finding missing evidence.
A false PASS is YOUR failure. A thorough FAIL report is success.
Rules:
1. Read specs/exit-criteria.md
2. For each criterion, run the EXACT verification specified
3. Do NOT accept the coder's claims — verify independently
4. Unit tests do NOT satisfy E2E criteria
5. Manual tool calls do NOT satisfy pipeline criteria
6. "Code compiles" does NOT mean "feature works"
7. EVIDENCE MUST BE VIDEO: For UI and interactive terminal criteria, ONLY accept .gif, .mp4, .webm, or .mov videos (UI) or .gif, .mp4, .webm, .mov, .cast (terminal) tied to a commit SHA. Static screenshots FAIL — no exceptions.
Output format per criterion:
CRITERION: [quote verbatim from specs/exit-criteria.md]
EVIDENCE FOUND: [what you actually observed — commands run, output seen]
EVIDENCE MISSING: [what should exist but doesn't]
VERDICT: PASS | FAIL | INSUFFICIENT
REASON: [specific gap or confirmation]
If ANY criterion is FAIL or INSUFFICIENT:
## Skeptic Verification Report
Task: [name]
Date: [date]
Coder model: [model]
Skeptic model: [model]
Iterations: [N]
| Criterion | Verdict | Evidence |
|---|---|---|
| A | PASS | [brief evidence] |
| B | FAIL | [what's missing] |
Overall: PASS / FAIL
Agent tool with subagent_type="pair-verifier" and skeptic system promptcodex exec with task promptcodex exec with skeptic prompt + workspace accessao spawn worker sessionao spawn --skeptic (new flag, spawns with skeptic agentRules)/pair verifier phase to use skeptic LLM instead of verifyCommand bashverifyCommand first (fast, deterministic), then skeptic for nuanced criteriaRLHF makes agents want to complete tasks. The Skeptic's "task" IS finding gaps. Its RLHF bias pushes it toward thoroughness in criticism, not toward premature approval. This turns RLHF from a bug into a feature.