with one click
agent-acceptance-gate
// Use when a multi-agent round needs a pre-merge gate, pre-commit check, verification before push, or a PASS/FAIL decision after reconciliation.
// Use when a multi-agent round needs a pre-merge gate, pre-commit check, verification before push, or a PASS/FAIL decision after reconciliation.
Use when the user asks to split a goal across Claude, Codex, or Gemini; plan a multi-agent run; break work into parallel agent tasks; or decompose a large task that needs bounded context handoffs. This is the **generic** multi-agent task splitter โ writes `.coord/plan.yml` (a DAG) plus per-agent task files. NOT for research-domain routing that touches `.research/`, `.paper/`, or Zotero/Obsidian/NotebookLM ingest pipelines โ for those, use `research-hub-multi-ai` instead (different artifact `.coord/multi_ai_plan.md`, research-hub-aware reconciliation).
Use when multi-agent work risks context overflow, memory growth, noisy logs, oversized handoffs, cross-session continuation, or parallel Codex and Gemini execution.
Use when multiple agents have completed a round and the user asks to reconcile outputs, compare Codex and Gemini, synthesize run results, identify conflicts, or decide what should be retried.
Use when a task needs single-agent self-correction across multiple iterations โ write plan, execute, critique own output, revise plan, re-execute, until convergence or budget exhausted. Different from `agent-debate` (which is 2 agents arguing pro vs con); this is 1 agent looping over its own work.
Use when a consequential decision needs adversarial review, opposing agent arguments, a second opinion via debate, or explicit trade-off analysis before implementation.
Use when the user asks to update shared memory, initialize multi-agent memory, summarize decisions so far, identify open questions, or prepare a fresh session primer.
| name | agent-acceptance-gate |
| description | Use when a multi-agent round needs a pre-merge gate, pre-commit check, verification before push, or a PASS/FAIL decision after reconciliation. |
The last gate before merging multi-agent output. The reconciler describes what each agent did; the acceptance-gate decides whether the round is mergeable.
It runs the success_criteria declared in .coord/plan.yml,
aggregates risks, optionally audits prose, checks budget, and
produces a single PASS / FAIL / RETRY verdict per task and overall.
The gate replaces ad-hoc inline shell verification with one structured
verdict. Concrete differences observed in production (measured-benefits.md
R5 + Phase D):
| Inline shell verification | This skill (R5 measured) |
|---|---|
| Main session runs 5+ grep / diff commands, interprets each output, drafts a manual PASS/FAIL โ ~10k tokens of shell-debugging in context | Subagent runs all checks + returns ~2k token structured report โ ~5ร token saving in main session |
| Easy to forget a check ("did I grep for banned phrases this time?") | Preset YAML encodes the full check set โ no missed checks |
| Catches one drift case; misses cross-language echoes | cross_document_link_text_parity v0.2.3 caught 9+ real drift bugs the human audit missed |
2026-05-13 session: Codex generated a paragraph claiming "DeepSeek-R2
reaches 94.2% on GPQA Diamond" โ pure fabrication. Manual grep on the
diff missed it because the claim was internally consistent. Only a
third-party reviewer agent doing live gh api search caught it.
The fact-check-frontier-models preset was built directly from this
incident. Running the preset would have caught the fabrication
automatically by hitting arxiv + GitHub for each (model, benchmark, %)
triple in the diff.
โ F14: the preset existed by 2026-05-14, but a subsequent operator skipped it because "the task feels simple". The Phase D dogfood (49 files ร 3 locales, 50k tokens inline vs 16k with gate = ~3ร saving) proved the gate catches drift the operator misses. The cost of running the gate is much smaller than the cost of one missed drift.
These triggers are mechanical โ if any fire, invoke the matching preset before commit. Skipping is F14 territory.
| Trigger condition | MUST invoke |
|---|---|
Diff touches โฅ 2 locale variants of same file stem (e.g., 06-x.md + 06-x.en.md + 06-x.zh-Hans.md) | --preset=multi-locale-mirror-sync --stem=<stem> |
| Diff adds entries to any catalog file (project listings, framework comparisons) | --preset=catalog-entry-add --catalog-file=<path> |
| Diff touches a "frontier model" claim (model name within 3 lines of a benchmark %, e.g., "GPT-5.5 reaches 94% GPQA") | --preset=fact-check-frontier-models --file=<path> --models=<csv> |
## Acceptance gate rule (enforced)
Before any commit that:
- touches โฅ 2 locale variants of the same file stem โ invoke
`Skill("agent-collab-workspace:agent-acceptance-gate",
args="--preset=multi-locale-mirror-sync --stem=<stem>")`
- adds catalog entries โ invoke `--preset=catalog-entry-add ...`
- includes a frontier model + benchmark % claim โ invoke
`--preset=fact-check-frontier-models ...`
Skip only when the trigger does not apply (single-file diff, pure
mechanical rename, etc.). NEVER skip when the trigger fires โ
this is the F14 anti-pattern that shipped the DeepSeek-R2
fabrication.
When the preset FAILS:
1. Read the FAIL reasons in the gate report.
2. Either fix-and-rerun (cheap drift) or re-delegate the task with
tighter constraints (systemic drift).
3. Do NOT override the FAIL by hand. The whole point of the gate is
that the operator's judgment failed earlier โ adding more operator
judgment on top defeats it.
Trigger phrases:
Not for:
agent-output-reconciler..coord/plan.yml โ just run
pytest and call it a day..coord/plan.yml โ round, tasks, success_criteria per
task, budget if declared, and context_policy if declared..ai/<agent>_log_<NNN>_<slug>.txt.result.json โ token usage,
risks, files_changed. Filename uses double extension by design
(codex-delegate appends .result.json to the log path; see
examples/codex_log_001_*.txt.result.json.sample)..coord/reconciliation_<NNN>.md โ reconciler's verdict; if
reconciler said "retry", gate respects that..coord/context_<NNN>.md (optional, if agent-context-budget
ran) โ declared per-task context budgets for this round. Gate
checks that actual summary sizes / log tail counts honored the
declared budgets. Absence is OK (the round may not have used
the context-budget skill); presence means the gate enforces it.result.json shows files_changed
matching *.md, *.tex, *.docx, the gate optionally invokes
academic-writing-skills banned-word + claim-evidence audit.
(Skipped silently if academic-writing-skills not installed.)Instead of hand-writing acceptance criteria every time, invoke a preset that codifies a tested set of checks:
| Preset | When to use | Invocation |
|---|---|---|
multi-locale-mirror-sync | After zh-TW โ en + zh-Hans mirror sync (or any N-locale fan-out) | agent-acceptance-gate --preset=multi-locale-mirror-sync --stem=stages/06-foo --required-terms="A,B,C" |
catalog-entry-add | Added entries to a catalog file | agent-acceptance-gate --preset=catalog-entry-add --catalog-file=resources/foo.md --new-entries="org/repo1,org/repo2" |
fact-check-frontier-models | Touched a frontier-model table | agent-acceptance-gate --preset=fact-check-frontier-models --file=stages/06-foo.md --models="GPT-5.5,Claude Opus 4.7" |
Preset YAMLs live in presets/. Each codifies failure modes
observed in real dogfooding (see docs/observed-failure-modes.md).
Don't hand-write checks if a preset covers your case.
Mandatory preset trigger conditions (gate auto-suggests if you don't specify):
multi-locale-mirror-sync MUST be invoked.resources/ matching catalog
shape โ catalog-entry-add MUST be invoked.fact-check-frontier-models MUST be
invoked.The presets above are not "consider running" โ they are must run
before commit when their trigger condition matches. The F14
incident (docs/observed-failure-modes.md) is the cautionary tale:
a Phase D run on awesome-agentic-ai-zh touched 49 files across 3
locale variants (textbook multi-locale-mirror-sync trigger),
skipped the preset, used a code-reviewer subagent instead, and
shipped a drift the preset's cross_document_link_text_parity check
was designed to catch.
Why skipping is tempting: when the work feels "just a title sweep, surely nothing can go wrong", the operator short-circuits the mandatory invocation. The presets exist precisely because that intuition is wrong โ drift hides in the "obvious" cases.
Anti-patterns:
code-reviewer subagent.
The subagent is a reasonable backup but cannot substitute for
the codified checks, which encode observed failure modes. Run
both, not one-instead-of-the-other.Enforcement options (in increasing strength):
CLAUDE.md rule.
Held in 5 of 6 Phase B rounds, failed in Phase D.docs/observed-failure-modes.md F14..coord/acceptance_<NNN>.md file exists in the commit.
Use this in repos where the cost of drift is high (curriculum,
public docs, anything user-facing).Read .coord/plan.yml. Default to highest round. User can
override.
For each task with agent: codex or agent: gemini:
success_criteria is either a runnable command or a
checkable assertion.pytest tests/auth, mypy src/, npm test):
"src/auth/interfaces.py exists and defines AuthProvider ABC"):
For agent: claude tasks:
success_criteria is usually "explicit YES/NO verdict in chat".Read .coord/reconciliation_<NNN>.md. If the reconciler's
"Recommended action" section says anything other than "merge all"
(e.g., "retry T2", "escalate", "manual merge needed"), the gate's
verdict is at most CONDITIONAL PASS โ the user is responsible
for resolving the reconciler's flagged issue.
Concat all risks arrays from result.json files. Group by
severity (gate makes its own call if not labeled โ failed test =
high; legacy compat concern noted but not breaking = medium).
If any task changed *.md / *.tex files AND
academic-writing-skills is installed:
.paper/claims.yml exists in
the project.If academic-writing-skills isn't installed, skip silently โ don't
fail the gate just because prose audit isn't available.
If .coord/plan.yml declared a budget.tokens:
tokens field across all result.json files for this round
(if present; some delegate wrappers don't write tokens โ handle
missing gracefully).If no budget declared, skip โ don't invent one.
If .coord/plan.yml declares context_policy OR
.coord/context_<NNN>.md exists, enforce both:
From context_policy (plan-wide defaults):
result_summary_word_budget (default 250 words)..coord/memory.yml entries must be promoted facts: decisions, open
questions, artifact pointers, or session outcomes. Long analysis in
memory is a context violation.From context_<NNN>.md (per-task overrides, if present):
task_packet_token_budget declarations.<task-id> packet (the actual .ai/<agent>_task_<NNN>_<slug>.md)
did not exceed declared budget (rough char/word count, no need for
exact tokenizer โ flag at >120% of declared).result_summary_word_budget per-task (overrides plan-wide
default if specified).raw_logs_inline: path-only was honored โ any log file
pasted inline in reconciliation / acceptance is a violation.Debate caps (if .coord/debate_*.md files exist and are linked from plan.yml):
plan.yml declares debate_rounds: N override).Violations make the verdict at most CONDITIONAL PASS. If the violation hides acceptance evidence, mark FAIL and require a bounded summary rewrite.
Compare git diff --name-only against each task's declared files_in_scope
(from .coord/plan.yml task entries).
For each modified file F in the diff:
files_in_scope โ in-scope โ
Scope violation: <file> not in any task's files_in_scope; agent went outside briefThis is the file-level enforcement for the W1 work boundary discipline. Without this check, brief writing "files in scope: [a, b, c]" is just guidance โ the agent may still touch [d, e, f] and only manual diff review would catch it.
What this check does NOT verify (be honest):
Confirmed scope:
echo block before editing. git diff --name-only only tells you which
files got modified โ it can't reconstruct whether the echo was the
agent's first action. If you need echo verification, add an
optional secondary check that greps the agent's result.md for the
Confirmed scope: sentinel string.foo.md section ยง3 but also edited ยง1). Use
finer-grained acceptance criteria (per-section grep / line-range
check) for that.F11 + F12 specific catches:
resources/style-guide*.md AND the change replaces
a literal term in a contrast table โ FAIL (F11 violation)^>?\s*(Attribution|Source| Credits|Citation)s?:\s* that wasn't requested in brief โ FAIL
(F12 violation)| Condition | Verdict |
|---|---|
| All success_criteria PASS, no risks, reconciler says merge, prose audit clean, budget ok | โ PASS |
| All success_criteria PASS but reconciler flagged something | โ CONDITIONAL PASS โ user resolves reconciler's issue, then re-run gate |
| Context contract violated but evidence is still checkable | โ CONDITIONAL PASS โ rewrite bounded summaries before next round |
| Any success_criterion FAIL | โ FAIL โ list which task / criterion |
| Risks include unresolved blockers | โ FAIL โ must address before merge |
| Budget exceeded | โ FAIL โ over budget (user explicitly OK can override by editing plan.yml) |
.coord/acceptance_<NNN>.md# Acceptance gate โ round 1
**Verdict:** โ CONDITIONAL PASS
**Run:** 2026-04-28T11:50:00Z
**Tasks gated:** 4
**Reconciliation report:** .coord/reconciliation_001.md
## Per-task results
### T1 โ codex โ extract-interfaces
- โ
"src/auth/interfaces.py exists and defines AuthProvider ABC" โ file present, grep matches.
- โ
"no other source files modified" โ git diff scope confirmed.
### T2 โ codex โ refactor-providers
- โ "pytest tests/auth/test_providers.py passes" โ test_legacy_compat FAILED.
- โ
"no imports of src.auth.legacy from other modules" โ grep clean.
### T3 โ gemini โ review-doc-coverage
- โ
"every public symbol in src/auth has a docstring" โ gemini's report confirms.
- โ
"report flags any docstring still mentioning the legacy class" โ 12 flagged.
### T4 โ claude โ design-review
- โ
"explicit YES/NO verdict + rationale in chat" โ said YES with conditional concerns.
## Risks
- **High:** test_legacy_compat failing in T2. Means backwards compat
is broken under the refactor.
## Budget
Declared: 200,000 tokens. Used: 142,000. โ
Under budget.
## Reconciler verdict
The reconciler flagged a cross-agent contradiction: T2's fallback
status conflicts with T4's "design is sound" verdict. The reconciler
recommended either retrying T2 with a deprecation shim, or accepting
the breaking change.
## Decision
โ **CONDITIONAL PASS.** Don't merge T2 in its current state.
**To unblock:**
Path 1 โ Retry T2 with deprecation shim:
1. Edit `.coord/plan.yml` round 1, T2: add to constraints
"preserve test_legacy_compat backward compatibility via shim".
2. Re-run T2 (`bash .claude/skills/codex-delegate/scripts/run_codex.sh ...`).
3. Re-run reconciler + gate.
Path 2 โ Accept breaking change:
1. Update `tests/auth/test_legacy_compat.py` to reflect new
architecture.
2. Re-run pytest manually to confirm green.
3. Manually mark this gate PASS by replacing the verdict above
with โ
PASS + your override rationale.
T1, T3, T4 are individually mergeable; only T2 is blocked.
[agent-acceptance-gate]
Round: 1
Verdict: โ CONDITIONAL PASS
Tasks gated: 4 (3 pass, 1 fail)
Risks: 1 high (test_legacy_compat)
Budget: 142k / 200k tokens โ under
Report: .coord/acceptance_001.md
Don't push yet. Resolve T2 (see report for two paths).
Then re-invoke this skill.
.coord/plan.yml doesn't
declare any for a task, that's a planning bug โ flag it, don't
invent assertions.When: โฅ 5 success_criteria checks to run, OR โฅ 4 result.json
files to aggregate, OR prose audit on โฅ 3 changed .md files.
Why: The acceptance gate naturally pulls a lot of data into the main session (test outputs, all result.json files, reconciliation report, optional banned-word audits). Delegating the mechanical checks to a subagent and keeping only the structured verdict in main session cuts context cost roughly 3-5ร.
Pattern:
Spawn `code-reviewer` subagent with:
- Read .coord/plan.yml + .coord/reconciliation_<NNN>.md + every
.ai/*.result.json + .coord/context_<NNN>.md (if present)
- For each success_criterion in plan.yml: execute the command
or check the assertion (file existence, grep, etc.)
- Aggregate risks; sum tokens against budget
- Verify context contract (summary word budgets, raw-log-paths-only,
memory promotion rules, debate caps)
- Return: structured verdict
{ verdict: PASS/CONDITIONAL_PASS/FAIL,
per_task: [...], risks: [...], budget_used: N,
context_violations: [...], next_actions: [...] }
Main session reads the structured verdict and writes
.coord/acceptance_<NNN>.md from it without re-reading the result.json /
test output files.
This makes the gate auditable AND keeps the gating session itself under the context_policy main_session_token_budget.
Every agent boundary is a commit boundary (see global rule:
~/.claude/CLAUDE.md โ "Commit Discipline for Multi-Agent Work"). This
makes multi-agent work auditable (commit log = agent log) and enables
surgical rollback via git revert <hash> of just one agent's commit.
Specific to this skill: this gate IS the final pre-merge commit check. It reads the per-agent commits between the round's plan commit and HEAD, verifies they collectively satisfy success_criteria, and writes its PASS/FAIL verdict to .coord/acceptance_<NNN>.md as a final commit. Only after PASS does the round get merged to main.