com um clique
agent-output-reconciler
// Use when multiple agents have completed a round and the user asks to reconcile outputs, compare Codex and Gemini, synthesize run results, identify conflicts, or decide what should be retried.
// Use when multiple agents have completed a round and the user asks to reconcile outputs, compare Codex and Gemini, synthesize run results, identify conflicts, or decide what should be retried.
Use when the user asks to split a goal across Claude, Codex, or Gemini; plan a multi-agent run; break work into parallel agent tasks; or decompose a large task that needs bounded context handoffs. This is the **generic** multi-agent task splitter — writes `.coord/plan.yml` (a DAG) plus per-agent task files. NOT for research-domain routing that touches `.research/`, `.paper/`, or Zotero/Obsidian/NotebookLM ingest pipelines — for those, use `research-hub-multi-ai` instead (different artifact `.coord/multi_ai_plan.md`, research-hub-aware reconciliation).
Use when a multi-agent round needs a pre-merge gate, pre-commit check, verification before push, or a PASS/FAIL decision after reconciliation.
Use when multi-agent work risks context overflow, memory growth, noisy logs, oversized handoffs, cross-session continuation, or parallel Codex and Gemini execution.
Use when a task needs single-agent self-correction across multiple iterations — write plan, execute, critique own output, revise plan, re-execute, until convergence or budget exhausted. Different from `agent-debate` (which is 2 agents arguing pro vs con); this is 1 agent looping over its own work.
Use when a consequential decision needs adversarial review, opposing agent arguments, a second opinion via debate, or explicit trade-off analysis before implementation.
Use when the user asks to update shared memory, initialize multi-agent memory, summarize decisions so far, identify open questions, or prepare a fresh session primer.
| name | agent-output-reconciler |
| description | Use when multiple agents have completed a round and the user asks to reconcile outputs, compare Codex and Gemini, synthesize run results, identify conflicts, or decide what should be retried. |
Cross-agent diff + synthesizer. After agent-task-splitter plans a
round and the delegate skills have run their tasks, this skill reads
everything they produced and reports:
Trigger phrases:
Not for:
codex-delegate / gemini-delegate.agent-acceptance-gate. The reconciler describes; the
acceptance gate decides..coord/plan.yml, there's
nothing to reconcile..coord/plan.yml — the round's plan. Use the round field
to identify which task files are in scope..ai/<agent>_log_<NNN>_<slug>.txt.result.json — one per
non-Claude task. Schema (from codex-delegate's contract):
{
"status": "success|fallback|error",
"delegate": "codex|gemini",
"model": "...",
"log_file": "...",
"output_file": "...",
"summary": "...",
"risks": [],
"files_changed": [],
"tests_run": [],
"timestamp_utc": "...",
// v0.2.2+ optional fields — see §2.6 promise/delivery contract:
"promised": [], // artifacts this task surfaces for downstream consumers
"consumed": [] // artifacts this task used from upstream tasks
}
Each entry in risks must be a single sentence (≤ 30 words).
Long-form analysis belongs in output_file, not here — verbose
risks entries are a context-contract violation and the
reconciler should flag them..ai/<agent>_result_<NNN>_<slug>.md — agent-written summary
(referenced from each task file's Acceptance section)..ai/<agent>_log_<NNN>_<slug>.txt — full log path; read only
the configured tail for error context if status: error.agent: claude tasks — read the Claude session's
in-conversation output (whatever Claude said in the chat for that
task, treated as the equivalent of result.md)..coord/context_<NNN>.md (optional, if agent-context-budget
ran) — declared per-task context budgets. Reconciler uses these
to flag oversized summaries / unbounded risks arrays at the
per-task granularity, not just the plan-wide default. Absence is OK.If the user passes specific paths, use those instead of auto-discovery.
Read .coord/plan.yml. If multiple rounds exist, default to the
highest round number unless the user specifies. Collect the list
of (task_id, agent, slug).
For each task:
result.json + result_<NNN>_<slug>.md +
log tail only when status: error (default max: last 50 lines).Flag any task where:
result.json is missing → run never completed.status: "error" → run failed.status: "fallback" → run completed but in degraded mode.result_<NNN>_<slug>.md missing → agent didn't write the
required summary (acceptance criterion violated).result_<NNN>_<slug>.md exceeds context_policy.result_summary_word_budget
(default 250 words) → context contract violated.When the round's outputs include ≥ 2 locale variants of the same
file stem (e.g., 06-memory-rag.md + 06-memory-rag.en.md +
06-memory-rag.zh-Hans.md), verify they actually stayed in lockstep:
grep -c '^## ' returns identical count
across all locales.docs/observed-failure-modes.md:
Gemini merged a 5-column Projects table's rows into a 3-column
Tools table.)plan.yml declares
required_terms for this round, each must appear in every locale
variant. Missing in one locale = mirror sync dropped content.If any of these fail, defer to the multi-locale-mirror-sync
preset of agent-acceptance-gate rather than computing it
manually — it's already codified there.
Surface each lockstep failure as HIGH in "Aggregated risks": the gate will demand a re-run before merging.
Real multi-agent runs sometimes produce outputs that claim to be about different tasks than they actually were. Common cause: gemini under inline-prompt mode (no file-system access) hallucinates plausible-but-wrong task slugs and agent assignments because it doesn't have plan.yml loaded.
For each agent's result.md summary, verify:
T2 per plan.yml, but its summary references "T1, T2,
T3, T4" as if doing all of them, that's drift — the agent
restated the entire plan instead of reporting on its own task.scaffold-provider-core but the result.md mentions
implement-auth-middleware (a slug not in the plan), the agent
invented context.Surface each drift as a HIGH severity item in the "Aggregated risks" section. Don't quietly ignore it; the gate needs to see it.
If drift is severe (entire summary is about a different task / scenario), recommend re-running that task with file-system access or with the prompt body itself containing all critical context (rather than just paths to read).
When tasks form a sequential chain (e.g., research-agent → write-agent →
verify-agent), each task's result.json MAY declare a promised field
listing artifacts the downstream consumer is expected to use:
{
"status": "ok",
"summary": "...",
"promised": [
{"kind": "video_url", "count": 5,
"detail": ["lVdajtNpaGI", "M2Yg1kwPpts", "bJFtcwLSNxI", "abc123", "def456"]},
{"kind": "concept_mapping", "count": 19}
]
}
The downstream consumer's result.json MAY declare a matching consumed field:
{
"consumed": [
{"kind": "video_url", "count": 3,
"detail": ["lVdajtNpaGI", "M2Yg1kwPpts", "bJFtcwLSNxI"]}
]
}
The reconciler computes the diff:
promised list — likely hallucination)Severity asymmetry rationale: dropped artifacts (promised but not consumed) are common in real workflows — research surfaces 10 facts, write pass uses 7 because 3 weren't relevant. Invented artifacts (consumed but not promised) are a different signal entirely — the agent claims to use something nobody surfaced, which strongly indicates fabrication. WARN vs HIGH reflects this difference in expected legitimacy.
This is the contract-driven hand-off check. It catches the common case where upstream research surfaces N facts, but downstream write pass only uses M < N. Without this check, the contract is implicit and silently breakable.
Backward compatibility: promised / consumed are optional fields.
Tasks without them skip this check (no FAIL). Adoption is incremental —
agents that opt-in benefit from the contract verification.
Build three views:
(a) Agreement table — per task pair, did agents converge?
| T1 (codex) | T2 (codex) | T3 (gemini) | |
|---|---|---|---|
| T1 | — | overlap: src/auth/providers.py | no overlap |
| T2 | overlap | — | no overlap |
| T3 | no overlap | no overlap | — |
Two tasks "overlap" if their files_changed lists share any path.
Two tasks "agree" if they overlap AND their changes don't
contradict (heuristic: same file changed by two agents → flag for
manual review unless one is git mv-style and the other is content
edit).
(b) Conflict heatmap — which files were touched by multiple tasks?
src/auth/providers.py [T1, T2] ⚠ conflict — both edit same file
src/auth/interfaces.py [T1] ok
docs/auth.md [T3] ok
tests/test_auth.py [T2] ok
For conflicts, read the actual diffs and either:
(c) Aggregated risks — concat all risks arrays from
result.json + risks mentioned in the .md summaries.
Based on the analysis:
| Situation | Recommendation |
|---|---|
All tasks status: success, no conflicts, no risks | "Merge all in dependency order (T1 → T2 → T3 → T4)." |
| All success but one conflict on file X | "Merge T1, T3, T4. Manually merge T2's edits to X with T1's." |
One task status: error | "Retry T2 (failure reason: ). Don't merge other tasks until T2 succeeds, since T4 depends on T2." |
One task status: fallback | "Review T3's degraded output before merging. Acceptance criteria may not have been met." |
| Risks flagged | "Address risks before merging: ..." |
| Cross-agent contradiction (e.g., Codex says X, Gemini's review says X is wrong) | "Escalate: invoke agent-debate on the contested point before deciding." |
.coord/reconciliation_<NNN>.mdFormat (full template: references/reconciliation_template.md):
# Multi-agent reconciliation — round 1
**Goal:** Refactor the auth module into plugin-based architecture
**Created:** 2026-04-28T10:30:00Z
**Tasks:** 4 (2 codex, 1 gemini, 1 claude)
## Per-task summary
### T1 — codex — extract-interfaces ✅ success
Files: src/auth/interfaces.py (+47 lines)
Tests: pytest tests/auth/test_interfaces.py PASS
Risks: none reported.
### T2 — codex — refactor-providers ⚠ fallback
Files: src/auth/providers/google.py, src/auth/providers/saml.py (+93 / -41)
Tests: pytest tests/auth/test_providers.py — 1 FAIL (test_legacy_compat)
Risks:
- Backwards compat with legacy.py possibly broken; test_legacy_compat is failing.
### T3 — gemini — review-doc-coverage ✅ success
Output: 12 docstrings flagged as outdated (still mention legacy class).
Risks: none reported.
### T4 — claude — design-review ✅ success
Verdict: YES, the refactor is sound. Specific concerns: the
test_legacy_compat failure suggests we may need a deprecation
shim before removing legacy.py.
## Cross-task analysis
### Agreement
- T1 and T2 share scope on src/auth/* — both touched only files in
their declared in-scope globs. ✅
- T3's doc review aligns with T2's implementation: T3 flagged the
same legacy references that T2 should have removed.
- T4's design review confirms T1's interface choice.
### Conflicts
- None on file paths.
- One contradiction: T2 succeeded with a fallback (legacy compat
test failing), T4 said design is sound; T4 didn't see T2's test
failure. Flag for user decision.
### Aggregated risks
1. test_legacy_compat is failing — backwards compat possibly broken.
## Recommended action
⚠ **Don't merge yet.** Two paths:
1. **Keep legacy.py as deprecation shim** (T4's suggestion). Re-run T2
with this constraint, retry, then re-reconcile.
2. **Accept the breaking change.** Update test_legacy_compat to
reflect the new architecture, then merge T1 + T2 + T3 (and treat
T4's verdict as conditional-pass).
If you want a third opinion, invoke `agent-debate` on "should we
keep a deprecation shim for legacy auth?" before deciding.
End with:
[agent-output-reconciler]
Round: 1
Tasks reconciled: 4 (2 success, 1 fallback, 1 success-claude)
Conflicts: 0 file-level, 1 cross-agent contradiction (T2 failure ↔ T4 verdict)
Risks: 1 (legacy compat)
Report: .coord/reconciliation_001.md
Recommended next: review the report and either retry T2 with shim,
or accept breaking change. After deciding, run agent-acceptance-gate
for the merge decision.
error. The
summary .md files and result.json are the primary inputs.When: ≥ 3 agent outputs to reconcile, OR any agent's
result_*.md exceeds 200 words.
Why: Reading 4 × full result.md files inline costs ~10 KB of main session context. A per-agent subagent can pre-digest each output and return only a structured ≤ 150-word verdict so the reconciler works from compact digests, not raw summaries.
Pattern (parallel, one subagent per agent output):
For each task in plan.yml round:
Spawn `code-reviewer` subagent with:
- Read .ai/<agent>_result_<NNN>_<slug>.md
- Read .ai/<agent>_log_<NNN>_<slug>.txt.result.json (status,
risks, files_changed)
- Return: ≤ 150-word digest = { status, key claim, risks,
slug consistency vs plan.yml, missing-output flags }
Main session reads only the digests and writes reconciliation_<NNN>.md
from them. Never inlines the original result.md / result.json content.
This is the canonical pattern for "fan-out / fan-in" multi-agent reconciliation. It keeps the main session linear in N (number of agents) instead of quadratic in (N × avg result size).
Every agent boundary is a commit boundary (see global rule:
~/.claude/CLAUDE.md → "Commit Discipline for Multi-Agent Work"). This
makes multi-agent work auditable (commit log = agent log) and enables
surgical rollback via git revert <hash> of just one agent's commit.
Specific to this skill: the reconciler reads each agent's output as a separate commit. If agents share an uncommitted working tree, the reconciler cannot disentangle which change came from which agent. Always commit each agent's output before invoking the reconciler.