| name | review-council |
| description | Multi-agent review council. Invoke for: review council, fellowship review, multi-agent review, panel review, diverse model review. |
Review Council
A multi-model code review panel. You orchestrate 3 phases: build a briefing pack, dispatch 15 parallel assessment agents (5 dimensions × 3 models), and consolidate findings.
Phase 1: Briefing Pack
Auto-detect the PR from the current branch. Build a self-contained briefing document containing all of the following:
- PR metadata: title, body, labels
- Linked issues: for every issue referenced in the PR body or commits, fetch the full issue body and all comments — this is where the real requirements and context live
- Full diff: merge-base diff only (
git diff $(git merge-base HEAD <base>)..HEAD) — shows only what the PR adds, not what main gained since the branch point
- Changed files list: with change type (added/modified/deleted), test files called out separately
- Commit messages: useful for correlating claims to specific code changes, but not a source of requirements
Use claude-opus-4.6 for this phase. The output is a single structured briefing document. Every assessment agent in Phase 2 receives this briefing verbatim — they should not need to re-fetch any of this context.
Phase 2: Parallel Assessment
Dispatch 15 agents in parallel: each of the 5 dimensions below assessed independently by each of 3 models.
Models: claude-opus-4.6, gemini-3-pro-preview, gpt-5.2-codex
Each agent receives:
- The full briefing pack from Phase 1
- Its specific dimension prompt (below)
- Instruction to return findings as a list, each with: location (file:line), description, severity (critical/high/medium/low)
Dimension 1: Claims Coverage
Are the goals met? Cross-reference every claim in the PR description, linked issues, and commit messages against actual code changes. Flag:
- Orphan claims: stated in PR/issue but no corresponding code change implements it
- Orphan changes: code changed but not mentioned in any claim — why is this here?
- Partial implementations: claim says X, code does X-minus-something — what's missing?
Before flagging, apply these gates:
- Execution context: understand how and where the code runs before judging it. A test harness, a standalone project, and the main compiler have different constraints. Do not assume the worst-case context.
- Direction: classify each behavioral change as regression (breaks something), improvement (fixes bug, removes race, simplifies), or unclear. Only regressions are findings. Improvements are observations at most. For unclear, state uncertainty — do not present as a defect.
Dimension 2: Test Coverage
Is every feature, fix, or behavioral change covered with tests? Check for:
- Happy path: does the basic intended usage work and is it tested end-to-end?
- Negative path: what happens with invalid input, malformed syntax, error conditions? Are diagnostics tested?
- Feature interactions: this is a compiler — edge cases are about how the change interacts with other F# features. Example: a codegen change should test both reference types and value types. A syntax change should test interaction with generics, constraints, computation expressions, etc.
- Assertion quality: tests must actually assert the claimed behavior, not just "compiles and runs without throwing". A test that calls the function but doesn't check the result is not a test.
Flag any behavioral change in the diff that lacks a corresponding test. Tests should be in the appropriate layer — pick based on what the issue is and what changed:
- Typecheck tests: the bulk of coverage — type inference, constraint solving, overload resolution, expected compiler warnings and errors
- SyntaxTreeTests: parser/syntax changes
- EmittedIL tests: codegen/IL shape changes
- compileAndRun tests: end-to-end behavioral correctness that absolutely needs proper execution on the .NET runtime
- Service.Tests: FCS API, editor features
- FSharp.Core.Tests: core library changes
A PR can and often should have tests in multiple categories.
Dimension 3: Code Quality
Assess structural quality of the changes:
- Logical layer placement: is the code in the right module/file, or shoved somewhere convenient? Would a reader expect to find this logic here?
- Ad-hoc "if/then" patches: flag any
if condition then specialCase else normalPath that looks like a band-aid rather than a systematic fix. These are symptoms of not understanding the root cause — the fix should be at the source, not patched at a consumer. A conditional that exists only to work around a bug elsewhere is a code smell.
- Duplicated logic within the PR diff: same or near-same code appearing in multiple places in the changeset
- Error handling: not swallowing exceptions, not ignoring Result values, not using
failwith where a typed error would be appropriate
- Respect intentional deviations: some projects (standalone tests, isolated builds, special harnesses) deliberately diverge from repo-wide conventions. Before flagging a pattern as wrong, check whether the project has a structural reason to be different. Intentional isolation is not inconsistency.
Dimension 4: Code Reuse & Higher-Order Patterns
Search the codebase for existing patterns that match the new code's structure. F# allows extracting logic into composable pieces — mappable structures, foldable, walkers, visitors. Look for:
- Highly similar nested pattern matches: a familiar structure of nested
match ... with but with a minor tweak. This is the #1 symptom — slight differences can almost always be extracted into a higher-order function or otherwise parameterized or made generic.
- Copy-paste-modify: new code that duplicates an existing function with small changes. The difference should be a parameter, a generic type argument, or a function argument (higher-order function).
- Missed abstractions: where two pieces of code share structure but differ in a specific operation — that operation should be a parameter, a generic type argument, or a function argument.
- Existing utilities ignored: the codebase may already have helpers, combinators, or active patterns that do what the new code reimplements from scratch. Search for them.
Dimension 5: Cyclomatic Complexity
Assess complexity of added/changed code:
- Pyramid of nested doom: deeply nested
if/then/else, heavily nested matchor interleaving with for and other branching constructs. Any nesting beyond 2 levels should be questioned — is there a flatter way?
- F# offers better tools: pattern matching and active patterns for non-trivial branching logic — flatter and easier to read than chains of
if/elif/else. Suggest them as alternatives.
- Active patterns for complex conditions: when a match guard or if-condition encodes domain logic, an active pattern names it and makes it reusable.
- Pipelines over nesting: sequential operations should be pipelined, not nested. Collections, Result, Option, ValueOption all support this — use
|>, bind, map chains instead of nested match or if/then.
- High branch count: functions with many
match arms or if branches — consider whether the cases can be grouped, or whether the function is doing too much and should be split.
- Flatter is better: a flat pattern match with 10 arms is easier to read than 4 levels of nesting with 10 combinations. Prefer wide over deep.
Phase 3: Consolidation
Collect all findings from the 15 agents. Then:
- Deduplicate: multiple agents will find the same issue. Merge findings that point at the same location and describe the same problem. Keep the best-written description.
- Filter false positives before classifying:
- Wrong context assumed: finding assumes a different execution context than the code actually runs in → discard.
- Correct convention: finding flags a pattern that is idiomatic for the tool/context in use → discard.
- Improvement, not regression: change makes things better (bugfix, simplification, race fix) → downgrade to informational, do not include in actionable findings.
- "Unexplained" ≠ "wrong": if the only concern is missing rationale in the commit message, that is a documentation gap — not a defect. Convert to comment, not finding.
- Speculation consensus: ≥2 models agree but reasoning relies on "could cause" without evidence it does → LOW confidence.
- Classify into three buckets:
- Behavioral: missing feature coverage, missing tests, incorrect logic, claims not met — things that affect correctness
- Quality: code structure, readability, complexity, reuse opportunities — it works, but could be better
- Nitpick: typos, naming, formatting, minor style — agents love producing these to have something to say. Low priority. Only surface if there are no higher-level findings.
- Rank within each bucket: prefer findings flagged by more agents (cross-model agreement = higher confidence).
- Present: Behavioral first, then Quality, then Nitpicks (if any). For each finding: location, dimension, description, how many agents flagged it.
If a model is unavailable at runtime, proceed with the remaining models. Minimum viable council = 2 models.