| name | backpack-code-review-checklist |
| description | Self-improving multi-agent review orchestrator for Backpack component PRs. Runs 6 parallel specialist agents (including a learned-patterns agent that mines past PR comments), then confidence-scores findings. Use for PR review, Constitution compliance checks, and pre-merge validation. Run with "learn" to mine recent PRs and auto-update checklist rules. |
Backpack Code Review — Multi-Agent Orchestrator (v3.1.3)
Dispatches 6 parallel specialist agents, each with its own prompt file in agents/.
A confidence-scoring pass filters false positives.
Threshold: default 75, override via /backpack-code-review-checklist threshold=80.
Execution Flow
Phase 0 Detect run mode — learn vs review; early-exit check (review only)
Phase 1 Metadata + RESOLVED/RAISED_SET ┐ run in parallel; Phase 1.5
Phase 1.5 Mine learned patterns ┘ uses parallel tool calls internally, no nested subagents
Phase 2a Launch Agents 1–5 immediately after Phase 1 (no wait for Phase 1.5)
Phase 2b Launch Agent 6 when Phase 1.5 completes (if applicable)
Phase 3 Independent confidence scoring — after ALL agents finish
Phase 4 Filter, format, and output
Phase 5 Orchestrator self-check (internal)
Phase 0: Detect Run Mode + Early Exit
Step 0.0 — Detect Learn Mode:
If the invocation contains learn, read and follow learn-mode.md. Stop here.
Step 0.1 — Determine review mode:
Step 0.2 — Early exit (PR mode only). Skip review if:
- PR is closed, merged, or draft
- PR is trivial/automated (only changelog or dependency bumps)
Phase 1 + Phase 1.5: Run in Parallel
Launch both steps at the same time — Phase 1 runs directly in the orchestrator (bash commands); Phase 1.5 is dispatched as a background subagent. They are independent and must not block each other.
Phase 1: Metadata + Review Thread Collection
Inline review comment fetch (PR mode only — skip in local mode):
Fetch the current PR's own inline review comments (line-level diff threads):
gh api repos/Skyscanner/backpack/pulls/[NUMBER]/comments --paginate \
--jq '[.[] | {id:.id, path:.path, line:(.line // .original_line), body:.body, in_reply_to_id:.in_reply_to_id}]'
Note: .line is null for outdated comments (position became stale after a subsequent push).
Fall back to .original_line so these threads are still captured in RESOLVED_SET/RAISED_SET.
Group comments into threads by in_reply_to_id (top-level = no in_reply_to_id; replies share the root's id).
Classify each thread into one of two sets:
- RESOLVED_SET: thread has at least one reply containing a resolution signal — a commit URL
(
github.com/Skyscanner/backpack/pull/NNN/commits/ or github.com/Skyscanner/backpack/commit/),
or the words "Done", "Fixed", "Fixed in", "updated in", "addressed in", "resolved"
(case-insensitive). Record {path, line, summary} for each thread.
If line is still null after the fallback, record with line: null and match by path only
(skip line-range check) — still suppresses the false positive.
- RAISED_SET: thread has a top-level comment but NO resolution signal in any reply.
Record
{path, line, summary} for each thread.
Store both sets for use in Phase 4.
Phase 1.5: Mine Learned Patterns
Always run as a single background subagent using parallel tool calls internally — no nested subagents.
When dispatching, pass the changed file list (newline-separated full paths) from Phase 0.
Step 1 — parallel gh pr list calls (one per component, all in one message):
Extract component names from changed paths (max 3).
- Normal case (≤70% brand-new files): search by component name — finds PRs for the same component:
gh pr list --repo Skyscanner/backpack --state merged --limit 5 \
--search "[COMPONENT_NAME]" --json number,title,mergedAt
- All-new-files case (>70% brand-new): widen the query to recent PRs generally, because new component
authors are the highest-risk group for anti-patterns they haven't seen yet. Search a broader set:
gh pr list --repo Skyscanner/backpack --state merged --limit 20 \
--search "bpk-component" --json number,title,mergedAt
Step 2 — parallel gh api calls (all PRs across all components, all in one message):
Collect every PR number returned from Step 1 (up to 15 total). For each PR, issue two parallel calls — one for
PR-level comments and one for inline diff comments — all in a single message:
gh api repos/Skyscanner/backpack/issues/[N]/comments --paginate \
--jq '[.[].body]'
gh api repos/Skyscanner/backpack/pulls/[N]/comments --paginate \
--jq '[.[].body]'
Collect and concatenate all returned body strings into raw comment text.
Pass combined raw comment text to Agent 6 as [INSERT RAW COMMENT TEXT].
Always launch Agent 6 when Phase 1.5 completes, even if few comments were found — Agent 6 handles empty input gracefully.
Phase 2: Launch Agents in Parallel
Phase 2a — Launch Agents 1–5 immediately after Phase 1
Do not wait for Phase 1.5. Dispatch in a single message as soon as Phase 1 is complete.
| Agent | Prompt file | Launch when | Self-fetched scope |
|---|
| Agent 1 | agents/agent1-constitution.md | Always | TSX/TS |
| Agent 2 | agents/agent2-sass.md | .scss files in changed list | SCSS |
| Agent 3 | agents/agent3-a11y.md | Always | TSX/TS |
| Agent 4 | agents/agent4-history.md | Files exist on main | Full diff |
| Agent 5 | agents/agent5-bugs.md | Always | Full diff |
Phase 2b — Launch Agent 6 when Phase 1.5 completes
| Agent | Prompt file | Launch when | Self-fetched scope |
|---|
| Agent 6 | agents/agent6-learned.md | Always (after Phase 1.5) | Full diff |
Always launch Agent 6 after Phase 1.5 completes. Agent 6 handles the case where no historical comments were found by returning [].
Template variables to fill in for every agent: [NUMBER], [SHA], [INSERT LIST], [INSERT] (PR summary).
Agent 6 additionally receives [INSERT RAW COMMENT TEXT] from Phase 1.5.
Each agent self-fetches its own scoped diff using the commands documented in its prompt.
Agents may also use the Read tool for deeper file inspection.
Wait for all running agents (1–5, and 6 if launched) to complete before starting Phase 3.
Each agent returns a JSON array of issues (no confidence score — Phase 3 scores independently):
[
{
"title": "Brief issue title (max 10 words)",
"explanation": "What is wrong, why it matters, what to use instead",
"file": "packages/bpk-component-foo/src/BpkFoo.tsx",
"startLine": 42,
"endLine": 45,
"source": "constitution|sass-tokens|a11y-testing|history|bug-scan|learned-patterns",
"rule_id": "constitution.xi.classname-restriction",
"rule": "Constitution XI — className restriction",
"supporting_lines": [{ "file": "...", "startLine": 42, "endLine": 45 }]
}
]
After aggregation — deduplication: If two issues share the same file AND overlapping
startLine–endLine AND similar title, merge into one. Keep the more specific rule_id;
priority order: constitution > sass-tokens > a11y-testing > history > bug-scan > learned-patterns.
Phase 3: Confidence Scoring
After all agents return and deduplication is done, collect every issue into a single list.
Scoring dispatch policy:
- If
len(issues) <= 15: launch parallel scoring agents — one per issue, all in one message, using model: haiku. Scoring is a classification task with an explicit rubric; it does not require tool calls or complex reasoning.
- If
len(issues) > 15: score all issues directly in a single orchestrator pass (no sub-agents).
Use the same scoring prompt below, but wrap all issues in one request:
Score each issue 0–100 using the rubric below. Return a JSON array:
[{"index": 0, "score": N, "confidence_explanation": "...", "rule_id": "...", "supporting_lines": [...]}]
Issues: [LIST ALL ISSUES WITH TITLE, EXPLANATION, CODE SNIPPET, RULE REFERENCE]
Each scoring agent (or the orchestrator in batch mode) receives:
- The issue title + explanation
- The relevant code snippet from the diff
- The relevant Constitution/decision rule text (if applicable)
- The issue metadata (
rule_id, supporting_lines)
Scoring prompt:
Score this issue 0–100 for confidence that it is a real, actionable issue in this PR:
Issue: [TITLE + EXPLANATION]
Code: [RELEVANT SNIPPET]
Rule reference: [CONSTITUTION/DECISION SECTION, if any]
Metadata: [RULE_ID + SUPPORTING_LINES]
Rubric:
- 0: False positive, or pre-existing issue not introduced by this PR
- 25: Might be real; stylistic, not explicitly required by Constitution/decisions
- 50: Real but minor nitpick; unlikely to matter in practice
- 75: Verified — Constitution/decisions explicitly requires this; PR contradicts it
- 100: Certain — NON-NEGOTIABLE violation (license header, className leak, missing a11y test)
For Constitution issues: verify the rule verbatim. If you cannot find it, score 0.
If the rule exists but violation requires interpretation, score ≤ 50.
Score ≥ 75 only after reading the exact rule text AND confirming the changed code contradicts it.
For bug issues: verify the bug can actually occur given the surrounding code context.
For history issues: verify the past feedback is relevant to the current change.
Return ONLY: {"score": NUMBER, "confidence_explanation": "brief explanation", "rule_id": "string", "supporting_lines": [{"file":"...","startLine":1,"endLine":2}]}
False positive patterns — always score 0:
- Pre-existing issues not introduced in this PR
- Something that looks like a bug but isn't, given context
- Pedantic nitpicks a senior engineer wouldn't flag
- Issues a linter/typechecker would catch
After scoring, attach confidence and confidence_explanation to each issue.
Phase 4: Filter, Format, and Output
Filter — step 1, confidence: split issues into three buckets:
confidence >= threshold → blocking issues — go through steps 2–3 and into output
60 <= confidence < threshold → observations — skip steps 2–3; conversation-only, never posted to GitHub
confidence < 60 → drop — false positive or noise
Issues with threshold <= confidence < 90 are human-gated — include them in output with
a "Gate rationale" line showing confidence_explanation, and require explicit user
confirmation before posting to GitHub.
Filter — step 2, RESOLVED_SET (PR mode only): For each remaining issue, check whether
it matches a RESOLVED_SET thread — same file (path) AND (thread line falls within
startLine–endLine of the issue, OR thread line is null — match by path only).
If it matches, drop the issue entirely (already caught and fixed by human reviewers in-PR).
Do not report it.
Filter — step 3, RAISED_SET annotation (PR mode only): For each remaining issue,
check whether it matches a RAISED_SET thread — same file, AND (overlapping line range OR
thread line is null). If it matches, keep the issue but prepend the label:
⚠️ Already raised in PR discussion — still open.
Output differs by mode:
Local mode output
Print the ### Code review block to the conversation. No GitHub posting.
If no issues and no observations:
### Code review
No issues found. Checked by N/6 agents (Constitution, Sass, A11y, History, Bug Scanner, Learned Patterns).
[Note any failed agents]
🤖 Generated with [Claude Code](https://claude.ai/code)
If issues found (with optional observations appended):
### Code review
Found N issues (reviewed by M/6 agents, filtered by confidence scoring):
[Note any failed agents]
1. [Title] *(recurring pattern)* — [explanation: what, why, fix. Link to codebase precedent if one exists.]
[path/file.tsx:42](path/file.tsx#L42)
[if human-gated] Gate rationale: [confidence_explanation]
[if observations exist]
Below-threshold observations (confidence 60–(threshold-1), not blocking):
- **[Title]** (score: N) — [one-line explanation: what, why]
[path/file.tsx:42](path/file.tsx#L42)
🤖 Generated with [Claude Code](https://claude.ai/code)
If no issues but observations exist:
### Code review
No issues met the threshold. Checked by N/6 agents (Constitution, Sass, A11y, History, Bug Scanner, Learned Patterns).
[Note any failed agents]
Below-threshold observations (confidence 60–(threshold-1), not blocking):
- **[Title]** (score: N) — [one-line explanation: what, why]
[path/file.tsx:42](path/file.tsx#L42)
🤖 Generated with [Claude Code](https://claude.ai/code)
PR mode output
Conversation: Print the same ### Code review block as local mode (for human review),
but links use full SHA permalinks:
https://github.com/Skyscanner/backpack/blob/[SHA]/[PATH]#L[START]-L[END]
Observations (if any) appear in the conversation block only — they are never included in the GitHub payload.
Autopost: OFF by default in PR mode. Post only when user explicitly says "post" / "post to GitHub" / BACKPACK_REVIEW_AUTOPOST=true.
Post a single GitHub PR review using the Reviews API (NOT a plain PR comment).
This places each issue as an inline comment on the relevant diff line, grouped under one review.
cat > /tmp/bpk_review_[NUMBER].json << 'ENDJSON'
{
"commit_id": "[SHA]",
"event": "COMMENT",
"body": "### Code review\n\nFound N issues (reviewed by M/6 agents).\n\n🤖 Generated with [Claude Code](https://claude.ai/code)",
"comments": [
{
"path": "packages/bpk-component-foo/src/BpkFoo.tsx",
"line": 45,
"start_line": 42,
"side": "RIGHT",
"body": "**[Title]** *(recurring pattern)*\n\n[explanation: what, why, fix]\n\n[if human-gated] Gate rationale: [confidence_explanation]"
}
]
}
ENDJSON
gh api repos/Skyscanner/backpack/pulls/[NUMBER]/reviews \
--method POST \
--input /tmp/bpk_review_[NUMBER].json \
--jq '{id: .id, state: .state, html_url: .html_url}'
Inline comment body format (per issue):
**[Title]** [*(recurring pattern)*]
[explanation: (a) what is wrong, (b) why, (c) what to use instead]
[if human-gated] ⚠️ Gate rationale: [confidence_explanation]
[if RAISED_SET match] ⚠️ Already raised in PR discussion — still open.
Line mapping rules:
- Single-line issue (
startLine == endLine): set "line": startLine, omit start_line
- Multi-line issue: set
"start_line": startLine, "line": endLine, "side": "RIGHT"
- If a line falls outside the PR diff (not a changed line), fall back to the nearest
changed line in the same file, or skip the inline comment and include that issue only
in the review body summary instead.
Human gate: Issues with threshold <= confidence < 90 require explicit user confirmation
before the review is posted. Show the full preview (inline comments + review body) and ask
"Post this review? (y/n)". Do not post any part of it until confirmed.
No partial posting: Either post all issues as one review, or don't post at all.
Shared output rules (both modes):
*(recurring pattern)* label for source: learned-patterns issues
- Explanation always covers: (a) what is wrong, (b) why, (c) what to use instead
- No strengths section, compliance table, or required-actions checklist
Phase 5: Orchestrator Self-Check (Internal — do not print)
- Check
className/style leakage for new components
- Verify design approval evidence is present and substantive
- Check private token misuse and token semantic-name correctness
- Check license headers on changed source files
- Perform mixin investigation for direct CSS properties
- Perform package-export investigation for imported Backpack helpers
- Verify token colour match AND semantic meaning
- Include diagnostics for any failed agents
- Enforce autopost guardrails
Privacy and Access Control
- Never include secrets/tokens/credentials in agent prompts or output
- Agents use only local repo files and
gh CLI — no external services beyond GitHub
- GitHub token scopes: read (diff/comments/history); write (PR comments, autopost only)
References