원클릭으로
fill-cards
Investigation-driven card grooming — investigate across data sources, synthesize findings into card content, present for approval
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Investigation-driven card grooming — investigate across data sources, synthesize findings into card content, present for approval
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
| name | fill-cards |
| description | Investigation-driven card grooming — investigate across data sources, synthesize findings into card content, present for approval |
| disable-model-invocation | true |
Find Shortcut stories with empty template sections, investigate across data sources (Intercom, PostHog, codebase, Slack, Jam), synthesize findings into card content, and present a complete draft for Paul's approval.
The goal is an approved draft, not a pushed card. Pushing to Shortcut is a separate step that happens only after Paul has reviewed the full card text and said to ship it. "Let me push it" is not approval. Present the wording, wait for explicit go-ahead.
Supports voice mode if available.
$ARGUMENTS determines the mode:
SC-NNN (default): Single card, standard investigation. Run the full
protocol below on the named card.--lead: Parallel fill (3a) lead role. Find candidates, propose a
product-area cluster split, brief the partner instance in an agenterminal
thread, work your own cards, own session-end documentation.--partner: Parallel fill (3a) partner role. Join the agenterminal
thread, read your assignment, acknowledge, then work your assigned cards.All three modes share the same per-card investigation protocol. The modes differ only in how cards are selected and how coordination happens.
Before starting, read these shared sections from box/shortcut-ops.md:
These are shared across plays and maintained in one place. Don't skip them.
agenterminal.execute_approved or present the
command for the user to run.Retry-After.intercom-search.py, intercom-evidence.py; Shortcut:
see Mutation Scripts table in box/shortcut-ops.md). If no wrapper exists for
the specific operation, grep reference/tooling-logistics.md for the recipe.
Don't re-derive payload shapes or endpoint paths.Before anything else:
ls .agent/fill-cards/ 2>/dev/null
If sc{NNN}.md exists for the card you are about to work:
If the user chooses "start fresh": archive the existing file before proceeding:
mv .agent/fill-cards/sc{NNN}.md .agent/fill-cards/sc{NNN}.bak.$(date +%s).md
Then continue to Step 1. Preflight will create a new session file.
If no session file exists, continue to Step 1.
Trustworthy result files on resume: Only read sidecar .json files that have a
matching ### {label} result saved receipt in the current session file — path
matches and file exists on disk. Any sc{NNN}-*.json files on disk without a
matching receipt in the current session file are stale and must be ignored.
If multiple receipts exist for the same path, the one with the latest timestamp
is authoritative; earlier receipts for the same path are superseded.
If $ARGUMENTS is SC-NNN, skip candidate discovery — you already have the card.
Otherwise, find cards with empty sections:
python3 box/shortcut-cards.py --state "In Definition" --empty-sections --summary
This ranks by most empty sections first, then oldest.
Run these before starting investigation. Not optional.
Fetch the card: python3 box/shortcut-cards.py --id SC-NNN --description. Check the
workflow state: Need Requirements cards get context to support a stakeholder
conversation. In Definition cards get the full treatment (investigation,
solution sketch, all sections).
Bug or feature? Bug cards use a leaner template (skip blank Monetization, UI, Reporting, Release sections).
New feature or extension? Frame as extension when possible. Identify the closest existing feature surface.
Visual/UX cards: Ask for screenshots of the actual behavior before investigating the code mechanism. Code tells you what can happen; a screenshot tells you what does happen. Screenshots are the primary source for visual issues; code is secondary.
Run pre-flight checks (hard gate):
python3 box/preflight.py --product-area AREA
Replace AREA with the card's product area (e.g. SMARTPIN, TURBO). This
verifies all tokens, DB connectivity, sync freshness, and extracts known
PostHog event names for the product area. Results are automatically logged
to the preflight_log audit table.
If preflight exits non-zero, stop. Do not proceed to investigation. A stale Intercom index means any "no signal" finding is unreliable. Report the failure to the user and wait for resolution before continuing. This is not advisory — investigating on stale data produces false negatives that are silent, permanent, and undetectable after the fact.
Preflight output includes PostHog event names for the product area. Use
these in step 4 instead of reading box/posthog-events.md separately.
After preflight passes, scan .claude/rules/product-knowledge.md for entries
relevant to this card's product area. Note any feature state, entity scope,
or vocabulary facts that apply to the investigation.
Run preflight with session file. Pass --session-file .agent/fill-cards/sc{NNN}.md
to the preflight command. Preflight creates the file if it does not exist:
python3 box/preflight.py --product-area AREA --card-id SC-NNN --session-file .agent/fill-cards/sc{NNN}.md
Pass --session-file .agent/fill-cards/sc{NNN}.md to all subsequent script
calls in this session. Pass session_file and save_path to all
agenterminal.delegate calls. No separate file creation step needed.
Fill-cards does not use collect. Disk persistence is handled by
save_path on delegate (push-first, PR #210). The checkpoint-approval gate
that collect provides is not needed — approve_content at step 7 is the
human-in-the-loop point. Push delivery keeps the agent responsive (no blocking
call to intercept). [Delegate completed] and [Delegate terminal] messages
arrive as session turns, not conversation thread entries.
Before dispatching anything, plan which data sources this card needs and what question each answers. Not every card needs all sources. A card about a UI change may not need PostHog. A card originating from a Slack idea with no user-facing reports may not need Intercom. Write the plan as a short list before starting. TodoWrite the plan items so they persist through step 4b (investigation accounting). Narrated plan items are invisible at accounting time; pending todos are not. Proved 2026-04-30: routing question planned in step 3, abandoned after one failed grep, accounting reported "Planned but NOT executed: None."
Write the plan to the session file. After setting todos, append to the session file:
## Investigation Plan
- {source 1}: {question}
- {source 2}: {question}
This is the cheapest recovery artifact: a fresh agent can restore todos from this section without replaying any investigation steps.
Record the delegation decision and expected read count for each data source. The investigation plan must explicitly state how each source will be handled. For codebase, direct investigation reads are capped at 3 per card. If the expected count exceeds 3, the plan must use delegation.
## Investigation Plan
- Intercom: delegate filter (Sonnet mode, >8 expected candidates)
- Codebase: delegate explore (3 axes: autosave architecture, event emission, FormResetHandler)
- PostHog: direct (iterative queries)
or:
## Investigation Plan
- Intercom: direct search (≤8 expected, narrow keyword)
- Codebase: direct (2 reads: grep for event name, read emit file)
- PostHog: direct (single funnel query)
The declared read count makes the delegation decision visible at planning time and auditable at accounting time (Step 4b). If actual reads exceed the declared count, the accounting must flag the deviation.
Bug cards: verify the code mechanism before searching Intercom. Understanding the failure mode tells you what symptoms to look for and how to distinguish this bug from related ones. Without code grounding, you evaluate Intercom conversations against the card's claims — proxy trust on the thing you're supposed to verify. Proved SC-1517: assessed conversation relevance against card description before reading any code; human caught it.
Multi-card sessions: compare investigation scope. If this is the Nth card in a session, list the data sources and search breadth used for the previous card before planning this one. The quality gate checklist measures artifact existence, not investigation depth — a card built from reused adjacent findings passes the same checklist as one with a full standalone investigation. Proved 2026-04-08: Ghostwriter card drafted from SC-1238 codebase findings with 2 Intercom searches vs 7+ for SC-1238; checklist passed; human caught it.
Work through the data sources identified in the plan. The investigation phase is complete when every data source in the plan has findings (or an honest "no signal" result), and those findings came through the defined channel: delegated exploration or within-budget direct reads for codebase (see Step 4 budget rule), direct search or filter subagent for Intercom, saved insights for PostHog.
Three-phase action through scripts. No raw SQL.
Search: python3 box/intercom-search.py "terms" (add --since,
--no-canned, --limit as needed). The script has a mechanical freshness
gate — it blocks with exit code 2 if the index is stale (>36h).
Delegation decision tree (based on expected volume from Step 3 plan):
intercom-search.py in primary
context). Primary reads all results. No delegation overhead.intercom-filter (Sonnet mode):
python3 box/compose-delegation.py intercom-filter --var SEARCH_GOAL="..." --var NOISE_PATTERNS="..." --var MODEL_OVERRIDE=sonnet. Delegate searches,
reads, classifies. Primary reads SIGNAL conversations only. Intercom audit
(Step 8) catches missed signal.For multi-phase investigations with file_path, read the template directly
(box/intercom-filter-prompt.md).
Session persistence for intercom-filter delegation. Call agenterminal.delegate,
passing:
conversation_id: (the active session conversation)session_file: ".agent/fill-cards/sc{NNN}.md"session_label: "intercom-filter"save_path: ".agent/fill-cards/sc{NNN}-intercom-filter-result.json"The delegate tool writes the task_id and label to the session file automatically.
The server writes the result JSON to save_path on completion, before the push.
Continue other investigation work. When a [Delegate completed] or
[Delegate terminal] message arrives for this delegation:
Check the header for saved_to= — if present, the file is on disk.
Verify: ls .agent/fill-cards/sc{NNN}-intercom-filter-result.json
If the file exists, append to the session file:
Task ID: {task_id} Path: .agent/fill-cards/sc{NNN}-intercom-filter-result.json Status: {completed|timed_out}
If saved_to is absent from the header, append:
Task ID: {task_id} Expected path: .agent/fill-cards/sc{NNN}-intercom-filter-result.json
Vocabulary shift (after first pass). Card titles anchor search terms toward product framing ("wrong site," "duplicate draft"). Users say "combined," "jumbled," "can't find," "manually made copies." After the first search pass, pause and ask: "how would a user describe this to support?" Run a second pass with user-vocabulary terms before declaring search complete. Logged 5 times (SC-201, SC-874, SC-1108, SC-67, SC-984) — the audit agent catches it every time because it only sees the What section, not the card title.
Vocabulary category checklist (second pass). Before declaring Intercom search complete, check each category:
Check shipped-card timeline. Before citing Intercom conversations as
current evidence, check whether a related card shipped since those
conversations. Use --since to filter to post-ship dates. Pre-ship
conversations about a now-fixed issue inflate the evidence count for a
problem that no longer exists. Proved 2026-04-08: multiple Jan-Feb credit
purchase complaints predated SC-470 (shipped Mar 18).
Read: python3 box/intercom-search.py --read ID1,ID2 for the
conversations going on the card. Search snippets are not evidence.
Read the actual conversation text before citing it. Long conversations
(>30 parts): scan the final 5-10 exchanges separately. Topic transitions
cluster in later parts as customers report new issues in existing threads.
Classifying by opening topic misses these. Proved SC-1541: conversation
classified as "engagement tracking" from opening messages; later parts
contained support confirming the pin-add bug.
Cross-reference ambiguous conversations with PostHog. When a bug has a trigger condition visible in PostHog person properties (billing version, subscription state, feature flags), look up the Intercom contact's email in PostHog to confirm they match the bug's profile. Resolves "could be this bug" into confirmed or ruled out. Proved SC-1517: turned 1 confirmed
Evidence block: Use python3 box/intercom-evidence.py to generate the
structured evidence block for the card. Records search date, index freshness,
terms used, and links to every signal conversation. See the Intercom evidence
schema in box/shortcut-ops.md for format details. Every card gets one of
three blocks: searched/signal, searched/no-signal, or not-searched with
rationale. Link all signal conversations classified as evidence, not a sample.
Codebase exploration is delegated by default. Multi-file exploration
(architecture discovery, feature surface mapping, instrumentation tracing)
goes through compose-delegation.py codebase-explore:
python3 box/compose-delegation.py codebase-explore --var QUESTION="..." --var CODEBASE_PATH="/Users/paulyokota/Dev/aero". For the file-writing
variant, read the template directly (box/verified-explore-prompt.md,
File-writing variant section). Keep to 3 search axes max per delegation.
After collecting: read the specific files the subagent identified that bear on card claims. Primary context is for targeted verification, not broad exploration.
Aggregate budget: 3 direct investigation reads per card. Direct codebase reads in primary context are capped at 3 per card investigation. A "direct investigation read" is one tool call that reads codebase content for the purpose of discovering facts: a Read or git show of a file/section, a Grep that returns file content, or a Bash command that outputs code. File-path-only greps (output_mode: files_with_matches) do not count.
Examples of what fits within 3: a single-file lookup (1 read), a stub-level validation (grep for event name + read emit file = 2 reads), a short chained lookup (2-3 reads). If the investigation would exceed 3, delegate.
Post-delegation verification reads are exempt. Reading files the subagent identified to verify claims against primary sources is expected and does not count against the budget.
Cross-cutting concerns use targeted grep delegation. Cross-cutting patterns (auth, session management, storage) touch many subsystems and timeout in explore delegates (SC-399). Delegate via agenterminal.delegate with a targeted grep-first prompt: instruct the delegate to grep for specific patterns and read only matching files. Do not use primary context for cross-cutting investigation.
When delegation fails (timeout, empty result, bad output): diagnose why.
Common causes: too many search axes (split and re-dispatch), timeout too short
(increase timeout_ms), wrong search terms (refine and re-dispatch). Do not
absorb exploration into primary context. If you're tempted to "just read the
files yourself," that's the signal to fix the delegation, not skip it.
Session persistence for codebase delegation. Call agenterminal.delegate,
passing:
conversation_id: (the active session conversation)session_file: ".agent/fill-cards/sc{NNN}.md"session_label: "codebase-explore"save_path: ".agent/fill-cards/sc{NNN}-codebase-result.json"The delegate tool writes the task_id and label to the session file automatically.
The server writes the result JSON to save_path on completion, before the push.
Continue other investigation work. When a [Delegate completed] or
[Delegate terminal] message arrives for this delegation:
Check the header for saved_to= — if present, the file is on disk.
Verify: ls .agent/fill-cards/sc{NNN}-codebase-result.json
If the file exists, append to the session file:
Task ID: {task_id} Path: .agent/fill-cards/sc{NNN}-codebase-result.json Status: {completed|timed_out}
If saved_to is absent from the header, append:
Task ID: {task_id} Expected path: .agent/fill-cards/sc{NNN}-codebase-result.json
Query for relevant events. Save queries as insights at query time, not
after drafting. Every number on the card needs a linkable saved insight. Use
the PostHog event names from preflight output (step 2). If preflight reported
NO_MATCH for the product area, use the PostHog MCP event-definitions-list
tool to discover events. For richer schema notes beyond event names, grep
box/posthog-events.md for the product area section.
Write insight permalinks to the session file. After saving each insight, append to the session file:
## PostHog
- {event name} ({date}): {permalink}
Required manual write. The permalink is the only PostHog artifact that survives compaction.
Read thread content for cards that originated from Slack ideas. Use:
python3 box/slack-scanner.py --channel C0ADJ4ATJE4 --threads <permalink>
Reply text is in the thread_context array of the JSON output. For ideas with
cross-channel links, the scanner's Play 1 output already includes thread content
inline — check cross_channel_links[].thread_context before making a second call.
Check for file attachments. The scanner's files array lists images and
screenshots. Download and view any screenshots:
curl -s -H "Authorization: Bearer $TOKEN" <url_private> -o /tmp/file.png
For visual/UX issues, screenshots are primary sources that code alone cannot substitute.
If the card's Slack thread, Intercom evidence, or description contains a Jam URL
(jam.dev/c/...), pull structured debug data via Jam MCP. getDetails first
(overview + investigation guide), then getNetworkRequests (filter by
statusCode="4xx" or "5xx"), getConsoleLogs, and getUserEvents for the
reproduction timeline. This surfaces technical root causes that text descriptions
miss. Especially valuable for bug cards: the Jam contains the reproduction
evidence, not just a link to it.
Before synthesizing, produce a structured accounting block:
git show returned [...Nch omitted...], the file is partially read: note the omitted range and read it before marking complete. Proved 2026-04-23: delegate line-number labels filled in for unread sections, survived the accounting step, and produced three monitor escalations.### Budget check
Investigation reads (independent):
- [x] file1.ts (grep for event name)
- [x] file2.ts (read emit site)
Total: 2 (budget: 3)
Verification reads (delegate-identified):
- [x] file3.ts (confirmed delegate finding)
Total: 1 (exempt)
Budget status: WITHIN BUDGET (2/3)
If investigation reads exceed 3: flag as OVER BUDGET. The per-read categorization makes the count auditable — a reviewer can check whether a read classified as "verification" was actually independent investigation.This is not optional — it is the gate between investigation and synthesis. The accounting verifies investigation completeness before synthesis starts. Proved 2026-04-15: monitor flagged investigation-to-synthesis transition; accounting produced after the flag but should have been spontaneous.
Write the accounting block to the session file before advancing to synthesis. Do not proceed to Step 5 until this write is complete. Append verbatim:
## Accounting Block ({timestamp})
{full accounting block}
Then append:
## STATUS: accounting ({timestamp})
A fresh agent reading this file can determine exactly what was investigated without replaying any steps. This is the highest-value recovery artifact.
URL::route() calls that would 500 on route removal). These hit different files.
The second pass can be a separate delegation.After filling a card, search Shortcut for non-archived cards that share infrastructure, prerequisites, or overlapping scope. Propose links with the right verb:
| Verb | Direction | When to use |
|---|---|---|
blocks | Subject blocks Object | Object cannot ship without Subject being done first. Test: "Could Object ship if Subject didn't exist?" If no, it blocks. |
relates to | Bidirectional | Cards share infrastructure, overlap in scope, or inform each other, but neither is a prerequisite. |
duplicates | Subject duplicates Object | Used in Find Dupes play. Subject is the loser (gets archived). |
Default to relates to unless there's a genuine prerequisite dependency. Present
proposed links alongside the card draft for approval.
For Released cards, read the description to confirm whether the fix addressed root cause or a surface symptom before including in the card landscape. Title + Released state is insufficient — ongoing Intercom complaints post-release indicate partial fixes. Proved SC-1798: SC-226 (Released) incorporated from title alone; description would reveal whether SmartPin multi-profile fix was root cause or patch.
For story link creation, use shortcut-mutate.py story-link:
python3 box/shortcut-mutate.py story-link SUBJECT_ID OBJECT_ID "relates to"
The script handles mutation + read-back verification in one call (exit 0 =
verified). Route through execute_approved. Proved 2026-04-22: raw curl via
reference/tooling-logistics.md recipe works but requires a separate
verification call; the wrapper script eliminates that gap.
Verify between each story link mutation. The script's built-in verification
covers the link itself. Still run shortcut-cards.py --id SC-NNN between links
to confirm cumulative state before creating the next one. Batch momentum erodes
per-mutation verification on sequential similar mutations — proved twice
2026-04-01.
Batch mutation rule. When executing 3+ sequential Shortcut mutations (card creates, updates, story links), verify each one independently before proceeding to the next. "Verify" means a GET request confirming the change, not reading the execute_approved response. Narrate the verification result explicitly — not just "verified." This pattern co-activates batch momentum (narration compresses) and proxy trust (API response treated as verification). Proved 2026-04-08: verification dropped by card 3 of 7, caught by human at card 4.
Sequence: Write → Read → spot-check → quality gate → approve_content. After writing the draft to a file, read it back before evaluating the quality gate. The quality gate’s checkmarks are claims about the draft content — verify them against what was actually written, not what you intended to write. Spot-check gate: before declaring the quality gate passed, run at least 2 fresh Grep or Read calls against the card’s highest-stakes factual claims (file paths, specific code properties, render order). In-context memory of prior reads is not verification — it’s reconstruction. Proved 2026-04-22 and 2026-04-23: quality gate passed from memory both times; fresh spot-checks confirmed accuracy but process was wrong both times. Select spot-checks targeting the card's most novel assertion and the file that is ground truth for it. Event emission claims: the emit site file. Data model claims: the schema/type file. Don't spot-check adjacent confirming evidence when the highest-risk claim has a specific ground-truth file. Proved 2026-04-24: spot-checked data-layer code (import-processor.ts, smart-pin-repository.ts) instead of the event emit site (csv-import-page.tsx) for a novel "CsvImport lacks generationFrequency" claim.
Present the completed checklist inline first (this is the due diligence evidence).
Then present the card description via approve_content with
content_type: "card-draft", filename: "scNNN". The user reviews and can edit
in the modal.
After approve_content returns, append to the session file:
## Draft approved ({timestamp})
Path: .agenterminal/approved/card-draft/sc{NNN}.md
## STATUS: approved ({timestamp})
The checklist is not presented without the draft, and the draft is not presented without the checklist. Any checklist item that fails must be resolved before presenting. If you can't resolve it, say so explicitly rather than presenting with a known gap.
### Pre-approval checklist
Related cards (all story links): SC-15, SC-68, SC-44
PostHog insights saved: [SmartPin adds](link), [Add distribution](link)
Intercom evidence block: schema-compliant (searched/no-signal/not-searched with date, index freshness, terms, all signal conversation IDs)
Codebase: all file paths on card read directly, no subagent-only claims
Card metadata: product area set, story links created
Severity (bug cards): Sev N (Level) — state the discriminator answer (see `reference/severity-framework.md`). Must be assessed here, not deferred to ship time.
Quality gate:
- Problem before solution
- Scoping-ready
- Verifiable evidence
- Observable done state
In multi-card sessions: copy-paste the checklist structure from card 1 for every subsequent card. Do not reconstruct it from memory — a template resists compression; a remembered format invites shortcuts. The checklist erodes predictably: manual verification drops first, then the checklist itself.
Checklist field definitions:
box/shortcut-ops.md. One of three formats: searched/signal,
searched/no-signal, or not-searched with rationale. Must include: search date,
index freshness date, all search terms used, and links to all signal conversations
classified as evidence (not a sample). Generated via python3 box/intercom-evidence.py.
Every cited conversation was read directly by the primary instance.
IDs must be linked as full Intercom URLs
(https://app.intercom.com/a/apps/2t3d8az2/inbox/inbox/all/conversations/{ID}),
not plain text. "Verifiable" means the reader can click through, not construct
the URL themselves. Subagent-classified conversations not yet verified by the
primary instance are listed separately as unverified candidates and do not
appear in the card's Evidence section.
Verify counts before asserting. Run grep -c on the evidence file or
comment file to confirm conversation counts match what you claim in the
checklist and card body. Do not recall counts from memory — three count
discrepancies in one session (2026-04-22) all came from recalling rather
than counting.Dispatch two independent verification agents in parallel. State which verifiers you're dispatching and why — if skipping one, state the rationale explicitly. Silent omission is batch momentum (proved 2026-03-25: SC-348's skipped Intercom audit found genuinely valuable evidence when the human insisted on it).
a) Codebase verification (existing):
python3 box/compose-delegation.py codebase-verify --var card_draft_path="$SAVED_PATH". Parse the JSON output and pass
prompt, output_instructions, and model to agenterminal.delegate.
Also pass:
conversation_id: a unique ID for this delegate (e.g. verify-codebase-{NNN})session_file: ".agent/fill-cards/sc{NNN}.md"session_label: "codebase-verify"save_path: ".agent/fill-cards/sc{NNN}-codebase-verify-result.json"b) Intercom evidence audit (multi-pass):
python3 box/compose-delegation.py intercom-audit --var card_what_section="$(cat /tmp/what.txt)". Parse the JSON output and
pass prompt, output_instructions, and model to
agenterminal.delegate. Pass ONLY the card's What section — the auditor
must reason independently about search terms. Do not pass the Evidence
section. Also pass:
conversation_id: a unique ID for this delegate (e.g. verify-intercom-{NNN})session_file: ".agent/fill-cards/sc{NNN}.md"session_label: "intercom-audit"save_path: ".agent/fill-cards/sc{NNN}-intercom-audit-result.json"Dispatch both in the same message (they have no dependency). Each delegate
must have a unique conversation_id — the server rejects a second delegate
with an already-active conversation_id. Any unique string works (the server
creates the conversation on the fly).
After dispatching both delegates, append to the session file:
## STATUS: verifying ({timestamp})
The delegate tool writes each task_id and label to the session file automatically.
The server writes each result JSON to save_path on completion, before the push.
As each [Delegate completed] or [Delegate terminal] push arrives:
Check the header for saved_to= — if present, the file is on disk.
Verify: ls {save_path}
If the file exists, append to the session file:
Task ID: {task_id} Path: {save_path} Status: {completed|timed_out}
If saved_to is absent from the header, append:
Task ID: {task_id} Expected path: {save_path}
Codebase: If all_verified: true — pick 1-2 claim numbers from the
delegate's report and run a fresh Read/Grep against the delegate's specific
source_file and exact_evidence. Pre-delegation reads that confirm the
same underlying facts do not count — the spot-check verifies the delegate's
work product, not the facts themselves. Proved 2026-04-23: SC-1760, pre-delegation
greps confirmed top-level claims but were substituted for post-collect
verification of the delegate's 17 specific claim-evidence pairs.
If all_verified: false —
read the cited source_file yourself for each failed claim before presenting.
The verifier's verified: false is a pointer to investigate, not a verdict
to forward. If you already read the file earlier in the session, re-check your
own reading against the claim — don't restructure your assessment around the
verifier's conclusion. Then present findings. User decides whether to fix or
ship as-is. Proved 2026-04-06: SC-1172, verifier flagged simplified param
shape as failure; agent forwarded verdict without checking own prior read;
human caught it.
If the codebase verifier times out: do not simply proceed on the basis of
files you read earlier. Run the inverse check: list every factual claim in
the Architecture Context, then confirm each was verified against a primary
source in this session. Any unverified claim is a reason to re-dispatch a
narrower verifier (one codebase, fewer claims) or read the file yourself.
For dense Architecture Context sections (20+ claims, multiple codebases),
prefer splitting into two focused verifier dispatches from the start.
Include negative assertions in the claim list. Claims like "no X exists
in the codebase" from the delegate's not_found section need at least one
targeted grep to convert trusted absence into verified absence. The inverse
check pattern naturally lists positive claims but lets negative delegate
claims pass unchecked. Proved 2026-04-20: "no bot actions" assertion from
delegate passed unverified through the inverse check.
Proved twice 2026-04-07: SC-64 and SC-1288 both timed out at 10 min.
Intercom audit: Two phases, in order. Do not assess before reading.
Phase 1 — Read all. Run the pre-composed read_commands from the
auditor's output sequentially. These are --read ID1,ID2,ID3 commands
grouping all IDs into batches of 3-5. Run every command; do not skip,
truncate, or substitute partial reads (head, limit, first-N-lines). Do not
assess relevance during this phase — read first, think second. The auditor's
snippets and relevance classifications are proxies — you cannot judge what a
conversation adds from a one-line snippet. Proved 2026-04-15: three effort
substitution activations on the same surface (partial reads, filtering to
"strongest 6," pre-judging from snippets). Human corrected all three.
Phase 1 gate — count and narrate before proceeding. After each batch
command, state: "Batch N of M: [conversation IDs read]. [Per-conversation
signal classification: strong/weak/tangential + what it adds]." Then state
the running count: "Read N of M audit batch commands." If N < M, run the
remaining before proceeding to Phase 2. The per-batch summary serves two
functions: it makes batch momentum visible (narration dropping from
substantive to silent is the tell), and it forces per-conversation judgment
rather than inheriting the auditor's relevance labels. Proved 2026-04-20
(session dd8e8dc1): assessed batch 1 substantively, went silent for batches
2-6 despite naming the completion bias risk in writing before starting. Also
proved 2026-04-15 (session 278a42e5): advisory "run every command" was in
the play text, read 6/27, concluded from proxies. Full reads added 10
meaningful conversations.
Phase 2 — Assess what the evidence says about the recommendation. Only
after reading all conversations, determine two things: (1) which add distinct
information to the evidence block, and (2) whether the evidence collectively
says anything about whether the card's recommendation is correct. Artifact
completeness (how many conversations to link) is not the only output of
verification — investigation depth (what do these conversations mean for the
card) matters more. Link every signal conversation classified as evidence per the schema
("link all signal conversations, not a sample"). Present findings to user. User
decides whether to add them to the card.
After assessment, regenerate the Intercom evidence block with all
signal conversation IDs from filter + audit. Noise exclusions from the
Sonnet-mode filter carry forward; audit additions are included. Exclude
only false positives (wrong feature, wrong product). The schema handles
10 via the comment format. Don't filter to 'strongest' — selectivity instinct conflicts with schema completeness. Proved 2026-04-20: presented 9 curated of 29 found; human redirected 'follow the schema fully.' If
total_found == 0— note "Multi-pass audit: no additional signal" in the checklist. Proved 2026-04-06: assessed 9 conversations from auditor snippets, recommended ship without reading any. Human caught it twice. Proved 2026-04-14: read 5 of 22, declared rest redundant from snippets. Human challenge unlocked 3 material findings (Safari confirmed broken by eng, returning user case, international domains).
Spot-check before recommending ship. Before saying "ship as-is," make at least one Read or Grep call that could falsify the recommendation. Delegate reports that read well are exactly when proxy trust activates. Proved 2026-04-03: recommended ship based entirely on delegate output without a single primary-source check.
Do not auto-fix failed codebase claims or auto-add Intercom conversations. Present findings, let the user decide.
approve_content)If the card was modified after the initial approve_content — verification fixes,
evidence additions, schema restructuring, any edits — re-present via
approve_content before proceeding to ship. The initial approval covered the
pre-verification draft, not the post-fix version. The user needs to see and approve
what will actually be pushed.
Before editing, enumerate ALL changes. List every change from both (1) verifier failures and (2) audit findings, numbered. Then edit. Check each off. If the edit count < the list count, something was dropped. "Just fixing line numbers" after an audit that found substantive new content is effort substitution — the assessment scope must carry through to the execution scope. Proved 2026-04-20: SC-1657, Anna's conversation assessed as "important context" then silently excluded from the re-draft.
Re-approval cycle: Read → Edit → Read → Submit. Read the saved file at the
saved_path from the previous approval. Edit the specific change. Re-read to
verify no other instances of the problem remain (e.g., grep for all local file
paths, not just the one flagged). Then submit. Do not reconstruct the card
content from context memory — context memory is a proxy for the file, and proxy
trust activates on your own prior output. Proved 2026-04-08: SC-1311.
Quantitative consistency gate. When any count, date, or quantitative claim changes in the re-edit, grep the full document for all instances of the old value before re-submitting. Downstream claims that derive from the changed number (e.g., "12 of 21" when the total changed to 30) need recalculation, not just find-and-replace. This is a mechanical pre-step, not a post-flag recovery. Proved 2026-05-15: evidence block updated from 21 to 30 conversations; "12 of 21" in Architecture Context and "12 Intercom conversations from ~11 distinct users" in What section left stale; required a third approval cycle after monitor flag.
All four steps, every time. This is mechanical, not conditional.
If the title needs to change, plan the title mutation alongside the ship
command before executing either. shortcut-ship.py handles description +
state + unassign but not title. A separate PUT is needed. Build the complete
mutation set first, then execute sequentially with verification between each.
Proved 2026-04-08: SC-1238 shipped with old title, required a follow-up
mutation to correct.
Run python3 box/ship-gate.py enter before the first production
execute_approved call (the hook will block production mutations until the
plan is declared). The dry-run in step 1 below is exempt.
Pre-ship check (dry-run). Re-fetches the card's current state from Shortcut and shows what the ship command will change. Present the output to the user and wait for go-ahead before proceeding.
python3 box/shortcut-ship.py SC-NNN <saved_path> --severity N --dry-run
--description-only to update content without changing state. Do not
ship without explicit user confirmation of the state change.Ship. Submit via agenterminal.execute_approved:
python3 box/shortcut-ship.py SC-NNN <saved_path>
The script strips YAML frontmatter automatically. The saved file at
.agenterminal/approved/card-draft/scNNN.md is compaction insurance and
audit trail. The script handles all four operations in a single call:
update description, move to Backlog, unassign owners, then re-fetch
and verify (state, owners, section headers, description length). Exit code
0 = verified, 1 = verification failed, 2 = mutation failed, 3 = backward
state transition blocked (use --force to override). Do not construct
payloads or curl commands manually.
Evidence comment (when >10 conversations). If the Intercom evidence
block says "full list in comment," post the comment via
shortcut-comment.py SC-NNN "text". Route through execute_approved.
This is part of shipping, not a follow-up. Verify the comment appears on
the card. Proved 2026-04-28: declared "shipped" with evidence comment
still pending; human correction required.
If any new PostHog event names were discovered during investigation, add them
to the appropriate section in box/posthog-events.md (Grep for the section
header). If the product area has no section, create one at the bottom and add
it to the index at the top of the file.
When $ARGUMENTS is --lead, you are the lead instance in a two-instance
parallel fill session.
3-4 cards to fill, at least two product-area clusters, agenterminal available.
For 1-2 cards, single-instance /fill-cards SC-NNN is simpler. For 5+ cards,
run multiple rounds rather than overloading one session.
Three straightforward cards per instance. The original 2-card limit was calibrated for ~200K context where both instances compacted at card 3. With 1M context, both instances completed 3 cards without compaction. The qualifier matters: cards were pre-selected for low ambiguity (narrow scope, clear investigation path). High-ambiguity cards (pricing strategy, feature sunset, cross-cutting architecture) consume significantly more context per card. For mixed batches, count high-ambiguity cards as 2 toward the limit.
Cluster by product area, not random. Architecture knowledge compounds across cards in the same area: the second card is faster because the mental model from the first card carries over. If there's a natural "dense cluster + everything else" shape, that's the split. Example: SmartPin cluster (SC-135, SC-51, SC-68, SC-132) vs mixed bag (SC-90, SC-118, SC-131).
Two instances means twice the compaction exposure. The compaction gate hooks are the primary protection: if an instance compacts, its tools lock until explicitly resumed. The 3-card limit accounts for this; don't exceed it even if context feels comfortable mid-session.
Each instance saves its own log observations and session notes. If an instance compacts before session-end, its takeaways are lost. This is acceptable for 3 cards of observations. The other instance and the human together can reconstruct what matters.
When $ARGUMENTS is --partner, you are the partner instance in a two-instance
parallel fill session.
Same rules as the lead instance:
Everything else is the standard per-card protocol. Pre-flight, investigation, verification bar, quality gate, approval flow, completion steps: all identical to single-card mode.
Refresh priority signals, manage active epics, and refresh Near Term pool
Multi-session deliverable play for projects spanning 3+ sessions with concrete outputs (proposals, strategies, wireframes). Provides project-level structure, evidence provenance, and cross-session handoff.
Use when acting as a reviewer in an AgenTerminal review conversation. Handles both code reviews (REVIEW_APPROVED) and plan reviews (PLAN_APPROVED).
Match Slack
Weekly release impact review — pull PostHog data for Released cards and tracked PRs, classify, draft observations, post to Slack
Verify instrumentation, build measurement insights, close Slack loop for Released cards