| name | fill-cards |
| description | Investigation-driven card grooming — investigate across data sources, synthesize findings into card content, present for approval |
| disable-model-invocation | true |
Fill Cards
Find Shortcut stories with empty template sections, investigate across data sources
(Intercom, PostHog, codebase, Slack, Jam), synthesize findings into card content,
and present a complete draft for Paul's approval.
The goal is an approved draft, not a pushed card. Pushing to Shortcut is a
separate step that happens only after Paul has reviewed the full card text and
said to ship it. "Let me push it" is not approval. Present the wording, wait
for explicit go-ahead.
Supports voice mode if available.
Arguments
$ARGUMENTS determines the mode:
SC-NNN (default): Single card, standard investigation. Run the full
protocol below on the named card.
--lead: Parallel fill (3a) lead role. Find candidates, propose a
product-area cluster split, brief the partner instance in an agenterminal
thread, work your own cards, own session-end documentation.
--partner: Parallel fill (3a) partner role. Join the agenterminal
thread, read your assignment, acknowledge, then work your assigned cards.
All three modes share the same per-card investigation protocol. The modes
differ only in how cards are selected and how coordination happens.
Also Read
Before starting, read these shared sections from box/shortcut-ops.md:
- Story Template: card format for drafting
- Card Quality Gate: the four criteria checked before presenting
- Verification Bar: factual accuracy checks
- Intercom Search Patterns: search syntax and patterns
- General Constraints: mutation cap, rate limiting
These are shared across plays and maintained in one place. Don't skip them.
Constraints
- Mutation gate: A PreToolUse hook blocks all Slack and Shortcut mutations
through Bash. Route through
agenterminal.execute_approved or present the
command for the user to run.
- Human-in-the-loop: Present one card at a time with full content rendered.
Wait for Paul's decision. Don't batch-execute without review.
- Context protection: Delegate codebase exploration and high-volume Intercom
filtering to subagents. Primary context is for targeted verification and
synthesis, not broad exploration.
- Mutation cap: 25 mutations per run. Each description update, state change,
story link, or archive = 1. Finish the current item before checking the cap.
- Rate limiting: 0.5s delay between sequential API calls. Respect
Retry-After.
- Before Shortcut or Slack API calls: use the wrapper scripts named in the
play steps (Intercom:
intercom-search.py, intercom-evidence.py; Shortcut:
see Mutation Scripts table in box/shortcut-ops.md). If no wrapper exists for
the specific operation, grep reference/tooling-logistics.md for the recipe.
Don't re-derive payload shapes or endpoint paths.
Steps
0. Resume check (single-card mode only)
Before anything else:
ls .agent/fill-cards/ 2>/dev/null
If sc{NNN}.md exists for the card you are about to work:
- Read it in full.
- Re-set TodoWrite items from the Investigation Plan section — these do not
survive compaction and must be restored for step 4b accounting to work.
- Report current state to the user based only on what is written in the
session file. Do not use the compaction summary. Name which steps completed,
what was found, and where the session ended.
- Ask whether to resume from that point or start fresh.
- Wait for explicit direction before proceeding.
If the user chooses "start fresh": archive the existing file before proceeding:
mv .agent/fill-cards/sc{NNN}.md .agent/fill-cards/sc{NNN}.bak.$(date +%s).md
Then continue to Step 1. Preflight will create a new session file.
If no session file exists, continue to Step 1.
Trustworthy result files on resume: Only read sidecar .json files that have a
matching ### {label} result saved receipt in the current session file — path
matches and file exists on disk. Any sc{NNN}-*.json files on disk without a
matching receipt in the current session file are stale and must be ignored.
If multiple receipts exist for the same path, the one with the latest timestamp
is authoritative; earlier receipts for the same path are superseded.
1. Find candidates (single-card mode)
If $ARGUMENTS is SC-NNN, skip candidate discovery — you already have the card.
Otherwise, find cards with empty sections:
python3 box/shortcut-cards.py --state "In Definition" --empty-sections --summary
This ranks by most empty sections first, then oldest.
2. Pre-flight
Run these before starting investigation. Not optional.
-
Fetch the card: python3 box/shortcut-cards.py --id SC-NNN --description. Check the
workflow state: Need Requirements cards get context to support a stakeholder
conversation. In Definition cards get the full treatment (investigation,
solution sketch, all sections).
-
Bug or feature? Bug cards use a leaner template (skip blank Monetization,
UI, Reporting, Release sections).
-
New feature or extension? Frame as extension when possible. Identify the
closest existing feature surface.
- Same-day sibling cards: If other cards in the same product area were
created the same day (brainstorm batch, investigation batch, or related
bug filings), read all of them — not just those already in story links.
Scope overlap and disambiguation needs are common and not caught by the
story-links check alone. Proved SC-1517: SC-1518 created same day, same
area, required explicit disambiguation but wasn't in story links.
- Check In Build cards in the same product area. A card actively changing
pricing or feature behavior (e.g. SC-46 adding credit costs) invalidates
assumptions on sibling cards. Fetch In Build cards for the product area
before starting investigation.
-
Visual/UX cards: Ask for screenshots of the actual behavior before
investigating the code mechanism. Code tells you what can happen; a
screenshot tells you what does happen. Screenshots are the primary source
for visual issues; code is secondary.
-
Run pre-flight checks (hard gate):
python3 box/preflight.py --product-area AREA
Replace AREA with the card's product area (e.g. SMARTPIN, TURBO). This
verifies all tokens, DB connectivity, sync freshness, and extracts known
PostHog event names for the product area. Results are automatically logged
to the preflight_log audit table.
If preflight exits non-zero, stop. Do not proceed to investigation.
A stale Intercom index means any "no signal" finding is unreliable.
Report the failure to the user and wait for resolution before continuing.
This is not advisory — investigating on stale data produces false negatives
that are silent, permanent, and undetectable after the fact.
Preflight output includes PostHog event names for the product area. Use
these in step 4 instead of reading box/posthog-events.md separately.
After preflight passes, scan .claude/rules/product-knowledge.md for entries
relevant to this card's product area. Note any feature state, entity scope,
or vocabulary facts that apply to the investigation.
Run preflight with session file. Pass --session-file .agent/fill-cards/sc{NNN}.md
to the preflight command. Preflight creates the file if it does not exist:
python3 box/preflight.py --product-area AREA --card-id SC-NNN --session-file .agent/fill-cards/sc{NNN}.md
Pass --session-file .agent/fill-cards/sc{NNN}.md to all subsequent script
calls in this session. Pass session_file and save_path to all
agenterminal.delegate calls. No separate file creation step needed.
Fill-cards does not use collect. Disk persistence is handled by
save_path on delegate (push-first, PR #210). The checkpoint-approval gate
that collect provides is not needed — approve_content at step 7 is the
human-in-the-loop point. Push delivery keeps the agent responsive (no blocking
call to intercept). [Delegate completed] and [Delegate terminal] messages
arrive as session turns, not conversation thread entries.
3. Plan the investigation
Before dispatching anything, plan which data sources this card needs and what
question each answers. Not every card needs all sources. A card about a UI
change may not need PostHog. A card originating from a Slack idea with no
user-facing reports may not need Intercom. Write the plan as a short list
before starting. TodoWrite the plan items so they persist through
step 4b (investigation accounting). Narrated plan items are invisible at
accounting time; pending todos are not. Proved 2026-04-30: routing question
planned in step 3, abandoned after one failed grep, accounting reported
"Planned but NOT executed: None."
Write the plan to the session file. After setting todos, append to the
session file:
## Investigation Plan
- {source 1}: {question}
- {source 2}: {question}
This is the cheapest recovery artifact: a fresh agent can restore todos from
this section without replaying any investigation steps.
Record the delegation decision and expected read count for each data source.
The investigation plan must explicitly state how each source will be handled.
For codebase, direct investigation reads are capped at 3 per card. If the
expected count exceeds 3, the plan must use delegation.
## Investigation Plan
- Intercom: delegate filter (Sonnet mode, >8 expected candidates)
- Codebase: delegate explore (3 axes: autosave architecture, event emission, FormResetHandler)
- PostHog: direct (iterative queries)
or:
## Investigation Plan
- Intercom: direct search (≤8 expected, narrow keyword)
- Codebase: direct (2 reads: grep for event name, read emit file)
- PostHog: direct (single funnel query)
The declared read count makes the delegation decision visible at planning time
and auditable at accounting time (Step 4b). If actual reads exceed the declared
count, the accounting must flag the deviation.
Bug cards: verify the code mechanism before searching Intercom. Understanding
the failure mode tells you what symptoms to look for and how to distinguish this
bug from related ones. Without code grounding, you evaluate Intercom conversations
against the card's claims — proxy trust on the thing you're supposed to verify.
Proved SC-1517: assessed conversation relevance against card description before
reading any code; human caught it.
Multi-card sessions: compare investigation scope. If this is the Nth card
in a session, list the data sources and search breadth used for the previous
card before planning this one. The quality gate checklist measures artifact
existence, not investigation depth — a card built from reused adjacent findings
passes the same checklist as one with a full standalone investigation. Proved
2026-04-08: Ghostwriter card drafted from SC-1238 codebase findings with 2
Intercom searches vs 7+ for SC-1238; checklist passed; human caught it.
4. Investigate
Work through the data sources identified in the plan. The investigation phase
is complete when every data source in the plan has findings (or an honest
"no signal" result), and those findings came through the defined channel:
delegated exploration or within-budget direct reads for codebase (see Step 4
budget rule), direct search or filter subagent for Intercom, saved insights
for PostHog.
Intercom
Three-phase action through scripts. No raw SQL.
-
Search: python3 box/intercom-search.py "terms" (add --since,
--no-canned, --limit as needed). The script has a mechanical freshness
gate — it blocks with exit code 2 if the index is stale (>36h).
Delegation decision tree (based on expected volume from Step 3 plan):
- ≤8 expected results → Search directly (
intercom-search.py in primary
context). Primary reads all results. No delegation overhead.
- >8 expected results → Delegate
intercom-filter (Sonnet mode):
python3 box/compose-delegation.py intercom-filter --var SEARCH_GOAL="..." --var NOISE_PATTERNS="..." --var MODEL_OVERRIDE=sonnet. Delegate searches,
reads, classifies. Primary reads SIGNAL conversations only. Intercom audit
(Step 8) catches missed signal.
- Unknown volume → Default to delegation (Sonnet mode). The >8 path is
safe for small result sets (just more overhead). The ≤8 path is unsafe for
large result sets (fills primary context).
For multi-phase investigations with file_path, read the template directly
(box/intercom-filter-prompt.md).
Session persistence for intercom-filter delegation. Call agenterminal.delegate,
passing:
conversation_id: (the active session conversation)
session_file: ".agent/fill-cards/sc{NNN}.md"
session_label: "intercom-filter"
save_path: ".agent/fill-cards/sc{NNN}-intercom-filter-result.json"
The delegate tool writes the task_id and label to the session file automatically.
The server writes the result JSON to save_path on completion, before the push.
Continue other investigation work. When a [Delegate completed] or
[Delegate terminal] message arrives for this delegation:
-
Check the header for saved_to= — if present, the file is on disk.
-
Verify: ls .agent/fill-cards/sc{NNN}-intercom-filter-result.json
-
If the file exists, append to the session file:
intercom-filter result saved ({timestamp})
Task ID: {task_id}
Path: .agent/fill-cards/sc{NNN}-intercom-filter-result.json
Status: {completed|timed_out}
-
If saved_to is absent from the header, append:
intercom-filter result save failed ({timestamp})
Task ID: {task_id}
Expected path: .agent/fill-cards/sc{NNN}-intercom-filter-result.json
Vocabulary shift (after first pass). Card titles anchor search terms
toward product framing ("wrong site," "duplicate draft"). Users say
"combined," "jumbled," "can't find," "manually made copies." After the first
search pass, pause and ask: "how would a user describe this to support?"
Run a second pass with user-vocabulary terms before declaring search
complete. Logged 5 times (SC-201, SC-874, SC-1108, SC-67, SC-984) — the
audit agent catches it every time because it only sees the What section,
not the card title.
Vocabulary category checklist (second pass). Before declaring Intercom
search complete, check each category:
- Symptom language ("scrambled," "mixed up," "wrong order," "lost")
- Use-case framing ("editorial calendar," "campaign," "seasonal content")
- Non-English terms if the feature has international users (German, Spanish)
- Workaround language ("manually," "one by one," "how do I fix")
The primary search anchors to product framing from the card title; these
categories reach users who describe the need, not the feature. Proved
2026-04-22: primary search (6 queries, English product terms) found 1
conversation; audit (20 queries including German, use-case framing) found 5
additional.
Check shipped-card timeline. Before citing Intercom conversations as
current evidence, check whether a related card shipped since those
conversations. Use --since to filter to post-ship dates. Pre-ship
conversations about a now-fixed issue inflate the evidence count for a
problem that no longer exists. Proved 2026-04-08: multiple Jan-Feb credit
purchase complaints predated SC-470 (shipped Mar 18).
-
Read: python3 box/intercom-search.py --read ID1,ID2 for the
conversations going on the card. Search snippets are not evidence.
Read the actual conversation text before citing it. Long conversations
(>30 parts): scan the final 5-10 exchanges separately. Topic transitions
cluster in later parts as customers report new issues in existing threads.
Classifying by opening topic misses these. Proved SC-1541: conversation
classified as "engagement tracking" from opening messages; later parts
contained support confirming the pin-add bug.
Cross-reference ambiguous conversations with PostHog. When a bug has a
trigger condition visible in PostHog person properties (billing version,
subscription state, feature flags), look up the Intercom contact's email
in PostHog to confirm they match the bug's profile. Resolves "could be
this bug" into confirmed or ruled out. Proved SC-1517: turned 1 confirmed
- 2 ambiguous conversations into 4 confirmed matches across 8 months.
-
Evidence block: Use python3 box/intercom-evidence.py to generate the
structured evidence block for the card. Records search date, index freshness,
terms used, and links to every signal conversation. See the Intercom evidence
schema in box/shortcut-ops.md for format details. Every card gets one of
three blocks: searched/signal, searched/no-signal, or not-searched with
rationale. Link all signal conversations classified as evidence, not a sample.
Codebase
Codebase exploration is delegated by default. Multi-file exploration
(architecture discovery, feature surface mapping, instrumentation tracing)
goes through compose-delegation.py codebase-explore:
python3 box/compose-delegation.py codebase-explore --var QUESTION="..." --var CODEBASE_PATH="/Users/paulyokota/Dev/aero". For the file-writing
variant, read the template directly (box/verified-explore-prompt.md,
File-writing variant section). Keep to 3 search axes max per delegation.
After collecting: read the specific files the subagent identified that bear
on card claims. Primary context is for targeted verification, not broad
exploration.
Aggregate budget: 3 direct investigation reads per card. Direct codebase
reads in primary context are capped at 3 per card investigation. A "direct
investigation read" is one tool call that reads codebase content for the purpose
of discovering facts: a Read or git show of a file/section, a Grep that returns
file content, or a Bash command that outputs code. File-path-only greps
(output_mode: files_with_matches) do not count.
Examples of what fits within 3: a single-file lookup (1 read), a stub-level
validation (grep for event name + read emit file = 2 reads), a short chained
lookup (2-3 reads). If the investigation would exceed 3, delegate.
Post-delegation verification reads are exempt. Reading files the subagent
identified to verify claims against primary sources is expected and does not
count against the budget.
Cross-cutting concerns use targeted grep delegation. Cross-cutting patterns
(auth, session management, storage) touch many subsystems and timeout in explore
delegates (SC-399). Delegate via agenterminal.delegate with a targeted grep-first
prompt: instruct the delegate to grep for specific patterns and read only
matching files. Do not use primary context for cross-cutting investigation.
When delegation fails (timeout, empty result, bad output): diagnose why.
Common causes: too many search axes (split and re-dispatch), timeout too short
(increase timeout_ms), wrong search terms (refine and re-dispatch). Do not
absorb exploration into primary context. If you're tempted to "just read the
files yourself," that's the signal to fix the delegation, not skip it.
Session persistence for codebase delegation. Call agenterminal.delegate,
passing:
conversation_id: (the active session conversation)
session_file: ".agent/fill-cards/sc{NNN}.md"
session_label: "codebase-explore"
save_path: ".agent/fill-cards/sc{NNN}-codebase-result.json"
The delegate tool writes the task_id and label to the session file automatically.
The server writes the result JSON to save_path on completion, before the push.
Continue other investigation work. When a [Delegate completed] or
[Delegate terminal] message arrives for this delegation:
-
Check the header for saved_to= — if present, the file is on disk.
-
Verify: ls .agent/fill-cards/sc{NNN}-codebase-result.json
-
If the file exists, append to the session file:
codebase-explore result saved ({timestamp})
Task ID: {task_id}
Path: .agent/fill-cards/sc{NNN}-codebase-result.json
Status: {completed|timed_out}
-
If saved_to is absent from the header, append:
codebase-explore result save failed ({timestamp})
Task ID: {task_id}
Expected path: .agent/fill-cards/sc{NNN}-codebase-result.json
PostHog
Query for relevant events. Save queries as insights at query time, not
after drafting. Every number on the card needs a linkable saved insight. Use
the PostHog event names from preflight output (step 2). If preflight reported
NO_MATCH for the product area, use the PostHog MCP event-definitions-list
tool to discover events. For richer schema notes beyond event names, grep
box/posthog-events.md for the product area section.
Write insight permalinks to the session file. After saving each insight,
append to the session file:
## PostHog
- {event name} ({date}): {permalink}
Required manual write. The permalink is the only PostHog artifact that survives
compaction.
Slack
Read thread content for cards that originated from Slack ideas. Use:
python3 box/slack-scanner.py --channel C0ADJ4ATJE4 --threads <permalink>
Reply text is in the thread_context array of the JSON output. For ideas with
cross-channel links, the scanner's Play 1 output already includes thread content
inline — check cross_channel_links[].thread_context before making a second call.
Check for file attachments. The scanner's files array lists images and
screenshots. Download and view any screenshots:
curl -s -H "Authorization: Bearer $TOKEN" <url_private> -o /tmp/file.png
For visual/UX issues, screenshots are primary sources that code alone cannot
substitute.
Jam recordings
If the card's Slack thread, Intercom evidence, or description contains a Jam URL
(jam.dev/c/...), pull structured debug data via Jam MCP. getDetails first
(overview + investigation guide), then getNetworkRequests (filter by
statusCode="4xx" or "5xx"), getConsoleLogs, and getUserEvents for the
reproduction timeline. This surfaces technical root causes that text descriptions
miss. Especially valuable for bug cards: the Jam contains the reproduction
evidence, not just a link to it.
4b. Investigation accounting
Before synthesizing, produce a structured accounting block:
This is not optional — it is the gate between investigation and synthesis.
The accounting verifies investigation completeness before synthesis starts.
Proved 2026-04-15: monitor flagged investigation-to-synthesis transition;
accounting produced after the flag but should have been spontaneous.
Write the accounting block to the session file before advancing to synthesis.
Do not proceed to Step 5 until this write is complete. Append verbatim:
## Accounting Block ({timestamp})
{full accounting block}
Then append:
## STATUS: accounting ({timestamp})
A fresh agent reading this file can determine exactly what was investigated
without replaying any steps. This is the highest-value recovery artifact.
5. Synthesize
- Synthesize into clear, concise product writing (bullet points, no jargon inflation)
- Don't add ideas Paul didn't express, don't drop ideas he did
- Removal/cleanup cards need a two-pass audit. Pass 1: "where is X mentioned?"
(content inventory). Pass 2: "what breaks if we remove X?" (risk scan, e.g.
URL::route() calls that would 500 on route removal). These hit different files.
The second pass can be a separate delegation.
- Black hat test on acceptance criteria. After writing completion criteria,
ask: "could someone pass these checks without doing the work?" Narrow
definitions and require human approval for judgment calls. Allowlists need
closed boundaries.
6. Story links
After filling a card, search Shortcut for non-archived cards that share
infrastructure, prerequisites, or overlapping scope. Propose links with the
right verb:
| Verb | Direction | When to use |
|---|
blocks | Subject blocks Object | Object cannot ship without Subject being done first. Test: "Could Object ship if Subject didn't exist?" If no, it blocks. |
relates to | Bidirectional | Cards share infrastructure, overlap in scope, or inform each other, but neither is a prerequisite. |
duplicates | Subject duplicates Object | Used in Find Dupes play. Subject is the loser (gets archived). |
Default to relates to unless there's a genuine prerequisite dependency. Present
proposed links alongside the card draft for approval.
For Released cards, read the description to confirm whether the fix addressed
root cause or a surface symptom before including in the card landscape. Title +
Released state is insufficient — ongoing Intercom complaints post-release indicate
partial fixes. Proved SC-1798: SC-226 (Released) incorporated from title alone;
description would reveal whether SmartPin multi-profile fix was root cause or patch.
For story link creation, use shortcut-mutate.py story-link:
python3 box/shortcut-mutate.py story-link SUBJECT_ID OBJECT_ID "relates to"
The script handles mutation + read-back verification in one call (exit 0 =
verified). Route through execute_approved. Proved 2026-04-22: raw curl via
reference/tooling-logistics.md recipe works but requires a separate
verification call; the wrapper script eliminates that gap.
Verify between each story link mutation. The script's built-in verification
covers the link itself. Still run shortcut-cards.py --id SC-NNN between links
to confirm cumulative state before creating the next one. Batch momentum erodes
per-mutation verification on sequential similar mutations — proved twice
2026-04-01.
Batch mutation rule. When executing 3+ sequential Shortcut mutations
(card creates, updates, story links), verify each one independently before
proceeding to the next. "Verify" means a GET request confirming the change,
not reading the execute_approved response. Narrate the verification result
explicitly — not just "verified." This pattern co-activates batch momentum
(narration compresses) and proxy trust (API response treated as verification).
Proved 2026-04-08: verification dropped by card 3 of 7, caught by human at card 4.
7. Present for approval
Sequence: Write → Read → spot-check → quality gate → approve_content. After
writing the draft to a file, read it back before evaluating the quality gate.
The quality gate’s checkmarks are claims about the draft content — verify them
against what was actually written, not what you intended to write. Spot-check
gate: before declaring the quality gate passed, run at least 2 fresh Grep or
Read calls against the card’s highest-stakes factual claims (file paths, specific
code properties, render order). In-context memory of prior reads is not
verification — it’s reconstruction. Proved 2026-04-22 and 2026-04-23: quality
gate passed from memory both times; fresh spot-checks confirmed accuracy but
process was wrong both times.
Select spot-checks targeting the card's most novel assertion and the file
that is ground truth for it. Event emission claims: the emit site file.
Data model claims: the schema/type file. Don't spot-check adjacent
confirming evidence when the highest-risk claim has a specific
ground-truth file. Proved 2026-04-24: spot-checked data-layer code
(import-processor.ts, smart-pin-repository.ts) instead of the event emit
site (csv-import-page.tsx) for a novel "CsvImport lacks generationFrequency"
claim.
Present the completed checklist inline first (this is the due diligence evidence).
Then present the card description via approve_content with
content_type: "card-draft", filename: "scNNN". The user reviews and can edit
in the modal.
After approve_content returns, append to the session file:
## Draft approved ({timestamp})
Path: .agenterminal/approved/card-draft/sc{NNN}.md
## STATUS: approved ({timestamp})
The checklist is not presented without the draft, and the draft is not presented
without the checklist. Any checklist item that fails must be resolved before
presenting. If you can't resolve it, say so explicitly rather than presenting
with a known gap.
### Pre-approval checklist
Related cards (all story links): SC-15, SC-68, SC-44
PostHog insights saved: [SmartPin adds](link), [Add distribution](link)
Intercom evidence block: schema-compliant (searched/no-signal/not-searched with date, index freshness, terms, all signal conversation IDs)
Codebase: all file paths on card read directly, no subagent-only claims
Card metadata: product area set, story links created
Severity (bug cards): Sev N (Level) — state the discriminator answer (see `reference/severity-framework.md`). Must be assessed here, not deferred to ship time.
Quality gate:
- Problem before solution
- Scoping-ready
- Verifiable evidence
- Observable done state
In multi-card sessions: copy-paste the checklist structure from card 1 for
every subsequent card. Do not reconstruct it from memory — a template resists
compression; a remembered format invites shortcuts. The checklist erodes
predictably: manual verification drops first, then the checklist itself.
Checklist field definitions:
- Card metadata: product area is set on the card (ship script requires it).
Story links are proposed with verbs. Both verified before declaring ready to ship.
- Related cards: list every card in the story links. All must be checked.
- PostHog insights saved: every number cited on the card maps to a saved
insight with a permalink. No ad-hoc query results without a saved link.
- Intercom evidence block: the card uses the structured Intercom evidence
schema from
box/shortcut-ops.md. One of three formats: searched/signal,
searched/no-signal, or not-searched with rationale. Must include: search date,
index freshness date, all search terms used, and links to all signal conversations
classified as evidence (not a sample). Generated via python3 box/intercom-evidence.py.
Every cited conversation was read directly by the primary instance.
IDs must be linked as full Intercom URLs
(https://app.intercom.com/a/apps/2t3d8az2/inbox/inbox/all/conversations/{ID}),
not plain text. "Verifiable" means the reader can click through, not construct
the URL themselves. Subagent-classified conversations not yet verified by the
primary instance are listed separately as unverified candidates and do not
appear in the card's Evidence section.
Verify counts before asserting. Run grep -c on the evidence file or
comment file to confirm conversation counts match what you claim in the
checklist and card body. Do not recall counts from memory — three count
discrepancies in one session (2026-04-22) all came from recalling rather
than counting.
- Codebase: every file path named on the card was read directly in this
session. No claims backed only by subagent output.
- Quality gate: the four criteria from the Card Quality Gate section.
8. Verification gate (after approve_content, before ship)
Dispatch two independent verification agents in parallel. State which
verifiers you're dispatching and why — if skipping one, state the rationale
explicitly. Silent omission is batch momentum (proved 2026-03-25: SC-348's
skipped Intercom audit found genuinely valuable evidence when the human
insisted on it).
a) Codebase verification (existing):
- Compose the prompt:
python3 box/compose-delegation.py codebase-verify --var card_draft_path="$SAVED_PATH". Parse the JSON output and pass
prompt, output_instructions, and model to agenterminal.delegate.
Also pass:
conversation_id: a unique ID for this delegate (e.g. verify-codebase-{NNN})
session_file: ".agent/fill-cards/sc{NNN}.md"
session_label: "codebase-verify"
save_path: ".agent/fill-cards/sc{NNN}-codebase-verify-result.json"
b) Intercom evidence audit (multi-pass):
- Compose the prompt:
python3 box/compose-delegation.py intercom-audit --var card_what_section="$(cat /tmp/what.txt)". Parse the JSON output and
pass prompt, output_instructions, and model to
agenterminal.delegate. Pass ONLY the card's What section — the auditor
must reason independently about search terms. Do not pass the Evidence
section. Also pass:
conversation_id: a unique ID for this delegate (e.g. verify-intercom-{NNN})
session_file: ".agent/fill-cards/sc{NNN}.md"
session_label: "intercom-audit"
save_path: ".agent/fill-cards/sc{NNN}-intercom-audit-result.json"
Dispatch both in the same message (they have no dependency). Each delegate
must have a unique conversation_id — the server rejects a second delegate
with an already-active conversation_id. Any unique string works (the server
creates the conversation on the fly).
After dispatching both delegates, append to the session file:
## STATUS: verifying ({timestamp})
The delegate tool writes each task_id and label to the session file automatically.
The server writes each result JSON to save_path on completion, before the push.
- Narrate the dispatch. Tell the user what you dispatched and roughly how
long verification takes ("two verification agents running — codebase
fact-check and Intercom audit — results will arrive as push messages").
Silence >2 minutes removes the user's ability to steer. Proved 2026-04-03:
22-minute silence, user had to ask "where are we."
As each [Delegate completed] or [Delegate terminal] push arrives:
-
Check the header for saved_to= — if present, the file is on disk.
-
Verify: ls {save_path}
-
If the file exists, append to the session file:
{session_label} result saved ({timestamp})
Task ID: {task_id}
Path: {save_path}
Status: {completed|timed_out}
-
If saved_to is absent from the header, append:
{session_label} result save failed ({timestamp})
Task ID: {task_id}
Expected path: {save_path}
-
Codebase: If all_verified: true — pick 1-2 claim numbers from the
delegate's report and run a fresh Read/Grep against the delegate's specific
source_file and exact_evidence. Pre-delegation reads that confirm the
same underlying facts do not count — the spot-check verifies the delegate's
work product, not the facts themselves. Proved 2026-04-23: SC-1760, pre-delegation
greps confirmed top-level claims but were substituted for post-collect
verification of the delegate's 17 specific claim-evidence pairs.
If all_verified: false —
read the cited source_file yourself for each failed claim before presenting.
The verifier's verified: false is a pointer to investigate, not a verdict
to forward. If you already read the file earlier in the session, re-check your
own reading against the claim — don't restructure your assessment around the
verifier's conclusion. Then present findings. User decides whether to fix or
ship as-is. Proved 2026-04-06: SC-1172, verifier flagged simplified param
shape as failure; agent forwarded verdict without checking own prior read;
human caught it.
If the codebase verifier times out: do not simply proceed on the basis of
files you read earlier. Run the inverse check: list every factual claim in
the Architecture Context, then confirm each was verified against a primary
source in this session. Any unverified claim is a reason to re-dispatch a
narrower verifier (one codebase, fewer claims) or read the file yourself.
For dense Architecture Context sections (20+ claims, multiple codebases),
prefer splitting into two focused verifier dispatches from the start.
Include negative assertions in the claim list. Claims like "no X exists
in the codebase" from the delegate's not_found section need at least one
targeted grep to convert trusted absence into verified absence. The inverse
check pattern naturally lists positive claims but lets negative delegate
claims pass unchecked. Proved 2026-04-20: "no bot actions" assertion from
delegate passed unverified through the inverse check.
Proved twice 2026-04-07: SC-64 and SC-1288 both timed out at 10 min.
-
Intercom audit: Two phases, in order. Do not assess before reading.
Phase 1 — Read all. Run the pre-composed read_commands from the
auditor's output sequentially. These are --read ID1,ID2,ID3 commands
grouping all IDs into batches of 3-5. Run every command; do not skip,
truncate, or substitute partial reads (head, limit, first-N-lines). Do not
assess relevance during this phase — read first, think second. The auditor's
snippets and relevance classifications are proxies — you cannot judge what a
conversation adds from a one-line snippet. Proved 2026-04-15: three effort
substitution activations on the same surface (partial reads, filtering to
"strongest 6," pre-judging from snippets). Human corrected all three.
Phase 1 gate — count and narrate before proceeding. After each batch
command, state: "Batch N of M: [conversation IDs read]. [Per-conversation
signal classification: strong/weak/tangential + what it adds]." Then state
the running count: "Read N of M audit batch commands." If N < M, run the
remaining before proceeding to Phase 2. The per-batch summary serves two
functions: it makes batch momentum visible (narration dropping from
substantive to silent is the tell), and it forces per-conversation judgment
rather than inheriting the auditor's relevance labels. Proved 2026-04-20
(session dd8e8dc1): assessed batch 1 substantively, went silent for batches
2-6 despite naming the completion bias risk in writing before starting. Also
proved 2026-04-15 (session 278a42e5): advisory "run every command" was in
the play text, read 6/27, concluded from proxies. Full reads added 10
meaningful conversations.
Phase 2 — Assess what the evidence says about the recommendation. Only
after reading all conversations, determine two things: (1) which add distinct
information to the evidence block, and (2) whether the evidence collectively
says anything about whether the card's recommendation is correct. Artifact
completeness (how many conversations to link) is not the only output of
verification — investigation depth (what do these conversations mean for the
card) matters more. Link every signal conversation classified as evidence per the schema
("link all signal conversations, not a sample"). Present findings to user. User
decides whether to add them to the card.
After assessment, regenerate the Intercom evidence block with all
signal conversation IDs from filter + audit. Noise exclusions from the
Sonnet-mode filter carry forward; audit additions are included. Exclude
only false positives (wrong feature, wrong product). The schema handles
10 via the comment format. Don't filter to 'strongest' — selectivity
instinct conflicts with schema completeness. Proved 2026-04-20: presented
9 curated of 29 found; human redirected 'follow the schema fully.'
If total_found == 0 — note "Multi-pass audit: no additional signal" in
the checklist.
Proved 2026-04-06: assessed 9 conversations from auditor snippets,
recommended ship without reading any. Human caught it twice.
Proved 2026-04-14: read 5 of 22, declared rest redundant from snippets.
Human challenge unlocked 3 material findings (Safari confirmed broken by
eng, returning user case, international domains).
-
Spot-check before recommending ship. Before saying "ship as-is," make
at least one Read or Grep call that could falsify the recommendation.
Delegate reports that read well are exactly when proxy trust activates.
Proved 2026-04-03: recommended ship based entirely on delegate output
without a single primary-source check.
Do not auto-fix failed codebase claims or auto-add Intercom conversations.
Present findings, let the user decide.
9. Re-approval (if the card changed after initial approve_content)
If the card was modified after the initial approve_content — verification fixes,
evidence additions, schema restructuring, any edits — re-present via
approve_content before proceeding to ship. The initial approval covered the
pre-verification draft, not the post-fix version. The user needs to see and approve
what will actually be pushed.
Before editing, enumerate ALL changes. List every change from both (1)
verifier failures and (2) audit findings, numbered. Then edit. Check each off.
If the edit count < the list count, something was dropped. "Just fixing line
numbers" after an audit that found substantive new content is effort
substitution — the assessment scope must carry through to the execution scope.
Proved 2026-04-20: SC-1657, Anna's conversation assessed as "important context"
then silently excluded from the re-draft.
Re-approval cycle: Read → Edit → Read → Submit. Read the saved file at the
saved_path from the previous approval. Edit the specific change. Re-read to
verify no other instances of the problem remain (e.g., grep for all local file
paths, not just the one flagged). Then submit. Do not reconstruct the card
content from context memory — context memory is a proxy for the file, and proxy
trust activates on your own prior output. Proved 2026-04-08: SC-1311.
Quantitative consistency gate. When any count, date, or quantitative claim
changes in the re-edit, grep the full document for all instances of the old
value before re-submitting. Downstream claims that derive from the changed
number (e.g., "12 of 21" when the total changed to 30) need recalculation,
not just find-and-replace. This is a mechanical pre-step, not a post-flag
recovery. Proved 2026-05-15: evidence block updated from 21 to 30
conversations; "12 of 21" in Architecture Context and "12 Intercom
conversations from ~11 distinct users" in What section left stale; required
a third approval cycle after monitor flag.
10. Completion (after verification passes or user accepts gaps)
All four steps, every time. This is mechanical, not conditional.
If the title needs to change, plan the title mutation alongside the ship
command before executing either. shortcut-ship.py handles description +
state + unassign but not title. A separate PUT is needed. Build the complete
mutation set first, then execute sequentially with verification between each.
Proved 2026-04-08: SC-1238 shipped with old title, required a follow-up
mutation to correct.
Run python3 box/ship-gate.py enter before the first production
execute_approved call (the hook will block production mutations until the
plan is declared). The dry-run in step 1 below is exempt.
-
Pre-ship check (dry-run). Re-fetches the card's current state from
Shortcut and shows what the ship command will change. Present the output
to the user and wait for go-ahead before proceeding.
python3 box/shortcut-ship.py SC-NNN <saved_path> --severity N --dry-run
- Exit 0: clean. Present the output, wait for go-ahead.
- Exit 3: backward state transition — the card has moved past Backlog
since you last fetched it (e.g. someone moved it to Near Term, In Build,
or In Test while you were investigating). Present the warning. Suggest
--description-only to update content without changing state. Do not
ship without explicit user confirmation of the state change.
-
Ship. Submit via agenterminal.execute_approved:
python3 box/shortcut-ship.py SC-NNN <saved_path>
The script strips YAML frontmatter automatically. The saved file at
.agenterminal/approved/card-draft/scNNN.md is compaction insurance and
audit trail. The script handles all four operations in a single call:
update description, move to Backlog, unassign owners, then re-fetch
and verify (state, owners, section headers, description length). Exit code
0 = verified, 1 = verification failed, 2 = mutation failed, 3 = backward
state transition blocked (use --force to override). Do not construct
payloads or curl commands manually.
-
Evidence comment (when >10 conversations). If the Intercom evidence
block says "full list in comment," post the comment via
shortcut-comment.py SC-NNN "text". Route through execute_approved.
This is part of shipping, not a follow-up. Verify the comment appears on
the card. Proved 2026-04-28: declared "shipped" with evidence comment
still pending; human correction required.
-
If any new PostHog event names were discovered during investigation, add them
to the appropriate section in box/posthog-events.md (Grep for the section
header). If the product area has no section, create one at the bottom and add
it to the index at the top of the file.
Idempotency
- Re-fetch current description from Shortcut before each update (don't use stale cache)
- Only update sections Paul explicitly changed — preserve everything else
Parallel Fill Mode (--lead)
When $ARGUMENTS is --lead, you are the lead instance in a two-instance
parallel fill session.
When to use parallel fill
3-4 cards to fill, at least two product-area clusters, agenterminal available.
For 1-2 cards, single-instance /fill-cards SC-NNN is simpler. For 5+ cards,
run multiple rounds rather than overloading one session.
Card limit
Three straightforward cards per instance. The original 2-card limit was
calibrated for ~200K context where both instances compacted at card 3. With 1M
context, both instances completed 3 cards without compaction. The qualifier
matters: cards were pre-selected for low ambiguity (narrow scope, clear
investigation path). High-ambiguity cards (pricing strategy, feature sunset,
cross-cutting architecture) consume significantly more context per card. For
mixed batches, count high-ambiguity cards as 2 toward the limit.
Splitting
Cluster by product area, not random. Architecture knowledge compounds across
cards in the same area: the second card is faster because the mental model from
the first card carries over. If there's a natural "dense cluster + everything
else" shape, that's the split. Example: SmartPin cluster (SC-135, SC-51, SC-68,
SC-132) vs mixed bag (SC-90, SC-118, SC-131).
Lead coordination protocol
- Find candidates using Step 1 above.
- Propose the product-area cluster split to Paul.
- Create an agenterminal conversation thread and brief both assignments.
- Join the thread and post the briefing before starting pre-flight. Do not
begin pre-flight until the partner instance has acknowledged in the thread.
- Work your own assigned cards using Steps 2-9 above.
- Own session-end documentation for this parallel session.
Coordination rules
- The conversation thread is for the initial briefing and session-end sync.
Don't post running status updates to the thread. Every message costs context
in both instances for information the user already has from the session panes.
- When a conversation notification arrives, respond before continuing tool
work. A 10-second acknowledgment costs less than the trust damage from
silence. This applies even mid-investigation.
- Intercom search index refresh only needs to happen once. First instance to
start runs it; second instance skips.
- Paul sees both session panes directly and reviews/approves cards in whatever
order he prefers.
Compaction risk
Two instances means twice the compaction exposure. The compaction gate hooks are
the primary protection: if an instance compacts, its tools lock until explicitly
resumed. The 3-card limit accounts for this; don't exceed it even if context
feels comfortable mid-session.
Session-end
Each instance saves its own log observations and session notes. If an instance
compacts before session-end, its takeaways are lost. This is acceptable for 3
cards of observations. The other instance and the human together can reconstruct
what matters.
Parallel Fill Mode (--partner)
When $ARGUMENTS is --partner, you are the partner instance in a two-instance
parallel fill session.
Partner coordination protocol
- Join the agenterminal conversation thread.
- Read the lead's briefing to get your assigned cards.
- Acknowledge in the thread before starting pre-flight. The lead instance
waits for this before beginning its own work.
- Work your assigned cards using Steps 2-9 above.
Coordination rules
Same rules as the lead instance:
- Thread is for briefing and session-end sync only. No running status updates.
- Respond to conversation notifications before continuing tool work.
- Skip Intercom search index refresh if the lead already ran it.
Everything else is the standard per-card protocol. Pre-flight, investigation,
verification bar, quality gate, approval flow, completion steps: all identical
to single-card mode.