| name | bug-discovery |
| description | Find untracked product bugs by searching Intercom for symptom keywords, cross-referencing PostHog failure events, and tracing codebase failure paths |
| disable-model-invocation | true |
Bug Discovery
Find untracked product bugs by searching Intercom for symptom keywords,
cross-referencing reporters against PostHog failure events, and tracing failure
paths in the codebase. Produces a lean bug card in Backlog.
This is the bug counterpart to Recurring Request (Play 4). Play 4 hunts for
untracked feature requests using topic nouns. This play hunts for untracked bugs
using symptom language.
Arguments
$ARGUMENTS is optional:
- (no args): Start from Phase 1 — broad symptom keyword search across the
Intercom index.
{keyword or symptom}: Skip broad keyword scoring — focus the search on
the specified symptom term(s).
Also Read
Before starting, read these shared sections from box/shortcut-ops.md:
- Pre-flight Steps: index refresh, token checks, DB connectivity
- Card Quality Gate: the four criteria checked before presenting
- Verification Bar: factual accuracy checks
- Intercom Search Patterns: search syntax and patterns
- General Constraints: mutation cap, rate limiting
Constraints
- Mutation gate: A PreToolUse hook blocks all Slack and Shortcut mutations
through Bash. Route through
agenterminal.execute_approved or present the
command for the user to run.
- Human-in-the-loop: Present one bug card at a time with full content
rendered. Wait for Paul's decision. Don't batch-execute without review.
- Context protection: Delegate codebase exploration and high-volume Intercom
filtering to subagents. Primary context is for targeted verification and
synthesis, not broad exploration.
- Before Shortcut API calls: check
reference/tooling-logistics.md for
tested recipes. Don't re-derive payload shapes or endpoint paths.
- Don't stack unreliable methods. DB theme classifications + keyword matching
against Shortcut titles compounds error. Go to primary sources and use
reasoning.
Steps
1. Signal hunting
- Refresh the Intercom search index (pre-flight step 6).
- Search the local Intercom index for symptom keywords: "disappeared", "lost",
"broken", "stuck", "failed", "error", "crash", "missing", "gone", "won't load":
python3 box/intercom-search.py "keyword"
- Score each keyword by volume and signal-to-noise. Low volume + high signal
("disappeared": 7 hits, all real bugs) beats high volume + noise ("error":
98 hits, mostly billing/spam).
- Check recency: are there instances in the last 30 days? If the most recent
instance is 6+ months old, flag as historical and deprioritize. Bugs that
stopped being reported may already be fixed.
- Check Shortcut to confirm the bug isn't already tracked:
python3 box/shortcut-cards.py --query 'title:"symptom keyword"' --summary
A title match is a candidate, not confirmation. If the search returns
results, read the matched card (shortcut-cards.py --id) and verify the
card's scope actually covers the discovered bug. Title keyword overlap
between different-scope bugs ("stuck in queue" vs "stuck spinning") silently
kills valid discoveries. Proved 2026-04-29 in sweep-channels: symptom
similarity ("scrape failure") assumed to match a specific card (Lambda
timeout) without verifying the mechanism.
Gate: Mark each cluster as tracked/untracked before proceeding to Step 2.
Do not read Intercom conversations for tracked clusters. Proved twice
(2026-03-30, 2026-04-07): reading conversations then discovering existing
Shortcut cards wastes 10-20 min per cluster.
2. Read and cluster
- Read 8-12 actual conversations behind the best keyword hits. Confirm they
describe the same symptom:
python3 box/intercom-search.py --read ID1,ID2,ID3
Search snippets are not evidence. Read the actual conversation text
before citing it.
- Cluster into candidate bugs. One keyword search may surface multiple distinct
issues.
- If any conversation references a Jam recording URL (
jam.dev/c/...), pull
structured debug data via Jam MCP: getDetails first (overview +
investigation guide), then getNetworkRequests (filter by statusCode="4xx"
or "5xx"), getConsoleLogs for errors, getMetadata for browser/OS
context, and getUserEvents for the reproduction timeline. This can surface
the technical root cause directly from the user's session.
3. Cross-reference to PostHog
- Pull contact emails from the Intercom conversations.
- Query PostHog for failure/error events filtered by those emails. The
Intercom-to-PostHog email cross-reference is the strongest evidence
technique: it turns "users say X happens" into "users who say X happens
have Y failure events."
- Check volume: query the failure event unfiltered for total count, unique
users, and weekly trend over 90 days. Spiky patterns suggest infrastructure
issues. Steady patterns suggest code bugs.
4. Codebase trace
- Grep the aero codebase for the failure reason or error string.
- Check whether the failure has user-facing UI copy (error messages, status
indicators). Unmapped failure reasons that fall through to generic fallbacks
are higher severity.
- Trace the failure path: where does the error originate, how does it
propagate, what does the user see?
Delegate exploration using agenterminal.delegate with the verified explore
prompt (box/verified-explore-prompt.md) for broad traces. For single-file
lookups ("does route X exist", "what does function Y do"), grep or direct read
is faster.
5. Card
- Re-read before citing. If the card will reference specific line numbers
or code structure, re-read those lines now (targeted offset/limit). In-context
memory of code is not sufficient for line-number claims on a durable surface.
Proved 2026-04-27: monitor had to prompt re-read that should have been
triggered by the card-drafting transition itself.
- Use the lean bug template: What, Evidence, Architecture Context. Skip
Monetization, UI Representation, Reporting, Release Strategy unless they
have real content.
- Evidence should include:
- Intercom conversation links with verbatim quotes
- PostHog event counts with saved insight links
- The email cross-reference results
- If Jam recordings provided debug data, include the key findings (specific
errors, failed requests)
- Architecture Context for bugs can be more prescriptive than for features:
root cause location, failure mechanism, specific code paths. The fix path
is typically more deterministic.
- Present the completed checklist inline, then present the card description
via
approve_content with content_type: "card-draft",
filename: "scNNN".
6. Verification gate (after approve_content, before ship)
Dispatch two independent verification agents in parallel. State which
verifiers you're dispatching and why — if skipping one, state the rationale
explicitly. Silent omission is batch momentum.
a) Codebase verification:
agenterminal.delegate with the prompt from box/card-verification-prompt.md,
filling {card_draft_path} with the saved file path.
b) Intercom evidence audit (multi-pass):
agenterminal.delegate with the prompt from box/intercom-evidence-audit-prompt.md,
filling {card_what_section} with ONLY the card's What section. Do not pass
the Evidence section — the auditor must reason independently about search terms.
Dispatch both in the same message (they have no dependency). Then collect both:
agenterminal.collect both results (checkpoints will fire).
- Codebase: If
all_verified: true — pick 1-2 claim numbers from the
delegate's report and run a fresh Read/Grep against the delegate's specific
source_file and exact_evidence. Pre-delegation reads that confirm the
same underlying facts do not count — the spot-check verifies the delegate's
work product, not the facts themselves. Proved 2026-04-23: SC-1760.
If all_verified: false —
present the verification report. User decides whether to fix or ship as-is.
- Intercom audit: If
total_relevant > 0 — present the found conversations
alongside existing evidence. Read any promising ones via --read. User
decides whether to add them to the card. If total_relevant == 0 — note
"Multi-pass audit: no additional signal" in the checklist.
Do not auto-fix failed codebase claims or auto-add Intercom conversations.
Present findings, let the user decide.
7. Re-approval (if the card changed after initial approve_content)
If the card was modified after the initial approve_content — verification fixes,
evidence additions, any edits — re-present via approve_content before proceeding
to ship. The initial approval covered the pre-verification draft, not the post-fix
version. The user needs to see and approve what will actually be pushed.
Re-approval cycle: Read → Edit → Read → Submit. Read the saved file at the
saved_path from the previous approval. Edit the specific change. Re-read to
verify no other instances of the problem remain (e.g., grep for all local file
paths, not just the one flagged). Then submit. Do not reconstruct the card
content from context memory — context memory is a proxy for the file, and proxy
trust activates on your own prior output. Proved 2026-04-08: SC-1311.
8. Ship (after verification passes or user accepts gaps)
- Submit via
agenterminal.execute_approved:
python3 box/shortcut-ship.py SC-NNN <saved_path> --severity N
The script executes the PUT, re-fetches the card, and verifies inline
(state, owners, section headers, description length). Exit 0 = verified,
1 = verification failed, 2 = mutation failed. Set the Severity custom
field and record DIC severity via
python3 box/framework-rank.py score SC-NNN --severity N.
- If any new PostHog event names were discovered during investigation, add
them to the appropriate section in
box/posthog-events.md.
Telemetry-First Mode (--telemetry)
When $ARGUMENTS is --telemetry, start from PostHog failure dashboards
instead of Intercom symptom searches.
Play 5 starts from Intercom (what users say). This mode starts from PostHog
(what the product does). They share Steps 4-8 but differ in discovery.
Play 5 catches bugs users report in words we guessed. This mode catches bugs
the telemetry records whether or not anyone reports them.
Also Read (additional)
- Verification Bar: factual accuracy checks
Dashboard
"Bug Discovery: Failure Event Monitor" (ID 1465582)
https://us.posthog.com/project/161414/dashboard/1465582
Each insight tracks one failure event as daily unique users over 90 days.
Insight descriptions carry machine-readable thresholds in this format:
MEDIAN:N/d P90:N/d SPIKE:N/d TREND:up|down|stable|METHOD:...|UPDATED:...|WINDOW:90d
T1. Dashboard scan
- Pull the dashboard insights via
mcp__posthog__dashboard-get (ID 1465582).
- For each insight, query current data via
mcp__posthog__insight-query.
- Parse the description thresholds. Flag any insight where:
- Spike: any day in the last 7 exceeds the SPIKE threshold
- Trend: 14-day trailing mean exceeds prior-14-day mean by >25%
- Composition shift: for events with sub-types (e.g.
$exception by
$exception_types), break down by type even if the top-line is stable.
A declining category can mask a rising one. Proved 2026-04-14: $exception
at stable ~1/d was actually ChunkLoadError disappearing while DOMException
spiked 3x.
- New event: pull
mcp__posthog__event-definitions-list with q=fail
and q=error, compare against dashboard insights. Flag failure-shaped
events not yet on the dashboard.
- Present flagged events to the user with: event name, current rate vs
threshold, trend direction, and whether it's a spike or a slow climb.
Wait for user to select which to investigate. Clustering is analysis;
choosing which failure to pursue is product judgment.
T2. Threshold maintenance
After scanning (whether or not any anomaly was found), recalculate thresholds
for all insights:
- BASELINE: median of 90-day daily values (exclude partial today)
- P90: 90th percentile
- SPIKE: P90 x 1.5 (round up)
- TREND: compare mean of last 14 complete days vs prior 14 days; >25% = up/down
- Write updated descriptions via
mcp__posthog__insight-update
- Add any newly discovered failure events as new insights on the dashboard
T3. Codebase trace
Same as Step 4 above:
- Grep the aero codebase for the failure event name, error string, or
failure reason.
- Check whether the failure has user-facing UI copy (error messages, status
indicators). Unmapped failure reasons that fall through to generic fallbacks
are higher severity.
- Trace the failure path: where does the error originate, how does it
propagate, what does the user see?
Delegate exploration using agenterminal.delegate with the verified explore
prompt (box/verified-explore-prompt.md) for broad traces.
T4. Cross-reference to Intercom
Reverse of the standard flow — check Intercom after the telemetry signal:
- Search the local Intercom index for symptom language matching the failure.
Use mechanism knowledge from the codebase trace to sharpen search terms
(the code tells you what words users would use).
- If conversations exist, read them. This adds qualitative context to the
quantitative signal: how users describe it, how severe it feels to them,
whether CS has a workaround.
- If no conversations exist, that's itself a finding — users are hitting
the failure but not reporting it. Note this on the card.
T5. Card
Same as Step 5 above:
- Lean bug template: What, Evidence, Architecture Context.
- Evidence should include: PostHog insight links with daily unique user
counts, trend description, the threshold that triggered investigation.
Intercom conversation links if any exist (with note if none found).
Codebase file paths for the failure mechanism.
- Assess severity per
reference/severity-framework.md.
- Present via
approve_content with content_type: "card-draft".
- After approval, create/ship the card. For new cards use
sc-create-story.py;
for existing cards use shortcut-ship.py. Both via execute_approved.
T6. Closing checklist
Verify each mutation independently before moving to the next. Completion bias
activates here ("card is done, rest is cleanup"):
Each step: check --help before first invocation of any tool not yet used.