| name | kb-audit |
| description | Find and fix Intercom knowledge base gaps — either wrong article content (accuracy) or missing coverage for a product area (coverage). Verify against the codebase, ship corrected articles. |
| disable-model-invocation | true |
/kb-audit
Find and fix knowledge base gaps that cause Fin (Gabby) to give users wrong or
missing information. Two modes depending on the signal:
- Accuracy audit: Fin gave a wrong answer traceable to an article claim.
- Coverage audit: A product area has inadequate KB coverage — Fin has
nothing to draw from, so it fabricates or falls back to irrelevant articles.
Both modes share the same fix engine (steps 3-9).
Quick reference
- Tracker:
box/research/kb-audit-tracker.json (triaged conversations, audited articles, last search date)
- Per-run artifacts:
box/research/kb-audit-article-comparison-run{N}.md, box/research/kb-audit-corrected-article-run{N}.html
- Run log / development history:
box/research/kb-accuracy-audit-brief.md
- Delivery:
PUT /articles/{id} via Intercom API (production mutation — execute_approved)
- Deliverable: one updated Intercom help center article per run
Error taxonomy
Four classes of article error, requiring different fixes:
- Commission: Article says something factually wrong (e.g., "go to Settings →
Pin Spacing Rules" when the path doesn't exist, or "Most Popular button in the
top right corner" when the button has been removed). Fix: correct or remove the
wrong claim.
- Omission: Article is correct for what it covers but missing context that
causes Fin to misapply it (e.g., image crop article applied to Reels because
it never says "images only"). Fix: add clarifying section, preserve existing
content.
- Omission enabling hallucination: Article's framing is vague enough that
Fin infers capabilities that don't exist (e.g., "community" framing led Fin
to fabricate messaging features). Fix: add explicit "what this does NOT
include" clarification.
- Missing coverage: No article covers the topic at all, forcing Fin to
fabricate or fall back to irrelevant articles. Fix: add a section to the most
relevant existing article, or flag for new article creation. This is the
typical finding in coverage audits.
Mode 1: Accuracy Audit
Signal: Fin gave a specific wrong answer. Goal: trace it to the article, fix
the article.
1a. Find Fin wrong answers
Load the tracker (box/research/kb-audit-tracker.json) to get the last search
date and skip already-triaged conversations.
Two discovery paths:
Path A — Daily digest bot observations. Check the #daily-digest channel
for recent digests. The :robot_face: Bot observations section contains
pre-classified candidates. Check those conversation IDs against the tracker's
triaged_conversations to skip already-processed items.
Important: digest bot observations are intermediaries written by a prior Claude
session. They are leads, not findings — read the actual conversation before
classifying bot behavior as incorrect or article-traceable.
After triaging a digest's bot observations, add ALL conversations to the
tracker (fixed, deferred, not_actionable) — not just the ones that produce
fixes. This prevents re-triaging the same digest items in future sessions.
Path B — Structural database query. Query the conversation search index
for conversations matching structural signals of KB errors. This catches
candidates the daily digest missed or that predate digest coverage.
SELECT conversation_id, part_count, created_at::date
FROM conversation_search_index
WHERE created_at >= '{since_date}'
AND part_count >= 10
AND (
full_text LIKE '%Sources:%'
OR full_text LIKE '%connect you with a human%'
OR full_text ~* '(can.t find|doesn.t exist|not there|where is the|that.s not|I don.t see)'
)
AND conversation_id NOT IN ({triaged_ids})
ORDER BY part_count DESC;
Why these signals: Validated against 8 fixed conversations from runs 1-5.
No single signal catches all errors — they're complementary (OR, not AND).
The 10-part floor filters for conversations long enough that the error cycle
plays out (user asks → Fin answers wrong → user pushes back → human takes
over). Below 10, signals match incidental text in short exchanges and marketing
emails.
Signal behavior by conversation length:
- 20+ parts: all three signals reliable, high hit rate
- 15-19 parts: signals still work (42% hit rate in run 5 validation sample)
- 10-14 parts: escalation signal degrades (matches Fin auto-follow-up on
abandoned conversations, not real failure), other signals still valid but
lower hit rate
- Below 10: signals match noise
Processing Path B results: Delegate a subagent to read candidates and
classify as KB_ERROR / BOT_BEHAVIOR / NOT_FIN. Read each KB_ERROR candidate
yourself before accepting the classification — subagent output is a filter,
not a finding.
Triage criteria for each conversation (both paths):
- Is this Fin giving wrong info (vs. user confusion, vs. product bug)?
- Can the wrong claim be traced to a specific KB article?
- Is Fin fabricating because no article covers the topic? (missing coverage —
this is actionable, not a reason to close)
- Is the article still published and unchanged since the incident?
Present triaged candidates to user for selection before proceeding.
Then proceed to step 3 (shared fix engine).
Mode 2: Coverage Audit
Signal: a product area may have inadequate KB coverage. Goal: understand the
product, check what the KB covers, fix the gaps.
Triggers: CS pattern in Slack, feature release/migration, proactive review,
customer-comms tracker, or an accuracy audit that reveals missing coverage
rather than a wrong claim.
Hallucination risk on the proactive path. Coverage audits triggered by
releases or the comms tracker (rather than wrong Fin answers) lack a grounding
conversation. Without user messages to constrain the content, there's more
room for assumption-based claims: definitions, attributions ("Pinterest
requires X"), and behavioral descriptions that sound plausible but aren't
code-verified. Every factual claim in new article content needs verification
against a primary source (codebase or platform docs), not just plausibility.
(Run 14: fabricated Simplified Pin definition survived to draft review.)
1b. Understand the product area
Build context on how the feature actually works. The depth here determines the
quality of the article content downstream.
- From a CS pattern (Slack thread, Intercom cluster): Read the thread/
conversations to understand what users are experiencing. Then trace the
mechanism in the codebase — how the feature works, not just what the symptom
is. Search Intercom for the conversation cluster to gauge scope.
- From a feature release: Read the PR(s) or commit history to understand
what changed. Trace the user-facing behavior in the codebase.
- From a proactive review: Start from the product area's code. Understand
the key user-facing behaviors, settings, and edge cases.
- From customer-comms tracker: Pick a recently released or communicated
feature from
reference/customer-comms.md. Search the article API for
existing coverage, then read the codebase to check completeness. Focus on
things users would encounter and ask about: limits, requirements,
interactions with other features, and what happens when something goes
wrong. Not a comprehensive technical audit -- the question is "would Fin
have an answer when a user asks about this?"
2b. Check existing KB coverage
Search the Intercom article API to find what coverage exists. Use the API,
not the public help center — the public site shows collections, not the full
article corpus Fin draws from.
INTERCOM_TOKEN=$(grep 'INTERCOM_ACCESS_TOKEN' .env | cut -d= -f2-)
curl -s "https://api.intercom.io/articles/search?phrase={url_encoded_phrase}" \
-H "Authorization: Bearer $INTERCOM_TOKEN" \
-H "Accept: application/json" \
-H "Intercom-Version: 2.11"
Search multiple phrases — feature name, user symptom language, related
concepts. The search is fuzzy and returns ranked results; different phrasings
surface different articles.
For each relevant article found, fetch the full body via GET /articles/{id}
and check whether it covers the behaviors identified in step 1b. Map the gaps:
what does the product do that no article explains?
If coverage is adequate: report findings and stop. Not every audit produces
a fix.
If gaps exist: identify the best article to add coverage to (usually the
most relevant existing article in the same product area), or flag that a new
article is needed. Optionally search Intercom conversations for user-impact
evidence to strengthen the case and inform back-testing (step 7).
Then proceed to step 3 (shared fix engine).
Shared Fix Engine (Steps 3-9)
Both modes converge here. You have: an article to fix, an understanding of
what's wrong (or missing), and codebase context.
3. Trace to source article
- Search Intercom article API:
GET /articles/search?phrase=...
- Pull the full article:
GET /articles/{id}
- For accuracy audits: map Fin's specific claims to specific article text
- For coverage audits: confirm the gap — the topic is genuinely absent
- Save original article to
/tmp/ff-article-{id}-original.html
Don't make article-content claims before reading. During triage, describe
what the bot said and what the user experienced. Claims about what articles
do or don't contain belong here, after reading the article.
4. Verify against codebase
This is where the value is. Don't shortcut it.
For each claim in the article (existing or planned new content), trace to the
actual code. Use git -C /Users/paulyokota/Dev/aero show origin/main:path/to/file
(local checkout may be behind remote).
4a. Name the exact surface AND the rendering template. The article
describes a user workflow. Identify which UI surface it's about, then find
the specific template/component that renders it. A component existing in a
different view doesn't confirm the article's claims. If the article has
screenshots, download and inspect them for specific UI text, button labels,
or layout patterns to search the codebase for.
Aero codebase domains:
destination-posts — new publisher / "Upload or Create a Post" flow
scheduler — legacy Pin Scheduler
advanced-scheduler — advanced scheduling
draft-gallery — draft management
create — Tailwind Create (design tool, NOT the scheduler)
smartpin / smartpin-v2 — SmartPin features (check which is GA)
turbo — Turbo features
Note: packages/extension/ is the Turbo browser extension, NOT the
scheduling extension. The scheduling extension loads its UI from the legacy
app's views (e.g., draft_pin.blade.php for the draft card, NOT
post_preview_editable.blade.php).
Browser screenshots are zero-trust for article content. If Chrome tools
were used to view the product, treat the screenshot as showing one account's
flag-gated state. Every UI element, layout, and navigation path visible in a
screenshot must be verified against code — especially feature flags on the
rendering component and its parent layout. (Run 17: screenshot showed sidebar
nav from product_focused_nav_enabled flag, nearly shipped nav-specific
instructions to an article read by users without that flag.)
4b. Feature flag gate. Grep for feature flags in the file AND parent
components. A behavior behind a flag may not be available to all users.
featureFlag / feature_flag / isEnabled / useFeature
-v2 directories or component names (may be beta-only)
- Seed data for flag defaults (default
'1' = GA)
For each flag found, write a GA verdict before proceeding: flag name, code
location, rollout evidence (PostHog pageview split, seed data, or plan cascade
default). If any documented behavior is behind a non-GA flag: STOP and present
the verdict to the user before proceeding to step 5. The user decides whether
to scope content to flagged users, wait for GA, or adjust approach. Step 5
cannot start with unresolved flag questions.
The verifier (step 6) cannot catch this. A feature behind a flag IS factually
correct code — the verifier will return all_verified: true because the claims
are true for users who have the flag. GA status is orthogonal to claim accuracy.
(Run 1: SmartPin v2 beta-only. Run 18: SmartPin CSV import v2-only, ~43%
rollout. Both passed verification. Two sessions wasted on Run 18.)
4c. Check all surfaces. If the article describes a workflow that could
happen in multiple places, verify behavior in each. An article claim might be
true in one surface and false in another.
4d. Distinguish article error from Fin error. (Accuracy audits only.) Is
the article itself wrong, or is Fin hallucinating beyond what the article says?
If the article is correct for its scope but Fin misapplied it, the fix is
adding clarification (omission), not changing existing content.
When codebase verification is inconclusive: code may live in a different
repo or service, or be configured via ops/infra. Defer with explicit note in
tracker, don't guess.
Product knowledge. Check .claude/rules/product-knowledge.md for known
facts about this product area before investigating — it may already have the
answer. If codebase verification reveals undocumented product mechanics (score
formulas, feature interactions, UI label mappings), document in
.claude/rules/product-knowledge.md and reference/tailwind-product.md.
Show content to user before writing — .claude/ files are sensitive.
5. Write corrected article
- Identify error type (commission / omission / omission enabling hallucination /
missing coverage)
- For commission: change only the wrong sections, preserve everything else
- For omission: add clarifying section, don't change existing correct content
- For missing coverage: add section to most relevant existing article
- If linking to another article, verify the URL resolves before including
- Show the new section text before writing — don't compose in a single
Write call without previewing the content
- Save corrected HTML to
box/research/kb-audit-corrected-article-run{N}.html
- Optimize for LLM retrieval. Fin retrieves article sections independently —
reason about the likely retrieval path and ensure each section works as a
standalone answer. Explicitly state what the feature does (including "schedule,"
"publish," "create") rather than relying on context from other sections to imply
it. Ground claims to the product surface name ("in Pin Scheduler") so Fin doesn't
interpret them as statements about the external platform. The article still needs
to read well for humans browsing the help center, but explicit capability
statements serve both audiences. Proved Run 15: article described carousel creation
workflow without stating carousels publish; Fin inferred publishing wasn't supported.
6. Codebase verification delegate
Before shipping, delegate an independent codebase verification of every
factual claim in the new/changed sections. Use the card verification prompt
pattern (box/card-verification-prompt.md) adapted for article content:
- Scope: new/changed sections only (identify by HTML heading IDs)
- Claim types: behavioral claims, UI claims, feature claims, mechanism claims,
conditional claims, negative claims
- Model:
claude-sonnet-4-6
- Gate:
all_verified: true → proceed. all_verified: false → present
report to user, user decides.
After collecting, spot-check at least 1-2 claims from the verifier's results
against primary sources. Structured JSON with all_verified: true suppresses
the instinct to check.
6b. External platform verification
Articles often make claims about external platforms (Pinterest, Instagram,
Facebook). These are not verifiable from the codebase -- the code shows what
Tailwind does, not what the platform requires or supports.
For each claim in the new/changed sections, tag as codebase (verifiable in
aero) or external_platform (requires platform docs). For external claims,
check the platform's help center or developer docs via Claude in Chrome
(preferred for JS-rendered or auth-gated docs), tavily_extract, or
WebFetch. If a claim can't be verified externally, flag it explicitly rather
than shipping as fact. Don't attribute requirements to external platforms based
on Tailwind's implementation choices -- "Tailwind forces X" ≠ "Pinterest
requires X." (Run 14: Tailwind's code forced Simplified Pin on product-tagged
Pins; this was nearly shipped as "This is a Pinterest requirement.")
7. Back-test against user input
Check whether user messages that describe the problem contain enough signal
to route Fin to the new content. Compare:
- User's original messages (exact words, error strings, feature names)
- The new section's heading and key phrases
For coverage audits: if Intercom conversations were read, use those messages.
If the trigger was a release (no conversations yet), imagine likely user
phrasings based on the product behavior and write headings accordingly.
If the user's language doesn't overlap with how the fix is written, adjust
the section heading or opening sentence to match how users describe the
problem. Codebase verification confirms the content is correct; this step
confirms it's findable.
8. Pre-ship checklist
Before presenting the corrected article for approval:
- ☐ Feature flag check on all verified behaviors
- ☐ Multi-surface check — article's workflow verified in the specific
surface(s) it describes, not just "this code exists somewhere"
- ☐ Diff only the changed sections against original — verify unchanged
sections are byte-identical
- ☐ Cross-article links verified (HTTP 200)
- ☐ Error type identified and fix matches type
- ☐ Back-test passed — user's original message matches new section language
- ☐ Codebase verifier passed (or failures reviewed with user)
- ☐ Any unverified UI claims in the article noted explicitly — don't declare
them unverifiable without exhausting leads
- ☐ External platform claims verified against platform docs (or flagged as
unverifiable)
- ☐ Independent verifier ran as delegate (not inline reads) — "Was the
verifier a fresh agent with no access to my session's reasoning?"
9. Ship
- Show exact proposed HTML changes to user (before/after for changed sections,
summary of added sections). Get explicit approval.
- Ship via
PUT /articles/{id} through execute_approved
- Include
Accept: application/json header (406 without it)
- Auth:
Authorization: Bearer {INTERCOM_ACCESS_TOKEN}
- Version:
Intercom-Version: 2.11
- Post-mutation verification: Re-fetch the article via
GET /articles/{id}.
Check that key strings from each change are present in the live body. Do not
rely solely on the PUT response status.
10. Update tracker
Update box/research/kb-audit-tracker.json:
- Add fixed article to
audited_articles with article_updated_at from the
post-mutation re-fetch (not the PUT response)
- Add all triaged conversations from this run to
triaged_conversations
(fixed, deferred, not_actionable) — all items, not just fixes
- fixed: Article was updated or created this run to address this conversation
- not_actionable: Read and triaged; not a KB content error (bot behavior,
product bug, hallucination not traceable to article)
- deferred: Identified as a candidate but not fully investigated this run —
pick up in a future session
- Update
last_search_date
- For coverage audits: record
discovery_source (e.g.,
slack_thread_CK4TLM9TR_p1777298260939679, release_pr_3426)
Write comparison doc to box/research/kb-audit-article-comparison-run{N}.md.
11. Continue or end
If untriaged or deferred candidates remain from this session's discovery
(digest observations, structural query results, or coverage gaps identified
during investigation), present the remaining list. Do not suggest ending —
present the choice neutrally.
Known gotchas
- Intercom search freshness:
intercom-search.py has a 36h freshness gate.
If stale, sync must run first.
ai_agent_participated is on conversation metadata, not searchable via
full-text index — need API or DB queries for this filter.
- Article update is a production mutation — blocked by mutation gate, must
route through
execute_approved.
- Intercom API version 2.11. Requires
Accept: application/json on PUT.
- Use the article API, not WebFetch, to enumerate KB coverage. The public
help center shows collections; the API shows everything Fin can draw from.
Search multiple phrasings — different terms surface different articles. (Run 6)
- Fin draws from multiple articles for a single answer — tracing to one
article requires matching specific claims, not just topic.
- Feature flags: Code behind a flag may not be GA. SmartPin v2 code looked
like article errors but was beta-only. (Run 1)
- Multiple surfaces: A feature may exist in multiple UI surfaces with
different behavior. Crop existed in Create but not scheduler Drafts. (Run 2)
- Conversation evidence is a pointer, not a finding. "This doesn't exist"
tells you the user's experience, not what the system does.
- Intercom article
url field may be null. Construct URL as
support.tailwindapp.com/en/articles/{id}-{slug} and verify with HTTP request.
- Digest bot observations are intermediaries. Written by a prior Claude
session. Read the actual conversation before classifying. (Run 5)
- Escalation signal degrades below 15 parts. "Connect you with a human"
matches Fin's auto-follow-up on abandoned conversations, not just real
failure-to-resolve. (Run 5 threshold validation)
- Hallucination often signals missing coverage. When Fin fabricates, check
whether any article covers the topic — absence of coverage is an actionable
KB gap, not a reason to mark not_actionable. (Run 5)
- Match article language to product's own wording. Check in-app
announcements, tooltips, and labels for how the product describes the feature.
Use that language in the article, not your own phrasing. (Run 6: "navigation
bar" matched the in-app announcement, not "Turbo navigation bar")
- When wrapping mutations in scripts, show the script contents before
execute_approved, not just the command path. execute_approved displays the command but not the script body — if the user approved curl commands and you switch to Python, re-show. Consent is to a specific artifact, not a category of action. (Run 7)
- Intercom normalizes HTML on PUT:
<h2> → <h1>, removes empty <p> tags, converts straight apostrophes to curly quotes. Post-mutation verification scripts must normalize before comparing, or they'll show false MISSING results. (Run 7)
- Don't define external concepts inline without checking. When an article
references a platform concept (Simplified Pin, Rich Pin, carousel), check
the existing KB or platform docs for the definition. Inventing a plausible
definition is the highest-risk hallucination pattern on the proactive path.
(Run 14)
- Implied capabilities don't work for LLM retrieval. An article that describes
a creation workflow without explicitly stating the feature publishes will be
interpreted by Fin as "you can build it but not publish it." Each article section
must answer the question "can Tailwind do this?" directly, not by implication.
(Run 15)