Best practices for building an unauthenticated public Q&A chatbot widget. Covers rate limiting, security hardening, cost optimization, semantic caching, observability, UX patterns, chat scroll behavior, and architecture. Tech-agnostic with concrete examples from a production implementation.
Best practices for building an unauthenticated public Q&A chatbot widget. Covers rate limiting, security hardening, cost optimization, semantic caching, observability, UX patterns, chat scroll behavior, and architecture. Tech-agnostic with concrete examples from a production implementation.
Best practices for building an unauthenticated public Q&A chatbot widget. Covers rate limiting, security hardening, cost optimization, semantic caching, observability, UX patterns, chat scroll behavior, and architecture. Tech-agnostic with concrete examples from a production implementation.
license
MIT
metadata
{"author":"aidotengineer","version":"1.1","category":"chatbot","compatibility":"Any web framework with a server-side API route","tags":"rate-limiting, security, caching, observability, LLM, chat-ui, virtualization"}
Public Q&A Chatbot - Best Practices
A comprehensive skill for building unauthenticated, public-facing Q&A chatbot widgets on marketing sites, conference pages, documentation portals, and similar contexts where you need to serve anonymous visitors while controlling cost and abuse.
For a runnable React/TanStack Virtual demo of long chat scroll behavior plus an expanded bottom command shelf, use assets/vite-react-tanstack-chat-demo. The demo includes hover/double-click message controls, subtle token/latency stats, tool-call and multimodal examples, assistant response variants via left/right swipe, and a Realtime voice capture strip with live transcription and an audiogram.
Run the demo locally:
cd assets/vite-react-tanstack-chat-demo
npm install
npm run dev -- --port 5179
To exercise the Realtime voice path, start the dev server with OPENAI_API_KEY set. Keep the standard API key on the server side only; the browser should receive an ephemeral Realtime client secret.
When to use this skill
Embedding a chatbot widget on a public website (no user login required)
Answering questions from a known FAQ / knowledge base
Serving anonymous visitors with LLM-powered responses
Needing to protect against abuse, cost overruns, and API quota exhaustion
Building a constrained Q&A bot (not a general-purpose assistant)
Reviewing a public chatbot's widget UX, streaming behavior, scroll anchoring, or history loading
Tech stack choices
This skill is written to be tech-agnostic. The reference implementation uses the stack below, but each component is swappable:
Component
Reference choice
Alternatives
LLM provider
Gemini 3.1 Flash-Lite (via @ai-sdk/google)
OpenAI GPT-4o-mini, Anthropic Claude Haiku, Mistral, Llama via Groq/Together
Native scroll for short widgets, react-virtuoso, custom virtual list only when already proven
Do not require virtualization for every public FAQ widget. A short, bounded chat can stay as a simple DOM list. Reach for a virtualized chat list when conversations can grow long, rows have dynamic heights, older history prepends, or streaming output makes scroll anchoring fragile. When using React and a virtualized list is justified, prefer TanStack Virtual's chat support over custom scroll math.
Do not require agentic document browsing for every public FAQ widget either. Plain RAG is sufficient for short, stable FAQs. Add a virtual docs filesystem when answers live across multiple pages, users ask for exact syntax, docs have a meaningful hierarchy, or top-k retrieval often misses the section an expert would grep for.
1. Rate Limiting
Multi-layer rate limits
Apply limits at multiple granularities to prevent abuse:
Per-turn: Cap messages per conversation (e.g. 9 turns/session)
Per-visitor per day: Cap sessions per IP per day (e.g. 15/day)
Global per day: Cap total sessions across all visitors (e.g. 3000/day)
Only increment the session counter after the server confirms a successful response, not when the user submits. This prevents phantom session counts from failed requests, network errors, or aborted streams:
// Client-side: count after first assistant response arrivesuseEffect(() => {
const hasAssistantMessage = messages.some(m => m.role === "assistant");
if (hasAssistantMessage && !sessionCounted.current) {
sessionCounted.current = true;
incrementSessionCount();
}
}, [messages]);
Non-new session handling
When a request is not a new session (i.e. a follow-up turn in an existing conversation), skip daily session counter increments entirely. Only the first turn of a conversation should count as a "session" for rate limiting purposes:
if (!isNewSession) {
return { allowed: true }; // Skip session counting for follow-up turns
}
BYOK (Bring Your Own Key) fallback
When rate-limited, let users input their own API key to continue chatting. This turns abuse into the user's own cost while preserving good UX:
// Skip rate limiting when user provides their own keyif (!userApiKey) {
const limit = awaitcheckRateLimit(ip, turnCount, isNewSession);
if (!limit.allowed) {
return res.status(429).json({ error: limit.reason, rateLimited: true });
}
}
const apiKey = userApiKey || serverKey;
Check the Origin or Referer header against an allowlist. This prevents cross-site request abuse where third parties embed scripts that burn your API quota:
Note: Substring matching (origin.includes(h)) is acceptable for v1 but could theoretically match crafted domains. For stricter validation, parse the URL and compare the hostname.
Input size limits
Cap both the number of messages and individual message length to prevent token-stuffing attacks that run up your LLM bill:
For documentation-backed chatbots, access control must happen before retrieval, not after answer generation. If the bot exposes semantic search, exact search, or a virtual docs filesystem, apply the same visibility filter to every surface:
Exclude unpublished, draft, internal, customer-only, or role-gated pages before building any path tree the model can browse.
Apply the same filter to vector, keyword, and chunk queries. Do not rely on hiding paths in the UI while leaving chunks searchable.
Prefer omitting inaccessible paths entirely. A model should not be able to mention "there is an internal billing page, but you cannot access it."
Include isPublic, groups, tenantId, docsVersion, or equivalent metadata with indexed chunks so filters are cheap and testable.
Safe error handling
Never leak raw SDK error strings to the client (may contain API keys from BYOK)
Never log full error objects (may contain sensitive data)
Return generic error messages:
} catch {
console.error("Chat API error");
return res.status(500).json({
error: "An error occurred processing your request. Please try again.",
});
}
IP resolution on serverless platforms
Use the platform's trusted headers. On Vercel: x-real-ip > x-vercel-forwarded-for > x-forwarded-for. The standard x-forwarded-for is spoofable by clients.
Alternatives:
Cloudflare:CF-Connecting-IP
AWS ALB/CloudFront:X-Forwarded-For (first IP is trustworthy when set by AWS)
Fastly:Fastly-Client-IP
Disable non-text modalities
If you only need text responses, explicitly restrict the model:
const model = provider("gemini-3.1-flash-lite", {
responseModalities: ["TEXT"], // Gemini-specific// For OpenAI: modalities: ["text"]
});
Also state "text-only assistant" in the system prompt as a defense-in-depth measure.
3. Cost Optimization
Semantic caching
Use vector similarity search to cache and reuse responses for semantically similar questions. Most effective for FAQ-style chatbots where users ask the same questions in different words.
Similarity threshold: 0.92+ to avoid returning wrong cached answers. Lower values increase hit rate but risk incorrect responses.
Embedding dimensions: 128 dims is sufficient for FAQ similarity and cheaper to compute/store than full 768/1536/3072.
Cache scope: Cache first-turn questions only (highest hit rate, simplest implementation).
TTL: 7 days is reasonable; stale answers are better than no cache.
Alternatives:
Pinecone - managed vector DB with metadata filtering
pgvector - if you already have PostgreSQL
Cloudflare Vectorize - edge-native, pairs with Workers
Qdrant/Weaviate - self-hosted or cloud, richer query capabilities
Cache TTL enforcement
Always store a cachedAt timestamp in cache entry metadata. On lookup, reject entries older than your TTL (e.g. 7 days). This prevents stale answers from persisting indefinitely, especially when FAQ content changes:
constCACHE_TTL_MS = 7 * 24 * 60 * 60 * 1000; // 7 daysif (Date.now() - result.metadata.cachedAt > CACHE_TTL_MS) {
// Stale - treat as cache miss
}
Stream protocol consistency for cache hits
When returning a cached response, use the same streaming protocol as live LLM responses. Don't switch to a different response format (e.g. manual Data Stream Protocol vs. UI Message Stream). Inconsistent formats cause client-side parsing errors and broken UX:
// BAD: different format for cache hits
res.write(`0:${JSON.stringify(cachedText)}\n`); // Manual Data Stream Protocol// GOOD: same format for both pathsconst stream = createUIMessageStream({ /* ... */ });
pipeUIMessageStreamToResponse(stream, res);
Optimized exact search
For docs assistants that expose grep-style tools, avoid scanning every page or chunk over the network. Use a two-stage exact-search path:
Coarse filter: ask the document database for pages whose metadata or text might contain the fixed string or regex. Use datastore-native filters where available, such as $contains, full-text indexes, trigram search, or metadata filters by section/path.
Bulk prefetch: fetch all candidate chunks for the matching pages in one batch, sorted by page and chunk_index.
Fine filter: run exact string or regex matching in memory and return only final hit paths/snippets.
Cache: store prefetched page chunks by { path, docsVersion } so repeated grep/cat workflows do not hit the database twice.
Log candidate count and final hit count. If the coarse filter returns too many pages, ask the model to narrow the query instead of silently running an expensive full-corpus scan.
FAQ list view
Offer a browsable FAQ list alongside the chat interface. This serves users who have common questions without making any LLM calls at all:
// Structured FAQ data for UI renderingexportconstFAQ_QUESTIONS: Array<{
category: string;
question: string;
answer: string;
}> = [
{ category: "Ticketing", question: "Can I get a refund?", answer: "Yes, per our refund policy..." },
// ...
];
Organize by category with expandable sections. Clicking a question can either show the pre-written answer directly or send it to the chat for a more detailed LLM response.
Use the cheapest sufficient model
For a constrained Q&A chatbot, you rarely need the most powerful model:
Model
Input cost
Output cost
Best for
Gemini 3.1 Flash-Lite
$0.25/1M
$1.50/1M
Cheapest, good for FAQ
GPT-4o-mini
$0.15/1M
$0.60/1M
Good balance of cost/quality
Claude Haiku
$0.25/1M
$1.25/1M
Fast, good at following instructions
Llama 3.3 70B (via Groq)
Free tier available
Free tier available
Cost-sensitive prototypes
Short output limits
Set maxOutputTokens to the minimum needed (e.g. 500 tokens for 2-4 sentence answers). This caps cost per request and keeps responses concise.
Context caching
Pre-build and cache the system prompt context at module level. This avoids re-computing expensive string concatenations on every request:
OpenTelemetry - vendor-neutral, export to Datadog/Honeycomb/Grafana
Datadog LLM Observability - if already using Datadog
Log semantic cache hits
Track cache hit rates to understand cost savings and tune the similarity threshold. A cache hit is a "free" response that saved an LLM call.
Trace retrieval tool calls
Trace retrieval tools separately from LLM calls. For semantic search, exact search, and virtual filesystem tools, log:
Tool name, query/pattern, requested path, and docs version.
Latency, cache hit/miss, database round trips, chunks fetched, candidate count, and final result count.
Whether the model escalated from broad semantic search to exact grep/cat/ls exploration.
User-visible outcome signals such as cited-answer rate, "I don't know" rate, thumbs up/down, and handoff/escalation rate.
This tells you whether agentic retrieval is improving answer quality or just adding cost and latency.
Don't log sensitive data
Avoid logging full error objects, API keys, or user PII. Log just enough to debug (error type, status codes, IP hashes).
5. UX Patterns
Markdown rendering
Enable markdown in chat responses and instruct the model to use it via the system prompt:
You may use markdown formatting in your responses when appropriate:
- Use **bold** for emphasis on key information like dates, prices, or venue names
- Use [links](url) when referencing websites
- Use bullet points for lists of speakers, sessions, or options
- Keep formatting light and readable
React:react-markdown + remark-gfmVue:vue-markdown-renderVanilla JS:marked or markdown-it
Draggable and resizable window
Let users reposition and resize the chat window. Persist geometry to localStorage so it survives page reloads. Clamp positions to viewport bounds:
Always stream responses for perceived speed. Use your SDK's streaming API rather than waiting for the full response. The first token appearing quickly matters more than total latency.
For the UI, update one in-progress assistant message as tokens arrive. Do not append a new message row per token. Token-level rows are expensive, break transcript semantics, and make scroll anchoring harder.
Chat scroll and virtualization
Public Q&A widgets often start short, so a plain scroll container is fine until there is evidence it is not. Add virtualization when the widget can hold long histories, rich markdown, tool results, images, code blocks, history pagination, or token-streaming messages that grow in height.
When virtualization is warranted, recommend TanStack Virtual's chat support for React implementations, but keep it optional and swappable. The important lessons are the scroll contracts:
Treat chat as an end-anchored reverse feed, not a normal top-anchored list.
Keep message data in normal chronological order; avoid flex-direction: column-reverse, inverted transforms, and hand-maintained scrollTop += delta bookkeeping.
Use stable message IDs as row keys. Index keys cannot preserve position after prepending older history.
Loading older history should prepend messages with ordinary array updates, such as setMessages((current) => [...olderMessages, ...current]).
Follow appended messages only when the user was already near the latest message. If the user scrolled up to read history, incoming output must not yank them back to the bottom.
Use an explicit "near latest" threshold, e.g. about 80px, rather than exact-bottom checks that are brittle across browsers and dynamic heights.
Expose a "Latest" or "Jump to bottom" affordance when the user is away from the end.
Dynamic row heights are the default for real chat. Markdown, links, code, tool output, and streamed text should be measured or allowed to reflow without overlap.
Prefer instant/auto follow for high-frequency token streaming. Smooth scroll can look nice for discrete appends, but validate it because animation targets can fight dynamic measurement.
Keep pagination cursors, hasMoreHistory, loading flags, and request dedupe in app state. The virtualizer should receive the current ordered message array, not own data fetching.
TanStack Virtual maps these lessons to anchorTo: 'end', followOnAppend, scrollEndThreshold, stable getItemKey, measureElement, isAtEnd(), getDistanceFromEnd(), and scrollToEnd(). These APIs are useful defaults, not a hard dependency.
Bottom command shelf
Do not treat the composer as only a textbox. For AI applications, the bottom of the screen is valuable thumb-reachable space for the actions users need while forming a prompt: attach, tools, model, voice, send, mode, reasoning depth, runtime context, and tool launchers.
Use a progressive bottom command shelf when the app has enough controls to justify it:
Keep the default composer compact: input, add/attach, tools toggle, model chip, mic, and send.
Expand into a bottom sheet for secondary controls rather than putting all controls in the default composer.
Keep send one tap away in both compact and expanded states.
Keep the main canvas visually calm; let the bottom shelf become the command plane.
Show mode and execution state as compact chips, e.g. Plan / Build, effort level, device/project/branch, and budget or usage.
Place tool launchers in the expanded shelf when they are likely to be used mid-prompt, e.g. terminal, file search, web search, docs, or attachments.
Add an explicit close/collapse affordance above the expanded shelf so the user can reclaim vertical space.
Avoid this pattern for simple public FAQ widgets with no tools or settings. For constrained Q&A, a compact composer plus FAQ chips is often enough.
Reserve layout space for the compact composer, then let expanded toolbar controls move upward as an overlay. Opening tools should not resize, jump, or re-anchor the chat transcript.
Test keyboard open/close, safe-area insets, shelf expand/collapse, message streaming, and history reading with the shelf in both states.
Graceful degradation
Every optional service should have a fallback:
Service
If unavailable...
Redis (rate limiting)
Fall back to in-memory counters
Vector DB (cache)
Skip semantic caching, always call LLM
Observability (tracing)
Skip tracing, log locally
Server API key
Prompt user for BYOK
Virtualized chat list
Fall back to a bounded native scroll list with transcript limits
// Pattern: optional service with graceful fallbackconst vectorIndex = vectorUrl && vectorToken
? newIndex({ url: vectorUrl, token: vectorToken })
: null; // null = skip cachingif (vectorIndex) { /* try cache */ }
// Always falls through to LLM call
Hover previews
Show top FAQ questions on hover over the chat bubble. This gives users an immediate sense of what the chatbot can help with and reduces "what do I ask?" friction.
Theme-aware / adaptive theming
When embedding a chatbot widget on a page that supports dark/light mode, make the chatbot colors contrast with the page background:
Dark page -> white/light chatbot
Light page -> black/dark chatbot
Accept the page's theme state (e.g. isDark prop) and derive all colors from a single theme palette function. Use useMemo to avoid recalculating on every render:
Define comprehensive color tokens so every UI element adapts. This avoids hardcoded colors scattered throughout the component and makes the entire widget respond to theme changes in one place.
6. Architecture
Pluggable component
Design the chatbot as a single component that accepts props so it can be dropped into any page with different branding/context:
<Chatbot
page="europe"
accentColor="#7C3AED"
title="AI Engineer Europe Assistant"
/>
Tool calls instead of context stuffing
Instead of stuffing all data into the system prompt, expose tools that the model can call on-demand. This keeps the context window smaller and responses more accurate:
Virtual documentation filesystem for agentic retrieval
Top-k RAG works for simple FAQ questions, but it breaks down when the answer spans several pages, the user needs exact syntax, or the correct page does not land in the nearest embedding results. For documentation-backed chatbots, consider exposing the knowledge base as a read-only virtual filesystem so the model can explore with familiar tools such as ls, cat, find, and grep.
The important idea is to give the model the filesystem workflow, not necessarily a real filesystem. Mintlify's ChromaFs pattern maps shell commands onto an existing docs index instead of booting a sandbox for every visitor. That matters for public chatbot latency and cost: their article reports p90 session creation dropping from about 46s with sandbox/repo setup to about 100ms with a virtual filesystem over Chroma.
Recommended shape:
Store a path tree for the docs site, e.g. page slugs and section paths, as a compact JSON artifact in the same datastore as the indexed content.
On session init, load the path tree into memory as Set<path> plus Map<directory, children> so ls, cd, and basic find do not need network calls.
Apply access control before the tree reaches the model. For public widgets this usually means pruning unpublished, private, draft, customer-only, or admin-only pages. The model should not see paths it cannot read.
Implement cat /path/page.mdx by fetching all chunks for that page, sorting by chunk_index, and reassembling the full page. Cache page reads during the session so repeated inspection is cheap.
Support lazy file pointers for large artifacts such as OpenAPI specs, generated API reference JSON, changelogs, or versioned docs. Show the file in ls, but fetch content only when the model runs cat.
Make the filesystem explicitly read-only. Any write-like operation should fail with an EROFS-style error so the assistant can explore freely without state cleanup or cross-user mutation risk.
Optimize recursive grep as a two-stage search: use the vector/document database as a coarse filter to identify candidate pages, then run exact string or regex matching in memory over the fetched candidates. This gives exact-match behavior without scanning every file over the network.
Expose the virtual filesystem as narrow tools rather than a general shell when possible:
Use this pattern when the chatbot needs to behave like a docs expert. Keep normal semantic search as the first-pass tool for broad questions, then let the model escalate to grep/cat/ls when it needs exact wording, syntax, cross-page synthesis, or source-grounded citations.
System prompt structure
Structure the system prompt with these sections in order:
Role and constraints - "You are the conference assistant..."
Formatting instructions - "Use markdown when appropriate..."
Tool usage guidance - "Use tools to search speakers/sessions..."
Hard constraints - "Text-only, no images/audio..."
Fallback instructions - "If you don't know, suggest emailing..."
Reference data - FAQ text, speaker list, session list
API route, not edge function
For chatbot endpoints that need streaming + external service calls (Redis, Vector DB, observability), use a standard API route / serverless function rather than edge functions. Edge functions have stricter size/dependency limits and cold start characteristics that can cause issues with multiple SDK imports.
7. Knowledge Base Management
Path tree index
For large documentation sites, generate a docs manifest alongside the chunk index:
At runtime, load the access-pruned manifest into memory:
Set<string> for valid file paths.
Map<string, string[]> for directory-to-children lookup.
Optional title/path aliases for forgiving find_docs behavior.
This makes list_docs, find_docs, and path validation memory-only. Rebuild or invalidate the tree when docsVersion changes.
Structured FAQ data
Maintain two representations of FAQ data:
Flat text for the system prompt - a single string the model reads as context
Structured objects for the UI - typed array with question, answer, category fields for rendering the FAQ list view
// System prompt context (flat text)exportconstFAQ_KNOWLEDGE_BASE = `
## TICKETING & PRICING
Q: Can I get a refund?
A: Yes, per our refund policy...
`;
// UI list view (structured)exportconstFAQ_QUESTIONS = [
{ category: "Ticketing", question: "Can I get a refund?", answer: "Yes..." },
];
Full-page reassembly from chunks
Chunked vector results are good for discovery, but they are often too lossy for final answers. For read_doc(path) or citation verification, fetch the whole page:
Cache full-page reads by { path, docsVersion }. This lets the model answer exact syntax, multi-section, and "compare these pages" questions with the same source material a human docs reader would inspect.
Include venue/logistics details
Always include practical information (venue name, address, dates, ticket URLs) directly in the context. These are the most common questions and should never require a tool call.
8. Common Pitfalls
Avoid DOM-manipulating libraries in React chat widgets
Libraries like html2canvas that clone and manipulate the DOM can interfere with React's virtual DOM reconciliation, causing page reloads, lost state, or broken event handlers. If you need page screenshots, use native browser APIs (navigator.mediaDevices.getDisplayMedia) or capture at the server level instead.
Don't make chat scroll a pile of special cases
The failure mode for long chatbot widgets is usually scattered scroll math: column-reverse, inverted transforms, manual offset deltas, unconditional scrollToBottom, and index-based keys. These hacks often pass short manual tests and then fail when older history loads, the assistant streams a long markdown answer, or the user reads history while new output arrives.
Prefer a single scroll contract:
Ordered messages in data.
Stable IDs for rows.
Prepend history without changing the user's visible anchor.
Append/follow only when already near latest.
Grow the active assistant row during streaming.
Test "reading history" and "pinned at latest" as different states.
Verify exact model identifiers before deploying
LLM model IDs change frequently and may require suffixes like -preview. A wrong model ID can return a 200 OK response with an empty or errored stream body, making it look like a frontend bug. Always verify the exact model ID against the provider's docs and test with a real API call before deploying.
Always run a local build before pushing
Never skip pnpm build / npm run build before pushing to a branch. TypeScript errors, import issues, and other compilation failures caught locally are much faster to fix than waiting for CI. This is especially important when multiple people are editing the same files.
9. Checklist
Use this checklist when building a new public Q&A chatbot:
Rate limiting: per-turn, per-visitor, and global limits
Distributed rate limiter for production (not in-memory only)
Session counter increments only after server confirms response