| name | confidence-layer-design |
| description | Expose when a judgment is high-confidence vs thin-evidence without undermining the platform's authority — covering the confidence trilemma, per-tier UI patterns, data model, copy patterns, and hard rules on when NOT to show confidence indicators. |
Confidence Layer Design
Review Checklist
The Confidence Trilemma — Why This Is Hard
Three approaches all fail in different ways:
Show confidence everywhere:
Users see a medium badge on a 7.5 score and immediately ask "why is this medium confidence? Is the 7.5 wrong?" The badge creates doubt about the score itself, not just the evidence. Users start gaming the system — submitting longer text to get high confidence badges rather than solving the actual challenge better.
Hide confidence entirely:
Platform feels like a black box. Sophisticated users (enterprise buyers, serious competitors) want to know the basis for scores. When a judge scores 3.0 on a lane where they had one thin sentence of evidence, hiding that feels dishonest. Trust erosion is slower but deeper.
Fake high confidence uniformly:
Catastrophic. Users will eventually submit something and see a confident 8.0 score on a lane where the submission had zero relevant content. One visible wrong high-confidence assessment destroys trust permanently. Don't do this.
The resolution:
Show confidence ONLY where it adds information — specifically, only when confidence is medium or low. High confidence is the baseline state. Showing a badge for high adds no information ("the platform is working normally"). Showing an indicator for medium or low says "something is different here, pay attention."
This is the same principle as error/warning UI: you don't show a green "success" indicator on every form field that has valid input. You only show a red indicator when something needs attention.
When NOT to show confidence indicators (enforced by code, not just guidelines):
- Aggregate/composite scores
- Final rankings and leaderboard positions
- List/card views of bouts (only show in detail view)
- When ALL lanes are high confidence (suppress the entire layer)
- For the final prize pool determination — the system acts on the score; confidence is informational only
TypeScript Confidence Tier Types
export type ConfidenceTier = 'high' | 'medium' | 'low';
export interface LaneConfidence {
tier: ConfidenceTier;
evidenceCount: number;
caveatText: string | null;
suppressDisplay: boolean;
}
export interface JudgeOutputConfidence {
laneConfidences: Record<string, LaneConfidence>;
hasAnyLowOrMedium: boolean;
}
export function computeConfidenceTier(params: {
evidenceCount: number;
feedbackWordCount: number;
specificityScore: number;
hasDirectQuotes: boolean;
}): ConfidenceTier {
const { evidenceCount, feedbackWordCount, specificityScore, hasDirectQuotes } = params;
if (evidenceCount === 0) return 'low';
if (feedbackWordCount < 30 && evidenceCount <= 1) return 'low';
if (specificityScore < 0.25) return 'low';
if (evidenceCount >= 3 && specificityScore >= 0.7 && hasDirectQuotes) return 'high';
if (evidenceCount >= 5 && specificityScore >= 0.6) return 'high';
if (evidenceCount >= 4 && feedbackWordCount >= 100 && specificityScore >= 0.6) return 'high';
return 'medium';
}
export function buildJudgeOutputConfidence(
laneData: Record<string, {
evidenceCount: number;
feedbackWordCount: number;
specificityScore: number;
hasDirectQuotes: boolean;
}>,
getCopyForLane: (lane: string, tier: ConfidenceTier) => string | null
): JudgeOutputConfidence {
const laneConfidences: Record<string, LaneConfidence> = {};
let hasAnyLowOrMedium = false;
for (const [lane, signals] of Object.entries(laneData)) {
const tier = computeConfidenceTier(signals);
if (tier !== 'high') hasAnyLowOrMedium = true;
laneConfidences[lane] = {
tier,
evidenceCount: signals.evidenceCount,
caveatText: tier === 'high' ? null : getCopyForLane(lane, tier),
suppressDisplay: tier === 'high',
};
}
return { laneConfidences, hasAnyLowOrMedium };
}
Copy Pattern Library
import type { ConfidenceTier } from '@/types/confidence';
const LANE_DISPLAY_NAMES: Record<string, string> = {
planning: 'Planning',
execution: 'Execution',
reasoning: 'Reasoning',
communication: 'Communication',
adaptability: 'Adaptability',
};
export function getConfidenceCopy(lane: string, tier: ConfidenceTier): string | null {
const laneName = LANE_DISPLAY_NAMES[lane] ?? lane;
if (tier === 'high') return null;
if (tier === 'medium') {
return `Score is based on moderate evidence in this lane.`;
}
if (tier === 'low') {
return `Based on limited evidence in the ${laneName} lane. The submission had few clear examples for this dimension — the score reflects what was present.`;
}
return null;
}
export function getLowConfidenceExplainer(lane: string, evidenceCount: number): string {
const laneName = LANE_DISPLAY_NAMES[lane] ?? lane;
if (evidenceCount === 0) {
return `The ${laneName} score is based on inferred signals only — no direct evidence was identified in the submission for this dimension.`;
}
return `Based on ${evidenceCount === 1 ? 'a single evidence point' : `${evidenceCount} evidence points`} in the ${laneName} lane. Limited visibility means this score has lower precision than other lanes.`;
}
SQL Confidence Storage Schema
Confidence is stored per-claim (per-lane per-judge output) in the judge_outputs.confidence_scores JSONB column AND materialized in a queryable table for analytics.
CREATE TABLE judge_lane_confidence (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
judge_output_id UUID NOT NULL REFERENCES judge_outputs(id) ON DELETE CASCADE,
submission_id UUID NOT NULL REFERENCES submissions(id) ON DELETE CASCADE,
judge_id TEXT NOT NULL,
lane TEXT NOT NULL,
confidence_tier TEXT NOT NULL CHECK (confidence_tier IN ('high', 'medium', 'low')),
evidence_count INTEGER NOT NULL DEFAULT 0,
specificity_score FLOAT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_lane_confidence_submission ON judge_lane_confidence(submission_id);
CREATE INDEX idx_lane_confidence_tier ON judge_lane_confidence(judge_id, lane, confidence_tier);
CREATE INDEX idx_lane_confidence_created ON judge_lane_confidence(created_at DESC);
SELECT
judge_id,
lane,
COUNT(*) AS total,
COUNT(*) FILTER (WHERE confidence_tier = 'low') AS low_count,
COUNT(*) FILTER (WHERE confidence_tier = 'medium') AS medium_count,
COUNT(*) FILTER (WHERE confidence_tier = 'high') AS high_count,
ROUND(
COUNT(*) FILTER (WHERE confidence_tier = 'low')::FLOAT / COUNT(*) * 100,
1
) AS pct_low
FROM judge_lane_confidence
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY judge_id, lane
ORDER BY judge_id, pct_low DESC;
Trigger to populate judge_lane_confidence when judge_outputs is inserted
CREATE OR REPLACE FUNCTION sync_lane_confidence()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
DELETE FROM judge_lane_confidence WHERE judge_output_id = NEW.id;
IF NEW.confidence_scores IS NOT NULL THEN
INSERT INTO judge_lane_confidence (
judge_output_id, submission_id, judge_id, lane, confidence_tier
)
SELECT
NEW.id,
NEW.submission_id,
NEW.judge_id,
key AS lane,
value::TEXT AS confidence_tier
FROM jsonb_each_text(NEW.confidence_scores);
END IF;
RETURN NEW;
END;
$$;
CREATE TRIGGER trg_sync_lane_confidence
AFTER INSERT OR UPDATE OF confidence_scores ON judge_outputs
FOR EACH ROW EXECUTE FUNCTION sync_lane_confidence();
TSX Confidence Indicator Components
'use client';
import { InformationCircleIcon } from '@heroicons/react/24/outline';
import { Tooltip, TooltipContent, TooltipTrigger } from '@/components/ui/tooltip';
import type { LaneConfidence } from '@/types/confidence';
interface ConfidenceIndicatorProps {
confidence: LaneConfidence;
lane: string;
}
export function ConfidenceIndicator({ confidence, lane }: ConfidenceIndicatorProps) {
if (confidence.suppressDisplay || confidence.tier === 'high') return null;
if (confidence.tier === 'medium') {
return (
<Tooltip>
<TooltipTrigger asChild>
<span className="inline-flex items-center gap-0.5 text-xs text-gray-400 cursor-help">
<InformationCircleIcon className="h-3.5 w-3.5" />
<span>Moderate evidence</span>
</span>
</TooltipTrigger>
<TooltipContent side="top" className="max-w-xs text-xs">
{confidence.caveatText}
</TooltipContent>
</Tooltip>
);
}
return (
<div className="mt-2 px-3 py-2 bg-gray-50 border border-gray-200 rounded text-xs text-gray-600">
<div className="flex items-start gap-2">
<InformationCircleIcon className="h-4 w-4 text-gray-400 shrink-0 mt-0.5" />
<span>{confidence.caveatText}</span>
</div>
</div>
);
}
LaneScoreWithConfidence — Full lane row component
'use client';
import { ConfidenceIndicator } from '@/components/confidence/ConfidenceIndicator';
import type { LaneConfidence } from '@/types/confidence';
import { cn } from '@/lib/utils';
interface LaneScoreWithConfidenceProps {
lane: string;
displayName: string;
score: number;
maxScore: number;
confidence: LaneConfidence;
feedbackText?: string;
}
export function LaneScoreWithConfidence({
lane,
displayName,
score,
maxScore,
confidence,
feedbackText,
}: LaneScoreWithConfidenceProps) {
const scorePercent = (score / maxScore) * 100;
const scoreColor = scorePercent >= 80 ? 'text-green-600'
: scorePercent >= 60 ? 'text-yellow-600'
: 'text-red-600';
return (
<div className="space-y-2 py-3 border-b border-gray-100 last:border-0">
<div className="flex items-center justify-between">
<div className="flex items-center gap-2">
<span className="text-sm font-medium text-gray-800">{displayName}</span>
{/* Confidence indicator for medium tier — subtle, inline */}
{confidence.tier === 'medium' && (
<ConfidenceIndicator confidence={confidence} lane={lane} />
)}
</div>
<span className={cn('text-sm font-semibold font-mono', scoreColor)}>
{score.toFixed(1)}/{maxScore}
</span>
</div>
{/* Progress bar */}
<div className="h-1.5 bg-gray-100 rounded-full overflow-hidden">
<div
className={cn(
'h-full rounded-full transition-all duration-700',
scorePercent >= 80 ? 'bg-green-500'
: scorePercent >= 60 ? 'bg-yellow-500'
: 'bg-red-400'
)}
style={{ width: `${scorePercent}%` }}
/>
</div>
{/* Feedback text */}
{feedbackText && (
<p className="text-sm text-gray-600">{feedbackText}</p>
)}
{/* Low confidence caveat — explicit, below feedback */}
{confidence.tier === 'low' && (
<ConfidenceIndicator confidence={confidence} lane={lane} />
)}
</div>
);
}
Anti-Patterns
❌ Anti-Pattern 1: Showing confidence badge on composite score
<div className="flex items-center gap-2">
<span className="text-2xl font-bold">{compositeScore}</span>
<ConfidenceBadge tier={overallConfidence} /> {}
</div>
<div>
<span className="text-2xl font-bold">{compositeScore}</span>
{}
</div>
{}
{lanes.map(lane => (
<LaneScoreWithConfidence key={lane.id} {...lane} confidence={lane.confidence} />
))}
❌ Anti-Pattern 2: Using "we're not sure" copy language
const caveatText = confidence.tier === 'low'
? "We're not fully confident in this score due to limited submission content."
: null;
const caveatText = confidence.tier === 'low'
? `Based on limited evidence in the ${laneName} lane. The submission had few clear examples for this dimension — the score reflects what was present.`
: confidence.tier === 'medium'
? `Score is based on moderate evidence in this lane.`
: null;
❌ Anti-Pattern 3: Showing confidence indicator when all lanes are high confidence
<section>
<h3>Score Confidence</h3>
{lanes.map(lane => (
<ConfidenceIndicator key={lane.id} confidence={lane.confidence} />
))}
</section>
const hasAnyConfidenceToShow = judgeConfidence.hasAnyLowOrMedium;
return hasAnyConfidenceToShow ? (
<section>
<h3>Score Notes</h3>
{lanes.map(lane => (
<ConfidenceIndicator key={lane.id} confidence={lane.confidence} />
))}
</section>
) : null;
Common Failures to Catch in Review
| Failure | Symptom | Fix |
|---|
| Confidence badge shown on leaderboard rank position | "You ranked #3 (medium confidence)" — implies ranking itself is uncertain | Only show confidence in the detailed breakdown view, never in list/rank views |
| All three tiers rendered with visual badges | High confidence gets a green badge — wastes cognitive space, de-emphasizes real warnings | High = no render; medium = subtle icon only; low = explicit caveat |
| Copy says "uncertain" or "not sure" | Users question whether to trust the score value, not just the evidence | Use "limited evidence" framing; keep score authority intact |
| Confidence stored as a single float (0.0–1.0) per submission | Can't show per-lane confidence; must bucketize at render time which is lossy | Store as JSONB { "planning": "high", "execution": "low" } per lane |
hasAnyLowOrMedium computed in TSX render instead of server | Triggers re-render cascade; complex conditional logic in JSX | Pre-compute in TypeScript layer, pass as boolean prop |
| Trigger doesn't handle UPDATE — only INSERT | Judge retry updates confidence_scores; old judge_lane_confidence rows persist | Use AFTER INSERT OR UPDATE OF confidence_scores in trigger definition |
computeConfidenceTier called with specificityScore = undefined | Falls through all conditions; returns 'medium' for everything by accident | Add specificityScore ?? 0 default in function signature |
| Low confidence caveat shown with empty string | Renders as <div> with no text — confusing whitespace | Guard: if (confidence.tier === 'low' && confidence.caveatText) before rendering caveat |
| Confidence copy hardcoded in JSX string literals | 47 different places have slight copy variations; impossible to audit | Co-locate all copy in components/confidence/copy.ts, import everywhere |
| Medium confidence uses same red color as low | Users treat all non-high scores as warnings; expand rate spikes on low-value lanes | Medium = gray/muted; Low = amber border box (never red — red implies error, not thin evidence) |
Changelog
- 2026-03-31: Created for Bouts confidence layer design build