name	trust-calibration
description	Helping users form warranted trust in the AI — neither overtrust nor undertrust — through deliberate confidence and source signalling.

Trust Calibration

Calibrated trust is the difference between an AI that augments user judgment and one that displaces it. Overtrust causes harm when the AI is wrong. Undertrust wastes the AI when it's right. Both failure modes are common, and neither shows up in standard accuracy metrics.

Designing for trust means giving users the information they need to update their trust appropriately, turn by turn.

What shapes user trust in the moment

Surface confidence — how certain the AI sounds, regardless of whether it should
Track record — prior interactions in this and previous sessions
Stakes legibility — how clearly the user understands what could go wrong
Source visibility — whether the AI shows reasoning, sources, or alternatives
Persona fit — a "professional" persona gets more trust than a "friendly" one for the same content

These shape trust whether you design for them or not. Designing for them deliberately is what trust calibration is.

Trust failure modes

Sycophancy-driven overtrust: AI tells the user what they want to hear; user trusts the agreement and acts on it
Confidence-mismatch overtrust: AI sounds certain about something it shouldn't be (hallucinations, edge cases)
Defensive undertrust: AI hedges everything ("might be", "could possibly") even when right; user tunes out the qualifier
Authority-collapse undertrust: one wrong answer in a high-stakes context destroys trust for the whole product
Trust laundering: low-confidence outputs presented with high-confidence formatting (bold headers, decisive bullets) — visual authority disconnected from epistemic authority

Calibration signals from the AI side

The AI shapes trust deliberately through:

Confidence markers proportionate to actual epistemic state: "I'm fairly sure" / "I'd verify this" / "I don't know" — used because they're true, not as decoration
Source attribution: "According to [X]" rather than unsourced assertion. Cite when possible; flag the gap when not.
Alternative surfacing: "Two interpretations: A and B. I went with A because…" — shows the model's working
Failure transparency: "I got that wrong earlier — here's the correction." Long-term trust gain at short-term cost.
Capability fence-posting: "I can help with X but not Y." Defines the boundary so trust isn't tested in the wrong place.

Calibration signals from the user side

Trust is two-way. The AI also helps the user calibrate:

Showing the cost of being wrong: "If this is wrong, you'd want to verify against [source] before [action]"
Recommending verification thresholds: "Low-stakes: this is probably fine. High-stakes: double-check."
Acknowledging variance: "This worked for most users in your situation; yours may differ."

Decision rules

High stakes + low confidence → bias toward undertrust by default. The cost of an action on bad info exceeds the cost of an extra verification step.
If the user has corrected the AI in this session, raise hedging on similar outputs for the rest of the session. Show the AI updating.
Never inflate confidence to match the user's apparent expectation. Sycophancy is the worst trust failure because it compounds across turns.
Prefer "I don't know" over a confident wrong answer. The trust cost of "I don't know" is lower than the trust cost of being caught wrong.
If the AI must guess, flag it. "Best guess: X. Reasoning: Y. Confidence: low — verify if this matters."
If the AI changes position based on user pushback, name the update. Silent flips destroy trust faster than disagreement does.

Anti-patterns

Universal hedging: every output ends with "but you should verify". The qualifier loses signal and reads as legal cover.
Confidence theatre: bold formatting, decisive language, perfect grammar applied to outputs the AI is uncertain about. Visual confidence ≠ epistemic confidence.
Apology as trust-building: over-apologising for errors signals incompetence, not calibration. One clean acknowledgement is more trustworthy than five.
Personality-driven confidence: the AI's persona dictates its confidence level rather than its actual epistemic state ("I'm a confident assistant!").
Silent updating: the AI changes position on user pushback without flagging it. Users notice; trust drops sharply.
Trust transfer by association: "Anthropic-built" / "GPT-4-powered" framed as a quality guarantee. Pedigree ≠ correctness in a specific case.

When not to use this

Low-stakes recreational AI (creative writing, brainstorming) where calibrated trust isn't the operating mode. Reach for tone-calibration or progressive-disclosure instead.
Deterministic-output products (code formatting, image conversion) where outputs are checkable directly. Verification is cheap; trust calibration matters less.
First-touch onboarding, where the user has no prior relationship. Trust builds through repeated interaction, not through hedges in turn one.

Design Artefacts

A trust calibration matrix:

Stakes	AI confidence	Recommended language
Low	Low	"Best guess: X. Easy to check by [Y]."
Low	High	Direct answer, no hedge.
Medium	Low	"I'd suggest X based on [Y]. If your situation differs, talk to [Z]."
Medium	High	Direct answer + one-line source.
High	Low	"I'm not the right tool for this. Try [Z]."
High	High	Direct answer + source + recommended verification step.

Worked example:

User asks an AI assistant: "is it safe to take ibuprofen with my blood thinner?"
Stakes: high. AI confidence: low (medical question without patient record).
Response: "I can't give medical advice on drug interactions — for blood thinners specifically the answer depends on which one and your dosage. Please ask your prescriber or call a pharmacist; many have free phone consults."

Other artefacts:

A confidence-language rubric: a banned-phrase list ("I'm 100% sure", "always", "never" without justification) and an approved-phrase set tied to confidence levels
Sycophancy stress tests: inputs designed to elicit agreement; measure whether expressed confidence inflates when the user pushes
A trust-loss audit: classify sessions where users push back as "AI right, user wrong" / "AI wrong, user right" / "ambiguous". Track AI behaviour across categories — does it cave to pushback even when right?
Persona-confidence audit: hold the same factual claim constant; vary the persona; measure perceived authority. Tune until persona doesn't drive trust independent of content.

Adapted from research on calibrated trust in human-AI teams (Lee & See 2004 on appropriate reliance; Lai et al. on trust in machine learning systems).

name	trust-calibration
description	Helping users form warranted trust in the AI — neither overtrust nor undertrust — through deliberate confidence and source signalling.