一键导入
llm-calibration-logprobs
LLM token logprobs and calibration: per-decision confidence, ECE, Brier, reliability diagrams, low-confidence triage.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
LLM token logprobs and calibration: per-decision confidence, ECE, Brier, reliability diagrams, low-confidence triage.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
| name | llm-calibration-logprobs |
| description | LLM token logprobs and calibration: per-decision confidence, ECE, Brier, reliability diagrams, low-confidence triage. |
This skill covers within-model confidence: how sure a single model is about each decision it makes, read off the token log-probabilities it emits, and whether that internal confidence is calibrated against ground truth. It pairs with the model-council-voting skill, which handles the complement — between-coder agreement across several independent models. Use both: one model's high self-reported confidence on an item, and three models independently agreeing on that item, are different kinds of evidence, and a careful pipeline reports both. This skill does not cover codebook design or human-validation statistics (κ, F1) — those live in the text-classification skill; cross-reference it rather than re-deriving them here.
log P(token | context); exponentiating gives a probability in [0, 1]. A logprob of 0.0 means probability 1.0 (the model treated that token as certain); -2.30 means probability ≈ 0.10. This is the model's own assessment of how likely that token was, conditional on everything before it.logprobs=True and top_logprobs=K (K ≤ 20 at time of writing); the response carries a per-token list with .token, .logprob, and a .top_logprobs list of the K most likely alternatives at that position. vLLM exposes the same fields through its OpenAI-compatible server (logprobs=K). Ollama returns per-token logprobs through its API as well. analysis/classify_openai.py in the open-text surveys project is a worked example: it sets api_kwargs["logprobs"] = True; api_kwargs["top_logprobs"] = 5 and stashes resp.choices[0].logprobs.content for every response.temperature=0 and a fixed seed (that script uses temperature=0, seed=42). Temperature 0 makes the label deterministic on most backends, but note that it does not by itself make logprob values bit-identical across hosted-API calls — see §6. Record both settings.civic_commitment tokenizes into multiple sub-word pieces (civic, _comm, itment, …). You must combine their logprobs into one per-item confidence. The robust way to find the right tokens is to reconstruct the full output string, locate the label inside it (e.g., the value after "code":), build a character-offset map for each token, and select exactly the tokens overlapping the label's character span — this is what extract_code_logprobs() does in classify_openai.py rather than guessing token indices. Aggregation choices:
log P(whole label string) = the joint sequence probability. This is the principled "how likely was this exact label" number but is length-confounded: longer labels score lower simply by having more factors, so it is only comparable across labels of similar token length. classify_openai.py stores this as logprob_code_sum.classify_openai.py stores logprob_first_token for exactly this reason. Use it only after checking your labels actually disambiguate early — if two codes share a long prefix, the first token is uninformative.top_logprobs list at the first label position (the top_alternatives_json column in classify_openai.py). The gap between the chosen token and the runner-up — the margin — is a separate, often more discriminating, uncertainty signal than the absolute logprob (see §3)."code" field) so you can regex-locate the label. Free-form prose makes the label-token span ambiguous. This is the same structured-output discipline the text-classification skill recommends for parsing.p = exp(logprob):
p ≥ 0.90 → auto-accept (high confidence)0.65 ≤ p < 0.90 → accept but flag for spot-checkp < 0.65 → route to human review (low confidence)p_top − p_second < 0.10) marks a genuine coin-toss between two specific labels. Margin and absolute confidence disagree often enough that flagging on either catches more true errors than flagging on absolute confidence alone. The runner-up token also tells you which alternative the model was torn between, which is diagnostic for codebook ambiguity.text-classification skill's hybrid workflow — there, items are flagged when models disagree; here, when a single model is internally unsure.A model is calibrated if, among all items it labels with confidence p, a fraction p are actually correct. Confidence tiers are only trustworthy once you have measured this against held-out ground truth.
p), and plot mean predicted confidence (x) against empirical accuracy (y) per bin. Perfect calibration lies on the 45° diagonal. Points below the diagonal = overconfidence (the model claims more certainty than it earns); above = underconfidence. Modern deep networks, including large transformers, are typically miscalibrated toward overconfidence (Guo et al. 2017).ECE = Σ_b (n_b / N) · |acc(b) − conf(b)| (Guo et al. 2017). Lower is better; report it with the bin count, since ECE is sensitive to binning. Treat ECE as a scalar companion to the diagram, never a replacement — two very different reliability curves can share an ECE.BS = (1/N) Σ (p_i − y_i)². It is a proper scoring rule — it rewards honest probabilities — and jointly captures calibration and resolution (the spread of confidences), so it penalizes a model that hedges everything at 0.5. Report Brier alongside ECE; they answer different questions.model-council-voting) can both look reassuring while the labels are jointly wrong — correlated model errors inflate both.T on a held-out set to divide the logits before softmax often largely corrects ECE without changing the argmax label (Guo et al. 2017). Fit T on a calibration split, report ECE/Brier before and after, and never fit it on the same data you evaluate on.text-classification skill flags for classifier-as-variable designs.temperature=0 fixes the sampled token on most backends but does not guarantee bit-identical logprob values on hosted infrastructure (batching, kernel non-determinism, and silent model updates all perturb them). Pin the exact model version, record the run date, and re-measure calibration after any model update. Open-weight models served locally (vLLM, Ollama) give far more stable and inspectable logprobs — hpc/DESKTOP_LOGPROBS.md runs Llama 3.1 8B and Qwen 2.5 3B locally via Ollama precisely to get reproducible per-token logprobs across models.temperature, seed, top_logprobs K), the output schema used to locate label tokens, and the aggregation function (first-token / sum / mean) with its rationale. Store the raw per-token logprobs and the top-K alternatives, not just the derived scalar — the logprob_first_token, logprob_code_sum, and top_alternatives_json columns in classify_openai.py are a reasonable minimal schema.methods-reporting skill. For the between-model agreement statistics that complement this within-model confidence, use the model-council-voting skill; for codebook and human-validation metrics, the text-classification skill.logprobs=True, top_logprobs=K) and temperature=0 with a fixed seed recordedextract_code_logprobs()Scaffold or audit an entire research project repository organized around its source library. Use whenever the user is starting, structuring, organizing, or reviewing a whole project — "set up a research repo", "how should I structure/organize this project", "initialize my sources folder", "new paper or literature-review project", "audit my repo structure", "is my sources folder set up right", "check my project layout". Builds the full tree from the sources spine outward — sources/{og,md,unprocessed}, references.bib, a PDF→Markdown convert script (OpenDataLoader PDF), a process-source intake command, CLAUDE.md/AGENTS.md, .gitignore, .venv — plus the analysis, manuscript, and review folders; or audits an existing repo and reports what is present, partial, or missing. NOT for intaking or converting a single PDF (use process-source) or building a publication replication package (use replication-package).
LLM council/panel voting: multi-model coders, consensus rules, inter-rater agreement (kappa, alpha), correlated-error diagnostics.
Compare OCR systems before a bulk run: candidate set, stratified ground truth, CER/WER, normalization, per-language and per-stratum accuracy.
Fact-check a manuscript's claims against the cited sources themselves: locate each source's knowledge-base Markdown file and verify the in-text claim is actually supported. Runs a pre-flight gate that refuses unless a per-source Markdown knowledge base exists and is clean (PDFs converted via process-source); then runs citation-check; then audits claim support, overclaiming, direction, scope, and misattribution.
Audit citation existence and fabrication risk, in-text/reference parity, DOIs, claim support, and style.
Typeset a working paper or journal submission in house-style LaTeX from any draft — Markdown, Word (.docx), TeX, ODT, RTF, or HTML. Convert with pandoc, wrap in an EB Garamond template, build the PDF with latexmk, and prepare for a specific journal (spacing, page limit, anonymization, disclosures, citation style). Use for "format/typeset/convert my paper to LaTeX", "make a working paper", "prepare this for submission to <journal>".