一键导入
model-council-voting
LLM council/panel voting: multi-model coders, consensus rules, inter-rater agreement (kappa, alpha), correlated-error diagnostics.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
LLM council/panel voting: multi-model coders, consensus rules, inter-rater agreement (kappa, alpha), correlated-error diagnostics.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
| name | model-council-voting |
| description | LLM council/panel voting: multi-model coders, consensus rules, inter-rater agreement (kappa, alpha), correlated-error diagnostics. |
| argument-hint | [describe your coding/discovery task, candidate models, and what agreement you want to measure] |
A "council" runs the same labeling, scoring, or term-discovery task through several language models independently and reads their (dis)agreement as data. This skill sits on top of the single-model codebook-and-validation workflow in the text-classification skill: build and validate the codebook there first, then escalate to a council only when one model is not enough. It is a companion to topic-modeling (an independent, non-LLM method for cross-checking what a council finds), to llm-calibration-logprobs (per-item confidence from one model's token probabilities, a different signal than cross-model agreement), and to methods-reporting (the reporting standards the final write-up must meet).
04_term_discovery/discovery/ollama_discover_terms.py runs a zero-shot extraction prompt over sampled text windows; requiring several model families to independently surface the same term (see §4) is what separates a real corpus signal from one model's idiosyncrasy.text-classification workflow). For high-volume, well-defined coding, the cost of N model passes plus the agreement bookkeeping rarely buys anything. Reserve the council for the genuinely contested decisions and the discovery steps.appendix_a.tex). Picking four checkpoints from one family is a near-useless council.vlm-ocr-pipeline makes the same point for OCR).appendix_de.tex). That is the right move when you want sampling diversity within a window but still need a reproducible final set: pin the seed, fix the inputs, and let the consensus rule (not the raw generation) be the stability criterion.appendix_a.tex's per-model precision table (the Korean-primary EXAONE reaches 56% precision and 100% recall against the nine-term reference; the English-primary Gemma 3 misses two terms). The reference coder is a yardstick, not a tie-breaker — keep it out of the vote count itself, or you reintroduce a single point of failure.exaone-deep:32b-q4_K_M, aya-expanse:32b, qwen2.5:32b, gemma3:27b in appendix_de.tex. Family-name-only reporting ("we used Qwen and Gemma") is not reproducible.appendix_a.tex); only the model varies, so any disagreement is attributable to the model and not to a moving input.CLASSIFICATION_FINDINGS.md treats agreement between two independent coders as the robustness check. Common defaults: 3-of-4, 4-of-6, or a simple majority. Fix k as a function of how conservative you need to be — higher k buys precision at the cost of recall (see the vote-rule sensitivity below). These specific k values are house conventions, not cited thresholds.wf ≥ 50): the mean+2SD rule "is sensitive to high-frequency demonym terms that inflate the distribution in some model runs," whereas "an absolute floor yields more comparable cut-offs across models that differ in extraction volume" (appendix_a.tex, appendix_de.tex). When jurors differ in output volume, a relative cutoff silently moves the bar per juror; an absolute floor keeps the bar fixed and comparable.appendix_a.tex, the cross-model voting table). Decide which categories are eligible before counting votes, so the filter is a stated rule rather than a post-hoc rescue.appendix_de.tex weighted-frequency sensitivity table shows the published wf ≥ 50 is "the highest value at which all nine final terms remain while also being the lowest value at which the only additional entrant is a general-register term," and that moving from 3-of-4 to unanimous 4-of-4 drops three substantive terms. A council whose output set swings wildly with a small threshold change is not a stable instrument — report the band over which the conclusion holds.text-classification skill makes the same point about NAs as informative missingness).CLASSIFICATION_FINDINGS.md reports overall agreement 80.9% with Cohen's κ = 0.730 between Llama 3.1 8B and Qwen 2.5 3B.CLASSIFICATION_FINDINGS.md the lowest per-code agreement was civic_commitment at 66.5%; collapsing two over-lapping codes raised overall agreement from 73.0% to 80.9% and κ from 0.634 to 0.730. Low council agreement usually signals a codebook problem (fix it in the text-classification workflow), not a model problem.appendix_a.tex) — examining which terms each model alone chose — is the kind of diagnostic that surfaces shared vs. idiosyncratic behavior.text-classification skill specifies 50–100 items, two independent human coders, Cohen's κ or Krippendorff's α for inter-coder reliability) and report each juror's and the consensus's precision/recall/F1 against it. This is the only step that speaks to validity.appendix_a.tex, the term-recovery table). For the topic-model side of this triangulation, see the topic-modeling skill; CLASSIFICATION_FINDINGS.md shows the parallel move, triangulating an LLM classifier against an STM ("two independent analytical approaches … converge on the same substantive story").appendix_de.tex records all of these). Family-name-only reporting is not reproducible.appendix_a.tex cross-model voting table gives one row per term with a check/dash per model, the vote tally, and the final status. A reader must be able to see the votes, not just the aggregate.methods-reporting skill.text-classification)appendix_a.tex)appendix_de.tex)appendix_a.tex)appendix_a.tex)appendix_de.tex, weighted-frequency sensitivity table)topic-modeling); the council never grades its own workmethods-reporting)Scaffold or audit an entire research project repository organized around its source library. Use whenever the user is starting, structuring, organizing, or reviewing a whole project — "set up a research repo", "how should I structure/organize this project", "initialize my sources folder", "new paper or literature-review project", "audit my repo structure", "is my sources folder set up right", "check my project layout". Builds the full tree from the sources spine outward — sources/{og,md,unprocessed}, references.bib, a PDF→Markdown convert script (OpenDataLoader PDF), a process-source intake command, CLAUDE.md/AGENTS.md, .gitignore, .venv — plus the analysis, manuscript, and review folders; or audits an existing repo and reports what is present, partial, or missing. NOT for intaking or converting a single PDF (use process-source) or building a publication replication package (use replication-package).
LLM token logprobs and calibration: per-decision confidence, ECE, Brier, reliability diagrams, low-confidence triage.
Compare OCR systems before a bulk run: candidate set, stratified ground truth, CER/WER, normalization, per-language and per-stratum accuracy.
Fact-check a manuscript's claims against the cited sources themselves: locate each source's knowledge-base Markdown file and verify the in-text claim is actually supported. Runs a pre-flight gate that refuses unless a per-source Markdown knowledge base exists and is clean (PDFs converted via process-source); then runs citation-check; then audits claim support, overclaiming, direction, scope, and misattribution.
Audit citation existence and fabrication risk, in-text/reference parity, DOIs, claim support, and style.
Typeset a working paper or journal submission in house-style LaTeX from any draft — Markdown, Word (.docx), TeX, ODT, RTF, or HTML. Convert with pandoc, wrap in an EB Garamond template, build the PDF with latexmk, and prepare for a specific journal (spacing, page limit, anonymization, disclosures, citation style). Use for "format/typeset/convert my paper to LaTeX", "make a working paper", "prepare this for submission to <journal>".