Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.
[describe your coding/discovery task, candidate models, and what agreement you want to measure]
Model Council Voting: Panels of Language Models as Independent Coders
Instructions
A "council" runs the same labeling, scoring, or term-discovery task through several language models independently and reads their (dis)agreement as data. This skill sits on top of the single-model codebook-and-validation workflow in the text-classification skill: build and validate the codebook there first, then escalate to a council only when one model is not enough. It is a companion to topic-modeling (an independent, non-LLM method for cross-checking what a council finds), to llm-calibration-logprobs (per-item confidence from one model's token probabilities, a different signal than cross-model agreement), and to methods-reporting (the reporting standards the final write-up must meet).
1. When a Council Beats a Single Model
Use a council when the labeling decision is contested or ambiguous — categories with fuzzy boundaries, stance or frame coding, or constructs where reasonable coders disagree. The council's disagreement rate is itself a measurable property of the task, not noise to be averaged away.
Use a council for corpus-driven discovery where the output set is not fixed in advance — e.g., extracting which identity terms a corpus foregrounds. The user's 04_term_discovery/discovery/ollama_discover_terms.py runs a zero-shot extraction prompt over sampled text windows; requiring several model families to independently surface the same term (see §4) is what separates a real corpus signal from one model's idiosyncrasy.
Use a council for robustness reporting when results must survive a skeptical reader. Showing a finding holds across models from different training traditions answers the "would this replicate with a different model?" objection directly. Open-weight panels run locally also have lower and more predictable cross-run variance than proprietary APIs (Barrie, Palmer & Spirling 2025).
A council is overkill when the task is unambiguous and a single validated model already agrees with humans at the level your downstream analysis needs (validate first via the text-classification workflow). For high-volume, well-defined coding, the cost of N model passes plus the agreement bookkeeping rarely buys anything. Reserve the council for the genuinely contested decisions and the discovery steps.
A council is not a substitute for human validation (§7). N models agreeing tells you the label is reliable, not that it is correct (§5). Decide up front which role the council plays — reliability evidence, robustness check, or discovery filter — and say so in the write-up.
2. Assembling a Diverse Panel
Diversify training families and origins to decorrelate errors. The threat a council guards against is a shared blind spot: models trained on overlapping data or by the same lab tend to make the same mistakes, so they vote together for the wrong reason. The user's term-discovery ensemble deliberately spans four origins — EXAONE-Deep (LG AI Research, Korea), Aya Expanse (Cohere, Canada), Qwen2.5 (Alibaba, China), and Gemma 3 (Google, US) — precisely because "none shares training data or architecture with the others in any direct way, which makes cross-family agreement a conservative test" (appendix_a.tex). Picking four checkpoints from one family is a near-useless council.
Aim for 3–6 jurors. Below 3 you cannot compute many-rater agreement (§5) or run a meaningful k-of-N rule (§4). Above ~6 the marginal decorrelation falls off and the bookkeeping grows. These bounds are house defaults, not a cited optimum; the binding constraint is family diversity, not raw count (§6).
Prefer open-weight models for reproducibility. Open weights can be pinned to an exact revision and re-run locally; proprietary APIs change underneath you and show high, unpredictable run-to-run variance even at temperature 0 (Barrie, Palmer & Spirling 2025). If a proprietary model is in the council, record its exact dated identifier and treat its votes as the least reproducible (vlm-ocr-pipeline makes the same point for OCR).
Set decoding to be as deterministic as the stack allows. Temperature 0 (greedy) is the default for coding tasks so a juror's vote does not wobble between runs. The user's discovery pipeline runs at temperature 0.3 with a fixed window-sampling seed and notes that "because decoding at non-zero temperature is not bit-for-bit deterministic, reproducibility is enforced at the level of the term set rather than the raw generations" (appendix_de.tex). That is the right move when you want sampling diversity within a window but still need a reproducible final set: pin the seed, fix the inputs, and let the consensus rule (not the raw generation) be the stability criterion.
Consider an optional reference coder. One stronger or domain-specialized model (or the human-coded gold sample of §7) can serve as a reference against which each juror's precision and recall are reported, as the user does in appendix_a.tex's per-model precision table (the Korean-primary EXAONE reaches 56% precision and 100% recall against the nine-term reference; the English-primary Gemma 3 misses two terms). The reference coder is a yardstick, not a tie-breaker — keep it out of the vote count itself, or you reintroduce a single point of failure.
Record the exact model tag for every juror. Version, quantization, and revision — e.g., exaone-deep:32b-q4_K_M, aya-expanse:32b, qwen2.5:32b, gemma3:27b in appendix_de.tex. Family-name-only reporting ("we used Qwen and Gemma") is not reproducible.
3. Keeping Votes Independent
Run each juror in isolation. No model may see another model's output, and there is no multi-round "discussion." The disagreement you are trying to measure only exists if the votes are cast independently; debate or chained prompting collapses it and manufactures a consensus that reflects persuasion order, not the corpus.
Do not pool jurors into one prompt. Asking a single model to "play the role of four experts" yields one model's guess at what four models would say — fully correlated by construction. Independence requires N separate inference runs.
Hold every shared input constant across jurors. Same prompt, same sampled windows, same seed, same post-processing. The user's pipeline gives all four models "the same Korean zero-shot prompt, the same adaptive window sampling, and the same wf ≥ 50 absolute floor" (appendix_a.tex); only the model varies, so any disagreement is attributable to the model and not to a moving input.
If you want a deliberation design, that is a different instrument. Multi-agent debate can improve a single answer, but it is not a panel of independent raters and its agreement statistics are not interpretable as inter-rater reliability. Do not report debate-derived consensus as if it were independent-coder agreement. Pick one design and name it.
4. Consensus Rules
State a k-of-N rule before looking at outputs. The user's discovery study keeps a term only if "at least three of the four models selected it" (3-of-4); the classification study in CLASSIFICATION_FINDINGS.md treats agreement between two independent coders as the robustness check. Common defaults: 3-of-4, 4-of-6, or a simple majority. Fix k as a function of how conservative you need to be — higher k buys precision at the cost of recall (see the vote-rule sensitivity below). These specific k values are house conventions, not cited thresholds.
Prefer absolute floors over distribution-relative thresholds. A per-juror selection threshold expressed as mean + 2 SD is distorted whenever a few high-frequency items inflate the distribution. The user hit exactly this and switched to an absolute weighted-frequency floor (wf ≥ 50): the mean+2SD rule "is sensitive to high-frequency demonym terms that inflate the distribution in some model runs," whereas "an absolute floor yields more comparable cut-offs across models that differ in extraction volume" (appendix_a.tex, appendix_de.tex). When jurors differ in output volume, a relative cutoff silently moves the bar per juror; an absolute floor keeps the bar fixed and comparable.
Add category filters that encode theory, not vote count. A term can win 4-of-4 and still be excluded on principled grounds. The user excludes proper nouns (kingdoms, dynasties, the country itself) "regardless of vote count" because they "function as referential labels rather than contested conceptual constructs" — five terms reach 4/4 consensus but are filtered out as proper nouns (appendix_a.tex, the cross-model voting table). Decide which categories are eligible before counting votes, so the filter is a stated rule rather than a post-hoc rescue.
Run a sensitivity analysis on both dials. Report how the selected set changes as you vary (a) the per-juror floor and (b) k. The user's appendix_de.tex weighted-frequency sensitivity table shows the published wf ≥ 50 is "the highest value at which all nine final terms remain while also being the lowest value at which the only additional entrant is a general-register term," and that moving from 3-of-4 to unanimous 4-of-4 drops three substantive terms. A council whose output set swings wildly with a small threshold change is not a stable instrument — report the band over which the conclusion holds.
Treat principled abstention as informative, not as a tie. If jurors may return "none / insufficient evidence," count and report the abstention rate per juror; do not silently recode it as a vote (the text-classification skill makes the same point about NAs as informative missingness).
5. Reading Agreement as Reliability, Not Validity
Agreement measures reliability — reproducibility of the coding — not validity. A council that agrees perfectly with itself can be perfectly, consistently wrong. The panel typically agrees with itself more than it agrees with humans, so high inter-model agreement must never be read as evidence the labels are correct (that is what §7 is for).
Report chance-corrected agreement, not raw percent agreement. Raw agreement is inflated by the base rate; two coders labeling everything the majority class agree often by luck. Use a chance-corrected statistic matched to your design:
Cohen's κ for exactly two coders / two jurors (Cohen 1960). The user's CLASSIFICATION_FINDINGS.md reports overall agreement 80.9% with Cohen's κ = 0.730 between Llama 3.1 8B and Qwen 2.5 3B.
Fleiss' κ when you have 3 or more raters assigning nominal categories (Fleiss 1971) — the natural statistic for a council of 3+ jurors.
Krippendorff's α for ordinal labels or more than two coders, and when you have missing votes/abstentions; it generalizes across measurement levels and rater counts (Krippendorff 2004/2019). Prefer α when jurors abstain or when labels are ranked.
Interpret κ/α with the Landis & Koch (1977) bands, treating them as rough guides, not bright lines: < 0.00 poor, 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect. The user's κ = 0.730 lands in the "substantial" band. The bands are a convention from one paper; report the raw α/κ value alongside the label so readers can judge.
Diagnose low agreement before trusting the consensus. Inspect which categories or items drive disagreement. In CLASSIFICATION_FINDINGS.md the lowest per-code agreement was civic_commitment at 66.5%; collapsing two over-lapping codes raised overall agreement from 73.0% to 80.9% and κ from 0.634 to 0.730. Low council agreement usually signals a codebook problem (fix it in the text-classification workflow), not a model problem.
6. The Correlated-Errors Caveat
N jurors carry fewer than N independent votes when their errors are correlated. A majority vote only beats the best single classifier when members err independently; when members are dependent, the ensemble can be no better — or worse — than its best member (Kuncheva & Whitaker 2003). Counting heads as if each were an independent draw overstates how much evidence a consensus represents.
Models from the same family share blind spots. Four checkpoints of one model, or four models distilled from a common teacher, will tend to be wrong together — so their unanimous vote is closer to one vote than to four. This is the mechanism behind §2's insistence on family diversity: diversity is the lever that raises the effective number of votes toward the nominal N.
Watch for shared-error signatures. If two jurors miss or hallucinate the same items, treat them as partially redundant and down-weight their joint vote, or report the council with and without one of the correlated pair. The user's per-model unique-selection analysis (appendix_a.tex) — examining which terms each model alone chose — is the kind of diagnostic that surfaces shared vs. idiosyncratic behavior.
Report a diversity check, not just an agreement number. High agreement with low family diversity is weak evidence (correlated jurors); high agreement across diverse families is strong evidence. State the families represented so a reader can judge the effective independence. (The general dependence result is well established; treating it as a precise "effective sample size" calculation for LLM juries goes beyond what can be cleanly cited, so report the diversity check rather than a single effective-N number.)
7. Validating Beyond the Panel
The council must never grade its own work. Reliability among models says nothing about correctness, so validate against at least one source outside the council.
Human-coded gold sample. Hand-code a stratified sample (the text-classification skill specifies 50–100 items, two independent human coders, Cohen's κ or Krippendorff's α for inter-coder reliability) and report each juror's and the consensus's precision/recall/F1 against it. This is the only step that speaks to validity.
An independent, non-LLM method. Cross-check council output against a method that imposes no LLM prior. The user runs BERTopic and LDA over the same corpus and asks whether they independently recover the council's nine terms: BERTopic recovers 9/9, LDA 5/9, with the LDA misses explained by known properties of document-level bag-of-words modeling (appendix_a.tex, the term-recovery table). For the topic-model side of this triangulation, see the topic-modeling skill; CLASSIFICATION_FINDINGS.md shows the parallel move, triangulating an LLM classifier against an STM ("two independent analytical approaches … converge on the same substantive story").
Triangulation is the goal. Convergence of a model council, a human sample, and an independent method is far stronger than any one alone. Where they diverge, report the divergence — it is usually substantively informative (e.g., LDA misses corpus-wide terms precisely because they are corpus-wide).
8. Reporting and Reproducibility
Report every juror's exact tag, quantization, revision, decoding parameters, and seed (appendix_de.tex records all of these). Family-name-only reporting is not reproducible.
Publish the per-item / per-term vote table. The unit-level record of which juror voted which way is the core evidence; the user's appendix_a.tex cross-model voting table gives one row per term with a check/dash per model, the vote tally, and the final status. A reader must be able to see the votes, not just the aggregate.
Report the consensus rule, the absolute floors, the category filters, and the sensitivity bands (§4) so the selection is fully specified rather than merely described.
Report the agreement statistic with its method and band (§5): which coefficient (Cohen/Fleiss/Krippendorff), the value, the Landis–Koch label, and per-category agreement where relevant.
State the council's role explicitly — reliability evidence, robustness check, or discovery filter — and report the out-of-council validation (human sample and/or independent method, §7). Distinguish discovery from confirmation: if the codebook or term set was revised after seeing council output, report the revision trajectory, since undocumented post-hoc revision is a researcher degree of freedom that can inflate findings (Simmons, Nelson & Simonsohn 2011; Nosek et al. 2018).
Archive prompts, sampling seeds, the merge/voting code, and the raw per-juror generations so the council can be re-run. For the broader methods-section checklist (APSA/JARS/DA-RT), compose with the methods-reporting skill.
Quality Checks
Council justified over a single model: task is contested/ambiguous, discovery-driven, or needs robustness — not an unambiguous high-volume task already handled by a validated single model (escalated from text-classification)
Panel spans diverse training families/origins to decorrelate errors, not multiple checkpoints of one family (EXAONE/Aya/Qwen/Gemma-style spread; appendix_a.tex)
3–6 jurors, enough for a many-rater agreement statistic and a meaningful k-of-N rule (house default)
Open-weight jurors pinned to exact revisions where reproducibility matters; any proprietary juror's dated identifier recorded (Barrie, Palmer & Spirling 2025)
Decoding pinned (temperature 0 or a fixed seed) so reproducibility is enforced at the level of the final set (appendix_de.tex)
Votes cast independently — no cross-talk, no deliberation, no single-prompt role-play; one inference run per juror on identical inputs
Consensus rule (k-of-N) stated before inspecting outputs (e.g., 3-of-4; house convention)
Absolute floors used in place of distribution-relative (mean+2SD) thresholds when jurors differ in output volume (appendix_a.tex)
Theory-driven category filters (e.g., conceptual noun vs. proper noun) applied as stated rules independent of vote count (appendix_a.tex)
Sensitivity analysis reported on both the per-juror floor and k; stability band stated (appendix_de.tex, weighted-frequency sensitivity table)
Agreement read as reliability, not validity; panel-self-agreement not presented as correctness
Chance-corrected agreement reported with the right coefficient (Cohen for 2; Fleiss for ≥3 nominal; Krippendorff α for ordinal/>2/abstentions) and the Landis–Koch band (Cohen 1960; Fleiss 1971; Krippendorff 2004/2019; Landis & Koch 1977)
Correlated-error caveat addressed: family diversity documented; shared-error signatures inspected; effective vs. nominal votes acknowledged (Kuncheva & Whitaker 2003)
Validated beyond the panel: human-coded gold sample (precision/recall/F1) and/or an independent non-LLM method (topic-modeling); the council never grades its own work
Per-item/per-term vote table published; exact model tags, seeds, prompts, voting code, and raw generations archived; discovery vs. confirmation framing and any revision trajectory stated (Simmons et al. 2011; Nosek et al. 2018; compose with methods-reporting)