with one click
decontamination-check
// Detect training-eval overlap against benchmark sets before dataset publication
// Detect training-eval overlap against benchmark sets before dataset publication
Generate Datasheet, Model Card, and Data Statement from a dataset manifest
Deterministically rebuild a dataset from its manifest and verify fixity equivalence
Create a versioned training dataset with manifest, fixity, provenance, and archive snapshot
End-to-end training dataset pipeline — acquire sources through publication
Generate SFT training examples from raw sources using Self-Instruct / Evol-Instruct / SQuAD / STaR patterns
Convert canonical training examples to Alpaca format for training frameworks
| name | decontamination-check |
| description | Detect training-eval overlap against benchmark sets before dataset publication |
| namespace | training-complete |
| category | quality |
| platforms | ["claude","copilot","cursor","factory","windsurf","warp","codex","opencode","openclaw","hermes"] |
| commandHint | {"argumentHint":"<dataset-version> [--targets <file>] [--mode <exact-ngram|fuzzy|semantic>] [--threshold <n>]"} |
Detect overlap between training examples and benchmark evaluation sets (MMLU, GSM8K, HumanEval, HELM, MT-Bench, AlpacaEval) before a dataset version is published. Produces a per-target overlap report and feeds the decontamination-gate lint rule.
Per ADR-022 D8, decontamination is a first-class pipeline stage — not an optional post-hoc check. Eval execution (running benchmarks against trained models) is out of scope for this skill; that work is delegated to the separate matric-eval project (see #849 and docs/matric-eval-integration.md).
dataset-version seals a new dataset release<dataset-version> (required)Either a dataset version ID (e.g., v2026.4) or a filesystem path to a directory of example records.
--targets <file> (optional)Path to a decontamination-targets.yaml file. Default: training-complete/schemas/decontamination-targets.yaml — ships with the 6 default targets below. User-declared targets are unioned with defaults (not replaced) unless the config explicitly sets override_defaults: true.
--mode <exact-ngram|fuzzy|semantic> (optional)Detection strategy. Default: exact-ngram.
exact-ngram — hash-based n-gram overlap with configurable N. Fast and deterministic. Recommended default.fuzzy — edit-distance-based overlap (Levenshtein ratio). Catches near-duplicates with minor paraphrasing, whitespace drift, or typo variants.semantic — embedding cosine similarity threshold (default ≥ 0.95). Catches deeper paraphrases and translation variants. Requires an embedding model; slowest mode.--threshold <n> (optional)Maximum acceptable overlap count per target. Default: 0 (zero tolerance). Override per-target in the targets config.
--ngram-size <n> (optional)N-gram size for exact-ngram mode. Default: 13 per REF-442 convention.
--report <path> (optional)Output report path. Default: .aiwg/training/reports/decontamination-<version>.md.
override_defaults: true).<dataset-version> via the memory-ingest consumer interface; normalize whitespace, casing per target config.eval_set_path).--threshold.templates/decontamination-report.md with per-target pass/fail and top-10 overlap samples.decontamination-check event to .aiwg/activity.log via memory-log-append. The lint rule (#843) consumes this event.Ships in schemas/decontamination-targets.yaml:
| ID | Name | Source |
|---|---|---|
| MMLU | Massive Multitask Language Understanding | Hendrycks et al. 2021 |
| GSM8K | Grade School Math 8K | Cobbe et al. 2021 |
| HumanEval | Code synthesis benchmark | Chen et al. 2021 |
| HELM | Holistic Evaluation of Language Models | Liang et al. 2022 |
| MT-Bench | Multi-turn chat benchmark | Zheng et al. 2023 |
| AlpacaEval | Instruction-following leaderboard | REF-450 |
User-declared targets are unioned into this set. Eval set data itself is NOT shipped — each target's eval_set_path points at a HuggingFace dataset identifier or local path that the operator must provision.
This skill detects contamination only. Running benchmark evaluations against a trained model is delegated to the separate matric-eval project (#849). See docs/matric-eval-integration.md for the integration contract. The two projects share the same target config schema to keep "what we checked for leakage" and "what we evaluate against" in lockstep.
.aiwg/training/reports/decontamination-<version>.md rendered from templates/decontamination-report.md, containing:
test/fixtures/decontamination/seeded-overlap.jsonl)exact-ngram mode, 13-gram)decontamination-check log event and blocks dataset-version publication on failure# Default: check v2026.4 against shipped target set
decontamination-check v2026.4
# Fuzzy mode with custom threshold
decontamination-check v2026.4 --mode fuzzy --threshold 2
# Semantic mode with user-extended targets
decontamination-check v2026.4 --mode semantic --targets config/my-targets.yaml
# On-demand check of a directory of examples
decontamination-check examples/candidate-batch/ --ngram-size 13
@agentic/code/frameworks/training-complete/schemas/decontamination-targets.yaml — target set schema and defaults@agentic/code/frameworks/training-complete/schemas/example-record.yaml — candidate example record format@agentic/code/frameworks/training-complete/templates/decontamination-report.md — output report templatematric-eval project (#849)@agentic/code/addons/semantic-memory/skills/memory-lint/SKILL.md@agentic/code/addons/semantic-memory/skills/memory-log-append/SKILL.md@agentic/code/addons/aiwg-utils/rules/human-authorization.md