mit einem Klick
dataset-reproduce
// Deterministically rebuild a dataset from its manifest and verify fixity equivalence
// Deterministically rebuild a dataset from its manifest and verify fixity equivalence
| name | dataset-reproduce |
| description | Deterministically rebuild a dataset from its manifest and verify fixity equivalence |
| namespace | training-complete |
| category | publication |
| platforms | ["claude","copilot","cursor","factory","windsurf","warp","codex","opencode","openclaw","hermes"] |
| commandHint | {"argumentHint":"<manifest-path> [--compare-fixity] [--workdir <path>]"} |
Deterministically rebuild a training dataset from a published manifest and verify that the rebuild matches the original via SHA-256 fixity comparison. Used to validate that a dataset published by dataset-version is genuinely reproducible, and to let third parties reconstruct a dataset from its manifest without access to the original artifacts.
Per the ML Reproducibility Checklist (REF-475), a dataset is "reproducible" only if an independent rebuild from the manifest produces byte-identical fixity hashes. This skill is the verifier.
dataset-version publishes, rebuild and compare to catch non-determinism before the dataset is used for training<manifest-path> (required)Path to a datasets/<version>.yaml manifest published by dataset-version. The YAML is authoritative — sibling .json exports are ignored by this skill.
--compare-fixity (optional)Compare rebuilt SHA-256 against the original fixity_manifest. Default: true. Disable only for partial rebuilds where comparison is not meaningful.
--workdir <path> (optional)Scratch directory for the rebuild. Default: .aiwg/training/reproduce/<version>-<timestamp>/. Must be empty; the skill refuses to overwrite.
<manifest-path> as YAML; validate against the schema at @agentic/code/frameworks/training-complete/schemas/dataset-manifest.yaml. Resolve sources[], reproduction_recipe, seed, and split_counts.reproduction_recipe.aiwg_version and training_complete_version against the current runtime. On mismatch, emit a WARN — reproducibility across versions is not guaranteed. Proceeding is allowed but flagged in the report.sources[], invoke acquire-training-source using the declared ref_id, license, and format. Fixity of each acquired source is checked against any upstream checksum; mismatch is a hard failure (the source has drifted).reproduction_recipe step-for-step: generator_configs via synthetic-data-generator and example-synthesizer; preference_config via preference-generator; filter_thresholds via quality filters; decontamination_thresholds via decontamination-check. Format exports via the declared format_exports adapters.seed. No new entropy is introduced. The same seed + same inputs + same configs MUST produce the same outputs.integrity-verification to emit a fresh SHA-256 manifest over the rebuilt dataset.fixity_manifest. Emit per-file match/mismatch with a summary verdict (MATCH, PARTIAL, MISMATCH). Write the report.A mismatch does not always mean a bug. Known non-determinism sources to document in the report:
reproduction_recipe.generator_configs.created_at in a per-example record will always differ; these are excluded from fixity scope.The report's "Mismatch Analysis" section classifies each divergence against this list.
reports/reproduce-<version>-<timestamp>.md containing:
MATCH / PARTIAL / MISMATCH) with example countsacquire-training-source (training-complete)example-synthesizer, synthetic-data-generator (training-complete)preference-generator (training-complete)format-adapter-alpaca, format-adapter-sharegpt, format-adapter-chatml, format-adapter-jsonl, format-adapter-parquetdecontamination-check (training-complete)@agentic/code/frameworks/media-curator/skills/integrity-verification/SKILL.md# Self-verify a freshly-published dataset
dataset-reproduce datasets/2026.4.0.yaml
# Reproduce with explicit workdir and no fixity comparison (partial rebuild)
dataset-reproduce datasets/2026.4.0.yaml \
--workdir /tmp/repro-2026.4.0 \
--compare-fixity false
@agentic/code/frameworks/training-complete/schemas/dataset-manifest.yaml — manifest schema and validation rulesGenerate Datasheet, Model Card, and Data Statement from a dataset manifest
Create a versioned training dataset with manifest, fixity, provenance, and archive snapshot
End-to-end training dataset pipeline — acquire sources through publication
Detect training-eval overlap against benchmark sets before dataset publication
Generate SFT training examples from raw sources using Self-Instruct / Evol-Instruct / SQuAD / STaR patterns
Convert canonical training examples to Alpaca format for training frameworks