بنقرة واحدة
dataset-docs
// Generate Datasheet, Model Card, and Data Statement from a dataset manifest
// Generate Datasheet, Model Card, and Data Statement from a dataset manifest
| name | dataset-docs |
| description | Generate Datasheet, Model Card, and Data Statement from a dataset manifest |
| namespace | training-complete |
| category | publication |
| platforms | ["claude","copilot","cursor","factory","windsurf","warp","codex","opencode","openclaw","hermes"] |
| commandHint | {"argumentHint":"<manifest-path> [--type <datasheet|model-card|data-statement|all>] [--interactive]"} |
Generate standards-compliant dataset documentation — Datasheet (Gebru et al. 2021), Model Card (Mitchell et al. 2019), and Data Statement (Bender & Friedman 2018) — by auto-populating templates from a dataset manifest and related AIWG training artifacts.
Invoke this skill after dataset-version has produced a finalized manifest and the downstream artifacts (quality report, license ledger, decontamination report, provenance record) exist in .aiwg/training/. The skill produces the compliance documentation bundle required by ADR-022 D9 before a dataset is released or used to train a published model.
Typical trigger points:
| Parameter | Required | Default | Description |
|---|---|---|---|
manifest-path | yes | — | Path to the dataset manifest YAML (e.g., .aiwg/training/datasets/v1.2.0-manifest.yaml). |
--type | no | all | One of datasheet, model-card, data-statement, or all. |
--interactive | no | false | If set, prompt the operator to fill <!-- HUMAN FILL --> fields inline; otherwise leave markers for later review. |
manifest-path. Validate required top-level fields (dataset_name, version, modality, instance_count, license_id). Fail fast with a clear error if the manifest is malformed..aiwg/training/ gather the quality report, license ledger, decontamination report, and W3C PROV provenance record keyed by {{version}}. Record missing artifacts as warnings — do not hard-fail, but flag the affected template fields as unknown.--type, load one or more of:
templates/datasheet-for-datasets.mdtemplates/model-card.mdtemplates/data-statement.md{{field_name}} placeholders. Substitute values from the manifest and related artifacts. Target ≥60% of fields auto-filled on a typical well-instrumented dataset (per REF-451 feasibility study). Unresolved placeholders are replaced with UNKNOWN — see manifest rather than left literal.--interactive is set, prompt the operator once per <!-- HUMAN FILL --> marker using the platform-native UX tool (see native-ux-tools rule); otherwise leave the markers in place for downstream editorial review..aiwg/training/datasets/<version>-{datasheet,model-card,data-statement}.md. Update manifest.yaml with documentation: block pointing at the generated files. Append an entry to .aiwg/activity.log per the activity-log rule.The skill aims for the ≥60% auto-fill rate documented in REF-451 as the threshold at which datasheets become practical to maintain. Fields mapped from the manifest include dataset identity, composition counts, splits, source URLs, license, collection window, preprocessing pipeline references, IRB identifiers, retention policy, and provenance links. Fields requiring human judgment (bias analysis, intended users, ethical considerations, out-of-scope uses) remain explicit <!-- HUMAN FILL --> markers.
Generated datasheets validate against the HuggingFace dataset card schema (YAML frontmatter fields: license, language, task_categories, size_categories, pretty_name). The skill emits a post-write validation report listing any HuggingFace fields that could not be derived from the manifest so the operator can decide whether to source them manually before upload.
.aiwg/training/datasets/<version>-datasheet.md.aiwg/training/datasets/<version>-model-card.md.aiwg/training/datasets/<version>-data-statement.md.aiwg/training/datasets/<version>-manifest.yaml documentation block.aiwg/activity.logdataset-version (produces manifest), provenance-create (produces PROV record), grade-on-ingest (produces quality report).Deterministically rebuild a dataset from its manifest and verify fixity equivalence
Create a versioned training dataset with manifest, fixity, provenance, and archive snapshot
End-to-end training dataset pipeline — acquire sources through publication
Detect training-eval overlap against benchmark sets before dataset publication
Generate SFT training examples from raw sources using Self-Instruct / Evol-Instruct / SQuAD / STaR patterns
Convert canonical training examples to Alpaca format for training frameworks