with one click
flow-dataset-build
// End-to-end training dataset pipeline — acquire sources through publication
// End-to-end training dataset pipeline — acquire sources through publication
Generate Datasheet, Model Card, and Data Statement from a dataset manifest
Deterministically rebuild a dataset from its manifest and verify fixity equivalence
Create a versioned training dataset with manifest, fixity, provenance, and archive snapshot
Detect training-eval overlap against benchmark sets before dataset publication
Generate SFT training examples from raw sources using Self-Instruct / Evol-Instruct / SQuAD / STaR patterns
Convert canonical training examples to Alpaca format for training frameworks
| name | flow-dataset-build |
| description | End-to-end training dataset pipeline — acquire sources through publication |
| namespace | training-complete |
| category | flow |
| platforms | ["claude","copilot","cursor","factory","windsurf","warp","codex","opencode","openclaw","hermes"] |
| commandHint | {"argumentHint":"<config-file> [--stages <stage1,stage2,...>] [--dry-run] [--version <v>] [--interactive]"} |
End-to-end orchestrator that runs the full corpus-to-dataset pipeline in a single invocation, chaining every downstream training-complete skill from acquisition through published dataset version.
Do NOT use this flow for ad hoc experiments; invoke individual stage skills directly when iterating on a single stage.
<config-file> — path to pipeline config YAML specifying sources, synthesis patterns, preference mode, format targets, decontamination targets (see schema below)--stages <list> — comma-separated subset of stages to run (default: all). Example: --stages acquire,quality-assess--dry-run — simulate every stage; validate config, print intended actions, write no artifacts--version <v> — target version string for published dataset (overrides config.version_pattern default)--interactive — pause at every stage boundary for human approval (per @native-ux-tools rule)--continue-on-warn — do not block on WARNING-level lint findings (default: strict; ERROR always blocks)--acknowledge-license-risk — bypass license-check gate on ERROR (requires explicit acknowledgement in pipeline report)--acknowledge-contamination — bypass decontamination gate on ERROR (same requirement)acquire-training-source once per source declared in config.sources. Writes raw corpus to .aiwg/training/working/<run-id>/raw/.example-quality-assess against sources and raw examples. Emits GRADE-style quality scores per source and per example.--acknowledge-license-risk supplied. Incompatible licenses fail the pipeline here.example-synthesizer if config.synthesis declares patterns. Optional; skipped silently if absent.synthetic-data-generator if config.synthetic_generator_config present. Optional.preference-generator if config.preference_generation declares a mode (DPO/RLHF/constitutional). Optional.config.format_exports: any of alpaca, sharegpt, chatml, jsonl, parquet. Each adapter is a separate skill.decontamination-check against config.decontamination_targets plus default targets (MMLU, HumanEval, GSM8K, HellaSwag, TruthfulQA, ARC, Winogrande).--acknowledge-contamination supplied.dataset-version. Creates manifest, SHA-256 fixity, W3C PROV provenance record, and archive snapshot at datasets/<version>.yaml.--stages acquire,quality-assess runs only those two stages.format without quality-assess is permitted but flagged).skip_stages: [synthetic-bulk] to default-skip a stage every run.acquire → quality-assess → license-check → {synthesize, synthetic-bulk, preference} → format → decontamination → decontamination-gate → publish.Per @human-authorization rule, the pipeline pauses for explicit human approval at these points:
--interactive is set.Gates use the platform-native question tool when available (AskUserQuestion on Claude Code; fallback to formatted stdout elsewhere).
# pipeline-config.yaml
version_pattern: "v{major}.{minor}.{patch}" # overridden by --version
split_ratios:
train: 0.9
validation: 0.05
test: 0.05
sources:
- uri: "hf://datasets/example/source1"
license: "apache-2.0"
- uri: "https://example.com/corpus.jsonl"
license: "cc-by-4.0"
license_policy:
allowlist: ["apache-2.0", "mit", "cc-by-4.0", "cc0-1.0"]
blocklist: ["cc-by-nc-4.0", "proprietary"]
synthesis: # optional — omit to skip stage 4
patterns: ["qa-rewrite", "chain-of-thought"]
max_examples: 5000
synthetic_generator_config: # optional — omit to skip stage 5
backend: "openai:gpt-4o"
target_count: 10000
preference_generation: # optional — omit to skip stage 6
mode: "dpo" # dpo | rlhf | constitutional
rater_model: "claude-opus-4"
format_exports: ["alpaca", "jsonl", "parquet"]
decontamination_targets:
- "mmlu"
- "humaneval"
- "custom:./eval/internal-holdout.jsonl"
skip_stages: [] # e.g., ["synthetic-bulk"] to default-skip
--continue-on-warn downgrades WARNING-level lint findings to informational; ERRORs still abort..aiwg/training/working/<run-id>/ for post-mortem and resumption.--stages <remaining-stages> pointing at the same run-id.memory-log-append to the run-scoped log at .aiwg/training/working/<run-id>/events.jsonl..aiwg/training/reports/pipeline-<version>-<timestamp>.md summarizing each stage, gate decisions, authorization records, and final artifact pointers.datasets/<version>.yaml manifest plus sibling outputs (provenance, fixity, archive snapshot) produced by dataset-version.@activity-log rule, one line appended to .aiwg/activity.log.# Full pipeline
aiwg flow-dataset-build ./configs/instruct-v3.yaml --version v3.1.0
# Subset: acquire and quality-assess only (dry run iteration)
aiwg flow-dataset-build ./configs/instruct-v3.yaml \
--stages acquire,quality-assess
# Dry-run the whole pipeline to validate config before committing compute
aiwg flow-dataset-build ./configs/instruct-v3.yaml \
--dry-run --version v3.1.0-rc.1
Downstream skills invoked by this flow:
acquire-training-source — stage 1example-quality-assess — stage 2example-synthesizer — stage 4synthetic-data-generator — stage 5preference-generator — stage 6format-adapter-alpaca, format-adapter-sharegpt, format-adapter-chatml, format-adapter-jsonl, format-adapter-parquet — stage 7decontamination-check — stage 8dataset-version — stage 10@human-authorization rule — authorization gate requirements@native-ux-tools rule — interactive prompt patterns@activity-log rule — post-run logging requirement