| name | synthetic-data-generation |
| description | Complete reference for the SynthChat synthetic dataset generation system. Covers CLI commands (generate, improve, validate), scenario YAML authoring, rubric YAML authoring, settings configuration, evaluation, and full workflow. Use when generating datasets, writing rubrics/scenarios, configuring models/workers, improving dataset quality, or running evaluations. This skill is about USING the system via CLI and YAML — never modifying source code. |
| allowed-tools | Read, Bash, Write, Grep, Glob |
SynthChat: Synthetic Data Generation
Generate, improve, validate, sanitize, and evaluate synthetic training datasets via CLI and YAML configuration.
Quick Reference
| Task | Command |
|---|
| Generate dataset | python -m SynthChat.run generate [options] |
| Generate with prompt optimization overlay | python -m SynthChat.run generate --prompt-opt-config configs/prompt_optimization/NAME.yaml [options] |
| Standalone prompt optimization | python tuner.py prompt-optimize --prompt-opt-config configs/prompt_optimization/NAME.yaml |
| Generate with environment runtime checks | python -m SynthChat.run generate --env-backend local [options] |
| Generate with custom tool schema/rules | python -m SynthChat.run generate --env-backend local --env-tool-schema path/to/tool_schema.yaml --env-exec-config path/to/environment_execution.yaml [options] |
| Debug environment generation only | python -m SynthChat.run env-generate --scenario SCENARIO --debug-artifacts [path] [options] |
| Improve dataset | python -m SynthChat.run improve -i FILE [options] |
| Validate dataset | python -m SynthChat.run validate -i FILE [options] |
| Sanitize docs or JSONL | python -m SynthChat.run sanitize -i PATH --privacy-profile PROFILE [options] |
| Evaluate model | python -m Evaluator.cli --model NAME [options] |
| Structural check | python3 scripts/validate_syngen.py FILE |
| JSONL → Markdown | ./scripts/jsonl_to_markdown.sh data.jsonl |
| Combine datasets | ./scripts/combine_datasets.sh -o out.jsonl FILE1 FILE2 |
| Interactive menu | ./run.sh |
Key Directories
SynthChat/scenarios/ — Generation templates (6 files, ~30 scenarios)
SynthChat/rubrics/ — Quality rubrics (17 files)
SynthChat/config/ — settings.yaml, validation.yaml
Evaluator/config/environment_execution.yaml — Runtime action inference rules (config-driven)
Datasets/synthchat/ — Generated datasets go here (dry-runs and full runs)
SynthChat/interactions/ — Judge/improve logs
Progressive Reference
Load the specific reference you need:
| Reference | When to Load | Path |
|---|
| CLI Commands | Running generate/improve/validate/sanitize/eval | reference/cli-commands.md |
| Settings Config | Configuring providers, models, workers, targets | reference/settings-config.md |
| Scenario Authoring | Writing or modifying scenario YAMLs | reference/scenario-authoring.md |
| Rubric Authoring | Writing or modifying rubric YAMLs | reference/rubric-authoring.md |
| Testing Protocol | After creating/modifying scenarios or rubrics — MUST dry-run before full generation | reference/testing-protocol.md |
| Manual Editing | Hand-crafting individual dataset lines | reference/manual-editing.md |
For environment-backed multi-turn tool data, also load:
reference/scenario-authoring.md for config-first scenario structure
reference/testing-protocol.md for dry-run, raw artifact inspection, and failure triage
MANDATORY: Dry-Run Before Full Generation
NEVER go straight from writing a scenario/rubric to a full generation run.
After creating or modifying any scenario or rubric YAML:
- Dry-run 3-5 examples → show user → get feedback
- Iterate on YAML based on feedback
- Only after user approves → run full generation
See reference/testing-protocol.md for the full protocol and dry-run script.
Default Quality Gates
In this repo, the default assumption for scenario authoring is:
- scenarios should include rubrics
- scenarios should use judge/final_judge where appropriate
- tool scenarios should use environment execution when practical
- environment-backed tool scenarios should default to a full repair loop:
initial assistant response -> response/schema judge -> response improver ->
response re-judge -> rerun environment with structured errors -> final judge
Do not assume generate is automatically running a judge just because
--max-iterations is set. That only matters when the scenario actually defines
rubrics/judge config. If a scenario has no rubrics, judge, or
final_judge, generation is just raw sampling plus any enabled environment
checks.
Treat judge-less scenarios as explicit smoke/plumbing exceptions, not the
default production pattern.
For environment-backed multi-turn tool data, response rubrics are not optional.
If an assistant turn fails response schema validation, the in-loop turn judge
will not run because the environment loop stops before executing that malformed
turn. The response rubric is the repair path that receives validation errors,
judge feedback, and environment issues, then produces the corrected assistant
message. The final_judge is a terminal acceptance gate, not the improver.
Environment-backed rows should persist the complete replay bundle by default:
generated fixture, assertions, resolved environment config, task context,
stage reviews, and enough source metadata to reconstruct the episode later.
When projecting rollouts into per-turn training rows, keep a pointer to the
canonical replay row or carry the replay bundle forward. A tool-turn-only
projection is useful for static supervised examples, but it is not enough for
later live environment replay unless the original environment artifact is still
available.
Use at least 3 retries/iterations for the response repair stages by default:
set CLI --max-iterations 3, keep scenario judge/final_judge max_retries: 3,
and make response rubrics strict enough to fail runtime misses such as missing
expected tools, malformed wrapper JSON, unsupported shell commands, or required
CLI arguments omitted by the model. If any stage fails, the default behavior
should be retry/repair with the structured failure context before accepting or
saving the example.
For multi-turn environment loops, keep single-response repair paths separate
from agentic rollouts. The normal path is:
- generate assistant turn
- if configured, return schema/format validation feedback to the model instead
of ending the episode before any environment step
- run the environment step
- show model-facing tool feedback
- run in-loop judge feedback when configured
- continue only when a tool call, recoverable runtime error, or failed judge
feedback requires another assistant action
- run final gates and final judge on the whole trajectory
Do not let a post-environment single-response improver flatten a multi-turn
episode unless the scenario explicitly opts into that behavior. A failed
multi-turn episode should usually be debugged from the episode trace, not
rewritten as one assistant message that tries to complete every step at once.
Likewise, do not run post-loop response validation/improvement on successful
agentic rollouts unless the scenario explicitly opts in. The in-loop schema
validation, environment execution, in-loop judge, final gates, and final judge
should be the normal acceptance path. A response-stage improver after the loop
can accidentally rewrite the saved terminal message and corrupt an otherwise
valid trajectory.
For config-driven tool schemas, distinguish internal executor identifiers from
model-facing command names. The model prompt, tool feedback, judges, rubrics,
and scenario prose should use the configured user-facing tool surface. Internal
executor names may still appear in runtime records, labels, or diagnostics, but
they should not leak into generated conversations unless that is the configured
surface being trained.
Common Patterns
Generate with parallel workers:
python -m SynthChat.run generate --workers 4
Improve specific rubrics on a line range:
python -m SynthChat.run improve -i data.jsonl --rubrics thinking_quality,factuality --start-line 1 --end-line 50
Regenerate arbitrary rows from a JSONL file:
python -m SynthChat.run improve \
-i data.jsonl \
--rubrics prompt_tools \
--lines 7,12,20-25 \
--workers 8 \
-o Datasets/synthchat/regen_slice.jsonl
Use a checked-in line manifest for targeted regeneration:
python -m SynthChat.run improve \
-i data.jsonl \
--rubrics content_tools \
--line-file Datasets/tools_datasets/reports/cli_schema/regen_lines.txt \
--workers 12 \
-o Datasets/synthchat/regen_slice.jsonl
Switch provider/model at CLI:
python -m SynthChat.run generate --provider openrouter --model MODEL_ID
Generate with prompt optimization overlays:
python -m SynthChat.run generate \
--prompt-opt-config configs/prompt_optimization/synthchat_smoke.yaml \
--targets-file SynthChat/config/targets_cli_existing_tools_quickcheck.json \
--output Datasets/synthchat/prompt_opt_dryrun.jsonl
--prompt-opt-config runs the prompt optimizer as an optional pre-generation
step, loads the produced overlays.json, applies matching prompt overlays in
memory, and records the artifact path plus selected candidate ID in each
generated row's metadata. Use
--prompt-opt-artifact path/to/artifact_or_overlays.json to reuse an existing
artifact without rerunning optimization. This is opt-in only: SynthChat does not
write optimized prompt text back into scenario/config YAML.
Run prompt optimization standalone:
python tuner.py prompt-optimize \
--prompt-opt-config configs/prompt_optimization/labkit_epistemic_humility_evaluator_smoke.yaml
Prompt optimization is config-first. Keep prompt subjects, operators, evaluator
scenarios, objective metrics, and stopping thresholds in
configs/prompt_optimization/*.yaml; do not hardcode dataset-specific prompt
surfaces into runtime code. The practical evaluator target threshold is
stopping.target_score: 0.8.
The checked-in evaluator smoke
configs/prompt_optimization/labkit_epistemic_humility_evaluator_smoke.yaml
must remain CI/local safe with evaluation.evaluator.dry_run: true. It is sized
for a minimal real smoke (population_size: 3, max_generations: 1), but real
LM Studio/evaluator execution should be done with an explicit local override or
manual copy that flips dry_run outside the checked-in smoke config.
Generate from docs:
python -m SynthChat.run generate --docs "path/to/essays/" --scenarios essay_outline --per-doc 1
Generate from raw docs with privacy preprocessing:
python -m SynthChat.run generate --docs "tests/fixtures/privacy/raw_seed_docs" --targets-file SynthChat/config/targets_privacy_docs_smoke.json --privacy-profile realistic_pseudonyms
Sanitize a docs folder or JSONL dataset:
python -m SynthChat.run sanitize -i tests/fixtures/privacy/raw_seed_docs --privacy-profile realistic_pseudonyms -o tmp/privacy_docs
python -m SynthChat.run sanitize -i tests/fixtures/privacy/raw_seed_dataset.jsonl --privacy-profile mask_only -o tmp/privacy_dataset.jsonl
Dry-run a checked-in smoke target manifest:
python -m SynthChat.run generate \
--targets-file SynthChat/config/targets_cli_existing_tools_quickcheck.json \
--max-iterations 3 \
--output Datasets/synthchat/dryrun_cli_existing_tools_quickcheck.jsonl
Isolate generated environment setup before a full rollout:
python -m SynthChat.run env-generate \
--scenario scenario_key \
--provider openrouter \
--model MODEL_ID \
--llm-timeout 30 \
--max-retries 1 \
--max-tokens 2048 \
--output SynthChat/output/env_generation_debug.json \
--debug-artifacts SynthChat/output/env_generation_debug.debug_events.jsonl
Use this when a full multi-turn generation smoke appears stuck before the
first assistant turn. It exercises only the configured environment_generation
stage, writes the generated environment/resolved context bundle, and streams
raw debug events so request starts, retries, errors, reviews, and generated
keys are inspectable without running the agent loop. For hang triage, set
--max-retries 1 and a short --llm-timeout first; otherwise the configured
retry envelope can make one failing environment request look like a long stall.
Use --max-tokens when isolating whether latency is caused by a large
structured environment response.
If a provider-routed request stalls but minimal requests work, rerun with
--disable-provider-routing to distinguish scenario/schema problems from a
specific hosted provider route.
For generated environments, prefer deterministic stage gates before relying on
an LLM judge. Useful generic gates include json_schema with
schema: canonical_environment, no_placeholder_strings,
required_mapping_keys, and min_fixture_items. The judge should explain
semantic quality, but gates should reject malformed schemas, placeholders,
missing hidden anchors, and too-small fixtures.
Treat environment generation, assistant turn generation, in-loop judging, and
final judging as separate model-selection surfaces. It can be correct to use a
more reliable structured-output model for fixture/answer-key authoring and a
different model for the actual rollout being trained or evaluated. Keep that
split in scenario YAML (environment_generation.provider/model,
assistant_generation.provider/model, judge.*.provider/model,
final_judge.provider/model) rather than code. When changing stage models,
rerun env-generation-only first, then a small full smoke, because a pass in one
stage does not prove the other stages are healthy.
For OpenRouter or other routed providers, choose the environment-generation
response format per stage. Use response_format: json_object for loose dynamic
maps. Use response_format: json_schema when the scenario supplies an inline
schema that fully constrains required fields, allowed extra fields, command
counts, and ASCII/path rules; this can prevent malformed JSON and corrupted
hidden anchors before the agent loop starts. Response-healing plugins and
fallback models are useful diagnostics, but do not rely on them instead of
deterministic gates and retryable stage reviews.
Use the same format decision for assistant turns and in-loop judges. If strict
provider schema mode returns provider-internal artifacts, malformed nested
arrays, or long routed stalls, set assistant_generation.response_format: json_object or judge.in_loop.response_format: json_object in scenario YAML
and let the local schema validator, environment executor, and judge feedback
drive retries. Keep this as config, not scenario-specific parser logic.
When a tool trajectory fails, inspect the raw debug artifact before changing
scenario prose. Check the exact tool string emitted by the model, the
executor's parsed arguments, the tool_results, and the in-loop judge feedback.
Executor normalization can hide model mistakes such as non-ASCII whitespace or
markdown backticks unless those are rejected through config-driven validation
rules such as invalid_cli_patterns in the environment execution config.
Likewise, generated task_context.expected_command_sequence should be gated
against shell syntax or stale command examples when the trained surface is a
configured CLI/tool wrapper. Reject those through scenario gates/config, not
runtime string repairs.
Validate then fix:
python -m SynthChat.run validate -i Datasets/synthchat/data.jsonl --rubrics system_prompt_format
python -m SynthChat.run improve -i Datasets/synthchat/data.jsonl --rubrics system_prompt_format
Always save outputs to SynthChat/outputs/:
python -m SynthChat.run generate -o Datasets/synthchat/my_dataset.jsonl
Environment Variables
OPENROUTER_API_KEY=sk-or-...
LMSTUDIO_HOST=localhost
LMSTUDIO_PORT=1234
OLLAMA_HOST=http://localhost:11434
HF_TOKEN=hf_...
OPF_CHECKPOINT=/path/to/privacy-filter-checkpoint
TIKTOKEN_CACHE_DIR=/path/to/tiktoken-cache
VLLM_HOST=127.0.0.1
VLLM_PORT=8000
Privacy Setup
Use the privacy preprocess path when raw docs or JSONL may contain PII or secrets and you want SynthChat to sanitize that content before it reaches the generation/improvement model.
Runtime split:
openai/privacy-filter is the local span-detection/redaction model
opf is the runtime wrapper used to load and run that model
vllm is only for the optional post-sanitize llm_polish step, not for OPF itself
Profiles live in:
SynthChat/config/privacy_profiles.yaml
Global defaults live in:
SynthChat/config/settings.yaml under privacy_preprocess
Scenario-level opt-in lives in:
seed_data.preprocess_profile
Recommended first-use setup when auto-download is unreliable:
python - <<'PY'
from pathlib import Path
import shutil
from huggingface_hub import snapshot_download
target = Path(r"F:\Code\Toolset-Training\tmp\opf_privacy_filter")
if not target.exists():
target.mkdir(parents=True, exist_ok=True)
snapshot_download(
repo_id="openai/privacy-filter",
local_dir=str(target),
allow_patterns=["original/*"],
)
original = target / "original"
for path in original.iterdir():
shutil.move(str(path), str(target / path.name))
original.rmdir()
print(target)
PY
OPF also needs the o200k_base.tiktoken encoding file. If that does not auto-download cleanly, cache it locally:
New-Item -ItemType Directory -Force -Path F:\Code\Toolset-Training\tmp\tiktoken_cache | Out-Null
Invoke-WebRequest `
-Uri "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken" `
-OutFile "F:\Code\Toolset-Training\tmp\tiktoken_cache\fb374d419588a4632f3f557e76b4b70aebbca790"
Then set:
$env:OPF_CHECKPOINT="F:\Code\Toolset-Training\tmp\opf_privacy_filter"
$env:TIKTOKEN_CACHE_DIR="F:\Code\Toolset-Training\tmp\tiktoken_cache"
Once those are set, the real OPF-backed sanitize flow can run fully from local files.
Config-Driven Architecture
SynthChat is fully config-driven — all tool-call formats, workspace structures, label mappings, and dataset-specific wrapper assumptions must be defined in YAML/config, not hardcoded in code.
Important discipline for this repo:
- the current CLI/tool wrapper is only one example dataset format, not a runtime truth
- do not encode wrapper names, top-level fields, or command assumptions in parser/executor/judge code unless that behavior is driven from config
- if a generation/eval issue seems specific to the current toy/example format, fix config, scenarios, rubrics, or format definitions first
- environment/runtime failures should be lifted into judge/improver context as structured payload data, not handled primarily with ad hoc format-specific code repairs
- prompt optimization overlays are an in-memory generation aid; promote them
into canonical YAML only after explicit human review
Key config files:
SynthChat/config/tool_call_formats.yaml — Tool-call response schemas (wrapper name, context fields, call structure)
SynthChat/config/workspace_formats.yaml — System prompt sections and structure
SynthChat/config/label_mappings.yaml — Issue classification and label rollups
SynthChat/config/settings.yaml — Generation settings, model config, output paths
To add a new tool-call format, add a named entry to tool_call_formats.yaml
and reference it from your scenario YAML. No code changes should be needed
unless you are adding a genuinely reusable runtime capability that cannot be
expressed in config.
Tips
- Use
--workers 4 for parallel generation (each worker gets its own LLM client)
- Use
improve --lines 3,8,10-15 or --line-file path.txt for targeted regeneration of arbitrary dataset rows
- Improve reports preserve original input
line_number values even when you select a subset, so merge workflows can patch the source file deterministically
- Set
save_failures: true in settings to keep failed examples as KTO negatives
- Interactions log in
SynthChat/interactions/ shows judge/improve exchanges
- Progress checkpoints save to
.synthchat_checkpoint.json on interruption
- Be greedy to stop on errors — kill early, fix, retest
- Environment traces are stored under
example.metadata.environment when enabled
- When environment validation is enabled, make sure the active judge/improver path receives those errors through prompt variables/config rather than relying on format-specific code patches
- If a response-stage retry is driven by environment feedback, each retry round must be judged against a freshly rerun environment result. Do not carry a prior round's environment failure forward into later judgments after the response has changed.
- For multi-turn tool scenarios, inspect raw
conversation_trace, metadata.environment.episode_trace, and SynthChat/interactions/*.jsonl before changing prompts. These usually reveal whether the problem is prompt rendering, tool syntax, environment execution, response repair, or final judging.
- If a model correctly answers after a successful environment pass, successful in-loop judge feedback should not force another assistant turn. Another turn should be requested only when the latest assistant action still needs correction.
- When feeding an invalid assistant tool response back to the model for repair, render prior tool calls as JSON, never Python
repr or pseudo-JSON. Single-quoted dict/list strings in validation feedback teach the model to repeat malformed tool-call arguments.
- Local environment runtimes should use a controlled runtime temp root rather than relying on Python
tempfile behavior if the host creates unwritable temp directories. Keep this generic and configurable; do not special-case a scenario.
- For non-default tool names, provide
--env-tool-schema and --env-exec-config
- Prefer checked-in
SynthChat/config/targets_*.json manifests over ad hoc inline JSON when running smoke tests or repeatable generation slices
- For privacy smoke tests, prefer the checked-in fixtures under
tests/fixtures/privacy/, the target manifest SynthChat/config/targets_privacy_docs_smoke.json, and the runbook docs/plans/synthchat-privacy-smoke-runbook.md