一键在 Manus 中运行任何 Skill

generate-synthetic-data

Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.

在 Manus 中运行

星标3

分支0

更新时间2026年4月29日 14:29

来源

itseffi

itseffi/ai-product-evals

打开 GitHub 仓库查看创作者相关仓库

安装命令

下载

在 Manus 中运行

适用职业SOC

数据科学家计算机与数学类职业15-2051L4

SKILL.md

readonly

name	generate-synthetic-data
description	Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.

Generate Synthetic Data

Overview

Start from real failure modes or real task dimensions, not random prompt generation.
Generate synthetic examples to cover missing combinations systematically.
Filter synthetic examples for realism, uniqueness, and evaluability.
Add only examples that improve coverage or expose a real blind spot.
Validate synthetic examples against real usage periodically.

Prerequisites

Inspect the current eval suite and recent traces first. Determine what dimensions are under-covered. Use evals/, traces/, and evaluators/index.mjs to understand what kinds of examples the repo can score well.

Synthetic data should feed the label workflow: generate inputs, run them through the actual system, review traces, store human labels, then promote durable examples into eval suites.

Core Instructions

Start From Dimensions, Not Random Prompts

Define the dimensions that matter for the task, such as:

difficulty
ambiguity
required format
domain
tool/no-tool decision
retrieval dependency
single-hop vs multi-hop

Generate examples by combining those dimensions intentionally.

Use Tuple-Based Coverage

Create combinations of task dimensions and ensure each combination is represented by at least one eval case. This prevents synthetic generation from overproducing the easiest examples.

Generate Evaluatable Cases

A synthetic example is useful only if you can score it reliably. Prefer cases that support:

exact checks
regex checks
tool-call checks
clearly grounded subjective evaluation

Do not generate cases whose “correct answer” is too vague to evaluate.

Filter Generated Examples

Reject examples that are:

duplicates
unrealistic
too similar to the prompt template
trivially easy
impossible to score cleanly

Keep only synthetic data that expands meaningful coverage.

Repo Files To Inspect

evals/
dataset.mjs
run-eval.mjs
evaluators/index.mjs
traces/
labels/
docs/schemas/labels.md

Anti-Patterns

Generating synthetic data without first identifying missing dimensions.
Adding large volumes of low-quality examples.
Treating synthetic data as a substitute for real user data.
Creating examples that cannot be scored clearly.
Overfitting the suite to synthetic phrasing.

同仓库更多 Skills

同仓库

write-judge-prompt

itseffi/ai-product-evals

Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

2026-05-243

build-review-interface

itseffi/ai-product-evals

Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.

2026-04-293

error-analysis

itseffi/ai-product-evals

Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.

2026-04-293

evaluate-rag

itseffi/ai-product-evals

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

2026-04-293

propose-judge-patch

itseffi/ai-product-evals

Drafts a reviewable judge-template patch from evaluator validation disagreements.

2026-04-293

validate-evaluator

itseffi/ai-product-evals

Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.

2026-04-293

name	generate-synthetic-data
description	Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.

Generate Synthetic Data

Overview

Start from real failure modes or real task dimensions, not random prompt generation.
Generate synthetic examples to cover missing combinations systematically.
Filter synthetic examples for realism, uniqueness, and evaluability.
Add only examples that improve coverage or expose a real blind spot.
Validate synthetic examples against real usage periodically.

Prerequisites

Synthetic data should feed the label workflow: generate inputs, run them through the actual system, review traces, store human labels, then promote durable examples into eval suites.

Core Instructions

Start From Dimensions, Not Random Prompts

Define the dimensions that matter for the task, such as:

difficulty
ambiguity
required format
domain
tool/no-tool decision
retrieval dependency
single-hop vs multi-hop

Generate examples by combining those dimensions intentionally.

Use Tuple-Based Coverage

Create combinations of task dimensions and ensure each combination is represented by at least one eval case. This prevents synthetic generation from overproducing the easiest examples.

Generate Evaluatable Cases

A synthetic example is useful only if you can score it reliably. Prefer cases that support:

exact checks
regex checks
tool-call checks
clearly grounded subjective evaluation

Do not generate cases whose “correct answer” is too vague to evaluate.

Filter Generated Examples

Reject examples that are:

duplicates
unrealistic
too similar to the prompt template
trivially easy
impossible to score cleanly

Keep only synthetic data that expands meaningful coverage.

Repo Files To Inspect

evals/
dataset.mjs
run-eval.mjs
evaluators/index.mjs
traces/
labels/
docs/schemas/labels.md

Anti-Patterns

Generating synthetic data without first identifying missing dimensions.
Adding large volumes of low-quality examples.
Treating synthetic data as a substitute for real user data.
Creating examples that cannot be scored clearly.
Overfitting the suite to synthetic phrasing.