一键导入
azure-ai-fine-tuning
// Use when the user wants to fine-tune a model on Azure AI Foundry — including dataset preparation, training, evaluation, and deployment.
// Use when the user wants to fine-tune a model on Azure AI Foundry — including dataset preparation, training, evaluation, and deployment.
| name | azure-ai-fine-tuning |
| description | Use when the user wants to fine-tune a model on Azure AI Foundry — including dataset preparation, training, evaluation, and deployment. |
| argument-hint | ["describe the model you want to fine-tune and the task it should perform"] |
Help the user fine-tune a model on Azure AI Foundry. This covers the full lifecycle:
$ARGUMENTS
Read the workflow file that matches the user's current stage:
workflows/auto-finetune.mdworkflows/quickstart.mdworkflows/full-pipeline.mdworkflows/dataset-creation.mdworkflows/traces-to-dataset.mdworkflows/synthetic-datagen.mdworkflows/iterative-training.mdworkflows/diagnose-poor-results.mdworkflows/experiment-review.mdIf the user wants full control, start with workflows/full-pipeline.md. If they want automation, try workflows/auto-finetune.md (experimental).
Read the relevant reference file before performing any step:
| File | When to read |
|---|---|
references/training-types.md | Choosing between SFT, DPO, and RFT |
references/hyperparameters.md | Setting learning rate, batch size, epochs |
references/dataset-formats.md | Preparing or converting training data |
references/data-generation-api.md | Foundry Data Generation API (preview) — sources, recipes, scenarios for traces→dataset and synthetic datagen |
references/deployment-formats.md | Deploying a fine-tuned model |
references/evaluation-methodology.md | Designing an eval rubric |
references/training-curve-analysis.md | Reading training logs and curves |
references/foundry-cli.md | Using the azd ai finetuning CLI for submit/deploy |
references/vision-fine-tuning.md | Fine-tuning with image data (gpt-4o, gpt-4.1) |
references/cost-management.md | Training costs, hosting tiers, budget planning |
references/distillation.md | Teacher-student model distillation workflow |
references/agentic-rft.md | Tool calling + endpoint graders for agentic RFT |
references/grader-design.md | Designing effective RFT graders (type selection, partial credit, threshold calibration) |
references/reward-hacking-prevention.md | Preventing reward hacking in RFT (grader alignment, monitoring, iteration) |
references/platform-bugs.md | Known platform bugs and workarounds |
references/large-file-uploads.md | Uploading large training files (>100MB) via chunked Uploads API |
Reusable Python scripts in scripts/. Each is self-contained with inline documentation.
| Script | Purpose |
|---|---|
auto_finetune.py | Autonomous orchestrator (experimental, SFT only) — runs the full loop: analyze → generate → prepare → baseline → train → evaluate → review → iterate. Good for exploration; use individual scripts for production workflows or RFT. |
auto_rft.py | Autonomous RFT orchestrator (experimental) — full loop for reinforcement fine-tuning: validate → prepare → calibrate → baseline → submit → monitor → evaluate → iterate. Use this for RFT; use auto_finetune.py for SFT. |
submit_training.py | Submit SFT, DPO, or RFT jobs (SDK + REST fallback) |
monitor_training.py | Poll a running job until completion, streaming events in real time |
generate_dataset.py | Generate fine-tuning or eval data via the Foundry Data Generation API (preview) — traces → SFT/eval, doc → Q&A, OpenAPI spec → tool-use. SDK + REST modes. Also includes --tools-from/--tools-to-openapi-out converter for OpenAI tool-spec → OpenAPI 3.0. |
calibrate_grader.py | Run base model through your RFT grader to find optimal pass_threshold |
generate_distillation_data.py | Generate training data from a teacher model for distillation (legacy custom-script approach) |
check_training.py | Pull training curves, detect overfitting, list checkpoints |
deploy_model.py | Deploy fine-tuned models via ARM REST API |
cleanup.py | List and delete old deployments, files, and pending jobs to reclaim quota |
evaluate_model.py | Run held-out eval with 2-dimension LLM judge |
convert_dataset.py | Convert between SFT, DPO, and RFT JSONL formats |
score_dataset.py | LLM-judge quality scoring on training data |
common.py | Shared auth helper — get_clients() tries /v1/, Foundry SDK, AzureOpenAI in order |
validate/validate_sft.py | Validate SFT JSONL: schema, roles, token limits, system prompt consistency |
validate/validate_dpo.py | Validate DPO JSONL: schema, identical-pair detection, DPO epoch warnings |
validate/validate_rft.py | Validate RFT JSONL: schema, grader escaping warnings, content moderation risk |
validate/data_stats.py | Dataset stats: token counts, format detection, cost estimates per model family |
content_safety_check.py | Debugging tool — when Azure FT fails with "User data has failed data safety check", run this to identify which specific rows tripped the classifier. Uses Azure Content Safety API (Hate/Sexual/SelfHarm/Violence severity 0-7) and can write a --drop-out file with only passing rows. Not part of the normal flow; reach for it only when preprocessing rejects an otherwise-clean file. |
transform_traces_jsonl.py | Required for traces-to-SFT distillation of tool-using agents — Foundry's Traces datagen emits data that fails Azure FT preprocessing in five independent ways (overlapping snapshots, fragments, content="null" on tool-call rows, consecutive asst tool_calls, missing system+tools). This script applies all five fixes per the canonical example notebook. Takes the raw *_dg.jsonl from Foundry + the agent's system prompt + tool definitions; writes Azure FT-ready JSONL. |
chunk_and_generate.py | Workaround for SimpleQnA saturation on large source docs — Foundry's QnA generator saturates at ~100-150 unique pairs per source file regardless of max_samples. This script chunks a local text file into N pieces (with overlap, snapping to paragraph boundaries), uploads each, runs N parallel datagen jobs, and concatenates the results. Use when your reference doc is too large to be summarized by a single ~100-question pass (e.g. distilling a 800-page technical standard). |
quality_filter.py | LLM-judge per-row quality filter for generated data. Scores each prompt/response pair on non_fragmented / non_empty / on_topic (1-5) and drops rows below threshold. Wired into the autopilot via --quality-filter. |
diagnose_iteration.py | When the autopilot returns ITERATE, this asks a strong LLM to look at the task description, eval rubric, dataset sizes, and 3 sample rows from train/test, then picks one of seven root-cause buckets (data_quality, train_test_mismatch, rubric_mismatch, task_genuinely_hard, needs_more_data, wrong_hps, wrong_base_model) plus a concrete next-step. Wired into cmd_review via --deep-diagnose. |
Always validate data before submitting jobs — run validate_sft.py / validate_dpo.py / validate_rft.py first, then data_stats.py for the overview.
Sample data: examples/sample-data/ contains sft_sample.jsonl, dpo_sample.jsonl, and rft_sample.jsonl — use these as format references.
CLI alternative: For quick single-job workflows, the azd ai finetuning CLI can replace submit_training.py and deploy_model.py. See references/foundry-cli.md.
| Task | Command |
|---|---|
| Validate SFT data | python scripts/validate/validate_sft.py data.jsonl |
Triage "User data has failed data safety check" errors | python scripts/content_safety_check.py --jsonl train.jsonl --endpoint https://<resource>.cognitiveservices.azure.com --api-key $env:AZURE_CONTENT_SAFETY_KEY --drop-out clean.jsonl |
| Transform Foundry traces output into Azure FT-ready JSONL (for traces-to-SFT distillation) | python scripts/transform_traces_jsonl.py --jsonl raw_traces_dg.jsonl --system-prompt-file system.md --tools-file tools.json --out sft.jsonl |
| Full autopilot for traces distillation (auto-runs the transform) | python scripts/auto_finetune.py auto --description "..." --task-name <name> --model <student> --teacher <strong> --datagen-agent-name <agent> --datagen-agent-version <v> --datagen-hours 720 --traces-system-prompt-file system.md --traces-tools-file tools.json |
| Generate dataset from agent traces | python scripts/generate_dataset.py --source traces --agent-name <name> --agent-version <v> --recipe traces --scenario sft --max-samples 200 --train-split 0.8 --hours 24 --download |
| Generate Q&A from a doc | python scripts/generate_dataset.py --source prompt-file --prompt-file policy.md --recipe qna --scenario sft --teacher gpt-4.1-mini --max-samples 100 --train-split 0.9 --download |
| Generate Q&A from a large doc (chunked workaround for SimpleQnA per-source saturation) | python scripts/chunk_and_generate.py --source-text big-doc.txt --chunks 10 --teacher gpt-4.1 --recipe qna --scenario sft --max-samples-per-chunk 100 --concurrency 2 --out merged.jsonl --cleanup-uploads |
| Filter generated data on quality (LLM judge) | python scripts/quality_filter.py --jsonl generated.jsonl --judge gpt-4.1-mini --threshold 4 --drop-out filtered.jsonl |
| Deep-diagnose why an autopilot iteration didn't ship | python scripts/diagnose_iteration.py --work-dir ./auto_ft_run --judge gpt-4.1 |
| Convert OpenAI tools to OpenAPI 3.0 (for tool-use) | python scripts/generate_dataset.py --tools-from openai_tools.json --tools-to-openapi-out openapi.json |
| Generate tool-use SFT (upload openapi.json first) | python scripts/generate_dataset.py --source file --file-id <openapi-file-id> --recipe tool-use --scenario sft --teacher gpt-4.1-mini --max-samples 50 --train-split 0.8 --download |
| Submit SFT job | python scripts/submit_training.py --model gpt-4.1-mini --training-file train.jsonl --validation-file val.jsonl --type sft |
| Monitor job | python scripts/monitor_training.py --job-id ftjob-xxx |
| Analyze curves | python scripts/check_training.py --job-id ftjob-xxx |
| Deploy model | python scripts/deploy_model.py --model-id ft:gpt-4.1-mini:... --name my-eval |
| Evaluate model | python scripts/evaluate_model.py --deployment-name my-eval --test-file test.jsonl |
| Auto fine-tune | python scripts/auto_finetune.py auto --data data.jsonl --description "task" --model gpt-4.1-mini |
| Auto (prompt only) | python scripts/auto_finetune.py auto --description "task" --model gpt-4.1-nano |
| Error | Cause | Fix |
|---|---|---|
| "API version not supported" | Older openai SDK on /v1/ endpoint | Upgrade to openai>=1.0 |
| "does not support fine-tuning with Standard TrainingType" | OSS model needs globalStandard | Use --use-rest flag or set trainingType: "globalStandard" |
| Job stuck in post-training eval | Under-provisioned tool endpoint (RFT) | Scale to S2+, enable Always On |
| "DeploymentNotReady" / 500 on deploy | Too many deployments or ARM race condition | Clean old eval deployments, retry after 5 min |
| Content safety block at deployment | PII-dense training data | Remove problematic document types |
| "BadRequestForDependentService" | Deployment still warming up | Wait 5+ minutes after deployment creation |
| Queue stuck ("jobs ahead") | Standard tier capacity exhausted | Cancel and resubmit on developerTier or globalStandard |
gpt-4.1 must be deployed in your resource before you can use it as a teacher or student.item.get('field_name') calls match the actual field names in your JSONL data. A mismatch (e.g., grader reads reference_answer but data uses answer) silently returns 0.0 for every sample — the grader never raises an error, it just gets empty strings. Always print and diff the grader source vs. the first line of your training JSONL.Resources may span multiple subscriptions. Always verify both the subscription AND resource before submitting jobs or querying status. The az CLI only searches the active subscription.
Map your resources before starting:
| Resource | Subscription | RG | Endpoint | Use |
|---|---|---|---|---|
<your-primary-resource> | <subscription-name> | <resource-group> | https://<your-primary-resource>.cognitiveservices.azure.com/ | Primary FT resource |
<your-secondary-resource> | <subscription-name> | <resource-group> | https://<your-secondary-resource>.cognitiveservices.azure.com/ | Secondary / overflow |
Tip: Run
az cognitiveservices account list --query "[].{name:name, rg:resourceGroup, endpoint:properties.endpoint}" -o tableto discover all your resources across subscriptions.
Before querying or submitting jobs:
az account set --subscription "<sub name>" to switch to the correct subscriptionaz account show --query name -o tsvCommon mistake: Forgetting to switch subscriptions before querying. If a job returns 404, try the other subscription before assuming it's lost.
Verify you're submitting to the correct resource AND subscription: Azure AI Foundry projects connect to a specific AIServices/OpenAI resource, which lives on a specific subscription. Jobs submitted to a different resource won't appear in the portal or telemetry. Symptoms: jobs show via API but not in the Foundry UI; "phantom" failures that the team can't reproduce; 404s when querying a valid job ID. Always (1) switch to the correct subscription first, (2) use the project endpoint (https://<resource>.services.ai.azure.com/api/projects/<project>/openai/v1/) or verify the OAI endpoint matches the resource connected to your Foundry project. A common mistake is submitting to <resource-A> instead of <resource-B> — all "platform 500" failures were actually jobs on the wrong resource.
Transient HTTP 500 failures: Azure AI Foundry FT jobs can fail with "A system error was encountered, please try again later" (HTTP 500). Single retries often succeed. Retry once or twice, then wait and check. If failures persist, verify you're hitting the correct resource endpoint before filing a support ticket.
All OSS FT jobs require trainingType: globalStandard: The Python SDK fails with "does not support fine-tuning with Standard TrainingType" for all OSS models (Ministral-3B, Qwen-32B, Llama-3.3-70B-Instruct, gpt-oss-20b). Use the REST API with "trainingType": "globalStandard" in the JSON payload. See scripts/submit_training.py for the fallback.
OSS model FT uses Global deployment tier: Ministral-3B, Qwen-32B, Llama-3.3-70B-Instruct, and gpt-oss-20b support fine-tuning via Global (not Standard regional). Any regional resource can use Global. The model catalog API incorrectly reports capabilities.fine_tune = false for these models — ignore the flag. Developer tier is only available for OpenAI models, not OSS. Model ID format: Ministral-3B (not gpt-*); FT output format: Ministral-3B.ft-{jobid}-suffix. Note: gpt-oss-20b is the model name but the versioned ID on the platform is gpt-oss-20b-11.
Deployment format matters: A wrong model.format gives an unhelpful HTTP 500. See references/deployment-formats.md for the exact mapping. For OSS models, use --model-format "OpenAI-OSS" with --sku-name "GlobalStandard". Deploy via CLI: az cognitiveservices account deployment create --name <resource> --resource-group <rg> --deployment-name <name> --model-name <model> --model-version "1" --model-format "OpenAI-OSS" --sku-capacity 100 --sku-name "GlobalStandard".
OSS endpoint matrix:
| Operation | Recommended path |
|---|---|
| FT job submission (OSS) | OAI REST endpoint with "trainingType": "globalStandard" — SDK rejects OSS models |
| FT job submission (OpenAI models) | SDK or /v1/ project endpoint |
| OSS FT model deployment | ARM management plane (management.azure.com) with format: "OpenAI-OSS" — data-plane PUT returns 404 |
| OSS inference | OAI endpoint with api-key header works — project endpoint also works |
| File upload | SDK client.files.create() + client.files.wait_for_processing() |
Project endpoint grading: The /v1/ path does NOT accept api-version query params. Use openai.OpenAI() client, NOT AzureOpenAI(), when calling project endpoints.
Try the /v1/ project endpoint first for fine-tuning operations. It supports features like Python graders without API version strings. Use openai.OpenAI(base_url="https://<resource>.services.ai.azure.com/api/projects/<project>/openai/v1/", api_key=KEY). If you encounter "API version not supported" errors on file uploads or job management, fall back to the non-project endpoint with 2025-04-01-preview (see Bug #2 in references/platform-bugs.md).
RFT grader escaping: Python code embedded in grader JSON must escape \n → \\n, \t → \\t, quotes → \".
RFT API version: Python graders for RFT require api-version=2025-04-01-preview or later. The 2025-03-01-preview API rejects type: "python".
RFT data format: The last message in RFT training data must be role: "user". Reference answers go in extra fields (e.g., "answer": 42), not in assistant messages.
RFT grader field name mismatch is SILENT: If the grader reads item.get('reference_answer') but training data uses answer, the grader gets empty string and returns 0.0 for every sample. No error is raised. The training UI will show 0% reward — or if combined with other scoring logic, misleading results. Always verify the grader source's item.get() field names match your JSONL's extra fields exactly. This was discovered during math RFT testing — the deployed grader read reference_answer but the training data had answer, causing zero reward signal.
RFT 100% pass rate from rollout 1 = no learning signal: If every sample passes from the very first rollout, the model has no gradient to improve. Common causes: (1) task too easy for the base model (o4-mini aces simple math/QA), (2) grader too lenient (word overlap scoring passes anything close), (3) grader broken (field mismatch returning default score). Fix: run the base model through your grader before submitting — if pass rate > 90%, the task is too easy for RFT.
RFT 0% pass rate from rollout 1 = no learning signal: If no samples pass, the model gets only negative reward and has no positive examples to learn from. Common causes: (1) task too hard for the base model, (2) pass_threshold too strict, (3) grader returning 0 for all outputs (broken parsing, field mismatch). Fix: lower the pass_threshold, simplify the task, or check the grader on base model outputs. Target 30-50% failure rate — the model needs some successes to learn what good looks like.
Recalibrate pass_threshold when changing datasets: A threshold that worked for a small dataset may be too strict or too lenient after adding more examples. The base model's score distribution shifts with different data composition. Always re-run threshold calibration after changing dataset size or content. See references/grader-design.md for the calibration workflow.
Grader template syntax: Template variables must have no spaces inside braces: {{item.answer}} ✅, {{ item.answer }} ❌.
RFT grader alignment is critical (reward hacking is the #1 RFT failure mode): If training grader ≠ eval grader, reward hacking is guaranteed. Symptom: train-val gap > 0.10. One RFT experiment showed a 0.24+ gap when using a Python AST grader for training but LLM judge for eval. Fix: use identical grading logic for both training and eval, or use endpoint graders. See references/reward-hacking-prevention.md and microsoft-foundry/fine-tuning RFT demos.
RFT content moderation: RFT training data must pass Azure content moderation. Prompts asking the model to "show your reasoning step by step" or "explain your chain of thought" may be flagged as "model reasoning extraction" and rejected. Use simpler instructions like "Solve this problem" or "Give your final answer."
DPO overtraining / degeneration: DPO is prone to model collapse when overtrained. Symptoms: repetitive token output ("I I I I I..."), especially on sensitive topics. Mitigation: (1) Use 1–2 epochs max, not 3; (2) Monitor for near-zero training loss early — if loss hits ~1e-6 before epoch 2, stop early or reduce learning rate; (3) Always evaluate on adversarial/edge-case prompts, not just average quality; (4) If the base model already handles the task well (>9/10), DPO may hurt more than help — consider whether fine-tuning is needed at all.
DPO default epochs: Azure Foundry defaults DPO to 3 epochs, not 2. For small datasets (<500 pairs), explicitly set n_epochs=1 or n_epochs=2 to avoid overtraining. The hyperparameters field on the job object may show None even when defaults are applied.
File upload quota: Max 100 files per resource. Delete old uploads when approaching the limit.
Token refresh: ARM tokens expire. Always call az account get-access-token immediately before each request.
Val loss overfitting ≠ worse quality: A model significantly above its best val_loss can still outperform an earlier checkpoint on downstream evals. Don't blindly deploy epoch-1 checkpoints — always evaluate with a held-out test set before deciding.
Small datasets teach format, not domain: With <100 examples (e.g., 73 tool-calling samples), a fine-tuned model learns mechanical patterns (always call a tool, produce valid JSON args) but does NOT improve task-specific accuracy (correct tool selection). Need 200+ examples for domain knowledge.
Dataset size sweet spot: 200–500 examples is the sweet spot to get started. Evaluate results, then decide if you need more data — quality matters more than quantity. Larger datasets (4K+) can actually hurt OSS models. For distillation tasks, 200–300 high-quality examples is often sufficient.
Distillation sweet spot: SFT distillation (e.g., mini→nano) routinely achieves high teacher gap closure on well-defined tasks (code generation, structured extraction) with just 200–300 examples and 2 epochs. This is the most reliable fine-tuning pattern. For classification tasks, direct SFT on gold labels works better than distillation when ground truth is clean.
Latency benchmarking: Always measure p50/p90/p95/p99 latency for base vs fine-tuned models, not just accuracy. Fine-tuned models often have lower latency + tighter variance — see the Image Breed Classification demo for an example showing mean and p99 latency improvements.
Content filters on PII/security data: Generating synthetic data containing SSNs, credit cards, or security-sensitive content can trigger Azure's jailbreak filter. Expect ~14% rejection rate. Generate extra examples to compensate.
Data Designer categories vs values: CategorySamplerParams uses values=, NOT categories=. Using the wrong field name causes a cryptic Field required pydantic error.
Data Designer LLMColumnConfig doesn't exist: Use LLMTextColumnConfig for text generation and LLMStructuredColumnConfig for structured output. The DD SKILL.md and data-designer agent context command have the correct types.
Data Designer Score requires options: dd.Score(name=..., description=..., options={"1": "Poor", "5": "Average", "10": "Excellent"}). The options dict maps score values to descriptions.
Data Designer model alias conflicts: If a model alias is configured globally (~/.data-designer/model_configs.yaml), do NOT also call add_model_config() in the script — it will fail with "alias already exists".
Data Designer + GPT-5 series: GPT-5.x models use max_completion_tokens, not max_tokens. Set it via extra_body in the DD model config, not max_tokens directly.
Data Designer Windows encoding: Set $env:PYTHONIOENCODING = "utf-8" on Windows before running DD CLI commands. Rich terminal output contains emoji that cp1252 encoding can't handle.
File upload processing delay: After client.files.create(), call client.files.wait_for_processing(file_id) before submitting a training job. This polls until the file is ready. Without it, immediate submission fails with "file import not completed".
DPO hyperparameters go inside method config: When submitting DPO jobs, n_epochs, learning_rate_multiplier, and beta must be inside method.dpo.hyperparameters, NOT at the top-level hyperparameters field. Top-level HPs cause invalidPayload error. Example: method={"type": "dpo", "dpo": {"hyperparameters": {"n_epochs": 2, "beta": 0.1, "learning_rate_multiplier": 1.0}}}.
Azure AI Eval SDK type field required: When using OpenAIModelConfiguration with the project /v1/ endpoint, you must include type="openai". Without it, the SDK throws '' is not a supported connection type.
SDK generic evaluators are degradation guardrails, not FT metrics: Built-in evaluators (Coherence, Fluency, TaskAdherence) measure general quality and may show no difference between base and fine-tuned models even when domain-specific eval shows clear improvement. Use them only to verify the model didn't regress. For actual FT evaluation, use the SDK's custom graders: AzureOpenAIScoreModelGrader (LLM judge with task-specific rubric), AzureOpenAIPythonGrader (code-based exact match), or AzureOpenAIStringCheckGrader (pattern matching).
Content safety rejection on FT models: Training may succeed but the resulting model can be rejected at deployment time with "model scores above acceptable thresholds for [Hate/Fairness]". This can happen even with innocuous data (e.g., entity extraction from medical records, legal documents, resumes with PII). Workaround: Remove document types containing sensitive attributes (medical, legal, HR), reduce PII density, or rephrase to use clearly synthetic names/data. There is no appeal process — you must resubmit with cleaner data.
Data Designer CLI syntax: The create command takes a positional config arg: data-designer create <config.py>, NOT --config. The config script must define load_config_builder() -> DataDesignerConfigBuilder.
Data Designer API key env var is per-shell: AZURE_FOUNDRY_API_KEY must be set in each new terminal. DD health check fails with "API key invalid or expired" if missing.
FT deployment "DeploymentNotReady" after ARM Succeeded: The data plane can lag behind the control plane. If a deployment stays stuck in DeploymentNotReady despite ARM showing Succeeded, delete and recreate it.
OpenAI SDK API version mismatch with project endpoints: The project /v1/ endpoint may not support all API versions for all operations. Use REST API directly (requests) for file uploads and job management when the SDK throws "API version not supported". The non-project endpoint (/openai/) with 2025-04-01-preview works for both files and jobs.
Tool calling eval must use per-example tools: When evaluating tool-calling models, use the tools field from each test example — NOT a hardcoded global tool list. DD generates different tool schemas per training scenario.
Vision FT training data is large: Base64-encoded JPEG images produce ~80KB per training example. A 2,000-example vision dataset is ~165MB. File upload may timeout — consider splitting into chunks or using the REST API with longer timeouts.
Vision FT requires image-capable models: Only gpt-4o and gpt-4.1 support vision fine-tuning. gpt-4.1-mini and gpt-4.1-nano do NOT support image inputs for fine-tuning.
ChartQA dataset filtering: HuggingFaceM4/ChartQA contains both human-labeled and machine-labeled examples. Filter to human_or_machine == 0 (human) for higher quality. The label column may be a list — extract the first element.
Classification FT: Small OSS models can't memorize large label sets: Small models (3B parameters) fail at many-class classification (50+ classes) — they invent synonym labels instead of learning the exact vocabulary. Increasing training data does not help; this is a model capacity limit. Larger models (20B+) perform significantly better. For classification with many classes, use ≥20B models or reduce label count.
Classification eval MUST include the system prompt: Training data with a system prompt ("You are a classifier...") teaches the model to output labels. Without the same system prompt at eval time, the FT model reverts to generic helpful assistant behavior (0% accuracy). Always replay the exact system prompt from training data.
These patterns are based on extensive end-to-end testing across SFT, DPO, and RFT.
references/reward-hacking-prevention.md for the #1 RFT failure mode| Pitfall | What happens | Fix |
|---|---|---|
| Skipping baseline evaluation | You can't measure improvement | Always evaluate the base model first |
| Too few examples (<100) | Model learns format but not domain knowledge | Use 200–500 examples minimum |
| DPO on strong base model | Quality degrades | Use SFT instead, or skip fine-tuning |
| Misaligned RFT grader | Reward hacking — model games the grader | Use same grading logic for training and eval |
| Small OSS models on large label sets | Model invents synonym labels (capacity limit) | Use ≥20B parameter models for 50+ classes |
The skill ships with two test suites under tests/:
cd Skills
# Fast checks (default — skips anything that hits the live service)
python -m pytest tests/ -v
# End-to-end suite against a real Foundry project
$env:FOUNDRY_PROJECT_ENDPOINT = "https://<resource>.services.ai.azure.com/api/projects/<project>"
$env:FOUNDRY_TEACHER_MODEL = "gpt-4.1" # any deployed chat model
$env:FOUNDRY_AGENT_NAME = "<your-agent>" # for traces/agent tests
$env:FOUNDRY_AGENT_VERSION = "1"
$env:E2E_JOB_TIMEOUT = "900" # 15 min per non-tool-use job
$env:E2E_TOOL_USE_TIMEOUT = "2700" # 45 min cap on tool-use jobs
python -m pytest tests/test_data_generation_e2e.py -m live -v
-m live selects the 9 live tests; the default -m "not live" runs only the 8 CLI/argparse tests plus the 48 existing skill-consistency tests. Authentication uses DefaultAzureCredential — run az login first.