بنقرة واحدة
azure-ml-model-evaluation
// Evaluate generative AI applications and models locally or in the cloud using Azure AI Evaluation SDK. Measure quality, safety, and performance with built-in and custom evaluators.
// Evaluate generative AI applications and models locally or in the cloud using Azure AI Evaluation SDK. Measure quality, safety, and performance with built-in and custom evaluators.
Generate synthetic and simulated datasets for evaluation and fine-tuning using Azure AI Foundry simulators. Create non-adversarial task data, adversarial safety data, and conversation datasets without manual data collection.
Train or fine-tune LLMs on Azure ML managed compute with TRL trainers. Uses direct trainer loops (SFT, DPO, RL) without relying on serverless APIs or Hugging Face infrastructure.
| name | azure-ml-model-evaluation |
| description | Evaluate generative AI applications and models locally or in the cloud using Azure AI Evaluation SDK. Measure quality, safety, and performance with built-in and custom evaluators. |
| license | See repository root |
Evaluate generative AI applications using Azure AI Evaluation SDK with built-in quality and safety metrics. Local or cloud-based evaluation integrated with CI/CD pipelines.
Three evaluation approaches:
Use this skill when:
azure-ai-evaluation, azure-identityaz loginThese are templates in examples/ directory. Copy and adapt them for your project:
examples/
├── local_evaluation.py # Template: Evaluate with built-in metrics
├── cloud_evaluation.py # Template: Cloud-scale evaluation job
└── utils.py # Template: Helper functions
Do NOT reference these files directly. Copy and adapt them for your project structure.
examples/local_evaluation.py and examples/utils.py to your projectpython local_evaluation.pytest_data.jsonlevaluation_results.jsonexamples/cloud_evaluation.py and examples/utils.py to your projectpython cloud_evaluation.py{"query": "What is Azure ML?", "response": "Azure ML is...", "context": "...", "ground_truth": "..."}
{
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
]
}
| Evaluator | Inputs | Desc |
|---|---|---|
| GroundednessEvaluator | query, response, context | Response supported by context |
| RelevanceEvaluator | query, response | Response addresses query |
| CoherenceEvaluator | query, response | Logical flow and clarity |
| FluencyEvaluator | query, response | Language quality |
| RetrievalEvaluator | query, context | Context relevance to query |
| IntentResolutionEvaluator | conversation | User intent resolved |
| TaskAdherenceEvaluator | conversation | Adherence to instructions |
| Evaluator | Inputs | Desc |
|---|---|---|
| F1ScoreEvaluator | response, ground_truth | Token overlap F1 |
| SimilarityEvaluator | response, ground_truth | Cosine embedding similarity |
| BleuScoreEvaluator | response, ground_truth | Translation quality |
| RougeScoreEvaluator | response, ground_truth | Summarization quality |
| MeteorScoreEvaluator | response, ground_truth | Semantic similarity |
| Evaluator | Desc | Severity |
|---|---|---|
| ViolenceEvaluator | Violent content | Very Low / Low / Med / High |
| SexualEvaluator | Sexual content | Very Low / Low / Med / High |
| SelfHarmEvaluator | Self-harm content | Very Low / Low / Med / High |
| HateUnfairnessEvaluator | Hate/discrimination | Very Low / Low / Med / High |
| IndirectAttackEvaluator | XPIA jailbreak attempts | True / False |
| ProtectedMaterialEvaluator | Copyrighted content | True / False |
| ContentSafetyEvaluator | Composite safety eval | Combined metrics |
See examples/local_evaluation.py for complete implementation with quality, NLP, and similarity evaluators.
See examples/local_evaluation.py for safety evaluator setup and execution.
See examples/cloud_evaluation.py for Azure cloud-based evaluation with dataset upload and evaluator configuration.
Define evaluators as Python functions with @tool decorator. Functions receive inputs and return dict with score/metric.
Create .prompty YAML files with model config and evaluation prompt. Load via Prompty.load() and pass to evaluate.
See examples/local_evaluation.py for custom evaluator patterns and integration.
See examples/local_evaluation.py for composite evaluator usage.
Pass a callable target function to evaluate() to automatically generate responses. Function receives query and returns response dict. See examples/local_evaluation.py for implementation.
Evaluate baseline model to establish metric baseline for comparison.
Evaluate fine-tuned model on same data and compare metrics to baseline.
Add quality gates by checking evaluation metrics against thresholds before deployment.
See examples/local_evaluation.py for evaluation and metric comparison patterns.
{
"metrics": {
"relevance": 4.5,
"groundedness": 4.8,
"coherence": 4.6,
"fluency": 4.9,
"f1_score": 0.92,
},
"rows": [
{
"inputs.query": "...",
"inputs.response": "...",
"outputs.relevance.relevance": 5.0,
"outputs.groundedness.groundedness": 4.5,
},
...
],
"studio_url": "https://ai.azure.com/...",
}
View results in Azure AI Foundry: Navigate to Evaluation → Evaluation runs and click your run ID.
Cloud Evaluation Stuck in "Running" → Azure OpenAI model lacks capacity; cancel job, increase capacity, retry
"Model not found" Error
→ Verify deployment exists: az cognitiveservices account deployment list
Safety Evaluator "Region not supported" → Create project in East US 2, France Central, UK South, or Sweden Central
"Storage account not connected" → Follow storage account setup