Run any Skill in Manus with one click

llm-evaluator-authoring

Author or refine a Phoenix LLM-as-a-judge evaluator — design the judge prompt, classification labels, input mapping, and test payload. Load before proposing edits to an LLM-evaluator draft, including single-shot judge rewrites.

Run Skill in Manus

Stars10,074

Forks914

UpdatedJune 9, 2026 at 21:30

Source

Arize-ai

Arize-ai/phoenix

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

SKILL.md

readonly

name	llm-evaluator-authoring
description	Author or refine a Phoenix LLM-as-a-judge evaluator — design the judge prompt, classification labels, input mapping, and test payload. Load before proposing edits to an LLM-evaluator draft, including single-shot judge rewrites.
summary	Design or refine an LLM-as-a-judge evaluator, including labels, rubric, prompt, and test cases.

LLM-as-a-Judge Evaluator Authoring

A Phoenix LLM evaluator is an LLM-as-a-judge: a judge prompt (a sequence of messages) sent to a model that returns one or more named classification annotations over a run's input, output, reference, and metadata. Authoring a good evaluator is mostly prompt design plus a clear, mutually exclusive set of labels — the same iteration loop as a playground prompt, scoped to a grading task.

Use this skill whenever you are creating a new LLM evaluator or revising an existing one's judge prompt, labels, input mapping, or test payload. The draft-edit operations and field shapes live in the edit_llm_evaluator_draft tool instructions; this skill is the authoring methodology that decides what to put in those operations.

Workflow: Create A New LLM Evaluator

Use this when the user wants a new judge for a dataset-backed experiment or for scoring run output.

Clarify the grading task: what does "good" mean for this output, what are the failure modes, and what evidence in the run distinguishes pass from fail. Ask for one or two concrete examples of a correct and an incorrect output when the user can give them.
Decide the labels first. LLM evaluators return classification annotations, so choose a small, mutually exclusive label set (often two: correct/incorrect, pass/fail, relevant/ irrelevant). Assign scores that match the optimization direction — higher score for the better label when maximizing. Add more labels only when the user needs to distinguish failure modes.
Identify which run field carries the signal. For dataset-backed evaluators, output is the new experiment run output at runtime and the dataset example output is passed as reference. Treat the dataset example shape as evidence for where the signal lives — chat-style messages arrays, assistant content parts, tool_calls/toolCalls, or function_call — rather than assuming a top-level key.
Write the judge prompt. A system message states the grader's role and the rubric; a user message presents the run fields via template variables ({{input}}, {{output}}, {{reference}}, {{metadata}}) and asks for the judgment. Make the rubric concrete: name the labels, say what each one means, and tie the decision to observable evidence in the fields.
Set the test payload to a representative case so the form preview exercises the judge. Shape testPayload.output from the dataset output shape or the user's concrete target case; treat it as representative evidence, not a fixed schema.
Propose the draft with edit_llm_evaluator_draft so the user reviews an accept/reject diff. Emit operations only for fields you intend to change.
After the user accepts a populated draft, offer to run test_llm_evaluator_draft (when it is available) before they click Create. Use the preview result to iterate: if the judge mislabels the test case, refine the rubric, the labels, or the test payload — not all three at once.
Repeat the refine-and-preview loop until the judge labels the representative cases correctly. Persistence stays the user's Create action in the form; do not claim the evaluator is saved.

Workflow: Refine An Existing LLM Evaluator

Use this when an LLM-evaluator form is already mounted (create or edit mode) and the user wants to improve the judge.

Call read_llm_evaluator_draft first to capture the current judge prompt, labels, input mapping, and testPayload. Reason about what the draft does before changing it.
Form a specific hypothesis for the change — a vague or under-specified rubric, the wrong field read, an ambiguous label set, or a test payload that misses the signal — rather than rewriting the whole judge.
Make the smallest edit that tests the hypothesis. Whole-list operations replace the judge prompt messages, the output configs, or the test payload; scoped operations change a single field. Keep the diff small so the user can read it.
When preview failures show the test case is missing the signal the judge should score, update the test payload to a representative case before concluding the judge prompt is wrong.
Re-run test_llm_evaluator_draft when available and compare. Keep the change only when the representative cases are labeled correctly and the tradeoff is acceptable.

Designing The Rubric And Labels

Keep labels mutually exclusive and collectively exhaustive for the task. A judge that cannot map a real output to exactly one label will be inconsistent.
State the rubric in the judge prompt, not just in the label names. "Mark correct only when the answer matches the reference's key facts; mark incorrect otherwise" beats a bare correct/ incorrect choice.
Prefer a binary judgment unless the user needs graded distinctions. More labels means more ambiguity at the boundaries and noisier results.
When the user wants a rationale, enable an explanation alongside the label so the judge justifies its choice — useful for spotting rubric ambiguity during iteration.

Things To Avoid

Don't propose edits without calling read_llm_evaluator_draft first — you will overwrite fields blindly and produce a noisy diff.
Don't assume the signal is at a top-level key. Inspect the dataset example shape and point the judge prompt at the field that actually carries the answer.
Don't change the judge model through set_judge_prompt; the model is a separate operation.
Don't hand-author the judge prompt tools or toolChoice. They are derived from the labels and the explanation setting and regenerated when the edit applies.
Don't claim the evaluator is created or updated. Persistence is the user's Create/Update action in the form.

More from this repository

same repository

annotate-spans

Arize-ai/phoenix

Write effective, consistent annotations on LLM/agent spans and traces, and coach the user on annotation practice. Load this whenever you are about to record structured feedback with the `batch_span_annotate` tool, or when the user asks how to annotate, label, score, or review spans/traces, build a failure taxonomy, or set up human/LLM review. Do NOT load for: pure analysis with no intent to save feedback (use debug-trace), latency or cost statistics, or prompt authoring (use playground).

2026-06-0910.1k

debug-trace

Arize-ai/phoenix

Diagnose failure modes by systematically investigating traces. Trigger when the user explicitly asks for cross-trace diagnosis: "what's going wrong?", "were there errors?", "debug this", "where is my agent struggling?". Do NOT trigger on: (1) advice questions ("what should I do?"), (2) statistical questions ("what's the average latency?"), (3) summarize requests, (4) trace filtering ("show me traces with errors"), (5) vague questions ("is there a problem?"), (6) unrelated requests.

2026-06-0910.1k

playground

Arize-ai/phoenix

Author, edit, or iterate on prompts in the Phoenix prompt playground, including running experiments over a dataset. Load before any playground tool call, including single-shot prompt rewrites.

2026-06-0910.1k

phoenix-github

Arize-ai/phoenix

Manage GitHub issues, labels, and project boards for the Arize-ai/phoenix repository. Use when filing roadmap issues, triaging bugs, applying labels, managing the Phoenix roadmap project board, or querying issue/project state via the GitHub CLI.

2026-06-0510.1k

arize-phoenix

Arize-ai/phoenix

Open-source AI observability platform for tracing, evaluating, and improving LLM applications with OpenTelemetry integration

2026-06-0310.1k

phoenix-frontend

Arize-ai/phoenix

Frontend development guidelines for the Phoenix AI observability platform. Use when writing, reviewing, or modifying React components, TypeScript code, styles, or UI features in the app/ directory. Triggers on any frontend task — new components, UI changes, styling, accessibility fixes, form handling, or component refactoring. Also use when the user asks about frontend conventions or component patterns for this project. For design system rules (error display, layout, dialogs, tokens), use the phoenix-design skill.

2026-06-0310.1k

LLM-as-a-Judge Evaluator Authoring

Workflow: Create A New LLM Evaluator

Use this when the user wants a new judge for a dataset-backed experiment or for scoring run output.

Clarify the grading task: what does "good" mean for this output, what are the failure modes, and what evidence in the run distinguishes pass from fail. Ask for one or two concrete examples of a correct and an incorrect output when the user can give them.

Decide the labels first. LLM evaluators return classification annotations, so choose a small, mutually exclusive label set (often two: correct/incorrect, pass/fail, relevant/ irrelevant). Assign scores that match the optimization direction — higher score for the better label when maximizing. Add more labels only when the user needs to distinguish failure modes.

Identify which run field carries the signal. For dataset-backed evaluators, output is the new experiment run output at runtime and the dataset example output is passed as reference. Treat the dataset example shape as evidence for where the signal lives — chat-style messages arrays, assistant content parts, tool_calls/toolCalls, or function_call — rather than assuming a top-level key.

Write the judge prompt. A system message states the grader's role and the rubric; a user message presents the run fields via template variables ({{input}}, {{output}}, {{reference}}, {{metadata}}) and asks for the judgment. Make the rubric concrete: name the labels, say what each one means, and tie the decision to observable evidence in the fields.

Set the test payload to a representative case so the form preview exercises the judge. Shape testPayload.output from the dataset output shape or the user's concrete target case; treat it as representative evidence, not a fixed schema.

Propose the draft with edit_llm_evaluator_draft so the user reviews an accept/reject diff. Emit operations only for fields you intend to change.

After the user accepts a populated draft, offer to run test_llm_evaluator_draft (when it is available) before they click Create. Use the preview result to iterate: if the judge mislabels the test case, refine the rubric, the labels, or the test payload — not all three at once.

Repeat the refine-and-preview loop until the judge labels the representative cases correctly. Persistence stays the user's Create action in the form; do not claim the evaluator is saved.

Workflow: Refine An Existing LLM Evaluator

Use this when an LLM-evaluator form is already mounted (create or edit mode) and the user wants to improve the judge.

Call read_llm_evaluator_draft first to capture the current judge prompt, labels, input mapping, and testPayload. Reason about what the draft does before changing it.

Form a specific hypothesis for the change — a vague or under-specified rubric, the wrong field read, an ambiguous label set, or a test payload that misses the signal — rather than rewriting the whole judge.

Make the smallest edit that tests the hypothesis. Whole-list operations replace the judge prompt messages, the output configs, or the test payload; scoped operations change a single field. Keep the diff small so the user can read it.

When preview failures show the test case is missing the signal the judge should score, update the test payload to a representative case before concluding the judge prompt is wrong.

Re-run test_llm_evaluator_draft when available and compare. Keep the change only when the representative cases are labeled correctly and the tradeoff is acceptable.

Designing The Rubric And Labels

Keep labels mutually exclusive and collectively exhaustive for the task. A judge that cannot map a real output to exactly one label will be inconsistent.

State the rubric in the judge prompt, not just in the label names. "Mark correct only when the answer matches the reference's key facts; mark incorrect otherwise" beats a bare correct/ incorrect choice.

Prefer a binary judgment unless the user needs graded distinctions. More labels means more ambiguity at the boundaries and noisier results.

When the user wants a rationale, enable an explanation alongside the label so the judge justifies its choice — useful for spotting rubric ambiguity during iteration.

Things To Avoid

Don't propose edits without calling read_llm_evaluator_draft first — you will overwrite fields blindly and produce a noisy diff.

Don't assume the signal is at a top-level key. Inspect the dataset example shape and point the judge prompt at the field that actually carries the answer.

Don't change the judge model through set_judge_prompt; the model is a separate operation.

Don't hand-author the judge prompt tools or toolChoice. They are derived from the labels and the explanation setting and regenerated when the edit applies.

Don't claim the evaluator is created or updated. Persistence is the user's Create/Update action in the form.