en un clic
build-review-interface
Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.
Menu
Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.
Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.
Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.
Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.
Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.
Drafts a reviewable judge-template patch from evaluator validation disagreements.
Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.
| name | build-review-interface |
| description | Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale. |
Inspect the current review surface in app.html, existing trace structure in traces/, and the runner output format in run-eval.mjs and tracer.mjs. Determine what human reviewers need to decide and what information is currently missing.
Decide whether the reviewer is being asked to:
The UI should be built around one or two explicit review tasks.
For each record, consider showing:
Do not hide the evidence that explains why a label should be applied.
Prefer structured fields over only free-form notes, such as:
These labels should be reusable later for evaluator validation or dataset cleanup.
Use the schema in docs/schemas/labels.md. Review exports should be consumable by scripts/validate-evaluator.mjs, scripts/promote-labels-to-eval.mjs, and scripts/error-analysis.mjs.
The interface should help answer:
app.htmltraces/tracer.mjsrun-eval.mjsevaluators/index.mjslabels/schema.mjsdocs/schemas/labels.md