Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

build-review-interface

Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.

Exécuter dans Manus

Étoiles3

Forks0

Mis à jour29 avril 2026 à 14:29

Source

itseffi

itseffi/ai-product-evals

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Commande d'installation

Téléchargement

Exécuter dans Manus

Utile pourSOC

Développeurs webProfessions informatiques et mathématiques15-1254L4

SKILL.md

readonly

name	build-review-interface
description	Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.

Build Review Interface

Overview

Start from the human review task, not from UI widgets.
Optimize the interface for fast, correct judgment on traces.
Show the evidence needed to make a decision without forcing excessive context switching.
Capture structured labels that can later validate evaluators or improve datasets.
Use the interface to support error analysis, not just browsing.

Prerequisites

Inspect the current review surface in app.html, existing trace structure in traces/, and the runner output format in run-eval.mjs and tracer.mjs. Determine what human reviewers need to decide and what information is currently missing.

Core Instructions

Define The Review Task Clearly

Decide whether the reviewer is being asked to:

accept or reject a model response
identify a failure category
compare two model outputs
validate a judge decision
annotate retrieval versus generation failures

The UI should be built around one or two explicit review tasks.

Show The Minimum Necessary Context

For each record, consider showing:

prompt
system prompt
retrieved context if relevant
model response
evaluator decision
trace metadata

Do not hide the evidence that explains why a label should be applied.

Capture Structured Labels

Prefer structured fields over only free-form notes, such as:

pass/fail
failure category
severity
corrected answer
evaluator disagreement

These labels should be reusable later for evaluator validation or dataset cleanup.

Use the schema in docs/schemas/labels.md. Review exports should be consumable by scripts/validate-evaluator.mjs, scripts/promote-labels-to-eval.mjs, and scripts/error-analysis.mjs.

Use Review To Improve The Eval Pipeline

The interface should help answer:

which failures are real model failures
which are evaluator mistakes
which test cases need rewriting
which missing cases should be added

Repo Files To Inspect

app.html
traces/
tracer.mjs
run-eval.mjs
evaluators/index.mjs
labels/schema.mjs
docs/schemas/labels.md

Anti-Patterns

Building UI before defining the review task.
Showing too little context for a human to judge accurately.
Capturing only free-form notes with no structured labels.
Treating review as a one-off dashboard instead of a data-generation tool.
Making reviewers navigate multiple screens to answer one simple question.

Plus depuis ce dépôt

même dépôt

write-judge-prompt

itseffi/ai-product-evals

Guides design of LLM-as-judge prompts for subjective evaluation criteria. Use when deterministic checks are insufficient and you need a judge prompt for quality dimensions like helpfulness, faithfulness, clarity, or tone.

2026-05-243

error-analysis

itseffi/ai-product-evals

Guides systematic analysis of eval failures using traces. Use when a suite is failing, model outputs seem inconsistent, evaluator behavior is suspect, or you need to classify failures before changing prompts, metrics, or datasets.

2026-04-293

evaluate-rag

itseffi/ai-product-evals

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

2026-04-293

generate-synthetic-data

itseffi/ai-product-evals

Guides creation of synthetic eval cases that expand coverage without drifting away from real usage. Use when the current eval set is too small, too repetitive, or missing edge cases, and you need more diverse prompts, distractors, or structured scenarios.

2026-04-293

propose-judge-patch

itseffi/ai-product-evals

Drafts a reviewable judge-template patch from evaluator validation disagreements.

2026-04-293

validate-evaluator

itseffi/ai-product-evals

Guides validation of evaluators, especially LLM judges, against labeled examples. Use when evaluator quality is uncertain, judge scores seem inconsistent, or you need to check whether the evaluator is biased, noisy, or misaligned.

2026-04-293

name	build-review-interface
description	Guides building or improving interfaces for human review of eval traces. Use when humans need to inspect failures, label outputs, compare model behavior, or audit evaluator decisions at scale.

Build Review Interface

Overview

Start from the human review task, not from UI widgets.
Optimize the interface for fast, correct judgment on traces.
Show the evidence needed to make a decision without forcing excessive context switching.
Capture structured labels that can later validate evaluators or improve datasets.
Use the interface to support error analysis, not just browsing.

Prerequisites

Core Instructions

Define The Review Task Clearly

Decide whether the reviewer is being asked to:

accept or reject a model response
identify a failure category
compare two model outputs
validate a judge decision
annotate retrieval versus generation failures

The UI should be built around one or two explicit review tasks.

Show The Minimum Necessary Context

For each record, consider showing:

prompt
system prompt
retrieved context if relevant
model response
evaluator decision
trace metadata

Do not hide the evidence that explains why a label should be applied.

Capture Structured Labels

Prefer structured fields over only free-form notes, such as:

pass/fail
failure category
severity
corrected answer
evaluator disagreement

These labels should be reusable later for evaluator validation or dataset cleanup.

Use the schema in docs/schemas/labels.md. Review exports should be consumable by scripts/validate-evaluator.mjs, scripts/promote-labels-to-eval.mjs, and scripts/error-analysis.mjs.

Use Review To Improve The Eval Pipeline

The interface should help answer:

which failures are real model failures
which are evaluator mistakes
which test cases need rewriting
which missing cases should be added

Repo Files To Inspect

app.html
traces/
tracer.mjs
run-eval.mjs
evaluators/index.mjs
labels/schema.mjs
docs/schemas/labels.md

Anti-Patterns

Building UI before defining the review task.
Showing too little context for a human to judge accurately.
Capturing only free-form notes with no structured labels.
Treating review as a one-off dashboard instead of a data-generation tool.
Making reviewers navigate multiple screens to answer one simple question.