Skip to main content
在 Manus 中运行任何 Skill
一键导入
$pwd:

exploring-llm-evaluations

// Investigate AI observability evaluations of both types — `hog` (deterministic code-based) and `llm_judge` (LLM-prompt-based). Find existing evaluations, inspect their configuration, run them against specific generations, query individual pass/fail results, and generate AI-powered summaries of patterns across many runs. Use when the user asks to debug why an evaluation is failing, surface common failure modes, compare results across filters, dry-run a Hog evaluator, prototype a new LLM-judge prompt, or manage the evaluation lifecycle (create, update, enable/disable, delete).

$ git log --oneline --stat
stars:34,659
forks:2,753
updated:2026年5月21日 20:11
SKILL.md
readonly