Jeden Skill in Manus ausführen
mit einem Klick
mit einem Klick
Jeden Skill in Manus mit einem Klick ausführen
Loslegen$pwd:
$ git log --oneline --stat
stars:2
forks:0
updated:26. Februar 2026 um 00:02
Datei-Explorer
SKILL.md
| name | phoenix-evals |
| description | Build and run evaluators for AI/LLM applications using Phoenix. |
| license | Apache-2.0 |
| metadata | {"author":"oss@arize.com","version":"1.0.0","languages":"Python, TypeScript"} |
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
| Task | Files |
|---|---|
| Setup | setup-python, setup-typescript |
| Build code evaluator | evaluators-code-{python|typescript} |
| Build LLM evaluator | evaluators-llm-{python|typescript}, evaluators-custom-templates |
| Batch evaluate DataFrame | evaluate-dataframe-python |
| Run experiment | experiments-running-{python|typescript} |
| Create dataset | experiments-datasets-{python|typescript} |
| Validate evaluator | validation, validation-calibration-{python|typescript} |
| Analyze errors | error-analysis, axial-coding |
| RAG evals | evaluators-rag |
| Avoid common mistakes | common-mistakes-python |
| Production | production-overview, production-guardrails |
Starting Fresh:
observe-tracing-setup → error-analysis → axial-coding → evaluators-overview
Building Evaluator:
fundamentals → common-mistakes-python → evaluators-{code\|llm}-{python\|typescript} → validation-calibration-{python\|typescript}
RAG Systems:
evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
Production:
production-overview → production-guardrails → production-continuous
| Prefix | Description |
|---|---|
fundamentals-* | Types, scores, anti-patterns |
observe-* | Tracing, sampling |
error-analysis-* | Finding failures |
axial-coding-* | Categorizing failures |
evaluators-* | Code, LLM, RAG evaluators |
experiments-* | Datasets, running experiments |
validation-* | Calibrating judges |
production-* | CI/CD, monitoring |
| Principle | Action |
|---|---|
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |