Run any Skill in Manus with one click

evals

Stars16,098

Forks2,216

UpdatedApril 30, 2026 at 16:09

Comprehensive AI agent evaluation framework with three grader types (code-based: deterministic/fast; model-based: nuanced/LLM rubric; human: gold standard) and pass@k / pass^k scoring. Evaluates agent transcripts, tool-call sequences, and multi-turn conversations — not just single outputs. Supports capability evals (~70% pass target) and regression evals (~99% pass target). Workflows: RunEval, CompareModels, ComparePrompts, CreateJudge, CreateUseCase, RunScenario, CreateScenario, ViewResults. Integrates with THE ALGORITHM ISC rows for automated verification. Domain patterns pre-configured for coding, conversational, research, and computer-use agent types in Data/DomainPatterns.yaml. Tools: AlgorithmBridge.ts (ISC integration), FailureToTask.ts (failures → tasks), SuiteManager.ts (create/graduate/saturation-check), ScenarioRunner.ts (multi-turn simulated-user), TranscriptCapture.ts, PAIAgentAdapter.ts (wraps Inference.ts), ScenarioToTranscript.ts. Code-based graders: string_match, regex_match, binary_tests, st

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

danielmiessler

danielmiessler/LifeOS

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

File Explorer

66 files

SKILL.md

readonly

More from this repository

same repository

art

danielmiessler/LifeOS

Generates static visual content across 20+ formats via Flux, Nano Banana Pro (Gemini 3 Pro), and GPT-Image-1. Covers blog header illustrations, editorial art, Mermaid flowcharts, technical architecture diagrams, D3.js dashboards, taxonomies, timelines, 2x2 framework matrices, comparisons, annotated screenshots, recipe cards, aphorism/quote cards, conceptual maps, stat cards, comic panels, YouTube thumbnails, PAI pack icons, and brand-logo wallpapers. Named workflows: Essay, D3Dashboards, Visualize, Mermaid, TechnicalDiagrams, Taxonomies, Timelines, Frameworks, Comparisons, AnnotatedScreenshots, RecipeCards, Aphorisms, Maps, Stats, Comics, YouTubeThumbnailChecklist, AdHocYouTubeThumbnail, CreatePAIPackIcon, LogoWallpaper, RemoveBackground. SKILLCUSTOMIZATIONS loads PREFERENCES.md, CharacterSpecs.md, and SceneConstruction.md. --remove-bg flag produces transparent-background PNG (can produce black backgrounds — verify visually). Up to 14 reference images per request (5 human, 6 object Gemini API limit). Output s

2026-05-0116.1k

evals

danielmiessler/LifeOS

2026-05-0116.1k

prompting

danielmiessler/LifeOS

Meta-prompting standard library — the PAI system for generating, optimizing, and composing prompts programmatically. Owns three pillars: Standards (Anthropic Claude 4.x best practices, context engineering principles, 1,500+ paper synthesis, Fabric pattern system, markdown-first / no-XML-tags); Templates (Handlebars-based — Briefing.hbs, Structure.hbs, Gate.hbs, DynamicAgent.hbs, and eval-specific templates Judge.hbs, Rubric.hbs, TestCase.hbs, Comparison.hbs, Report.hbs used by Agents and Evals skills); and Tools (RenderTemplate.ts for CLI/TypeScript rendering with data-content separation). Philosophy: prompts that write prompts — structure is code, content is data. Delivered 65% token reduction across PAI (53K → 18K tokens) via template extraction. Output is always a prompt to be used elsewhere, not final content. Reference files: Standards.md (complete prompt engineering guide), Tools/RenderTemplate.ts (rendering implementation). NOT FOR generating final content or answers — this skill produces prompts only

2026-05-0116.1k

agents

danielmiessler/LifeOS

Compose CUSTOM agents from Base Traits + Voice + Specialization, and manage predefined functional TEAMS. Traits combine expertise (security, technical, research), personality (skeptical, analytical, enthusiastic), and approach (thorough, rapid, systematic). ComposeAgent.ts merges base + user config, outputs unique prompt + ElevenLabs voice + prosody. Predefined teams: engineering, architecture, marketing, design, security, research, content, strategy — each YAML-configured with roles, tensions, and specialist members. Observer team variant: read-only oversight agents that vote continue/halt/escalate against the tool-activity audit log (high-blast-radius or unattended runs only). USE WHEN create custom agents, spin up agents, specialized agents, agent personalities, available traits, list traits, agent voices, compose agent, spawn parallel agents, launch agents, engineering team, architecture team, marketing team, design team, security team, research team, content team, strategy team, get the team on this, obs

2026-04-3016.1k

apertureoscillation

danielmiessler/LifeOS

3-pass scope oscillation that holds a question constant while shifting the scope envelope — narrow/tactical, wide/strategic, then synthesis — to surface design tensions invisible at any single zoom level. Requires two distinct inputs: the tactical target (what you're building) and strategic context (the larger system it serves). Pass 1 captures the component's own internal logic. Pass 2 reveals what the system needs it to be. Pass 3 finds where those views diverge — that delta is the output. Produces: design tensions, scope recommendations, and coherence assessments. Single workflow: Workflows/Oscillate.md. BPE-fragile — quarterly test recommended to verify smarter models don't naturally oscillate scope without prompting. Best integration point: Algorithm OBSERVE phase (before ISC) or THINK phase (before approach commitment). NOT a lens rotation (that's IterativeDepth) and NOT idea generation (that's BeCreative). NOT FOR deep incident causal chains (use RootCauseAnalysis) or assumption decomposition (use Firs

2026-04-3016.1k

aphorisms

danielmiessler/LifeOS

Manages a curated aphorism collection with full CRUD — content-based matching, themed search, thinker research, and database maintenance. Organizes quotes by author, theme, context, and newsletter usage history to prevent repetition. Four workflows: FindAphorism (analyze newsletter content, match themes, return 3-5 ranked recommendations with rationale), AddAphorism (parse quote + author, extract themes, validate uniqueness, update theme index), ResearchThinker (deep research on philosopher, add sourced quotes to database), SearchAphorisms (search by theme, keyword, or author). Database at ~/.claude/skills/aphorisms/Database/aphorisms.md — stores full quote text, author attribution, theme tags, context/background, source reference, and usage history per entry. Theme index supports 12+ categories: Work Ethic, Resilience, Learning, Stoicism, Risk, Wisdom, Truth-seeking, Excellence, Curiosity, Freedom, Rationality, Clarity. Supported thinkers: Hitchens, Feynman, Deutsch, Sam Harris, Spinoza, plus any requested a

2026-04-3016.1k

name	Evals
description	Comprehensive AI agent evaluation framework with three grader types (code-based: deterministic/fast; model-based: nuanced/LLM rubric; human: gold standard) and pass@k / pass^k scoring. Evaluates agent transcripts, tool-call sequences, and multi-turn conversations — not just single outputs. Supports capability evals (~70% pass target) and regression evals (~99% pass target). Workflows: RunEval, CompareModels, ComparePrompts, CreateJudge, CreateUseCase, RunScenario, CreateScenario, ViewResults. Integrates with THE ALGORITHM ISC rows for automated verification. Domain patterns pre-configured for coding, conversational, research, and computer-use agent types in Data/DomainPatterns.yaml. Tools: AlgorithmBridge.ts (ISC integration), FailureToTask.ts (failures → tasks), SuiteManager.ts (create/graduate/saturation-check), ScenarioRunner.ts (multi-turn simulated-user), TranscriptCapture.ts, PAIAgentAdapter.ts (wraps Inference.ts), ScenarioToTranscript.ts. Code-based graders: string_match, regex_match, binary_tests, static_analysis, state_check, tool_calls. Model-based graders: llm_rubric, natural_language_assert, pairwise_comparison. USE WHEN eval, evaluate, benchmark, regression test, run eval, compare models, compare prompts, create judge, test agent, quality check, pass@k, grader, agent transcript, scenario simulation, capability test, before/after comparison, suite saturation, failure to task, graduate suite. NOT FOR general research or web investigation (use Research) or scientific method framing (use Science).
effort	high
context	fork

Customization

Before executing, check for user customizations at: ~/.claude/PAI/USER/SKILLCUSTOMIZATIONS/Evals/

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

You MUST send this notification BEFORE doing anything else when this skill is invoked.

Send voice notification:

curl -s -X POST http://localhost:31337/notify \
  -H "Content-Type: application/json" \
  -d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \
  > /dev/null 2>&1 &

Output text notification:

Running the **WorkflowName** workflow in the **Evals** skill to ACTION...

This is not optional. Execute this curl command immediately upon skill invocation.

Evals - AI Agent Evaluation Framework

Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).

Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.

When to Activate

"run evals", "test this agent", "evaluate", "check quality", "benchmark"
"regression test", "capability test"
"run scenario", "multi-turn eval", "simulated user test"
"create scenario", "simulate conversation"
Compare agent behaviors across changes
Validate agent workflows before deployment
Verify ALGORITHM ISC rows
Create new evaluation tasks from failures

Core Concepts

Three Grader Types

Type	Strengths	Weaknesses	Use For
Code-based	Fast, cheap, deterministic, reproducible	Brittle, lacks nuance	Tests, state checks, tool verification
Model-based	Flexible, captures nuance, scalable	Non-deterministic, expensive	Quality rubrics, assertions, comparisons
Human	Gold standard, handles subjectivity	Expensive, slow	Calibration, spot checks, A/B testing

Evaluation Types

Type	Pass Target	Purpose
Capability	~70%	Stretch goals, measuring improvement potential
Regression	~99%	Quality gates, detecting backsliding

Key Metrics

pass@k: Probability of at least 1 success in k trials (measures capability)
pass^k: Probability all k trials succeed (measures consistency/reliability)

Workflow Routing

Request Pattern	Route To
Run eval, evaluate suite, run tests, benchmark	`Workflows/RunEval.md`
Compare models, model comparison, A/B test models	`Workflows/CompareModels.md`
Compare prompts, prompt comparison, test prompts	`Workflows/ComparePrompts.md`
Create judge, model grader, evaluation judge	`Workflows/CreateJudge.md`
Create use case, new eval, test case, create suite	`Workflows/CreateUseCase.md`
Run scenario, multi-turn eval, simulated user test	`Workflows/RunScenario.md`
Create scenario, new multi-turn eval, simulate conversation	`Workflows/CreateScenario.md`
View results, eval results, scores, pass rate	`Workflows/ViewResults.md`

CLI Quick Reference

Trigger	Tool
Run suite	`Tools/AlgorithmBridge.ts`
Log failure	`Tools/FailureToTask.ts log`
Convert failures	`Tools/FailureToTask.ts convert-all`
Create suite	`Tools/SuiteManager.ts create`
Check saturation	`Tools/SuiteManager.ts check-saturation`
Run scenario	`Tools/ScenarioRunner.ts --scenario <path>`

Quick Reference

CLI Commands

# Run an eval suite
bun run ${CLAUDE_SKILL_DIR}/Tools/AlgorithmBridge.ts -s <suite>

# Log a failure for later conversion
bun run ${CLAUDE_SKILL_DIR}/Tools/FailureToTask.ts log "description" -c category -s severity

# Convert failures to test tasks
bun run ${CLAUDE_SKILL_DIR}/Tools/FailureToTask.ts convert-all

# Manage suites
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts list
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts check-saturation <name>
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts graduate <name>

ALGORITHM Integration

Evals is a verification method for THE ALGORITHM ISC rows:

# Run eval and update ISC row
bun run ${CLAUDE_SKILL_DIR}/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC rows can specify eval verification:

| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |

Available Graders

Code-Based (Fast, Deterministic)

Grader	Use Case
`string_match`	Exact substring matching
`regex_match`	Pattern matching
`binary_tests`	Run test files
`static_analysis`	Lint, type-check, security scan
`state_check`	Verify system state after execution
`tool_calls`	Verify specific tools were called

Model-Based (Nuanced)

Grader	Use Case
`llm_rubric`	Score against detailed rubric
`natural_language_assert`	Check assertions are true
`pairwise_comparison`	Compare to reference with position swap

Domain Patterns

Pre-configured grader stacks for common agent types:

Domain	Primary Graders
`coding`	binary_tests + static_analysis + tool_calls + llm_rubric
`conversational`	llm_rubric + natural_language_assert + state_check
`research`	llm_rubric + natural_language_assert + tool_calls
`computer_use`	state_check + tool_calls + llm_rubric

See Data/DomainPatterns.yaml for full configurations.

Task Schema (YAML)

task:
  id: "fix-auth-bypass_1"
  description: "Fix authentication bypass when password is empty"
  type: regression  # or capability
  domain: coding

  graders:
    - type: binary_tests
      required: [test_empty_pw.py]
      weight: 0.30

    - type: tool_calls
      weight: 0.20
      params:
        sequence: [read_file, edit_file, run_tests]

    - type: llm_rubric
      weight: 0.50
      params:
        rubric: prompts/security_review.md

  trials: 3
  pass_threshold: 0.75

Resource Index

Resource	Purpose
`Types/index.ts`	Core type definitions
`Graders/CodeBased/`	Deterministic graders
`Graders/ModelBased/`	LLM-powered graders
`Tools/TranscriptCapture.ts`	Capture agent trajectories
`Tools/TrialRunner.ts`	Multi-trial execution with pass@k
`Tools/SuiteManager.ts`	Suite management and saturation
`Tools/FailureToTask.ts`	Convert failures to test tasks
`Tools/AlgorithmBridge.ts`	ALGORITHM integration
`Tools/ScenarioRunner.ts`	Multi-turn scenario runner (langwatch/scenario)
`Tools/PAIAgentAdapter.ts`	Wraps PAI Inference.ts as scenario AgentAdapter
`Tools/ScenarioToTranscript.ts`	Scenario result → Evals Transcript/Trial/GraderResult
`Scenarios/`	Authored multi-turn scenarios (`.scenario.ts`)
`Data/DomainPatterns.yaml`	Domain-specific grader configs

Key Principles (from Anthropic)

Start with 20-50 real failures - Don't overthink, capture what actually broke
Unambiguous tasks - Two experts should reach identical verdicts
Balanced problem sets - Test both "should do" AND "should NOT do"
Grade outputs, not paths - Don't penalize valid creative solutions
Calibrate LLM judges - Against human expert judgment
Check transcripts regularly - Verify graders work correctly
Monitor saturation - Graduate to regression when hitting 95%+
Build infrastructure early - Evals shape how quickly you can adopt new models

ALGORITHM: Evals is a verification method
Science: Evals implements scientific method
Browser: For visual verification graders

Gotchas

Choose the right grader type: Code-based for deterministic checks (fast, cheap). Model-based for nuanced quality (flexible, expensive). Human for calibration (gold standard, slow).
pass@k scoring requires multiple runs. A single run doesn't give statistical significance. Default to pass@3 minimum.
Transcript capture must be enabled BEFORE the test run. Can't retroactively capture transcripts.
Eval results go to the current work directory — not a global location. Tie evals to the work item.
Don't evaluate skills with trivial prompts. Simple one-liners may not trigger skill usage. Test prompts must be substantive.

Examples

Example 1: Compare two prompts

User: "evaluate which prompt produces better summaries"
→ Creates eval suite with 3+ test cases
→ Runs both prompts against test cases
→ Model-based grader scores quality
→ Reports pass@k and comparative analysis

Example 2: Regression test a skill change

User: "run evals on the Research skill after the update"
→ Uses existing test fixtures for Research
→ Before/after comparison
→ Reports any quality regressions

Execution Log

After completing any workflow, append a single JSONL entry:

echo '{"ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","skill":"Evals","workflow":"WORKFLOW_USED","input":"8_WORD_SUMMARY","status":"ok|error","duration_s":SECONDS}' >> ~/.claude/PAI/MEMORY/SKILLS/execution.jsonl

Replace WORKFLOW_USED with the workflow executed, 8_WORD_SUMMARY with a brief input description, and SECONDS with approximate wall-clock time. Log status: "error" if the workflow failed.

evals

More from this repository

More from this repository

Customization

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

Evals - AI Agent Evaluation Framework

When to Activate

Core Concepts

Three Grader Types

Evaluation Types

Key Metrics

Workflow Routing

CLI Quick Reference

Quick Reference

CLI Commands

ALGORITHM Integration

Available Graders

Code-Based (Fast, Deterministic)

Model-Based (Nuanced)

Domain Patterns

Task Schema (YAML)

Resource Index

Key Principles (from Anthropic)

Related

Gotchas

Examples

Execution Log

Customization

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

Evals - AI Agent Evaluation Framework

When to Activate

Core Concepts

Three Grader Types

Evaluation Types

Key Metrics

Workflow Routing

CLI Quick Reference

Quick Reference

CLI Commands

ALGORITHM Integration

Available Graders

Code-Based (Fast, Deterministic)

Model-Based (Nuanced)

Domain Patterns

Task Schema (YAML)

Resource Index

Key Principles (from Anthropic)

Related

Gotchas

Examples

Execution Log