with one click
evaluate-presets
// Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
// Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | evaluate-presets |
| description | Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues. |
| metadata | {"internal":true} |
Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.
Evaluate a single preset:
./tools/evaluate-preset.sh tdd-red-green claude
Evaluate all presets:
./tools/evaluate-all-presets.sh claude
Arguments:
.yml extension)claude or kiro, defaults to claude)IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
timeout: 600000 (10 minutes max) and run_in_background: truetimeout: 600000 (10 minutes max) and run_in_background: trueSince preset evaluations can run for hours (especially the full suite), always run in background mode and use the TaskOutput tool to check progress periodically.
Example invocation pattern:
Bash tool with:
command: "./tools/evaluate-preset.sh tdd-red-green claude"
timeout: 600000
run_in_background: true
After launching, use TaskOutput with block: false to check status without waiting for completion.
evaluate-preset.shtools/preset-test-tasks.yml (if yq available)--record-session for metrics captureOutput structure:
.eval/
├── logs/<preset>/<timestamp>/
│ ├── output.log # Full stdout/stderr
│ ├── session.jsonl # Recorded session
│ ├── metrics.json # Extracted metrics
│ ├── environment.json # Runtime environment
│ └── merged-config.yml # Config used
└── logs/<preset>/latest -> <timestamp>
evaluate-all-presets.shRuns all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/
├── SUMMARY.md # Markdown report
├── <preset>.json # Per-preset metrics
└── latest -> <suite-id>
| Preset | Test Task |
|---|---|
tdd-red-green | Add is_palindrome() function |
adversarial-review | Review user input handler for security |
socratic-learning | Understand HatRegistry |
spec-driven | Specify and implement StringUtils::truncate() |
mob-programming | Implement a Stack data structure |
scientific-method | Debug failing mock test assertion |
code-archaeology | Understand history of config.rs |
performance-optimization | Profile hat matching |
api-design | Design a Cache trait |
documentation-first | Document RateLimiter |
incident-response | Respond to "tests failing in CI" |
migration-safety | Plan v1 to v2 config migration |
Exit codes from evaluate-preset.sh:
0 — Success (LOOP_COMPLETE reached)124 — Timeout (preset hung or took too long)output.log)Metrics in metrics.json:
iterations — How many event loop cycleshats_activated — Which hats were triggeredevents_published — Total events emittedcompleted — Whether completion promise was reachedCritical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
Each hat should execute in its own iteration:
Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETE
BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work
^^^ All in one bloated context!
1. Count iterations vs events in session.jsonl:
# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log
# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
Expected: iterations ≈ events published (one event per iteration) Bad sign: 2-3 iterations but 5+ events (all work in single iteration)
2. Check for same-iteration hat switching in output.log:
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
.eval/logs/<preset>/latest/output.log
Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
3. Check event timestamps in session.jsonl:
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
Red flag: Multiple events with identical timestamps (published in same iteration).
| Pattern | Diagnosis | Action |
|---|---|---|
| iterations ≈ events | ✅ Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |
If hat routing is broken:
Check workflow prompt in hatless_ralph.rs:
Check hat instructions propagation:
HatInfo include instructions field?## HATS section?Check events context:
build_prompt(context) using the context parameter?## PENDING EVENTS section?After evaluation, delegate fixes to subagents:
Read .eval/results/latest/SUMMARY.md and identify:
❌ FAIL → Create code tasks for fixes⏱️ TIMEOUT → Investigate infinite loops⚠️ PARTIAL → Check for edge casesFor each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"
For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"
./tools/evaluate-preset.sh <fixed-preset> claude
brew install yqtools/evaluate-preset.sh — Single preset evaluationtools/evaluate-all-presets.sh — Full suite evaluationtools/preset-test-tasks.yml — Test task definitionstools/preset-evaluation-findings.md — Manual findings docpresets/ — The preset collection being evaluated