원클릭으로
evaluate-presets
// Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
// Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
Introspect, explain, and improve Ralph Orchestrator using its published llms.txt doc map. Use this skill whenever the user asks questions about Ralph's behavior, wants to understand how a Ralph internal works (event loop, hats, memories, tasks, backends, presets), debug an unfamiliar failure mode, or propose a code change to the ralph-orchestrator repo. The skill teaches the agent to discover authoritative answers from the live docs via llms.txt before guessing, and to scope improvements through the published architecture rather than the local checkout alone.
Run, monitor, resume, merge, and debug Ralph loops. Use this skill whenever the user asks to operate `ralph run` or `ralph loops`, inspect loop state, recover suspended loops, analyze diagnostics, or unblock merge queue issues.
Guides implementation of code tasks using test-driven development in an Explore, Plan, Code, Commit workflow. Acts as a Technical Implementation Partner and TDD Coach — following existing patterns, avoiding over-engineering, and producing idiomatic, modern code.
Generates structured .code-task.md files from descriptions or PDD implementation plans. Auto-detects input type, creates properly formatted tasks with Given-When-Then acceptance criteria.
Lists all code tasks in the repository with their status, dates, and metadata. Useful for getting an overview of pending work or finding specific tasks.
Transforms a rough idea into a detailed design document with implementation plan. Follows Prompt-Driven Development — iterative requirements clarification, research, design, and planning.
| name | evaluate-presets |
| description | Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues. |
| metadata | {"internal":true} |
Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.
Evaluate a single preset:
./tools/evaluate-preset.sh tdd-red-green claude
Evaluate all presets:
./tools/evaluate-all-presets.sh claude
Arguments:
.yml extension)claude or kiro, defaults to claude)IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
timeout: 600000 (10 minutes max) and run_in_background: truetimeout: 600000 (10 minutes max) and run_in_background: trueSince preset evaluations can run for hours (especially the full suite), always run in background mode and use the TaskOutput tool to check progress periodically.
Example invocation pattern:
Bash tool with:
command: "./tools/evaluate-preset.sh tdd-red-green claude"
timeout: 600000
run_in_background: true
After launching, use TaskOutput with block: false to check status without waiting for completion.
evaluate-preset.shtools/preset-test-tasks.yml (if yq available)--record-session for metrics captureOutput structure:
.eval/
├── logs/<preset>/<timestamp>/
│ ├── output.log # Full stdout/stderr
│ ├── session.jsonl # Recorded session
│ ├── metrics.json # Extracted metrics
│ ├── environment.json # Runtime environment
│ └── merged-config.yml # Config used
└── logs/<preset>/latest -> <timestamp>
evaluate-all-presets.shRuns all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/
├── SUMMARY.md # Markdown report
├── <preset>.json # Per-preset metrics
└── latest -> <suite-id>
| Preset | Test Task |
|---|---|
tdd-red-green | Add is_palindrome() function |
adversarial-review | Review user input handler for security |
socratic-learning | Understand HatRegistry |
spec-driven | Specify and implement StringUtils::truncate() |
mob-programming | Implement a Stack data structure |
scientific-method | Debug failing mock test assertion |
code-archaeology | Understand history of config.rs |
performance-optimization | Profile hat matching |
api-design | Design a Cache trait |
documentation-first | Document RateLimiter |
incident-response | Respond to "tests failing in CI" |
migration-safety | Plan v1 to v2 config migration |
Exit codes from evaluate-preset.sh:
0 — Success (LOOP_COMPLETE reached)124 — Timeout (preset hung or took too long)output.log)Metrics in metrics.json:
iterations — How many event loop cycleshats_activated — Which hats were triggeredevents_published — Total events emittedcompleted — Whether completion promise was reachedCritical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
Each hat should execute in its own iteration:
Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETE
BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work
^^^ All in one bloated context!
1. Count iterations vs events in session.jsonl:
# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log
# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
Expected: iterations ≈ events published (one event per iteration) Bad sign: 2-3 iterations but 5+ events (all work in single iteration)
2. Check for same-iteration hat switching in output.log:
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
.eval/logs/<preset>/latest/output.log
Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
3. Check event timestamps in session.jsonl:
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
Red flag: Multiple events with identical timestamps (published in same iteration).
| Pattern | Diagnosis | Action |
|---|---|---|
| iterations ≈ events | ✅ Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |
If hat routing is broken:
Check workflow prompt in hatless_ralph.rs:
Check hat instructions propagation:
HatInfo include instructions field?## HATS section?Check events context:
build_prompt(context) using the context parameter?## PENDING EVENTS section?After evaluation, delegate fixes to subagents:
Read .eval/results/latest/SUMMARY.md and identify:
❌ FAIL → Create code tasks for fixes⏱️ TIMEOUT → Investigate infinite loops⚠️ PARTIAL → Check for edge casesFor each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"
For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"
./tools/evaluate-preset.sh <fixed-preset> claude
brew install yqtools/evaluate-preset.sh — Single preset evaluationtools/evaluate-all-presets.sh — Full suite evaluationtools/preset-test-tasks.yml — Test task definitionstools/preset-evaluation-findings.md — Manual findings docpresets/ — The preset collection being evaluated