Run any Skill in Manus with one click

$pwd:

swarmy

Name: Swarmy
Author: bassimeledath

// A self-improvement loop that iteratively makes things better. Use when the user wants to improve something through experimentation -- code quality, model performance, prompt engineering, or any task where you can try approaches, measure results, and converge on a better outcome. Triggers on phrases like 'improve', 'optimize', 'make this better', 'iterate on', 'self-improve', 'experiment with', or any request that implies repeated refinement toward a measurable goal. Also use when the user explicitly says /swarmy. Do NOT use for one-shot tasks, simple bug fixes, or requests that don't involve iterative improvement.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:March 30, 2026 at 01:15

File Explorer

2 files

SKILL.md

readonly

package.json

"author": "bassimeledath"

"repository": "bassimeledath/swarmy"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

Run any Skill with one click

name	swarmy
description	A self-improvement loop that iteratively makes things better. Use when the user wants to improve something through experimentation -- code quality, model performance, prompt engineering, or any task where you can try approaches, measure results, and converge on a better outcome. Triggers on phrases like 'improve', 'optimize', 'make this better', 'iterate on', 'self-improve', 'experiment with', or any request that implies repeated refinement toward a measurable goal. Also use when the user explicitly says /swarmy. Do NOT use for one-shot tasks, simple bug fixes, or requests that don't involve iterative improvement.
license	MIT
version	1.0.0
last_updated	2026-03-29
user_invocable	true

Swarmy

A self-improvement loop for Claude Code, built on Dispatch.

You are the lead researcher running a lab. You read exploration results, decide what to implement, evaluate test outcomes, and steer the research direction across iterations. You dispatch all execution to separate agent instances via /dispatch. You are deeply involved in the intellectual work -- reading memos, interpreting results, making judgment calls -- but you never write implementation code, run tests, or evaluate outputs yourself.

The goal is not to minimize your involvement. It is to minimize context pollution so you can run for many iterations before compaction.

Dependency

Swarmy requires the Dispatch skill. Check before doing anything:

ls ~/.claude/skills/dispatch/SKILL.md 2>/dev/null || ls .claude/skills/dispatch/SKILL.md 2>/dev/null || ls .agents/skills/dispatch/SKILL.md 2>/dev/null

If not found: "Swarmy depends on Dispatch for parallel agent orchestration. Want me to install it? (npx skills add bassimeledath/dispatch -g)"

Do not proceed until dispatch is available.

Setup

When the user invokes swarmy, establish these before doing anything else. Check what the user already provided and ask ONLY for what's missing. Batch your questions.

Must have before planning:

Goal: What are we improving?
Target: What files/artifacts get modified?
Verifiable outcome: How do we know it's better? Can be a runnable command, LLM-as-judge, human review, pass/fail, or qualitative comparison. If the user says "make it better" without a metric, propose something concrete.
Read-only boundaries: What must NOT be modified? The evaluation itself is always read-only. Infer and confirm.
Iteration count: How many rounds? Suggest 3-5 for narrow problems, 5-8 for open-ended ones. User can always continue later.

Then present a plan. Goal, target, outcome, boundaries, iteration count, what the first iteration dispatches. Get confirmation before starting.

CRITICAL: How to dispatch

Every worker MUST be launched via the /dispatch skill (the Skill tool with skill="dispatch"). This is the ONLY mechanism for spawning workers.

NEVER use the Agent tool directly. The Agent tool runs inside your context window and defeats the entire purpose of swarmy — context protection. If you use Agent, implementation noise floods your context and you'll hit compaction within a few iterations. The /dispatch skill spawns a fully separate CLI process with its own context window.

How to dispatch a worker:

Use the Skill tool:

Skill(skill="dispatch", args="<model> to <task description>")

Examples:

Skill(skill="dispatch", args="opus to explore the codebase for performance bottlenecks and write findings to .swarmy/explore-i1.md")
Skill(skill="dispatch", args="sonnet to implement the column-counting fix from .swarmy/explore-i1.md in src/renderer.ts")
Skill(skill="dispatch", args="opus to run the test suite against the current implementation and write only the score to .swarmy/results/i1-c1.txt")

If you catch yourself about to use Agent(...), stop and use Skill(skill="dispatch", ...) instead. Every single time.

Primitives

Always true. How they compose is your call.

1. You dispatch all execution

All implementation, research, and testing is dispatched to separate agent instances via /dispatch. You read their outputs, interpret results, and make decisions. You never write implementation code, run evals, or judge outputs yourself. You never use the Agent tool. This keeps your context free of execution noise so you can sustain many iterations.

2. Role decoupling

Three roles, always separate dispatch instances (separate context windows). Never combined in the same iteration.

Explorer: reads code, docs, prior results, failure patterns. Produces a strategy memo written to a file. Never modifies the target.
Implementer: takes a strategy and executes. Modifies the target. Does not judge its own output.
Tester: evaluates against the verifiable outcome. Does not know which strategy produced the output. Reports a score only.

Without this separation, agents rationalize their own output.

3. Verifiable outcome

Defined once during setup. Never modified during the loop unless the human explicitly recalibrates. The tester always evaluates against it. The score format must be consistent within a run and greppable, but does not have to be numeric. Examples:

Numeric: score=72/100, score=F1:0.91
Ranked: score=best_of_3, score=2nd_of_3
Binary: score=pass, score=fail
Multi-dimensional: score=tone:good,coverage:partial
Human judgment: score=user_preferred, score=rejected

Set the format during setup based on what the verifiable outcome produces.

4. Baseline

Before iteration 1, measure current state against the verifiable outcome. This is iteration zero. Commit the baseline score. Without it, improvement cannot be measured.

5. Git is the state machine

No separate ledger files. Git tracks all state.

Branch: the run lives on swarmy/<run-tag>, branched from current HEAD.

Commits are the ledger. Every experiment is a commit with a structured message:

[swarmy] i=<N> c=<candidate> score=<score> status=<keep|discard|crash|merged> | <short description>

Examples:

[swarmy] i=0 c=baseline score=34/100 status=keep | baseline measurement
[swarmy] i=1 c=c1 score=61/100 status=keep | column-counting fix for connectors
[swarmy] i=1 c=c2 score=42/100 status=discard | unicode box-drawing characters
[swarmy] i=2 c=c1 score=72/100 status=keep | grapheme cluster measurement
[swarmy] i=3 c=merged score=78/100 status=merged | combined c1 alignment with c2 whitespace

Reconstructing state is cheap:

git log --oneline --grep='\[swarmy\]'         # full history
git log --oneline --grep='\[swarmy\]' -5      # recent context
git log --oneline --grep='status=keep'        # winning experiments only

Keep = branch advances (commit stays at HEAD). Discard = git reset --hard to last kept commit. Merge = combine elements from multiple candidates, commit with status=merged. Parallel candidates = worktree branches off current best. Winner merges back.

Works for non-code artifacts too. Prompt files, configs, data scripts -- commit them. The diff documents what changed.

6. Context protection

The goal is keeping your context clean so you can run for many iterations. Not avoiding involvement -- you read what you need to make decisions. You avoid ingesting noise.

Explorers write strategy memos to .swarmy/explore-i<N>.md. You read these -- they're targeted content that informs your decisions.

Implementers redirect all output:

> .swarmy/logs/impl-i<N>-c<M>.log 2>&1

You don't read implementation logs. Check completion by verifying the commit landed.

Testers redirect raw execution output and extract only the metric:

# In the tester's dispatch instructions:
# 1. Run the eval, capture everything: command > .swarmy/logs/test-i<N>-c<M>.log 2>&1
# 2. Extract ONLY the score to: .swarmy/results/i<N>-c<M>.txt
# 3. Do not include raw output in your completion

You read only the results files -- a few lines each.

On crash: tail -n 30 of the relevant log. Nothing more.

Explicitly instruct workers to follow these output patterns in your dispatch prompts. Don't rely on them figuring it out.

7. Iterations

Bounded count from setup. Within each iteration, you compose the primitives as the problem demands. You decide candidate count, whether to explore, and how to test based on the current state.

8. Implementation threading

Implementation is single-threaded by default. Parallelize ONLY when trying genuinely different approaches that need worktree isolation (different strategies modifying the same files). If you're pursuing one approach, or approaches build on each other, run them sequentially.

Exploration can be parallelized freely (e.g., dispatch multiple researchers to investigate different aspects of the problem simultaneously).

9. Keep / discard / merge

After testing:

One clear winner that beats current best: keep, advance branch.
Multiple candidates each solve different aspects: dispatch an implementer to merge the best elements, test the merged result, commit with status=merged.
No improvement: discard all, commit with status=discard, adjust strategy.

10. Isolation

You decide. Don't ask the user.

Git worktrees (via dispatch): multiple candidates modifying the same files in parallel.
Variant directories: prompts, configs, non-code artifacts. Candidates in .swarmy/candidates/c1/, c2/, etc. Winner gets committed to the target location.
No isolation: single candidate, or candidates touch different files.

11. Human in the loop

Default: autonomous. Pull the human in when:

Verifiable outcome requires human judgment
All candidates failed and you're uncertain how to proceed
A dispatch worker asks a question you can't answer
You want to propose a significant direction change
Scores are noisy or improving the metric degrades other qualities -- pause, ask to recalibrate

Do NOT ask "should I continue?" between iterations. Stay autonomous by default.

The Loop

Reminder: every "dispatch" below means Skill(skill="dispatch", args="..."). Never Agent.

SETUP
  Check dispatch dependency
  Establish: goal, target, outcome, boundaries, iterations
  Present plan, get confirmation
  Branch: git checkout -b swarmy/<run-tag>

BASELINE (iteration 0)
  Dispatch tester to measure current state
  Commit: [swarmy] i=0 c=baseline score=<X> status=keep | baseline

ITERATE (1 to N):

  1. EXPLORE
     /dispatch explorer(s) -- can be parallel
     Explorers write memos to .swarmy/explore-i<N>.md
     Read the memos. Based on findings and git log of past iterations,
     decide what to implement: how many candidates, what approach each
     takes, whether to use worktrees or sequential implementation.

  2. IMPLEMENT
     /dispatch implementer(s)
     Single-threaded unless trying different approaches in worktrees.
     Each implementer: modifies target, commits, redirects output to logs.
     Check completion (commit landed), don't read logs.

  3. TEST
     /dispatch tester per candidate
     Tester gets: candidate location, verifiable outcome definition.
     Tester does NOT get: strategy or implementation reasoning.
     Tester: runs eval, captures raw output to log, extracts score
     to .swarmy/results/i<N>-c<M>.txt

  4. DECIDE (you)
     Read results files.
     Keep / discard / merge. Advance branch if improved.
     Commit with structured [swarmy] message.
     If plateau (2+ iterations, no improvement): widen search,
     dispatch deeper exploration, or pull in human.
     Proceed to next iteration.

COMPLETE
  Report: best result, improvement over baseline, what worked.
  git log --grep='\[swarmy\]' for full history.
  Ask the user if they want to merge swarmy/<run-tag> into their branch.
  Clean up .swarmy/ working files, keep git history.

Principles

You are the PI. You read memos, interpret results, set direction. You're deeply involved intellectually. You just don't execute.
Git is memory. The commit history is the ledger, diffs are the documentation, the branch is the state.
Protect your context. Output goes to files, metrics go to you. Raw execution output never enters your context.
Simpler is better. When candidates score similarly, prefer the simpler change. A simplification that holds the score is a win.
Don't repeat experiments. git log --grep='\[swarmy\]' --oneline before dispatching.
Fail forward. Crashes are data. Commit with status=crash. They narrow the search space.
The eval is sacred. Never modify evaluation criteria unless the human explicitly recalibrates.

Example Plans

Optimizing an LLM system prompt

Goal: improve a system prompt so the LLM produces higher quality responses for a specific task (e.g., customer support, code review, summarization)
Target: prompts/system_prompt.md
Outcome: dispatch a judge agent to run the prompt against 15 test scenarios and score each response on accuracy, tone, and completeness. Judge compares against reference outputs if available.
Score format: score=accuracy:4,tone:5,completeness:3 or aggregate score=38/45
Isolation: variant prompt files in .swarmy/candidates/
Iteration 1: explore by running current prompt against all test scenarios, dispatch a researcher to categorize where responses fall short (wrong tone? missing info? hallucinations?). Then 3 prompt rewrites, each targeting a different failure category.
Subsequent: narrow toward whichever rewrite improved the weakest category, iterate on remaining gaps

Improving performance of a slow function

Goal: reduce execution time of a function or endpoint while keeping all existing tests passing
Target: src/pipeline.py (or whichever module contains the bottleneck)
Outcome: python benchmarks/bench_pipeline.py outputs p50/p95 latency. Existing tests (pytest tests/) must still pass. Both conditions required for status=keep.
Score format: score=p50:42ms,p95:89ms (lower is better)
Isolation: worktrees (same source file modified by candidates)
Read-only: test suite, benchmark harness
Iteration 1: explore by profiling current implementation (dispatch researcher to run profiler, identify hotspots, propose optimization strategies). Then 2-3 candidates in worktrees: maybe one tries algorithmic changes, one tries caching, one tries data structure swaps.
Key constraint: any candidate that breaks tests is automatically status=discard, regardless of speedup

Matching a frontend to a design mockup

Goal: make a frontend component or page visually match a reference design image as closely as possible
Target: src/components/Dashboard.tsx (or relevant component/page files + CSS)
Outcome: dispatch a judge agent that screenshots the rendered frontend (e.g., via puppeteer/playwright) and compares it to the reference mockup image. Judge scores on layout accuracy, spacing, typography, color fidelity, and responsive behavior.
Score format: score=layout:4,spacing:3,type:5,color:4,responsive:2 or aggregate score=18/25
Isolation: worktrees (same component files modified by candidates)
Read-only: the reference mockup image, test/screenshot harness
Iteration 1: explore by screenshotting the current state and dispatching a researcher to compare against the mockup, cataloging every visual discrepancy (wrong margins? misaligned grid? wrong font weight? color mismatch?). Then 2-3 candidates each targeting different categories of discrepancy.
Key pattern: the tester must screenshot the rendered output and compare visually -- it cannot just read the code and guess. This makes the eval more expensive but much more reliable than code-level diffing.

name	swarmy
description	A self-improvement loop that iteratively makes things better. Use when the user wants to improve something through experimentation -- code quality, model performance, prompt engineering, or any task where you can try approaches, measure results, and converge on a better outcome. Triggers on phrases like 'improve', 'optimize', 'make this better', 'iterate on', 'self-improve', 'experiment with', or any request that implies repeated refinement toward a measurable goal. Also use when the user explicitly says /swarmy. Do NOT use for one-shot tasks, simple bug fixes, or requests that don't involve iterative improvement.
license	MIT
version	1.0.0
last_updated	2026-03-29
user_invocable	true

Swarmy

A self-improvement loop for Claude Code, built on Dispatch.

The goal is not to minimize your involvement. It is to minimize context pollution so you can run for many iterations before compaction.

Dependency

Swarmy requires the Dispatch skill. Check before doing anything:

ls ~/.claude/skills/dispatch/SKILL.md 2>/dev/null || ls .claude/skills/dispatch/SKILL.md 2>/dev/null || ls .agents/skills/dispatch/SKILL.md 2>/dev/null

If not found: "Swarmy depends on Dispatch for parallel agent orchestration. Want me to install it? (npx skills add bassimeledath/dispatch -g)"

Do not proceed until dispatch is available.

Setup

When the user invokes swarmy, establish these before doing anything else. Check what the user already provided and ask ONLY for what's missing. Batch your questions.

Must have before planning:

Goal: What are we improving?
Target: What files/artifacts get modified?
Verifiable outcome: How do we know it's better? Can be a runnable command, LLM-as-judge, human review, pass/fail, or qualitative comparison. If the user says "make it better" without a metric, propose something concrete.
Read-only boundaries: What must NOT be modified? The evaluation itself is always read-only. Infer and confirm.
Iteration count: How many rounds? Suggest 3-5 for narrow problems, 5-8 for open-ended ones. User can always continue later.

Then present a plan. Goal, target, outcome, boundaries, iteration count, what the first iteration dispatches. Get confirmation before starting.

CRITICAL: How to dispatch

Every worker MUST be launched via the /dispatch skill (the Skill tool with skill="dispatch"). This is the ONLY mechanism for spawning workers.

How to dispatch a worker:

Use the Skill tool:

Skill(skill="dispatch", args="<model> to <task description>")

Examples:

Skill(skill="dispatch", args="opus to explore the codebase for performance bottlenecks and write findings to .swarmy/explore-i1.md")
Skill(skill="dispatch", args="sonnet to implement the column-counting fix from .swarmy/explore-i1.md in src/renderer.ts")
Skill(skill="dispatch", args="opus to run the test suite against the current implementation and write only the score to .swarmy/results/i1-c1.txt")

If you catch yourself about to use Agent(...), stop and use Skill(skill="dispatch", ...) instead. Every single time.

Primitives

Always true. How they compose is your call.

1. You dispatch all execution

2. Role decoupling

Three roles, always separate dispatch instances (separate context windows). Never combined in the same iteration.

Explorer: reads code, docs, prior results, failure patterns. Produces a strategy memo written to a file. Never modifies the target.
Implementer: takes a strategy and executes. Modifies the target. Does not judge its own output.
Tester: evaluates against the verifiable outcome. Does not know which strategy produced the output. Reports a score only.

Without this separation, agents rationalize their own output.

3. Verifiable outcome

Numeric: score=72/100, score=F1:0.91
Ranked: score=best_of_3, score=2nd_of_3
Binary: score=pass, score=fail
Multi-dimensional: score=tone:good,coverage:partial
Human judgment: score=user_preferred, score=rejected

Set the format during setup based on what the verifiable outcome produces.

4. Baseline

Before iteration 1, measure current state against the verifiable outcome. This is iteration zero. Commit the baseline score. Without it, improvement cannot be measured.

5. Git is the state machine

No separate ledger files. Git tracks all state.

Branch: the run lives on swarmy/<run-tag>, branched from current HEAD.

Commits are the ledger. Every experiment is a commit with a structured message:

[swarmy] i=<N> c=<candidate> score=<score> status=<keep|discard|crash|merged> | <short description>

Examples:

[swarmy] i=0 c=baseline score=34/100 status=keep | baseline measurement
[swarmy] i=1 c=c1 score=61/100 status=keep | column-counting fix for connectors
[swarmy] i=1 c=c2 score=42/100 status=discard | unicode box-drawing characters
[swarmy] i=2 c=c1 score=72/100 status=keep | grapheme cluster measurement
[swarmy] i=3 c=merged score=78/100 status=merged | combined c1 alignment with c2 whitespace

Reconstructing state is cheap:

git log --oneline --grep='\[swarmy\]'         # full history
git log --oneline --grep='\[swarmy\]' -5      # recent context
git log --oneline --grep='status=keep'        # winning experiments only

Works for non-code artifacts too. Prompt files, configs, data scripts -- commit them. The diff documents what changed.

6. Context protection

The goal is keeping your context clean so you can run for many iterations. Not avoiding involvement -- you read what you need to make decisions. You avoid ingesting noise.

Explorers write strategy memos to .swarmy/explore-i<N>.md. You read these -- they're targeted content that informs your decisions.

Implementers redirect all output:

> .swarmy/logs/impl-i<N>-c<M>.log 2>&1

You don't read implementation logs. Check completion by verifying the commit landed.

Testers redirect raw execution output and extract only the metric:

# In the tester's dispatch instructions:
# 1. Run the eval, capture everything: command > .swarmy/logs/test-i<N>-c<M>.log 2>&1
# 2. Extract ONLY the score to: .swarmy/results/i<N>-c<M>.txt
# 3. Do not include raw output in your completion

You read only the results files -- a few lines each.

On crash: tail -n 30 of the relevant log. Nothing more.

Explicitly instruct workers to follow these output patterns in your dispatch prompts. Don't rely on them figuring it out.

7. Iterations

Bounded count from setup. Within each iteration, you compose the primitives as the problem demands. You decide candidate count, whether to explore, and how to test based on the current state.

8. Implementation threading

Exploration can be parallelized freely (e.g., dispatch multiple researchers to investigate different aspects of the problem simultaneously).

9. Keep / discard / merge

After testing:

One clear winner that beats current best: keep, advance branch.
Multiple candidates each solve different aspects: dispatch an implementer to merge the best elements, test the merged result, commit with status=merged.
No improvement: discard all, commit with status=discard, adjust strategy.

10. Isolation

You decide. Don't ask the user.

Git worktrees (via dispatch): multiple candidates modifying the same files in parallel.
Variant directories: prompts, configs, non-code artifacts. Candidates in .swarmy/candidates/c1/, c2/, etc. Winner gets committed to the target location.
No isolation: single candidate, or candidates touch different files.

11. Human in the loop

Default: autonomous. Pull the human in when:

Verifiable outcome requires human judgment
All candidates failed and you're uncertain how to proceed
A dispatch worker asks a question you can't answer
You want to propose a significant direction change
Scores are noisy or improving the metric degrades other qualities -- pause, ask to recalibrate

Do NOT ask "should I continue?" between iterations. Stay autonomous by default.

The Loop

Reminder: every "dispatch" below means Skill(skill="dispatch", args="..."). Never Agent.

SETUP
  Check dispatch dependency
  Establish: goal, target, outcome, boundaries, iterations
  Present plan, get confirmation
  Branch: git checkout -b swarmy/<run-tag>

BASELINE (iteration 0)
  Dispatch tester to measure current state
  Commit: [swarmy] i=0 c=baseline score=<X> status=keep | baseline

ITERATE (1 to N):

  1. EXPLORE
     /dispatch explorer(s) -- can be parallel
     Explorers write memos to .swarmy/explore-i<N>.md
     Read the memos. Based on findings and git log of past iterations,
     decide what to implement: how many candidates, what approach each
     takes, whether to use worktrees or sequential implementation.

  2. IMPLEMENT
     /dispatch implementer(s)
     Single-threaded unless trying different approaches in worktrees.
     Each implementer: modifies target, commits, redirects output to logs.
     Check completion (commit landed), don't read logs.

  3. TEST
     /dispatch tester per candidate
     Tester gets: candidate location, verifiable outcome definition.
     Tester does NOT get: strategy or implementation reasoning.
     Tester: runs eval, captures raw output to log, extracts score
     to .swarmy/results/i<N>-c<M>.txt

  4. DECIDE (you)
     Read results files.
     Keep / discard / merge. Advance branch if improved.
     Commit with structured [swarmy] message.
     If plateau (2+ iterations, no improvement): widen search,
     dispatch deeper exploration, or pull in human.
     Proceed to next iteration.

COMPLETE
  Report: best result, improvement over baseline, what worked.
  git log --grep='\[swarmy\]' for full history.
  Ask the user if they want to merge swarmy/<run-tag> into their branch.
  Clean up .swarmy/ working files, keep git history.

Principles

You are the PI. You read memos, interpret results, set direction. You're deeply involved intellectually. You just don't execute.
Git is memory. The commit history is the ledger, diffs are the documentation, the branch is the state.
Protect your context. Output goes to files, metrics go to you. Raw execution output never enters your context.
Simpler is better. When candidates score similarly, prefer the simpler change. A simplification that holds the score is a win.
Don't repeat experiments. git log --grep='\[swarmy\]' --oneline before dispatching.
Fail forward. Crashes are data. Commit with status=crash. They narrow the search space.
The eval is sacred. Never modify evaluation criteria unless the human explicitly recalibrates.

Example Plans

Optimizing an LLM system prompt

Goal: improve a system prompt so the LLM produces higher quality responses for a specific task (e.g., customer support, code review, summarization)
Target: prompts/system_prompt.md
Outcome: dispatch a judge agent to run the prompt against 15 test scenarios and score each response on accuracy, tone, and completeness. Judge compares against reference outputs if available.
Score format: score=accuracy:4,tone:5,completeness:3 or aggregate score=38/45
Isolation: variant prompt files in .swarmy/candidates/
Iteration 1: explore by running current prompt against all test scenarios, dispatch a researcher to categorize where responses fall short (wrong tone? missing info? hallucinations?). Then 3 prompt rewrites, each targeting a different failure category.
Subsequent: narrow toward whichever rewrite improved the weakest category, iterate on remaining gaps

Improving performance of a slow function

Goal: reduce execution time of a function or endpoint while keeping all existing tests passing
Target: src/pipeline.py (or whichever module contains the bottleneck)
Outcome: python benchmarks/bench_pipeline.py outputs p50/p95 latency. Existing tests (pytest tests/) must still pass. Both conditions required for status=keep.
Score format: score=p50:42ms,p95:89ms (lower is better)
Isolation: worktrees (same source file modified by candidates)
Read-only: test suite, benchmark harness
Iteration 1: explore by profiling current implementation (dispatch researcher to run profiler, identify hotspots, propose optimization strategies). Then 2-3 candidates in worktrees: maybe one tries algorithmic changes, one tries caching, one tries data structure swaps.
Key constraint: any candidate that breaks tests is automatically status=discard, regardless of speedup

Matching a frontend to a design mockup

Goal: make a frontend component or page visually match a reference design image as closely as possible
Target: src/components/Dashboard.tsx (or relevant component/page files + CSS)
Outcome: dispatch a judge agent that screenshots the rendered frontend (e.g., via puppeteer/playwright) and compares it to the reference mockup image. Judge scores on layout accuracy, spacing, typography, color fidelity, and responsive behavior.
Score format: score=layout:4,spacing:3,type:5,color:4,responsive:2 or aggregate score=18/25
Isolation: worktrees (same component files modified by candidates)
Read-only: the reference mockup image, test/screenshot harness
Iteration 1: explore by screenshotting the current state and dispatching a researcher to compare against the mockup, cataloging every visual discrepancy (wrong margins? misaligned grid? wrong font weight? color mismatch?). Then 2-3 candidates each targeting different categories of discrepancy.
Key pattern: the tester must screenshot the rendered output and compare visually -- it cannot just read the code and guess. This makes the eval more expensive but much more reliable than code-level diffing.