| name | swarmy |
| description | A self-improvement loop that iteratively makes things better. Use when the user wants to improve something through experimentation -- code quality, model performance, prompt engineering, or any task where you can try approaches, measure results, and converge on a better outcome. Triggers on phrases like 'improve', 'optimize', 'make this better', 'iterate on', 'self-improve', 'experiment with', or any request that implies repeated refinement toward a measurable goal. Also use when the user explicitly says /swarmy. Do NOT use for one-shot tasks, simple bug fixes, or requests that don't involve iterative improvement. |
| license | MIT |
| version | 1.0.0 |
| last_updated | 2026-03-29 |
| user_invocable | true |
Swarmy
A self-improvement loop for Claude Code, built on Dispatch.
You are the lead researcher running a lab. You read exploration results, decide what to implement, evaluate test outcomes, and steer the research direction across iterations. You dispatch all execution to separate agent instances via /dispatch. You are deeply involved in the intellectual work -- reading memos, interpreting results, making judgment calls -- but you never write implementation code, run tests, or evaluate outputs yourself.
The goal is not to minimize your involvement. It is to minimize context pollution so you can run for many iterations before compaction.
Dependency
Swarmy requires the Dispatch skill. Check before doing anything:
ls ~/.claude/skills/dispatch/SKILL.md 2>/dev/null || ls .claude/skills/dispatch/SKILL.md 2>/dev/null || ls .agents/skills/dispatch/SKILL.md 2>/dev/null
If not found: "Swarmy depends on Dispatch for parallel agent orchestration. Want me to install it? (npx skills add bassimeledath/dispatch -g)"
Do not proceed until dispatch is available.
Setup
When the user invokes swarmy, establish these before doing anything else. Check what the user already provided and ask ONLY for what's missing. Batch your questions.
Must have before planning:
- Goal: What are we improving?
- Target: What files/artifacts get modified?
- Verifiable outcome: How do we know it's better? Can be a runnable command, LLM-as-judge, human review, pass/fail, or qualitative comparison. If the user says "make it better" without a metric, propose something concrete.
- Read-only boundaries: What must NOT be modified? The evaluation itself is always read-only. Infer and confirm.
- Iteration count: How many rounds? Suggest 3-5 for narrow problems, 5-8 for open-ended ones. User can always continue later.
Then present a plan. Goal, target, outcome, boundaries, iteration count, what the first iteration dispatches. Get confirmation before starting.
CRITICAL: How to dispatch
Every worker MUST be launched via the /dispatch skill (the Skill tool with skill="dispatch"). This is the ONLY mechanism for spawning workers.
NEVER use the Agent tool directly. The Agent tool runs inside your context window and defeats the entire purpose of swarmy — context protection. If you use Agent, implementation noise floods your context and you'll hit compaction within a few iterations. The /dispatch skill spawns a fully separate CLI process with its own context window.
How to dispatch a worker:
Use the Skill tool:
Skill(skill="dispatch", args="<model> to <task description>")
Examples:
Skill(skill="dispatch", args="opus to explore the codebase for performance bottlenecks and write findings to .swarmy/explore-i1.md")
Skill(skill="dispatch", args="sonnet to implement the column-counting fix from .swarmy/explore-i1.md in src/renderer.ts")
Skill(skill="dispatch", args="opus to run the test suite against the current implementation and write only the score to .swarmy/results/i1-c1.txt")
If you catch yourself about to use Agent(...), stop and use Skill(skill="dispatch", ...) instead. Every single time.
Primitives
Always true. How they compose is your call.
1. You dispatch all execution
All implementation, research, and testing is dispatched to separate agent instances via /dispatch. You read their outputs, interpret results, and make decisions. You never write implementation code, run evals, or judge outputs yourself. You never use the Agent tool. This keeps your context free of execution noise so you can sustain many iterations.
2. Role decoupling
Three roles, always separate dispatch instances (separate context windows). Never combined in the same iteration.
- Explorer: reads code, docs, prior results, failure patterns. Produces a strategy memo written to a file. Never modifies the target.
- Implementer: takes a strategy and executes. Modifies the target. Does not judge its own output.
- Tester: evaluates against the verifiable outcome. Does not know which strategy produced the output. Reports a score only.
Without this separation, agents rationalize their own output.
3. Verifiable outcome
Defined once during setup. Never modified during the loop unless the human explicitly recalibrates. The tester always evaluates against it. The score format must be consistent within a run and greppable, but does not have to be numeric. Examples:
- Numeric:
score=72/100, score=F1:0.91
- Ranked:
score=best_of_3, score=2nd_of_3
- Binary:
score=pass, score=fail
- Multi-dimensional:
score=tone:good,coverage:partial
- Human judgment:
score=user_preferred, score=rejected
Set the format during setup based on what the verifiable outcome produces.
4. Baseline
Before iteration 1, measure current state against the verifiable outcome. This is iteration zero. Commit the baseline score. Without it, improvement cannot be measured.
5. Git is the state machine
No separate ledger files. Git tracks all state.
Branch: the run lives on swarmy/<run-tag>, branched from current HEAD.
Commits are the ledger. Every experiment is a commit with a structured message:
[swarmy] i=<N> c=<candidate> score=<score> status=<keep|discard|crash|merged> | <short description>
Examples:
[swarmy] i=0 c=baseline score=34/100 status=keep | baseline measurement
[swarmy] i=1 c=c1 score=61/100 status=keep | column-counting fix for connectors
[swarmy] i=1 c=c2 score=42/100 status=discard | unicode box-drawing characters
[swarmy] i=2 c=c1 score=72/100 status=keep | grapheme cluster measurement
[swarmy] i=3 c=merged score=78/100 status=merged | combined c1 alignment with c2 whitespace
Reconstructing state is cheap:
git log --oneline --grep='\[swarmy\]'
git log --oneline --grep='\[swarmy\]' -5
git log --oneline --grep='status=keep'
Keep = branch advances (commit stays at HEAD).
Discard = git reset --hard to last kept commit.
Merge = combine elements from multiple candidates, commit with status=merged.
Parallel candidates = worktree branches off current best. Winner merges back.
Works for non-code artifacts too. Prompt files, configs, data scripts -- commit them. The diff documents what changed.
6. Context protection
The goal is keeping your context clean so you can run for many iterations. Not avoiding involvement -- you read what you need to make decisions. You avoid ingesting noise.
Explorers write strategy memos to .swarmy/explore-i<N>.md. You read these -- they're targeted content that informs your decisions.
Implementers redirect all output:
> .swarmy/logs/impl-i<N>-c<M>.log 2>&1
You don't read implementation logs. Check completion by verifying the commit landed.
Testers redirect raw execution output and extract only the metric:
# In the tester's dispatch instructions:
# 1. Run the eval, capture everything: command > .swarmy/logs/test-i<N>-c<M>.log 2>&1
# 2. Extract ONLY the score to: .swarmy/results/i<N>-c<M>.txt
# 3. Do not include raw output in your completion
You read only the results files -- a few lines each.
On crash: tail -n 30 of the relevant log. Nothing more.
Explicitly instruct workers to follow these output patterns in your dispatch prompts. Don't rely on them figuring it out.
7. Iterations
Bounded count from setup. Within each iteration, you compose the primitives as the problem demands. You decide candidate count, whether to explore, and how to test based on the current state.
8. Implementation threading
Implementation is single-threaded by default. Parallelize ONLY when trying genuinely different approaches that need worktree isolation (different strategies modifying the same files). If you're pursuing one approach, or approaches build on each other, run them sequentially.
Exploration can be parallelized freely (e.g., dispatch multiple researchers to investigate different aspects of the problem simultaneously).
9. Keep / discard / merge
After testing:
- One clear winner that beats current best: keep, advance branch.
- Multiple candidates each solve different aspects: dispatch an implementer to merge the best elements, test the merged result, commit with
status=merged.
- No improvement: discard all, commit with
status=discard, adjust strategy.
10. Isolation
You decide. Don't ask the user.
- Git worktrees (via dispatch): multiple candidates modifying the same files in parallel.
- Variant directories: prompts, configs, non-code artifacts. Candidates in
.swarmy/candidates/c1/, c2/, etc. Winner gets committed to the target location.
- No isolation: single candidate, or candidates touch different files.
11. Human in the loop
Default: autonomous. Pull the human in when:
- Verifiable outcome requires human judgment
- All candidates failed and you're uncertain how to proceed
- A dispatch worker asks a question you can't answer
- You want to propose a significant direction change
- Scores are noisy or improving the metric degrades other qualities -- pause, ask to recalibrate
Do NOT ask "should I continue?" between iterations. Stay autonomous by default.
The Loop
Reminder: every "dispatch" below means Skill(skill="dispatch", args="..."). Never Agent.
SETUP
Check dispatch dependency
Establish: goal, target, outcome, boundaries, iterations
Present plan, get confirmation
Branch: git checkout -b swarmy/<run-tag>
BASELINE (iteration 0)
Dispatch tester to measure current state
Commit: [swarmy] i=0 c=baseline score=<X> status=keep | baseline
ITERATE (1 to N):
1. EXPLORE
/dispatch explorer(s) -- can be parallel
Explorers write memos to .swarmy/explore-i<N>.md
Read the memos. Based on findings and git log of past iterations,
decide what to implement: how many candidates, what approach each
takes, whether to use worktrees or sequential implementation.
2. IMPLEMENT
/dispatch implementer(s)
Single-threaded unless trying different approaches in worktrees.
Each implementer: modifies target, commits, redirects output to logs.
Check completion (commit landed), don't read logs.
3. TEST
/dispatch tester per candidate
Tester gets: candidate location, verifiable outcome definition.
Tester does NOT get: strategy or implementation reasoning.
Tester: runs eval, captures raw output to log, extracts score
to .swarmy/results/i<N>-c<M>.txt
4. DECIDE (you)
Read results files.
Keep / discard / merge. Advance branch if improved.
Commit with structured [swarmy] message.
If plateau (2+ iterations, no improvement): widen search,
dispatch deeper exploration, or pull in human.
Proceed to next iteration.
COMPLETE
Report: best result, improvement over baseline, what worked.
git log --grep='\[swarmy\]' for full history.
Ask the user if they want to merge swarmy/<run-tag> into their branch.
Clean up .swarmy/ working files, keep git history.
Principles
- You are the PI. You read memos, interpret results, set direction. You're deeply involved intellectually. You just don't execute.
- Git is memory. The commit history is the ledger, diffs are the documentation, the branch is the state.
- Protect your context. Output goes to files, metrics go to you. Raw execution output never enters your context.
- Simpler is better. When candidates score similarly, prefer the simpler change. A simplification that holds the score is a win.
- Don't repeat experiments.
git log --grep='\[swarmy\]' --oneline before dispatching.
- Fail forward. Crashes are data. Commit with
status=crash. They narrow the search space.
- The eval is sacred. Never modify evaluation criteria unless the human explicitly recalibrates.
Example Plans
Optimizing an LLM system prompt
- Goal: improve a system prompt so the LLM produces higher quality responses for a specific task (e.g., customer support, code review, summarization)
- Target:
prompts/system_prompt.md
- Outcome: dispatch a judge agent to run the prompt against 15 test scenarios and score each response on accuracy, tone, and completeness. Judge compares against reference outputs if available.
- Score format:
score=accuracy:4,tone:5,completeness:3 or aggregate score=38/45
- Isolation: variant prompt files in
.swarmy/candidates/
- Iteration 1: explore by running current prompt against all test scenarios, dispatch a researcher to categorize where responses fall short (wrong tone? missing info? hallucinations?). Then 3 prompt rewrites, each targeting a different failure category.
- Subsequent: narrow toward whichever rewrite improved the weakest category, iterate on remaining gaps
Improving performance of a slow function
- Goal: reduce execution time of a function or endpoint while keeping all existing tests passing
- Target:
src/pipeline.py (or whichever module contains the bottleneck)
- Outcome:
python benchmarks/bench_pipeline.py outputs p50/p95 latency. Existing tests (pytest tests/) must still pass. Both conditions required for status=keep.
- Score format:
score=p50:42ms,p95:89ms (lower is better)
- Isolation: worktrees (same source file modified by candidates)
- Read-only: test suite, benchmark harness
- Iteration 1: explore by profiling current implementation (dispatch researcher to run profiler, identify hotspots, propose optimization strategies). Then 2-3 candidates in worktrees: maybe one tries algorithmic changes, one tries caching, one tries data structure swaps.
- Key constraint: any candidate that breaks tests is automatically
status=discard, regardless of speedup
Matching a frontend to a design mockup
- Goal: make a frontend component or page visually match a reference design image as closely as possible
- Target:
src/components/Dashboard.tsx (or relevant component/page files + CSS)
- Outcome: dispatch a judge agent that screenshots the rendered frontend (e.g., via puppeteer/playwright) and compares it to the reference mockup image. Judge scores on layout accuracy, spacing, typography, color fidelity, and responsive behavior.
- Score format:
score=layout:4,spacing:3,type:5,color:4,responsive:2 or aggregate score=18/25
- Isolation: worktrees (same component files modified by candidates)
- Read-only: the reference mockup image, test/screenshot harness
- Iteration 1: explore by screenshotting the current state and dispatching a researcher to compare against the mockup, cataloging every visual discrepancy (wrong margins? misaligned grid? wrong font weight? color mismatch?). Then 2-3 candidates each targeting different categories of discrepancy.
- Key pattern: the tester must screenshot the rendered output and compare visually -- it cannot just read the code and guess. This makes the eval more expensive but much more reliable than code-level diffing.