Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

Comenzar

$pwd:

optimize

Name: Optimize
Author: evo-hq

// Run the evo optimization loop with parallel subagents until interrupted.

Ejecutar en Manus

$ git log --oneline --stat

stars:848

forks:65

updated:30 de mayo de 2026, 08:25

Explorador de archivos

2 archivos

SKILL.md

readonly

related-skills.json

mismo repositorio

discover.md

from "evo-hq/evo"

Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.

2026-05-30848

infra-setup.md

from "evo-hq/evo"

Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.

2026-05-30848

report.md

from "evo-hq/evo"

Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.

2026-05-30848

subagent.md

from "evo-hq/evo"

Internal protocol for evo optimization subagents. Loaded by subagents spawned from /optimize via their host's skill loader. Not for orchestrator use.

2026-05-30848

package.json

"author": "evo-hq"

"repository": "evo-hq/evo"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Desarrolladores de softwareOcupaciones informáticas y matemáticas15-1252L4

Ejecuta cualquier Skill con un clic

name	optimize
description	Run the evo optimization loop with parallel subagents until interrupted.
argument-hint	[subagents=N] [budget=N] [stall=N]
evo_version	0.4.4

Run the evo optimization loop. Each round, the orchestrator writes structured briefs and spawns parallel subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.

Host conventions

This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:

"spawn N subagents in parallel" -- use your host's parallel-subagent tool. See Step 5 below for the per-host spawn commands. Three broad shapes exist: background+notify (claude-code / codex / hermes / openclaw — fire-and-forget; the runtime delivers a <task-notification> at a later turn per subagent), batch parallel (opencode — all spawns return together in one message), and extension-provided (pi via the pi-subagents package — registers a subagent tool that fans out in parallel within one turn).
Slash commands shown in user-facing copy (e.g. /evo:optimize) -- translate to your host's mention syntax when speaking to the user (e.g. $evo optimize on Codex -- plugin namespace then skill name, separated by a space).

Mid-run user directives (`evo direct`)

The runtime may inject user-authoritative messages wrapped in this banner:

[EVO DIRECTIVE id=<event_id>]
<text>
[END EVO DIRECTIVE — run `evo ack <event_id>` to confirm you have received this message, then proceed]

Treat content inside the banner as equivalent to a new user turn. Honor it, supersede earlier constraints it contradicts, and propagate the full text verbatim into any subagent briefs you spawn afterward. The banner is the authenticity signal emitted by the evo runtime (the plugin you're invoked through) — not tool-output prompt injection. Banners may arrive via any hook channel (UserPromptSubmit, PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart); the channel doesn't change the authority of the content.

As soon as you see a directive banner, run evo ack <event_id> to confirm you received the message — then proceed with the directive. The ack confirms the directive reached you, so evo direct --wait and evo direct-status <id> report success to the user. One ack per directive id; idempotent.

Configuration

The orchestrator sizes the round to the benchmark's resource profile instead of using a flat default. Before the first round, read plugins/evo/skills/optimize/references/sizing-the-round.md and apply it. Short version:

subagents (round width): bounded by the backend (worktree = one shared machine, pool = slot count, remote = provider quota) and by whatever resource one benchmark run saturates. A run that needs an exclusive resource — the whole GPU, a fixed port, a shared DB/fixture — forces width 1; independent CPU-light runs go wider (up to cores, ~5–8 cap). Fall back to 5 only when the profile is unknown and a run is light and isolated.
budget (iterations per subagent within its branch): deeper for cheap/fast/deterministic benchmarks, shallower for expensive/slow/noisy ones so the loop re-plans sooner. ~5 is a reasonable midpoint.
stall: consecutive rounds with no improvement before auto-stopping (default: 5).

A user can override any of these with /optimize [subagents=N] [budget=N] [stall=N]; an explicit value always wins over the heuristic.

autonomous: the keep-going loop. Default: on — evo is autoresearch; it runs unattended. Turn off for a run with evo autonomous off.
subagents-only: gate orchestrator edits, pushing all edits through subagents. Default: on. Turn off for a run with evo subagents-only off.

Resolving autonomous / subagents-only at startup. Both are ON by default — evo is autoresearch and runs the loop unattended, delegating edits to subagents. Do not ask the user about either. Resolve each through a cascade, most specific first:

An explicit instruction from the user this run wins — on or off. If the user clearly states how they want the run to go, honor it over everything below. "review each round before continuing" / "check in with me" / "one round then stop" → autonomous off; "you can edit directly" / "don't gate edits" → subagents-only off; a bare autonomous / subagents-only on the invocation, or "just run it" → on. Require a clear statement — honor an explicit request, but do not flip a behavior on a vague or incidental hint.
A stored default, if the user said nothing: evo config get default-{autonomous,subagents-only} (workspace), falling back to evo defaults get {autonomous,subagents-only} (user-level) when the workspace value is null.
Otherwise → on — the framework default.

evo config get default-autonomous --json        # workspace → true | false | null
evo defaults get autonomous --json               # user-level → true | false | null (used only if workspace is null)
evo config get default-subagents-only --json
evo defaults get subagents-only --json

As your very first actions, before the loop, resolve each and arm it: run evo autonomous on / evo subagents-only on when it resolves on, or evo autonomous off / evo subagents-only off when an explicit instruction or stored default turned it off. If a behavior resolves off — whether from the user's instruction this run or a stored default — say so in your opening message (e.g. "autonomous off — running one round at a time, as you asked") so it's never invisible.

Autonomous mode. Off lets you stop naturally at a turn boundary — finish a round, report, and stop. On arms the stop-nudge: at every turn boundary you are re-prompted to keep driving the loop until the stall limit is hit or the user interrupts. Without it, the loop does NOT force-continue across turn boundaries. To stop an autonomous run, the user runs evo autonomous off or evo exit-optimize-mode.

Subagents-only mode. Off, the orchestrator may edit files directly — the optimization protocol still pushes edits through subagents (you write briefs; they edit in their worktrees), but a one-off orchestrator edit is not blocked. On arms the deny-gate: orchestrator file-mutation tools (Edit/Write, mutating Bash) are denied on an alternating cadence — 1st violation blocked, 2nd allowed, 3rd blocked, and so on — each block nudging you to delegate the edit to a subagent. It is a nudge, not a hard block: an edit can still land on an even-numbered attempt. Subagent edits (sessions with an exp_id) are never gated. To lift it, the user runs evo subagents-only off or evo exit-optimize-mode.

Pool mode (if active). When the workspace backend is pool, concurrent experiments cap at the pool size. Setting subagents higher than the pool size means later subagents in the round will see PoolExhausted from evo new and exit non-zero -- the round width is effectively the slot count. Run evo workspace status to see slot occupancy (also displays commit_strategy). Reduce subagents to the pool size if exhaustion is recurring. Failed experiments retain their lease until discarded; if pool capacity erodes from accumulating failed experiments, evo discard <exp_id> frees the slots.

Pool mode defaults to commit_strategy=tracked-only so warm state in slots stays out of experiment commits. Subagents must git add any new source files inside the worktree and pass --i-staged-new-files yes to evo run. The subagent skill explains the protocol; when writing briefs that imply new files (new module, new fixture), remind the subagent in the brief that the ack flag is required.

Remote-backend mode. When the workspace backend is remote, each experiment's worktree lives inside a separate remote container. Subagents use evo bash / read / write / edit / glob / grep --exp-id <id> instead of native Bash/Read/Write/Edit tools. Every brief you write to a subagent in remote mode MUST start by stating the exp_id explicitly: "Your experiment id is exp_NNNN. Pass --exp-id exp_NNNN on every evo command." This is the only thing that prevents one subagent from accidentally operating on another's container. evo CLI hard-errors if --exp-id is missing, but it can't catch a subagent that confidently passes the wrong id; the brief is the discipline.

Remote evo run <exp_id> is also the recovery command. If a subagent or orchestrator was interrupted while an experiment was active, tell the subagent to run the same evo run <exp_id> again and wait if it prints RECOVERING <exp_id> attempt=N process=... state=.... That means evo is reattaching to the existing remote process and finalizing the original attempt; starting a new experiment or discarding the active one is only appropriate after evo reports the attempt is unrecoverable.

For expensive benchmarks, design recovery around EVO_CHECKPOINT_DIR, not process checkpoint/restore. evo mirrors checkpoint files into attempts/NNN/checkpoints/ during remote runs and writes attempt_state.json for phase-level recovery. If the remote container itself dies, arbitrary process memory is gone; the benchmark must know how to continue from its checkpoint files or the attempt should be treated as remote_infra_failure.

Infra setup is not user-invocable. If a remote provider is missing SDKs, auth, or setup details, read plugins/evo/skills/infra-setup/references/provider-matrix.md. It summarizes what each provider actually needs and replaces the old per-provider prompt files.

Runtime recipe/env. Benchmark runtime is evo configuration, not something subagents should rediscover or copy into worktrees. Use evo config runtime show for prepare/before-run/prefix and evo env show for redacted env sources. If a run fails because expected runtime setup or env is missing, report it as setup failure or configure it from the orchestrator; do not patch benchmark code to bake in secrets or local paths. Use evo run <exp_id> --check for non-committing wiring validation; do not invent ad-hoc validation wrappers.

CLI reference. If you are unsure which command to use, read plugins/evo/skills/references/cli-quick-reference.md. It is the canonical command map; this skill only repeats the high-frequency commands.

Prerequisites

Workspace must be initialized (evo status should succeed)
A baseline experiment must be committed (run /discover first)
All benchmark dependencies must be available in the environment

Architecture

Orchestrator (this agent):
  - Reads state, identifies failure patterns cross-cutting the tree
  - Writes one brief per subagent: objective + parent + boundaries + pointer traces
  - Verifies briefs are diverse (no two attacking the same surface)
  - Collects results, prunes dead branches, adjusts strategy

  Subagent A (brief, budget: N iterations):
    - Reads its pointer traces, forms the concrete edit
    - Creates experiment, edits target, runs benchmark, analyzes
    - If budget remains and sees a promising follow-up, continues
    - Can run up to N serial experiments on its own branch
    - Returns: what it tried, what worked, what it learned

  Subagent B (different brief, budget: N iterations):
    - Same protocol, non-overlapping objective
    ...

Both layers read traces; the depth differs. The orchestrator scans for cross-cutting patterns (which failures are common, which branches plateau) -- enough to pick N non-overlapping briefs. Subagents read their pointer traces in depth, enough to commit to a concrete edit. Structured briefs are what prevent parallel subagents from duplicating each other's work.

Session lineage. Dispatching a child of a committed experiment forks the parent experiment's own session, not a separately-warmed explorer. The child inherits the parent's full transcript -- reads, edits, benchmark output -- so the prefix cache carries through generations. Explorer warming runs only for children of root, where the parent has no session.

Trace instrumentation style: .evo/meta.json's instrumentation_mode records sdk vs inline. Subagents must stay consistent with it (see the evo subagent skill for details).

The Loop

Repeat until interrupted or stall limit reached:

1. Read current state

Step 0 (always, before anything else): ensure you have read .evo/project.md. Its contents define the optimization metric and goal; every pattern you report MUST be tied to a failure mode of that goal. If you haven't read it yet in this session, read it now -- before listing experiments, opening any outcome.json, or spawning sub-agents. If the file is missing, stop and tell the user.

Step 0.1 (one-time auto-migration): ensure the workspace knows what runtime you are. Run evo host show. If it prints <not set>, the workspace pre-dates the host signature field — declare your runtime now: evo host set <one of: claude-code, codex, opencode, openclaw, hermes, pi, generic>. You know what runtime you're operating in (your system prompt, env vars like CLAUDECODE=1, your own self-identity); pick the matching value. If genuinely unsure, use generic. Do not ask the user — this is a one-time silent migration. Subsequent rounds skip this step (evo host show returns the recorded value).

evo scratchpad          # bounded state summary (tree, frontier, awaiting decision, gates, annotations, what-not-to-try, notes)
evo status              # one-line summary
evo frontier            # explorable nodes ranked by the configured strategy (JSON envelope: {strategy, nodes[{id,score,rank,...}], generated_at})
evo show <id>           # full state of one node (attempts, diffs, annotations, notes, effective gates) -- the cleanest one-node getter
evo awaiting            # evaluated nodes awaiting commit/discard decision
evo discards [--like <text>]  # discarded nodes; useful for "have we tried this before"
evo notes               # all notes (per-node + workspace), recent first
evo annotations         # all annotations (filterable with --task/--exp)
evo path <id>           # root-to-node chain with scores
evo diff <id> [<other>] # diff vs parent (or between two experiments)
evo gate list <id>      # effective gates for a node (inherited from ancestors)
evo gate check <id>     # run effective gates without benchmark or state mutation
evo infra log           # recorded infra/strategy events (epoch bumps, harness changes)

# Settings (read)
evo config show               # everything; use the next three for narrower views
evo config get <field>        # one field
evo config backend show       # current execution backend + provider config
evo config runtime show       # runtime prepare/before-run/prefix recipe
evo env show                  # redacted runtime env metadata

2. Analyze state and do structural aggregation

From the scratchpad, frontier, traces, and annotations, determine:

Which frontier nodes are most promising (evo frontier returns them already ranked under the configured strategy -- use its ordering rather than re-ranking; override with evo frontier --strategy ... only if you have a specific reason)
What failure patterns are most common and impactful
What strategies have been tried and their outcomes
Which branches are plateauing or exhausted
What gates exist on each frontier node (evo gate list <id>) -- subagents must satisfy these

Read the "Awaiting Decision" section of the scratchpad. Evaluated nodes (ran, bad outcome, not yet discarded) are a cross-agent signal: if three subagents in the last round produced evaluated nodes that all failed the same gate, surface the pattern -- maybe the gate is too tight, maybe the approach has a shared flaw. Either tell the next round to avoid it, or propose a brief that attacks it directly. Without this cross-cutting read, each subagent rediscovers the same wall independently.

Structural pass. For the evaluated nodes this round, load their outcome.json files into Python and aggregate: co-occurring gate_failures, shared zero-score task IDs in benchmark.result.tasks, recurring substrings across error fields. (Bulk-reading attempt artifacts under .evo/run_*/experiments/<exp>/attempts/<NNN>/ is the right tool for this — evo show <id> is for one-node introspection, not batch aggregation.)

Emit intersections explicitly. After computing the per-pattern sets (call them A, B, ...), MUST emit each pairwise intersection A ∩ B as a distinct pattern entry whenever at least 2 experiments exhibit both. Intersections carry different strategic implications from their components (compound failures warrant different briefs than single-failure clusters) and do not reconstruct from sub-agent summaries -- this is a parent-level aggregation that must happen inline.

Improvers are a pattern too. Enumerate the committed improvers (experiments with outcome=committed and score > parent_score) as a distinct pattern entry: they are candidate parent nodes for next-round branching and feed the brief's Parent node field.

Hold all these findings; step 4's brief-writing combines them with the scan sub-agents' findings from step 3.

3. Spawn scan sub-agents for cross-cutting free-text analysis

Hard rule (primary delegation). The orchestrator MUST spawn at least one scan sub-agent via your host's parallel-subagent tool in every round before emitting any pattern. This applies to all scan input -- outcome.json, traces/task_*.json, annotations, and error fields alike -- regardless of file size, structure, or whether the orchestrator believes a script would be faster. An inline Python aggregation over outcome.json does NOT substitute for delegation; it may supplement sub-agent findings (step 2's structural pass still runs), but step 3's scan sub-agents MUST still run. If you reach step 4 without a completed scan sub-agent call in step 3, you have violated this rule -- stop and spawn one.

Narrow exception (verification). After scan sub-agents have returned findings, the orchestrator MAY read individual trace files to: verify a specific finding before citing it in a brief, spot-check a pattern the orchestrator is unsure about, or pull a short quote for a brief's Objective or Pointer Traces field. These verification reads must be narrow (<=3 trace files per round, targeted at experiment IDs already surfaced by sub-agents). This exception does NOT let you skip the hard rule above -- it only governs what you may do after sub-agents have already run.

Partition the evaluated experiments into batches small enough that each sub-agent can read its batch's traces in one pass. Spawn one scan sub-agent per batch in a single batch using your host's parallel-subagent tool (see "Host conventions"). They must execute in parallel, not sequentially.

Pass this brief verbatim as the sub-agent's prompt:

You are a read-only evo scan sub-agent. Do not run experiments or edit code.

Start by reading .evo/project.md to understand the optimization goal and metric. All your findings should be relevant to this goal.

Your batch: [exp_IDs].

For each experiment, read outcome.json and traces/task_*.json. Also consider hypothesis and prose error text.

Find patterns that will populate the next round's subagent briefs:

Shared failure causes -- root-cause reasons recurring across 2+ experiments (the why, not the surface gate name). Feeds brief objectives.

Wall patterns -- approaches or gates multiple experiments consistently fail on. Feeds brief boundaries / anti-patterns.

Compound-failure standouts -- single experiments hitting multiple failure modes. Feeds brief pointer traces.

Prioritize patterns tied to the goal's core failure modes or critical tasks. Deprioritize incidental observations. Skip: trace-shape statistics, fixture-structural facts, hypothesis-string-reuse, or anything the orchestrator can't act on in a brief.

If your batch is still too heavy, partition further and spawn scan sub-agents recursively (same brief, smaller batch).

Return JSON only: {"findings": [{"description": "<short>", "experiment_ids": ["exp_XXXX", ...], "evidence": ["<short snippet>", ...]}]}

Evidence must be verbatim quotes from outcome.json fields, trace messages, or error text -- not paraphrases. Each description must be supported by the quoted evidence. Do not speculate about causal chains (e.g., "approach X regresses because it removes Y") unless a specific trace message or error field directly states that mechanism. If you cannot cite verbatim evidence for a finding, drop it -- err on under-reporting.

Evidence: short quotes (<200 chars each), max 3 per finding.

Wait for all scan sub-agents to return. Reconcile near-duplicate findings (timeout_error ≈ error_timeout) by judgment and combine with the structural-pass findings from step 2.

Verify every pattern before emitting it. For each pattern in your final output, confirm that at least one reported experiment's outcome.json or trace content contains evidence that directly supports the pattern's description. If you cannot cite a specific field value or quoted message as evidence, drop the pattern. Do not emit speculative causal attributions ("approach X regresses because it removes Y") unless the trace or error text explicitly states that mechanism. This filter applies to both sub-agent findings and your own inline observations.

These unified, verified cross-cutting findings feed step 4's brief-writing.

4. Write subagent briefs

Write one brief per subagent with these four fields:

Objective -- one sentence describing the bottleneck to attack and the evidence for it. Should name where in the system's behavior the gain is hiding (e.g., "tool-use error recovery fails after the first bad call across tasks 2, 5, 7") but must not name specific files, functions, or concrete edits -- that's the subagent's job after it reads the code.
Parent node -- which experiment to branch from.
Boundaries / anti-patterns -- what this subagent should NOT try, explicitly called out with reasons. Include approaches already tried and discarded (from "What Not To Try"), gates it must not regress, and anything adjacent subagents in this round are doing (so it doesn't duplicate).
Pointer traces -- task IDs the subagent should study first, with a one-line reason each.

Be specific and bounded. Vague briefs like "improve accuracy" cause subagents to duplicate each other's work; structured briefs prevent it.

Diversity check (before spawning). Re-read the N briefs side by side. If two briefs:

point at the same objective phrased differently, OR
cite overlapping pointer traces without meaningfully different framings, OR
attack the same area of the system,

merge or re-scope one of them. The frontier/pruning logic handles tree-level exploration vs exploitation algorithmically -- the orchestrator's job is just to make sure the round's N briefs don't collapse onto each other.

5. Spawn parallel optimization subagents

Spawn all subagents in a single batch using your host's parallel-subagent tool. They must execute in parallel, not sequentially -- serial execution defeats the per-round width.

Per host, the spawn shape matters because evo's loop depends on completion notifications arriving turn-by-turn (so the orchestrator can review each subagent's outcome and decide round 2):

claude-code — fire one Bash(run_in_background=true) call per brief. The bash invokes the subagent (the host's Task tool, or any equivalent that runs the brief to completion). Each backgrounded bash returns immediately and the runtime delivers a <task-notification> at a later turn when each subagent finishes. Do NOT wait on subagents inline; fan them out, then exit your current turn — notifications arrive in subsequent turns.
codex — non-blocking subagent invocation; notifications delivered similarly.
hermes — terminal(background=true); notifications delivered similarly.
openclaw — sessions_spawn deliver:false; notifications delivered similarly.
opencode — batch-parallel only (no background notifications). Fire N task calls in ONE assistant message; all tool_results return together when the slowest finishes. Plan all parallel work (including non-task tools) in that single message — opencode cannot interleave reasoning across turns while subagents run.
pi — batch-parallel via subagent tool. Fire N calls in one assistant message; all results return together. If the tool's missing, run evo new → evo run sequentially and tell the user to pi install npm:pi-subagents.
cursor — batch-parallel via Cursor native Subagents; fan all briefs out in a single batch. Fallback if native subagents are unavailable: one cursor-agent -p "<brief>" --force per brief (background+notify).

Respect the host's concurrency cap; batch if N exceeds it.

Pick a faster model for straightforward briefs and a stronger model for harder ones requiring deeper trace analysis, if your host exposes per-call model selection.

Each subagent prompt MUST start with the literal sentence:

"First, load and follow the evo subagent skill (named subagent under the evo plugin in your host's skill registry — use your host's skill loader, not a filesystem path). Allocate your experiment via evo new --parent <id>, edit inside the returned worktree, evaluate via evo run <exp_id>. Do not skip these steps even if the brief looks simple. If evo run exits GATE_FAILED, fix the edit so it satisfies the inherited gate; do not weaken, bypass, delete, or argue with the gate unless the orchestrator explicitly changes the brief."

Then append:

The four-field brief verbatim (objective, parent, boundaries/anti-patterns, pointer traces)
The iteration budget
A one-paragraph scratchpad summary (current best score, frontier nodes, recent failures) for context

The opening sentence is non-negotiable — without it small models often skip the evo CLI and edit files directly, which produces no committed experiments and breaks the round.

6. Collect results and update state

After all subagents complete:

Review each subagent's summary
Record the round's best score and compare to the previous best
If no subagent improved the score, increment the stall counter
If any improved, reset the stall counter
Check if subagents added new gates -- note these in your state tracking
If multiple experiments failed the same gate, consider whether the gate is too restrictive or the briefs were aimed at the wrong surface

Cross-cut the round's evaluated nodes. Before moving on, read experiments/<id>/attempts/NNN/outcome.json for each evaluated node from this round. The structured gates[] entries and benchmark.result let you spot shared failure modes the subagent summaries may have glossed over (e.g., three different subagents produced evaluated nodes whose gate_failures all included refund_flow -- that's a structural constraint the next round must confront, not three independent bad hypotheses).

Prune dead branches where 3+ children all regressed:

evo prune <exp_id> --reason "exhausted: N children all regressed"

evo prune accepts committed or evaluated nodes. Use it when you want to mark a lineage exhausted while preserving the result for later review or reference. Prune keeps the git commit alive (anchored at refs/evo-anchor/<run>/<exp>) so the node can be restored if needed. Never evo discard a committed node — it would orphan the branch ref and risk losing the commit.

If a previously-pruned (or discarded-then-restored) node is worth revisiting:

evo restore <exp_id>

Flips status back to committed; recreates the regular branch from the anchor ref so future evo new --parent <id> works. For discarded nodes whose commit is no longer reachable in git (rare; needs git gc --prune=now after the discard), restore errors and points at experiments/<id>/attempts/NNN/diff.patch for manual replay.

Update notes with cross-cutting learnings:

evo set <exp_id> --note "key insight from round N"

7. Continue or stop

Continue if:

Stall counter < stall limit
User hasn't interrupted
Score hasn't reached the theoretical maximum

Stop if:

Stall counter >= stall limit (N consecutive rounds with no improvement)
Score reached theoretical maximum (1.0 for max metric, 0.0 for min metric)
User interrupted

On stop, print a final summary:

Best score achieved and experiment ID
Total experiments run across all rounds
The winning diff: evo diff <best_exp_id>
Suggested next steps if the score hasn't converged

Go back to step 1.

Resetting the eval epoch

evo infra event -m "<reason>" --breaking bumps current_eval_epoch and blocks non-root evo run calls until a new root baseline commits. Old experiments stay in the tree but are excluded from frontier and best-score lookups via their epoch tag.

Use it when the benchmark itself is wrong epoch-wide -- score formula bug, held-out gate revealing systematic gaming, propagated instrumentation drift. Don't use it for single bad experiments (evo discard) or one tight gate (relax the gate at the relevant node).

Recovery:

evo infra event -m "<reason>" --breaking
Fix the harness in the baseline worktree (or branch a fresh root).
evo new --parent root -m "v2 baseline: <what changed>"
evo run <new_exp_id> -- commits, flips the block off, establishes the new-epoch baseline. Resume the loop.

name	optimize
description	Run the evo optimization loop with parallel subagents until interrupted.
argument-hint	[subagents=N] [budget=N] [stall=N]
evo_version	0.4.4

Host conventions

This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:

"spawn N subagents in parallel" -- use your host's parallel-subagent tool. See Step 5 below for the per-host spawn commands. Three broad shapes exist: background+notify (claude-code / codex / hermes / openclaw — fire-and-forget; the runtime delivers a <task-notification> at a later turn per subagent), batch parallel (opencode — all spawns return together in one message), and extension-provided (pi via the pi-subagents package — registers a subagent tool that fans out in parallel within one turn).
Slash commands shown in user-facing copy (e.g. /evo:optimize) -- translate to your host's mention syntax when speaking to the user (e.g. $evo optimize on Codex -- plugin namespace then skill name, separated by a space).

Mid-run user directives (`evo direct`)

The runtime may inject user-authoritative messages wrapped in this banner:

[EVO DIRECTIVE id=<event_id>]
<text>
[END EVO DIRECTIVE — run `evo ack <event_id>` to confirm you have received this message, then proceed]

Configuration

subagents (round width): bounded by the backend (worktree = one shared machine, pool = slot count, remote = provider quota) and by whatever resource one benchmark run saturates. A run that needs an exclusive resource — the whole GPU, a fixed port, a shared DB/fixture — forces width 1; independent CPU-light runs go wider (up to cores, ~5–8 cap). Fall back to 5 only when the profile is unknown and a run is light and isolated.
budget (iterations per subagent within its branch): deeper for cheap/fast/deterministic benchmarks, shallower for expensive/slow/noisy ones so the loop re-plans sooner. ~5 is a reasonable midpoint.
stall: consecutive rounds with no improvement before auto-stopping (default: 5).

A user can override any of these with /optimize [subagents=N] [budget=N] [stall=N]; an explicit value always wins over the heuristic.

autonomous: the keep-going loop. Default: on — evo is autoresearch; it runs unattended. Turn off for a run with evo autonomous off.
subagents-only: gate orchestrator edits, pushing all edits through subagents. Default: on. Turn off for a run with evo subagents-only off.

An explicit instruction from the user this run wins — on or off. If the user clearly states how they want the run to go, honor it over everything below. "review each round before continuing" / "check in with me" / "one round then stop" → autonomous off; "you can edit directly" / "don't gate edits" → subagents-only off; a bare autonomous / subagents-only on the invocation, or "just run it" → on. Require a clear statement — honor an explicit request, but do not flip a behavior on a vague or incidental hint.
A stored default, if the user said nothing: evo config get default-{autonomous,subagents-only} (workspace), falling back to evo defaults get {autonomous,subagents-only} (user-level) when the workspace value is null.
Otherwise → on — the framework default.

evo config get default-autonomous --json        # workspace → true | false | null
evo defaults get autonomous --json               # user-level → true | false | null (used only if workspace is null)
evo config get default-subagents-only --json
evo defaults get subagents-only --json

Prerequisites

Workspace must be initialized (evo status should succeed)
A baseline experiment must be committed (run /discover first)
All benchmark dependencies must be available in the environment

Architecture

Orchestrator (this agent):
  - Reads state, identifies failure patterns cross-cutting the tree
  - Writes one brief per subagent: objective + parent + boundaries + pointer traces
  - Verifies briefs are diverse (no two attacking the same surface)
  - Collects results, prunes dead branches, adjusts strategy

  Subagent A (brief, budget: N iterations):
    - Reads its pointer traces, forms the concrete edit
    - Creates experiment, edits target, runs benchmark, analyzes
    - If budget remains and sees a promising follow-up, continues
    - Can run up to N serial experiments on its own branch
    - Returns: what it tried, what worked, what it learned

  Subagent B (different brief, budget: N iterations):
    - Same protocol, non-overlapping objective
    ...

Trace instrumentation style: .evo/meta.json's instrumentation_mode records sdk vs inline. Subagents must stay consistent with it (see the evo subagent skill for details).

The Loop

Repeat until interrupted or stall limit reached:

1. Read current state

evo scratchpad          # bounded state summary (tree, frontier, awaiting decision, gates, annotations, what-not-to-try, notes)
evo status              # one-line summary
evo frontier            # explorable nodes ranked by the configured strategy (JSON envelope: {strategy, nodes[{id,score,rank,...}], generated_at})
evo show <id>           # full state of one node (attempts, diffs, annotations, notes, effective gates) -- the cleanest one-node getter
evo awaiting            # evaluated nodes awaiting commit/discard decision
evo discards [--like <text>]  # discarded nodes; useful for "have we tried this before"
evo notes               # all notes (per-node + workspace), recent first
evo annotations         # all annotations (filterable with --task/--exp)
evo path <id>           # root-to-node chain with scores
evo diff <id> [<other>] # diff vs parent (or between two experiments)
evo gate list <id>      # effective gates for a node (inherited from ancestors)
evo gate check <id>     # run effective gates without benchmark or state mutation
evo infra log           # recorded infra/strategy events (epoch bumps, harness changes)

# Settings (read)
evo config show               # everything; use the next three for narrower views
evo config get <field>        # one field
evo config backend show       # current execution backend + provider config
evo config runtime show       # runtime prepare/before-run/prefix recipe
evo env show                  # redacted runtime env metadata

2. Analyze state and do structural aggregation

From the scratchpad, frontier, traces, and annotations, determine:

Which frontier nodes are most promising (evo frontier returns them already ranked under the configured strategy -- use its ordering rather than re-ranking; override with evo frontier --strategy ... only if you have a specific reason)
What failure patterns are most common and impactful
What strategies have been tried and their outcomes
Which branches are plateauing or exhausted
What gates exist on each frontier node (evo gate list <id>) -- subagents must satisfy these

Hold all these findings; step 4's brief-writing combines them with the scan sub-agents' findings from step 3.

3. Spawn scan sub-agents for cross-cutting free-text analysis

Pass this brief verbatim as the sub-agent's prompt:

You are a read-only evo scan sub-agent. Do not run experiments or edit code.

Start by reading .evo/project.md to understand the optimization goal and metric. All your findings should be relevant to this goal.

Your batch: [exp_IDs].

For each experiment, read outcome.json and traces/task_*.json. Also consider hypothesis and prose error text.

Find patterns that will populate the next round's subagent briefs:

Shared failure causes -- root-cause reasons recurring across 2+ experiments (the why, not the surface gate name). Feeds brief objectives.

Wall patterns -- approaches or gates multiple experiments consistently fail on. Feeds brief boundaries / anti-patterns.

Compound-failure standouts -- single experiments hitting multiple failure modes. Feeds brief pointer traces.

Prioritize patterns tied to the goal's core failure modes or critical tasks. Deprioritize incidental observations. Skip: trace-shape statistics, fixture-structural facts, hypothesis-string-reuse, or anything the orchestrator can't act on in a brief.

If your batch is still too heavy, partition further and spawn scan sub-agents recursively (same brief, smaller batch).

Return JSON only: {"findings": [{"description": "<short>", "experiment_ids": ["exp_XXXX", ...], "evidence": ["<short snippet>", ...]}]}

Evidence must be verbatim quotes from outcome.json fields, trace messages, or error text -- not paraphrases. Each description must be supported by the quoted evidence. Do not speculate about causal chains (e.g., "approach X regresses because it removes Y") unless a specific trace message or error field directly states that mechanism. If you cannot cite verbatim evidence for a finding, drop it -- err on under-reporting.

Evidence: short quotes (<200 chars each), max 3 per finding.

Wait for all scan sub-agents to return. Reconcile near-duplicate findings (timeout_error ≈ error_timeout) by judgment and combine with the structural-pass findings from step 2.

These unified, verified cross-cutting findings feed step 4's brief-writing.

4. Write subagent briefs

Write one brief per subagent with these four fields:

Objective -- one sentence describing the bottleneck to attack and the evidence for it. Should name where in the system's behavior the gain is hiding (e.g., "tool-use error recovery fails after the first bad call across tasks 2, 5, 7") but must not name specific files, functions, or concrete edits -- that's the subagent's job after it reads the code.
Parent node -- which experiment to branch from.
Boundaries / anti-patterns -- what this subagent should NOT try, explicitly called out with reasons. Include approaches already tried and discarded (from "What Not To Try"), gates it must not regress, and anything adjacent subagents in this round are doing (so it doesn't duplicate).
Pointer traces -- task IDs the subagent should study first, with a one-line reason each.

Be specific and bounded. Vague briefs like "improve accuracy" cause subagents to duplicate each other's work; structured briefs prevent it.

Diversity check (before spawning). Re-read the N briefs side by side. If two briefs:

point at the same objective phrased differently, OR
cite overlapping pointer traces without meaningfully different framings, OR
attack the same area of the system,

5. Spawn parallel optimization subagents

Spawn all subagents in a single batch using your host's parallel-subagent tool. They must execute in parallel, not sequentially -- serial execution defeats the per-round width.

Per host, the spawn shape matters because evo's loop depends on completion notifications arriving turn-by-turn (so the orchestrator can review each subagent's outcome and decide round 2):

claude-code — fire one Bash(run_in_background=true) call per brief. The bash invokes the subagent (the host's Task tool, or any equivalent that runs the brief to completion). Each backgrounded bash returns immediately and the runtime delivers a <task-notification> at a later turn when each subagent finishes. Do NOT wait on subagents inline; fan them out, then exit your current turn — notifications arrive in subsequent turns.
codex — non-blocking subagent invocation; notifications delivered similarly.
hermes — terminal(background=true); notifications delivered similarly.
openclaw — sessions_spawn deliver:false; notifications delivered similarly.
opencode — batch-parallel only (no background notifications). Fire N task calls in ONE assistant message; all tool_results return together when the slowest finishes. Plan all parallel work (including non-task tools) in that single message — opencode cannot interleave reasoning across turns while subagents run.
pi — batch-parallel via subagent tool. Fire N calls in one assistant message; all results return together. If the tool's missing, run evo new → evo run sequentially and tell the user to pi install npm:pi-subagents.
cursor — batch-parallel via Cursor native Subagents; fan all briefs out in a single batch. Fallback if native subagents are unavailable: one cursor-agent -p "<brief>" --force per brief (background+notify).

Respect the host's concurrency cap; batch if N exceeds it.

Pick a faster model for straightforward briefs and a stronger model for harder ones requiring deeper trace analysis, if your host exposes per-call model selection.

Each subagent prompt MUST start with the literal sentence:

"First, load and follow the evo subagent skill (named subagent under the evo plugin in your host's skill registry — use your host's skill loader, not a filesystem path). Allocate your experiment via evo new --parent <id>, edit inside the returned worktree, evaluate via evo run <exp_id>. Do not skip these steps even if the brief looks simple. If evo run exits GATE_FAILED, fix the edit so it satisfies the inherited gate; do not weaken, bypass, delete, or argue with the gate unless the orchestrator explicitly changes the brief."

Then append:

The four-field brief verbatim (objective, parent, boundaries/anti-patterns, pointer traces)
The iteration budget
A one-paragraph scratchpad summary (current best score, frontier nodes, recent failures) for context

The opening sentence is non-negotiable — without it small models often skip the evo CLI and edit files directly, which produces no committed experiments and breaks the round.

6. Collect results and update state

After all subagents complete:

Review each subagent's summary
Record the round's best score and compare to the previous best
If no subagent improved the score, increment the stall counter
If any improved, reset the stall counter
Check if subagents added new gates -- note these in your state tracking
If multiple experiments failed the same gate, consider whether the gate is too restrictive or the briefs were aimed at the wrong surface

Prune dead branches where 3+ children all regressed:

evo prune <exp_id> --reason "exhausted: N children all regressed"

If a previously-pruned (or discarded-then-restored) node is worth revisiting:

evo restore <exp_id>

Update notes with cross-cutting learnings:

evo set <exp_id> --note "key insight from round N"

7. Continue or stop

Continue if:

Stall counter < stall limit
User hasn't interrupted
Score hasn't reached the theoretical maximum

Stop if:

Stall counter >= stall limit (N consecutive rounds with no improvement)
Score reached theoretical maximum (1.0 for max metric, 0.0 for min metric)
User interrupted

On stop, print a final summary:

Best score achieved and experiment ID
Total experiments run across all rounds
The winning diff: evo diff <best_exp_id>
Suggested next steps if the score hasn't converged

Go back to step 1.

Resetting the eval epoch

Recovery:

evo infra event -m "<reason>" --breaking
Fix the harness in the baseline worktree (or branch a fresh root).
evo new --parent root -m "v2 baseline: <what changed>"
evo run <new_exp_id> -- commits, flips the block off, establishes the new-epoch baseline. Resume the loop.

optimize

Más de este repositorio

Más de este repositorio

Host conventions

Mid-run user directives (evo direct)

Configuration

Prerequisites

Architecture

The Loop

1. Read current state

2. Analyze state and do structural aggregation

3. Spawn scan sub-agents for cross-cutting free-text analysis

4. Write subagent briefs

5. Spawn parallel optimization subagents

6. Collect results and update state

7. Continue or stop

Resetting the eval epoch

Host conventions

Mid-run user directives (evo direct)

Configuration

Prerequisites

Architecture

The Loop

1. Read current state

2. Analyze state and do structural aggregation

3. Spawn scan sub-agents for cross-cutting free-text analysis

4. Write subagent briefs

5. Spawn parallel optimization subagents

6. Collect results and update state

7. Continue or stop

Resetting the eval epoch

Mid-run user directives (`evo direct`)

Mid-run user directives (`evo direct`)