一键在 Manus 中运行任何 Skill

$pwd:

rp-optimize-v2

Name: Rp Optimize V2
Author: novotnyllc

// Iterative performance optimization loop using RepoPrompt MCP tools: instrument with debug-only metrics, establish a baseline, then plan → delegate one optimize+harden cycle → re-measure → ask oracle for next plan, looping until the oracle is satisfied or the target metric is met

在 Manus 中运行

$ git log --oneline --stat

stars:0

forks:0

updated:2026年5月10日 18:06

文件资源管理器

2 个文件

SKILL.md

readonly

related-skills.json

同仓库

subagent-orchestration.md

from "novotnyllc/subagent-orchestration"

Use for Codex multi_agent_v2 subagent orchestration: delegation decisions, spawn_agent prompts, custom agent_type roles, reasoning_effort choices, safe parallelism, wait/list/send/followup/close handling, commit discipline, receipts, and root-thread coordination.

2026-05-150

rp-build-v2.md

from "novotnyllc/subagent-orchestration"

Build with RepoPrompt MCP tools context builder plan → implement

2026-05-100

rp-deep-plan-v2.md

from "novotnyllc/subagent-orchestration"

Deep planning workflow using RepoPrompt MCP tools: map seams, draft, critique, polish — produces a ready-to-execute plan document

2026-05-100

rp-investigate-v2.md

from "novotnyllc/subagent-orchestration"

Deep investigation with RepoPrompt MCP tools: tools gather evidence, follow-up reasoning synthesizes selected context

2026-05-100

rp-oracle-export-v2.md

from "novotnyllc/subagent-orchestration"

Export a ChatGPT-ready Question / Plan / Review prompt using RepoPrompt MCP tools

2026-05-100

rp-orchestrate-v2.md

from "novotnyllc/subagent-orchestration"

Plan, decompose, and delegate complex tasks across multiple agents using RepoPrompt MCP tools

2026-05-100

package.json

"author": "novotnyllc"

"repository": "novotnyllc/subagent-orchestration"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件开发工程师计算机与数学类职业15-1252L4

一键运行任何 Skill

name	rp-optimize-v2
description	Iterative performance optimization loop using RepoPrompt MCP tools: instrument with debug-only metrics, establish a baseline, then plan → delegate one optimize+harden cycle → re-measure → ask oracle for next plan, looping until the oracle is satisfied or the target metric is met
repoprompt_managed	true
repoprompt_skills_version	61
repoprompt_variant	mcp

MCP Optimizer

Raw request: $ARGUMENTS

You are an optimization orchestrator. Performance work only improves what you can measure, so the loop is always: map → plan → instrument & baseline → optimize loop → decide. Keep looping until the oracle signals the gains have plateaued, the target metric is met, or the iteration cap is reached.

This workflow is delegation-heavy by design. Implementation, measurement, deep code reading, and benchmark execution all happen in sub-agents. You own coordination, planning, and the stop decision. Your direct tool calls are reserved for: triaging the user's prompt, reading the scoreboard, spot-checking sub-agent claims, and curating the file selection that the oracle and context_builder reason over.

How delegation flows

You (the agent): Orchestrate. Translate the prompt, fan out code_mapper agents to map surface area, run context_builder to plan setup, dispatch implementation_lead to land instrumentation + baseline, then loop implementation_lead dispatches for each optimization. Your context stays lean.
Mapping agents (spawn_agent with agent_type:"code_mapper", reasoning_effort:"low", fork_turns:"none"): Read-only sub-agents that map the surface — find AGENTS.md, locate hot paths, discover existing benchmarks, define scope boundaries. Cheap and parallel; spawn liberally during Phase 1.
context_builder: Used in plan mode during Phase 2 to design the metric, instrumentation strategy, and first-pass optimization candidates in one shot. Reused in the loop to plan each individual optimization.
Implementation agents (spawn_agent with agent_type:"implementation_lead", reasoning_effort:"high", fork_turns:"none"): Carry out implementation, measurement, and hardening. Phase 3 lands instrumentation and the baseline. Each loop iteration lands one attributed change, runs tests, re-measures, and appends to the scoreboard.
Oracle (oracle_send): Reasons over the scoreboard and changed files at decision points — "did this iteration earn its keep?" and "should we keep going?". Selection-aware; you curate before each call.

Core principles

Don't read what an agent can read for you. Reserve direct read_file / file_search / git for verifying agent claims and reading the scoreboard. Mapping the codebase, running benchmarks, reading AGENTS.md — those go to sub-agents.
One attributed change per loop iteration. Causality is cheap to preserve and expensive to recover.
The scoreboard is the shared truth. Every iteration appends; nothing gets overwritten. Sub-agents and the oracle both read from it.
The oracle is the stop signal. You don't decide when to stop on gut feel — you ask, with the scoreboard in selection, and respect the answer.

Phase 0: Workspace Verification (REQUIRED)

Before any optimization work, bind to the target codebase using its working directory:

{"tool":"bind_context","args":{"op":"bind","working_dirs":["/absolute/path/to/project"]}}

This auto-resolves to the window containing your project. No need to list windows first.

If binding succeeds → proceed to Phase 1 If no match → the codebase isn't loaded. Find and open the workspace:

{"tool":"manage_workspaces","args":{"action":"list"}}
{"tool":"manage_workspaces","args":{"action":"switch","workspace":"<workspace_name>","open_in_new_window":true}}

Then retry the working_dirs bind.

Phase 1: Surface Mapping & Bottleneck Scouting (delegate to `code_mapper` agents)

Your job here is prompt translation + orchestrated scouting, not codebase exploration. Spend at most 1–2 navigation calls turning the user's request into the codebase's actual nouns, then fan out code_mapper agents in parallel to scout for bottleneck candidates around the named target.

The user names what to optimize, but the actual bottleneck is rarely just inside that function. It can sit in the callers (called 10k times in a tight loop, where the loop itself is the cost), in the inputs (caller wastefully constructs the data the target consumes), in adjacent operations that run together in the same code path, or in shared infrastructure the target touches (locks, caches, allocators). Scouting radiates outward from the named target so Phase 2's plan is grounded in evidence about where time is actually going.

1a. Translate the prompt

Rewrite the user's request in the repo's likely terminology — don't dive deeper yet.

Example:

Raw: "Speed up search"
Translated: "Reduce p95 latency of path-matching under the test fixtures — likely PathMatcher and friends. Need to confirm exact module and existing benchmarks."

Default: run the full fan-out. Even when the user names the function, the cost is rarely all inside that function — callers, inputs, and adjacent operations often dominate. Bottleneck scouting is what surfaces that.

Two narrow exceptions:

Profile data already exists (user attached a sample report or pointed at a recent profiler trace in the repo) → dispatch one focused code_mapper to read the trace + summarize bottleneck candidates, then go to Phase 2. Skip the rest of the fan-out.
User gave a feature/feeling ("feels slow during X") → full fan-out, plus add <X>-entry-point discovery to the "Target & call graph" brief.

1b. Dispatch `code_mapper` agents in parallel

Spawn code_mapper agents — each with one narrow question — for the facts you need before context_builder can plan. The bottleneck-candidates scout is the heart of this phase. Typical fan-out:

Agent	Question
Bottleneck candidates	"Scout for performance bottlenecks around `<translated target>`. Look at the target itself AND its surrounding context: callers (especially loops or hot code paths that invoke the target frequently), data dependencies (how inputs to the target are constructed upstream), adjacent operations that run together in the same code path, shared infrastructure the target touches (locks, caches, allocators). Hunt for: tight loops with per-iteration allocations, redundant computation across iterations, locking/serialization, expensive data transformations (JSON/XML/string), sync I/O on hot paths, O(n²) patterns, repeated work that could be cached, unbatched UI/IO updates. Report 2–3 ranked candidates with `file:line` refs and a one-sentence rationale per candidate ('suspicious because…'). Don't propose fixes yet — just identify what looks expensive."
Target & call graph	"Locate the implementation of `<translated target>`. Then map its call graph: who calls it, how often, and in what context (tight loop? cold init? user-driven? background job?). Report the implementation `file:line` and the 3–5 most relevant call sites with the surrounding code context that explains how the target is invoked."
Prior perf work	"Find prior performance work related to `<area>`. Look for: (a) existing benchmarks, perf tests, or instrumentation in code (search `Tests`, `Bench`, `Perf*`, `benchmarks/`, `bench/`); (b) profiler traces, sample reports, or perf logs in the repo (look under `error-triage/`, `reports/`, `docs/investigations/`, `perf/`, or similar); (c) TODOs/FIXMEs/comments mentioning 'slow', 'perf', 'O(n', 'hot path', 'bottleneck' in or near `<area>`. Report what exists, where it lives, and (for benchmarks/instrumentation) how to invoke it."
Conventions	"Read AGENTS.md (or the project's testing/benchmarking doc). Report: how to run unit tests, how to run benchmarks if any, how to launch a debug harness, and any sanctioned measurement commands. Quote the exact commands."
Scope	"Identify the file/module boundary that defines `<area>` — which files are in scope for changes, which are clearly out of scope. List both."

Each spawn_agent call starts an independent agent; no detach flag is needed:

// The Bottleneck candidates scout — full prose, since this is the heart of Phase 1
{"tool":"spawn_agent","args":{
	"task_name":"bottleneck_candidates_<area>",
	"agent_type":"code_mapper",
	"reasoning_effort":"low",
	"fork_turns":"none",
	"message":"Scout for performance bottlenecks around <translated target>. Look at the target AND its surrounding context: callers (especially loops or hot paths invoking it frequently), data dependencies (how inputs are constructed upstream), adjacent operations in the same code path, shared infrastructure the target touches. Hunt for: tight loops with per-iteration allocations, redundant computation, locking/serialization, expensive transformations, sync I/O on hot paths, O(n^2) patterns, unbatched updates. Report 2-3 ranked candidates with file:line refs and a one-sentence rationale per candidate. Don't propose fixes yet."
}}

// One more representative one — same shape, different question
{"tool":"spawn_agent","args":{
	"task_name":"conventions_agents_md",
	"agent_type":"code_mapper",
	"reasoning_effort":"low",
	"fork_turns":"none",
	"message":"Read AGENTS.md (or the project's equivalent). Report how to run unit tests, benchmarks, debug harness, sanctioned measurement commands. Quote exact commands."
}}

// Repeat the same shape for the remaining 3 scouts in the table above
// (Target & call graph, Prior perf work, Scope).

// Then wait on the batch
{"tool":"wait_agent","args":{"timeout_ms":180000}}
{"tool":"list_agents","args":{}}

⚠️ Agents may block on permission approvals. Use wait_agent / list_agents periodically so you can approve and keep them unblocked.

If the bottleneck-candidates scout returns thin or generic results ("nothing obviously expensive"), that's a signal — either the area is genuinely well-tuned and the user's complaint is elsewhere, or the scout needed broader radius. Either way, re-dispatch one targeted code_mapper with a wider radius (e.g., "look two call levels up") rather than reading the code yourself.

1c. Synthesize the target

When the explores return, write down (in your head or as scratch — no need for a file yet):

Metric: a single, nameable number (latency, peak memory, allocations, frame time, etc.) plus its unit.
Stop criterion: hard threshold ("p95 < 50 ms"), relative target ("30% faster"), or "until oracle says diminishing returns".
Scope: the file/module boundary, citing what's in and what's out (from the Scope scout).
Measurement command: the exact command(s) the explores reported (from Conventions + Prior perf work).
Ranked bottleneck candidates: 2–3 candidates with file:line and rationale, augmented by call-site context from the call-graph scout.

If the bottleneck candidates and the user's named target diverge significantly — e.g., user said "speed up PathMatcher.match" but the scout reports the real cost is in the caller's per-iteration allocation of MatchOptions — pause before dispatching Phase 2 and ask the user explicitly: "Scouting suggests the bigger lever is X (file:line). Pursue X, stay on the original target, or both?" Wait for their answer; don't reframe the scope unilaterally.

You'll feed all five to context_builder in Phase 2.

Phase 2: Plan Setup with `context_builder` (plan mode)

Now that the surface is mapped and bottleneck candidates are in hand, route the setup design through context_builder in plan mode. One call produces:

Instrumentation strategy — where the metric will be measured, what build gate to use, which secondary test/support file holds the collection logic. Where the instrumentation lives is informed by the bottleneck candidates from Phase 1 — measure where the cost actually is.
Baseline procedure — how many samples, how to discard outliers, what variance band to expect.
First-pass optimization candidates — 2–3 concrete optimizations ranked by expected delta vs. risk, grounded in the bottleneck candidates from Phase 1 (each of those is now a candidate to address; context_builder translates them into actionable changes with risk assessment).
Scoreboard scaffold — the markdown shape for prompt-exports/optimize-<slug>-runs.md.

{"tool":"context_builder","args":{
	"instructions":"<task>Design the setup for an iterative optimization loop targeting <metric> on <scope>.\n\nReturn an actionable plan with:\n1. Instrumentation strategy: which file to add/extend (must be a test/support file, not production code), the debug-build gate to use (e.g. #if DEBUG / cfg(debug_assertions) / NODE_ENV / etc. — pick what matches this repo's convention), and the smallest hook the production code needs to expose.\n2. Baseline procedure: how many samples, how to discard outliers, expected variance band, and the exact command to run (per AGENTS.md).\n3. First-pass optimization candidates: 2–3 concrete optimizations ranked by (expected delta / risk). For each: the change, why it should help the metric, and what could regress.\n4. Scoreboard scaffold: the initial contents for prompt-exports/optimize-<slug>-runs.md.</task>\n\n<context>\nUser request: <raw request>\nMetric + units: <name + units>\nStop criterion: <threshold or 'oracle-satisfied'>\nScope: <files/modules in play>\nMeasurement command (from Conventions scout): <exact command>\nTarget & call graph (from Target scout): <implementation file:line + 3–5 callers with context>\nBottleneck candidates (from Bottleneck scout — the heart of Phase 1): <ranked list with file:line and one-sentence rationale per candidate>\nPrior perf work (from Prior perf work scout): <existing benchmarks, profiler traces, perf TODOs and what they say>\nProject conventions doc: AGENTS.md (already summarized by the scout — the agent doesn't need to re-read it)\n\nWhen ranking first-pass optimization candidates, treat the Bottleneck candidates list as the seed. Each item there is a hypothesis about where time is going; the plan should propose how to address each, with risk and verification.\n</context>",
	"response_type":"plan",
	"export_response":true
}}

The tool returns oracle_export_path and oracle_export_instruction. Save the export path — Phase 3 and every loop iteration reference it.

If the plan looks thin (no concrete instrumentation site, vague candidates), refine the instructions and re-run rather than trying to fill gaps yourself.

Phase 3: Land Instrumentation + Baseline (delegate to `implementation_lead`)

You don't run measurements. Dispatch a single implementation_lead agent to execute the setup plan:

{"tool":"spawn_agent","args":{
	"task_name":"optimize_setup",
	"agent_type":"implementation_lead",
	"reasoning_effort":"high",
	"fork_turns":"none",
	"message":"Read the setup plan at <plan path> with read_file first. Execute the setup phase only -- do not pursue any optimization yet:\n\n1. Land the instrumentation per the plan: keep it in a secondary test/support file, gate it behind the repo's debug build flag, and expose only the minimum hook in production code.\n2. Verify a release build with instrumentation stripped still compiles cleanly.\n3. Capture the baseline: run the measurement command 3-5 times. Discard obvious outliers. Record the median and p95 (or whichever is appropriate for this metric). Note the variance band -- if optimizations smaller than that band would be invisible, say so explicitly.\n4. Create prompt-exports/optimize-<slug>-runs.md from the scoreboard scaffold in the plan, fill in the baseline row with median, p95, environment notes, and current commit.\n5. Report back: instrumentation files touched, the exact command used, baseline numbers, variance, and any concerns about measurement reliability.\n\nDo not attempt any optimization yet -- that's the next iteration. Skip oracle review; the orchestrator handles that."
}}

When the implementation agent returns:

Read the scoreboard. That's your verification surface — not the diff. If the baseline row looks reasonable, move on.
Sanity-check variance. If the variance band is large enough to swallow a 10–20% optimization, that's a problem. Steer the implementation agent to take more samples or narrow the workload before continuing.
Don't read the instrumentation diff yet. If concerns surface in later phases, dispatch a narrow code_mapper to summarize the diff for you.

Housekeeping

Sessions persist after agents finish — useful when you might revisit output, but they pile up over a multi-agent workflow. Once you've recorded what an agent produced, you can dismiss its session:

{"tool":"close_agent","args":{"target":"<task_name>"}}

Mapping-agent sessions are good to dismiss right away — narrow reconnaissance, no follow-up value. Keep heavier agent sessions if you might revisit them.

Phase 4: The Optimization Loop

One iteration = one attributed change + one re-measurement. Running multiple optimizations in parallel destroys causality — keep the loop serial by default.

Loop until one of the termination criteria in 4d fires.

4a. Plan the next optimization

The Phase 2 plan listed first-pass candidates. For iteration 1, pick the top-ranked one and skip straight to 4b. From iteration 2 onward, use context_builder to plan the next single change with the scoreboard in selection. The Phase 2 plan's candidates are still the seed; context_builder refines whichever you select next. Reach for the oracle in this slot only when you already have a fully-formed candidate and just need a sanity check before dispatch.

{"tool":"manage_selection","args":{
	"op":"set",
	"paths":["<target source files>","<benchmark or test>","prompt-exports/optimize-<slug>-runs.md","<setup plan path>"],
	"mode":"full"
}}
{"tool":"context_builder","args":{
	"instructions":"<task>Propose the single next optimization to pursue for <metric>. One change, not a list. Include: the specific change, why you expect it to move the metric, any risks to behavior or correctness, how to verify it didn't regress other tests.</task>\n\n<context>Current baseline and prior runs are in prompt-exports/optimize-<slug>-runs.md. The setup plan at <setup plan path> already listed first-pass candidates — prefer one of those if it still looks promising and hasn't been tried. Target: <threshold or directional goal>. Scope: <modules>.</context>",
	"response_type":"plan",
	"export_response":true
}}

If the plan proposes more than one change, pick the one with the best expected delta per unit of risk.

4b. Dispatch one full optimize-and-harden loop

Dispatch one implementation_lead agent for the selected change. The brief covers landing the optimization and hardening it in one shot — implementation, tests, re-measurement, scoreboard append:

{"tool":"spawn_agent","args":{
	"task_name":"optimize_<n>",
	"agent_type":"implementation_lead",
	"reasoning_effort":"high",
	"fork_turns":"none",
	"message":"Read the plan at <plan path> with read_file first. Run one full optimize-and-harden loop:\n\n1. Implement the change in <files>.\n2. Run the project's standard test command (see AGENTS.md) for the touched modules -- fix anything that breaks.\n3. Re-run the same measurement command used for the baseline (in prompt-exports/optimize-<slug>-runs.md). Take the same number of samples as the baseline so deltas are comparable. Append a new row -- don't overwrite.\n4. If the change regressed the metric or broke correctness, either revert it or iterate once to fix, then report back.\n5. Summarize: what you changed, what the metric moved to, tests touched, concerns worth flagging.\n\nStay inside <scope>. Don't pursue tangential optimizations -- one attributed change per loop. Skip oracle review; the orchestrator handles that."
}}

Always use implementation_lead with reasoning_effort:"high". Optimizations involve trade-offs (correctness, locality, complexity), and re-measurement requires interpreting noisy results. Use test_automator with reasoning_effort:"medium" only for isolated measurement/test work.

4c. Verify (without re-reading the codebase)

Verify the agent's output against the plan's done-when criteria — for optimize, that's the scoreboard, not the diff. Optimization-specific checks, all designed for minimal direct reads:

Read the scoreboard, not the diff. A new row should be there with sample-matched numbers. If the agent forgot to append, steer it to fix.
Is the delta real? Compare to the variance band you recorded in Phase 3. Single-digit shifts inside the noise band are inconclusive — note that in the scoreboard before consulting the oracle.
Tests actually ran? "Ran the tests" in a summary isn't the same as tests passing. If you're suspicious, ask the agent for the exact command and exit code, or dispatch a narrow test_automator to re-run them.
Behavior spot-check, when needed. If the change touches a correctness-sensitive surface, dispatch a narrow code_mapper agent to read the diffed files and report whether semantics shifted (early returns, cache invalidation, ordering). Don't pull the diff into your own context unless the scout flags something concrete.

If the change regressed the metric or broke correctness that the sub-agent didn't catch, steer the same agent to fix it before opening a new loop. Rolling back counts as progress — record the attempt in the scoreboard so the next plan knows that path was tried.

4d. Ask the oracle for the next plan (and the stop decision)

After a successful iteration, refresh the selection and ask the oracle both questions in one call:

{"tool":"manage_selection","args":{
	"op":"set",
	"paths":["<files changed this iteration>","<benchmark or test>","prompt-exports/optimize-<slug>-runs.md"],
	"mode":"full"
}}
{"tool":"oracle_send","args":{
	"message":"Plan: We just landed <change summary>. Metric moved from <baseline> to <new>. Scoreboard is in the selection. Given the stop criterion (<criterion>), should we run another iteration? If yes, what's the single best next optimization to pursue. If no, explain why you think we've hit diminishing returns or the target.",
	"mode":"plan"
}}

The oracle's answer determines the loop's next step:

"Keep going, try X next" → return to 4a with X as the seed for the next plan. Reuse the same scoreboard and instrumentation; only re-instrument if the hot path moved.
"We're done" / "Diminishing returns" / target met → exit the loop, go to Phase 5.
"Can't tell — measurement is too noisy" or "behavior may have regressed" → stop optimizing. The next dispatch must be an implementation_lead to harden the instrumentation or fix the regression. Do not plan a new optimization on top of an unreliable measurement — every later result becomes uninterpretable.

Termination criteria (stop conditions)

Exit the loop when any of these fire:

Oracle says so. "Good enough" or "diminishing returns".
Target metric met. The stop criterion from Phase 1 is satisfied in the latest measurement.
Iteration cap. 5 loops, hard. Before dispatching loop 6, surface the scoreboard to the user and ask explicitly to continue. Don't extend the cap on your own judgment.
Oracle can't propose a plausible next move. Two consecutive "I'm not sure what to try" responses means the search has stalled.
Regression budget exhausted. If correctness keeps breaking faster than performance improves, stop and escalate to the user.

Parallelism note

The loop is serial by design — attribution collapses when you run multiple changes at once. The one legitimate use of parallelism is evaluating alternatives for the same slot: dispatch two implementation_lead agents to try two different candidate optimizations on branches or temporary copies, pick the winner, discard the loser. This is advanced and rarely worth the coordination cost; don't reach for it unless the oracle explicitly suggests it.

Housekeeping (loop)

Same session-cleanup hygiene as Phase 3. Also delete superseded plan exports each iteration so prompt-exports/ reflects only live work:

Plan and review exports generated during orchestration (via export_response:true on context_builder or oracle_send) accumulate under prompt-exports/ as files like oracle-plan-<date>-<slug>.md or oracle-review-<date>-<slug>.md. Once an export has been superseded by a newer plan, consumed by the sub-agent it was meant for, or otherwise made irrelevant by completed work, delete it so the folder reflects only live, in-progress plans. file_actions.delete requires a true absolute filesystem path, not the relative display path shown under prompt-exports/; use get_file_tree with type:"roots" if you need the loaded root's absolute path. When unsure, leave it.

{"tool":"file_actions","args":{"action":"delete","path":"/absolute/path/to/repo/prompt-exports/<stale-export>.md"}}

Phase 5: Final Rollup

After all iterations complete, give the user a final rollup:

What was accomplished per iteration
Any failures or partial completions
Any conflicts or coordination issues that surfaced
Suggested follow-ups if anything was deferred

Specifically for optimize:

Starting metric → final metric, with iteration count. A one-line summary: "PathMatcher.match p95: 124ms → 38ms over 4 iterations (-69%)."
Which changes landed, in order, with their individual deltas.
Which changes were tried and reverted, with the reason — useful so the next person doesn't repeat dead ends.
State of the instrumentation. If the debug-only metrics are worth keeping, say so; otherwise suggest removal. The scoreboard file under prompt-exports/ can stay as historical record or be deleted — default to keeping it and let the user decide.
Known follow-ups. Anything the oracle flagged but wasn't pursued this session.

Role Summary

You (the agent) own triage, prompt translation, scoreboard reads, sub-agent verification, and the stop decision. Everything else is delegated:

Capability	Mapping Agents	`context_builder`	Implementation Lead (setup)	Implementation Lead (each loop)	Oracle (`oracle_send`)
Map surface (AGENTS.md, target & call graph, prior perf work, scope)	✅ Primary	—	—	—	—
Scout bottleneck candidates around target	✅ Primary	—	—	—	—
Plan setup (metric, instrumentation, candidates from scouting)	—	✅ Primary	—	—	—
Land instrumentation + capture baseline	—	—	✅ Primary	—	—
Plan one optimization	—	✅ Primary	—	—	⚠️ sanity-check only
Implement + test + re-measure + append scoreboard	—	—	—	✅ Primary	—
Delegated spot-check / diff summary	✅ on demand	—	—	—	—
Decide continue vs stop	—	—	—	—	✅ Primary

Cheat sheet for the four operations you'll repeat:

spawn_agent agent_type=<code_mapper|implementation_lead> reasoning_effort=<effort> fork_turns=none  # dispatch
wait_agent / list_agents                                                                  # block/check
followup_task target="<task_name>"                                                        # correct
`context_builder` response_type=plan export_response=true                                  # plan

Anti-patterns

🚫 Skipping the bottleneck-candidates scout because the user named a specific function — even with a named target, callers and inputs often dominate the cost
🚫 Skipping context_builder in Phase 2 and dispatching the setup implementation agent from your own sketch — you'll lose the candidate queue and the instrumentation gating discipline
🚫 Letting context_builder re-derive optimization candidates from scratch instead of seeding it with the bottleneck candidates from Phase 1 — the scout already paid for that scouting; pass it forward
🚫 Starting to optimize before defining the metric and stop criterion — you won't know when you're done
🚫 Shipping measurement overhead to production — always gate metrics behind a debug/test build flag
🚫 Putting instrumentation in the same file as the code being measured — it belongs in a secondary test/support file
🚫 Skipping Phase 0 (Workspace Verification) — you must confirm the target codebase is loaded first
🚫 Taking a single sample as a baseline — one number isn't a measurement, it's a guess
🚫 Running multiple optimizations in one loop iteration — you'll never know which change produced which delta
🚫 Forgetting to re-run tests after the optimization — speed without correctness isn't a win
🚫 Skipping the oracle check and looping on your own judgment — the oracle sees the whole scoreboard; use it
🚫 Overwriting scoreboard rows instead of appending — historical data is how you spot regressions and dead ends

Now begin with Phase 0.

name	rp-optimize-v2
description	Iterative performance optimization loop using RepoPrompt MCP tools: instrument with debug-only metrics, establish a baseline, then plan → delegate one optimize+harden cycle → re-measure → ask oracle for next plan, looping until the oracle is satisfied or the target metric is met
repoprompt_managed	true
repoprompt_skills_version	61
repoprompt_variant	mcp

MCP Optimizer

Raw request: $ARGUMENTS

How delegation flows

You (the agent): Orchestrate. Translate the prompt, fan out code_mapper agents to map surface area, run context_builder to plan setup, dispatch implementation_lead to land instrumentation + baseline, then loop implementation_lead dispatches for each optimization. Your context stays lean.
Mapping agents (spawn_agent with agent_type:"code_mapper", reasoning_effort:"low", fork_turns:"none"): Read-only sub-agents that map the surface — find AGENTS.md, locate hot paths, discover existing benchmarks, define scope boundaries. Cheap and parallel; spawn liberally during Phase 1.
context_builder: Used in plan mode during Phase 2 to design the metric, instrumentation strategy, and first-pass optimization candidates in one shot. Reused in the loop to plan each individual optimization.
Implementation agents (spawn_agent with agent_type:"implementation_lead", reasoning_effort:"high", fork_turns:"none"): Carry out implementation, measurement, and hardening. Phase 3 lands instrumentation and the baseline. Each loop iteration lands one attributed change, runs tests, re-measures, and appends to the scoreboard.
Oracle (oracle_send): Reasons over the scoreboard and changed files at decision points — "did this iteration earn its keep?" and "should we keep going?". Selection-aware; you curate before each call.

Core principles

Don't read what an agent can read for you. Reserve direct read_file / file_search / git for verifying agent claims and reading the scoreboard. Mapping the codebase, running benchmarks, reading AGENTS.md — those go to sub-agents.
One attributed change per loop iteration. Causality is cheap to preserve and expensive to recover.
The scoreboard is the shared truth. Every iteration appends; nothing gets overwritten. Sub-agents and the oracle both read from it.
The oracle is the stop signal. You don't decide when to stop on gut feel — you ask, with the scoreboard in selection, and respect the answer.

Phase 0: Workspace Verification (REQUIRED)

Before any optimization work, bind to the target codebase using its working directory:

{"tool":"bind_context","args":{"op":"bind","working_dirs":["/absolute/path/to/project"]}}

This auto-resolves to the window containing your project. No need to list windows first.

If binding succeeds → proceed to Phase 1 If no match → the codebase isn't loaded. Find and open the workspace:

{"tool":"manage_workspaces","args":{"action":"list"}}
{"tool":"manage_workspaces","args":{"action":"switch","workspace":"<workspace_name>","open_in_new_window":true}}

Then retry the working_dirs bind.

Phase 1: Surface Mapping & Bottleneck Scouting (delegate to `code_mapper` agents)

1a. Translate the prompt

Rewrite the user's request in the repo's likely terminology — don't dive deeper yet.

Example:

Raw: "Speed up search"
Translated: "Reduce p95 latency of path-matching under the test fixtures — likely PathMatcher and friends. Need to confirm exact module and existing benchmarks."

Two narrow exceptions:

Profile data already exists (user attached a sample report or pointed at a recent profiler trace in the repo) → dispatch one focused code_mapper to read the trace + summarize bottleneck candidates, then go to Phase 2. Skip the rest of the fan-out.
User gave a feature/feeling ("feels slow during X") → full fan-out, plus add <X>-entry-point discovery to the "Target & call graph" brief.

1b. Dispatch `code_mapper` agents in parallel

Agent	Question
Bottleneck candidates	"Scout for performance bottlenecks around `<translated target>`. Look at the target itself AND its surrounding context: callers (especially loops or hot code paths that invoke the target frequently), data dependencies (how inputs to the target are constructed upstream), adjacent operations that run together in the same code path, shared infrastructure the target touches (locks, caches, allocators). Hunt for: tight loops with per-iteration allocations, redundant computation across iterations, locking/serialization, expensive data transformations (JSON/XML/string), sync I/O on hot paths, O(n²) patterns, repeated work that could be cached, unbatched UI/IO updates. Report 2–3 ranked candidates with `file:line` refs and a one-sentence rationale per candidate ('suspicious because…'). Don't propose fixes yet — just identify what looks expensive."
Target & call graph	"Locate the implementation of `<translated target>`. Then map its call graph: who calls it, how often, and in what context (tight loop? cold init? user-driven? background job?). Report the implementation `file:line` and the 3–5 most relevant call sites with the surrounding code context that explains how the target is invoked."
Prior perf work	"Find prior performance work related to `<area>`. Look for: (a) existing benchmarks, perf tests, or instrumentation in code (search `Tests`, `Bench`, `Perf*`, `benchmarks/`, `bench/`); (b) profiler traces, sample reports, or perf logs in the repo (look under `error-triage/`, `reports/`, `docs/investigations/`, `perf/`, or similar); (c) TODOs/FIXMEs/comments mentioning 'slow', 'perf', 'O(n', 'hot path', 'bottleneck' in or near `<area>`. Report what exists, where it lives, and (for benchmarks/instrumentation) how to invoke it."
Conventions	"Read AGENTS.md (or the project's testing/benchmarking doc). Report: how to run unit tests, how to run benchmarks if any, how to launch a debug harness, and any sanctioned measurement commands. Quote the exact commands."
Scope	"Identify the file/module boundary that defines `<area>` — which files are in scope for changes, which are clearly out of scope. List both."

Each spawn_agent call starts an independent agent; no detach flag is needed:

// The Bottleneck candidates scout — full prose, since this is the heart of Phase 1
{"tool":"spawn_agent","args":{
	"task_name":"bottleneck_candidates_<area>",
	"agent_type":"code_mapper",
	"reasoning_effort":"low",
	"fork_turns":"none",
	"message":"Scout for performance bottlenecks around <translated target>. Look at the target AND its surrounding context: callers (especially loops or hot paths invoking it frequently), data dependencies (how inputs are constructed upstream), adjacent operations in the same code path, shared infrastructure the target touches. Hunt for: tight loops with per-iteration allocations, redundant computation, locking/serialization, expensive transformations, sync I/O on hot paths, O(n^2) patterns, unbatched updates. Report 2-3 ranked candidates with file:line refs and a one-sentence rationale per candidate. Don't propose fixes yet."
}}

// One more representative one — same shape, different question
{"tool":"spawn_agent","args":{
	"task_name":"conventions_agents_md",
	"agent_type":"code_mapper",
	"reasoning_effort":"low",
	"fork_turns":"none",
	"message":"Read AGENTS.md (or the project's equivalent). Report how to run unit tests, benchmarks, debug harness, sanctioned measurement commands. Quote exact commands."
}}

// Repeat the same shape for the remaining 3 scouts in the table above
// (Target & call graph, Prior perf work, Scope).

// Then wait on the batch
{"tool":"wait_agent","args":{"timeout_ms":180000}}
{"tool":"list_agents","args":{}}

⚠️ Agents may block on permission approvals. Use wait_agent / list_agents periodically so you can approve and keep them unblocked.

1c. Synthesize the target

When the explores return, write down (in your head or as scratch — no need for a file yet):

Metric: a single, nameable number (latency, peak memory, allocations, frame time, etc.) plus its unit.
Stop criterion: hard threshold ("p95 < 50 ms"), relative target ("30% faster"), or "until oracle says diminishing returns".
Scope: the file/module boundary, citing what's in and what's out (from the Scope scout).
Measurement command: the exact command(s) the explores reported (from Conventions + Prior perf work).
Ranked bottleneck candidates: 2–3 candidates with file:line and rationale, augmented by call-site context from the call-graph scout.

You'll feed all five to context_builder in Phase 2.

Phase 2: Plan Setup with `context_builder` (plan mode)

Now that the surface is mapped and bottleneck candidates are in hand, route the setup design through context_builder in plan mode. One call produces:

Instrumentation strategy — where the metric will be measured, what build gate to use, which secondary test/support file holds the collection logic. Where the instrumentation lives is informed by the bottleneck candidates from Phase 1 — measure where the cost actually is.
Baseline procedure — how many samples, how to discard outliers, what variance band to expect.
First-pass optimization candidates — 2–3 concrete optimizations ranked by expected delta vs. risk, grounded in the bottleneck candidates from Phase 1 (each of those is now a candidate to address; context_builder translates them into actionable changes with risk assessment).
Scoreboard scaffold — the markdown shape for prompt-exports/optimize-<slug>-runs.md.

{"tool":"context_builder","args":{
	"instructions":"<task>Design the setup for an iterative optimization loop targeting <metric> on <scope>.\n\nReturn an actionable plan with:\n1. Instrumentation strategy: which file to add/extend (must be a test/support file, not production code), the debug-build gate to use (e.g. #if DEBUG / cfg(debug_assertions) / NODE_ENV / etc. — pick what matches this repo's convention), and the smallest hook the production code needs to expose.\n2. Baseline procedure: how many samples, how to discard outliers, expected variance band, and the exact command to run (per AGENTS.md).\n3. First-pass optimization candidates: 2–3 concrete optimizations ranked by (expected delta / risk). For each: the change, why it should help the metric, and what could regress.\n4. Scoreboard scaffold: the initial contents for prompt-exports/optimize-<slug>-runs.md.</task>\n\n<context>\nUser request: <raw request>\nMetric + units: <name + units>\nStop criterion: <threshold or 'oracle-satisfied'>\nScope: <files/modules in play>\nMeasurement command (from Conventions scout): <exact command>\nTarget & call graph (from Target scout): <implementation file:line + 3–5 callers with context>\nBottleneck candidates (from Bottleneck scout — the heart of Phase 1): <ranked list with file:line and one-sentence rationale per candidate>\nPrior perf work (from Prior perf work scout): <existing benchmarks, profiler traces, perf TODOs and what they say>\nProject conventions doc: AGENTS.md (already summarized by the scout — the agent doesn't need to re-read it)\n\nWhen ranking first-pass optimization candidates, treat the Bottleneck candidates list as the seed. Each item there is a hypothesis about where time is going; the plan should propose how to address each, with risk and verification.\n</context>",
	"response_type":"plan",
	"export_response":true
}}

The tool returns oracle_export_path and oracle_export_instruction. Save the export path — Phase 3 and every loop iteration reference it.

If the plan looks thin (no concrete instrumentation site, vague candidates), refine the instructions and re-run rather than trying to fill gaps yourself.

Phase 3: Land Instrumentation + Baseline (delegate to `implementation_lead`)

You don't run measurements. Dispatch a single implementation_lead agent to execute the setup plan:

{"tool":"spawn_agent","args":{
	"task_name":"optimize_setup",
	"agent_type":"implementation_lead",
	"reasoning_effort":"high",
	"fork_turns":"none",
	"message":"Read the setup plan at <plan path> with read_file first. Execute the setup phase only -- do not pursue any optimization yet:\n\n1. Land the instrumentation per the plan: keep it in a secondary test/support file, gate it behind the repo's debug build flag, and expose only the minimum hook in production code.\n2. Verify a release build with instrumentation stripped still compiles cleanly.\n3. Capture the baseline: run the measurement command 3-5 times. Discard obvious outliers. Record the median and p95 (or whichever is appropriate for this metric). Note the variance band -- if optimizations smaller than that band would be invisible, say so explicitly.\n4. Create prompt-exports/optimize-<slug>-runs.md from the scoreboard scaffold in the plan, fill in the baseline row with median, p95, environment notes, and current commit.\n5. Report back: instrumentation files touched, the exact command used, baseline numbers, variance, and any concerns about measurement reliability.\n\nDo not attempt any optimization yet -- that's the next iteration. Skip oracle review; the orchestrator handles that."
}}

When the implementation agent returns:

Read the scoreboard. That's your verification surface — not the diff. If the baseline row looks reasonable, move on.
Sanity-check variance. If the variance band is large enough to swallow a 10–20% optimization, that's a problem. Steer the implementation agent to take more samples or narrow the workload before continuing.
Don't read the instrumentation diff yet. If concerns surface in later phases, dispatch a narrow code_mapper to summarize the diff for you.

Housekeeping

Sessions persist after agents finish — useful when you might revisit output, but they pile up over a multi-agent workflow. Once you've recorded what an agent produced, you can dismiss its session:

{"tool":"close_agent","args":{"target":"<task_name>"}}

Mapping-agent sessions are good to dismiss right away — narrow reconnaissance, no follow-up value. Keep heavier agent sessions if you might revisit them.

Phase 4: The Optimization Loop

One iteration = one attributed change + one re-measurement. Running multiple optimizations in parallel destroys causality — keep the loop serial by default.

Loop until one of the termination criteria in 4d fires.

4a. Plan the next optimization

{"tool":"manage_selection","args":{
	"op":"set",
	"paths":["<target source files>","<benchmark or test>","prompt-exports/optimize-<slug>-runs.md","<setup plan path>"],
	"mode":"full"
}}
{"tool":"context_builder","args":{
	"instructions":"<task>Propose the single next optimization to pursue for <metric>. One change, not a list. Include: the specific change, why you expect it to move the metric, any risks to behavior or correctness, how to verify it didn't regress other tests.</task>\n\n<context>Current baseline and prior runs are in prompt-exports/optimize-<slug>-runs.md. The setup plan at <setup plan path> already listed first-pass candidates — prefer one of those if it still looks promising and hasn't been tried. Target: <threshold or directional goal>. Scope: <modules>.</context>",
	"response_type":"plan",
	"export_response":true
}}

If the plan proposes more than one change, pick the one with the best expected delta per unit of risk.

4b. Dispatch one full optimize-and-harden loop

{"tool":"spawn_agent","args":{
	"task_name":"optimize_<n>",
	"agent_type":"implementation_lead",
	"reasoning_effort":"high",
	"fork_turns":"none",
	"message":"Read the plan at <plan path> with read_file first. Run one full optimize-and-harden loop:\n\n1. Implement the change in <files>.\n2. Run the project's standard test command (see AGENTS.md) for the touched modules -- fix anything that breaks.\n3. Re-run the same measurement command used for the baseline (in prompt-exports/optimize-<slug>-runs.md). Take the same number of samples as the baseline so deltas are comparable. Append a new row -- don't overwrite.\n4. If the change regressed the metric or broke correctness, either revert it or iterate once to fix, then report back.\n5. Summarize: what you changed, what the metric moved to, tests touched, concerns worth flagging.\n\nStay inside <scope>. Don't pursue tangential optimizations -- one attributed change per loop. Skip oracle review; the orchestrator handles that."
}}

4c. Verify (without re-reading the codebase)

Verify the agent's output against the plan's done-when criteria — for optimize, that's the scoreboard, not the diff. Optimization-specific checks, all designed for minimal direct reads:

Read the scoreboard, not the diff. A new row should be there with sample-matched numbers. If the agent forgot to append, steer it to fix.
Is the delta real? Compare to the variance band you recorded in Phase 3. Single-digit shifts inside the noise band are inconclusive — note that in the scoreboard before consulting the oracle.
Tests actually ran? "Ran the tests" in a summary isn't the same as tests passing. If you're suspicious, ask the agent for the exact command and exit code, or dispatch a narrow test_automator to re-run them.
Behavior spot-check, when needed. If the change touches a correctness-sensitive surface, dispatch a narrow code_mapper agent to read the diffed files and report whether semantics shifted (early returns, cache invalidation, ordering). Don't pull the diff into your own context unless the scout flags something concrete.

4d. Ask the oracle for the next plan (and the stop decision)

After a successful iteration, refresh the selection and ask the oracle both questions in one call:

{"tool":"manage_selection","args":{
	"op":"set",
	"paths":["<files changed this iteration>","<benchmark or test>","prompt-exports/optimize-<slug>-runs.md"],
	"mode":"full"
}}
{"tool":"oracle_send","args":{
	"message":"Plan: We just landed <change summary>. Metric moved from <baseline> to <new>. Scoreboard is in the selection. Given the stop criterion (<criterion>), should we run another iteration? If yes, what's the single best next optimization to pursue. If no, explain why you think we've hit diminishing returns or the target.",
	"mode":"plan"
}}

The oracle's answer determines the loop's next step:

"Keep going, try X next" → return to 4a with X as the seed for the next plan. Reuse the same scoreboard and instrumentation; only re-instrument if the hot path moved.
"We're done" / "Diminishing returns" / target met → exit the loop, go to Phase 5.
"Can't tell — measurement is too noisy" or "behavior may have regressed" → stop optimizing. The next dispatch must be an implementation_lead to harden the instrumentation or fix the regression. Do not plan a new optimization on top of an unreliable measurement — every later result becomes uninterpretable.

Termination criteria (stop conditions)

Exit the loop when any of these fire:

Oracle says so. "Good enough" or "diminishing returns".
Target metric met. The stop criterion from Phase 1 is satisfied in the latest measurement.
Iteration cap. 5 loops, hard. Before dispatching loop 6, surface the scoreboard to the user and ask explicitly to continue. Don't extend the cap on your own judgment.
Oracle can't propose a plausible next move. Two consecutive "I'm not sure what to try" responses means the search has stalled.
Regression budget exhausted. If correctness keeps breaking faster than performance improves, stop and escalate to the user.

Parallelism note

Housekeeping (loop)

Same session-cleanup hygiene as Phase 3. Also delete superseded plan exports each iteration so prompt-exports/ reflects only live work:

{"tool":"file_actions","args":{"action":"delete","path":"/absolute/path/to/repo/prompt-exports/<stale-export>.md"}}

Phase 5: Final Rollup

After all iterations complete, give the user a final rollup:

What was accomplished per iteration
Any failures or partial completions
Any conflicts or coordination issues that surfaced
Suggested follow-ups if anything was deferred

Specifically for optimize:

Starting metric → final metric, with iteration count. A one-line summary: "PathMatcher.match p95: 124ms → 38ms over 4 iterations (-69%)."
Which changes landed, in order, with their individual deltas.
Which changes were tried and reverted, with the reason — useful so the next person doesn't repeat dead ends.
State of the instrumentation. If the debug-only metrics are worth keeping, say so; otherwise suggest removal. The scoreboard file under prompt-exports/ can stay as historical record or be deleted — default to keeping it and let the user decide.
Known follow-ups. Anything the oracle flagged but wasn't pursued this session.

Role Summary

You (the agent) own triage, prompt translation, scoreboard reads, sub-agent verification, and the stop decision. Everything else is delegated:

Capability	Mapping Agents	`context_builder`	Implementation Lead (setup)	Implementation Lead (each loop)	Oracle (`oracle_send`)
Map surface (AGENTS.md, target & call graph, prior perf work, scope)	✅ Primary	—	—	—	—
Scout bottleneck candidates around target	✅ Primary	—	—	—	—
Plan setup (metric, instrumentation, candidates from scouting)	—	✅ Primary	—	—	—
Land instrumentation + capture baseline	—	—	✅ Primary	—	—
Plan one optimization	—	✅ Primary	—	—	⚠️ sanity-check only
Implement + test + re-measure + append scoreboard	—	—	—	✅ Primary	—
Delegated spot-check / diff summary	✅ on demand	—	—	—	—
Decide continue vs stop	—	—	—	—	✅ Primary

Cheat sheet for the four operations you'll repeat:

spawn_agent agent_type=<code_mapper|implementation_lead> reasoning_effort=<effort> fork_turns=none  # dispatch
wait_agent / list_agents                                                                  # block/check
followup_task target="<task_name>"                                                        # correct
`context_builder` response_type=plan export_response=true                                  # plan

Anti-patterns

🚫 Skipping the bottleneck-candidates scout because the user named a specific function — even with a named target, callers and inputs often dominate the cost
🚫 Skipping context_builder in Phase 2 and dispatching the setup implementation agent from your own sketch — you'll lose the candidate queue and the instrumentation gating discipline
🚫 Letting context_builder re-derive optimization candidates from scratch instead of seeding it with the bottleneck candidates from Phase 1 — the scout already paid for that scouting; pass it forward
🚫 Starting to optimize before defining the metric and stop criterion — you won't know when you're done
🚫 Shipping measurement overhead to production — always gate metrics behind a debug/test build flag
🚫 Putting instrumentation in the same file as the code being measured — it belongs in a secondary test/support file
🚫 Skipping Phase 0 (Workspace Verification) — you must confirm the target codebase is loaded first
🚫 Taking a single sample as a baseline — one number isn't a measurement, it's a guess
🚫 Running multiple optimizations in one loop iteration — you'll never know which change produced which delta
🚫 Forgetting to re-run tests after the optimization — speed without correctness isn't a win
🚫 Skipping the oracle check and looping on your own judgment — the oracle sees the whole scoreboard; use it
🚫 Overwriting scoreboard rows instead of appending — historical data is how you spot regressions and dead ends

Now begin with Phase 0.

rp-optimize-v2

同仓库更多 Skills

同仓库更多 Skills

MCP Optimizer

How delegation flows

Core principles

Phase 0: Workspace Verification (REQUIRED)

Phase 1: Surface Mapping & Bottleneck Scouting (delegate to code_mapper agents)

1a. Translate the prompt

1b. Dispatch code_mapper agents in parallel

1c. Synthesize the target

Phase 2: Plan Setup with context_builder (plan mode)

Phase 3: Land Instrumentation + Baseline (delegate to implementation_lead)

Housekeeping

Phase 4: The Optimization Loop

4a. Plan the next optimization

4b. Dispatch one full optimize-and-harden loop

4c. Verify (without re-reading the codebase)

4d. Ask the oracle for the next plan (and the stop decision)

Termination criteria (stop conditions)

Parallelism note

Housekeeping (loop)

Phase 5: Final Rollup

Role Summary

Anti-patterns

MCP Optimizer

How delegation flows

Core principles

Phase 0: Workspace Verification (REQUIRED)

Phase 1: Surface Mapping & Bottleneck Scouting (delegate to code_mapper agents)

1a. Translate the prompt

1b. Dispatch code_mapper agents in parallel

1c. Synthesize the target

Phase 2: Plan Setup with context_builder (plan mode)

Phase 3: Land Instrumentation + Baseline (delegate to implementation_lead)

Housekeeping

Phase 4: The Optimization Loop

4a. Plan the next optimization

4b. Dispatch one full optimize-and-harden loop

4c. Verify (without re-reading the codebase)

4d. Ask the oracle for the next plan (and the stop decision)

Termination criteria (stop conditions)

Parallelism note

Housekeeping (loop)

Phase 5: Final Rollup

Role Summary

Anti-patterns

Phase 1: Surface Mapping & Bottleneck Scouting (delegate to `code_mapper` agents)

1b. Dispatch `code_mapper` agents in parallel

Phase 2: Plan Setup with `context_builder` (plan mode)

Phase 3: Land Instrumentation + Baseline (delegate to `implementation_lead`)

Phase 1: Surface Mapping & Bottleneck Scouting (delegate to `code_mapper` agents)

1b. Dispatch `code_mapper` agents in parallel

Phase 2: Plan Setup with `context_builder` (plan mode)

Phase 3: Land Instrumentation + Baseline (delegate to `implementation_lead`)