| name | refine |
| description | Run a MassGen-style quality refinement loop. Generates eval criteria, produces/improves answers with subagent help, regression-guards before submission, then evaluates with round-evaluator and trace-analyzer in parallel. Invoke with /refine. |
| user-invocable | true |
| argument-hint | <query or path-to-deliverable> [--criteria 'custom criteria'] [--max-rounds N] [--prior-answer path] [--eval-backends codex/gpt-5.4 gemini/gemini-3-flash] [--massgen-orchestrated] [--cost-budget N] |
| allowed-tools | Read, Write, Edit, Bash, Glob, Grep, WebFetch, WebSearch, Agent, Skill, EnterPlanMode, ExitPlanMode |
Quality Refinement Loop
You are orchestrating a MassGen-style quality refinement workflow. This is not
a one-shot task — it is an iterative loop that produces genuinely high-quality
output through structured evaluation and improvement.
Autonomy
Default: fully autonomous. Do not ask the user for confirmation at any
step — proceed through the entire workflow without stopping. Generate criteria,
produce the answer, evaluate, iterate, all without pausing for approval.
Do NOT use AskUserQuestion during the refinement loop — it blocks execution
in non-interactive environments (e.g. Docker sandbox). Make your best judgment
and proceed.
The user can override this at invocation time:
--interactive or --confirm-criteria: pause for criteria confirmation
--confirm-each-round: pause before each iteration
- If the user says "stop at checkpoints" or "ask me before iterating", honor that
Otherwise: execute the full loop autonomously until converged or max rounds hit.
Workflow
Phase 1: Setup
- Accept the user's query and any prior answer or deliverable path.
- Git repo check — the launch script handles git init, but verify:
git rev-parse --is-inside-work-tree
If this fails, warn the user to relaunch via launch-claude-tmux.sh.
- If you plan to use
read_media for verification (visual/audio deliverables),
create a CONTEXT.md in the workspace root with a brief task description.
Required by the vision/audio API. Not needed for generate_media.
- Call
init_session (quality-tools MCP) with an optional label describing the task.
This creates a new timestamped session directory for isolated state.
- Generate 3-7 evaluation criteria appropriate for the task.
- If the user provided custom criteria, use those instead.
- Call
generate_eval_criteria (quality-tools MCP) to store them.
- Do NOT ask for confirmation unless the user requested
--interactive.
Phase 2: Answer Production
Produce or improve the deliverable. Core principles during production:
Output-first verification: Do not just write — run, view, interact:
- Run/view the output
- Interact as a user would (click buttons, navigate, test inputs)
- Identify gaps from the interaction
- Fix and enhance
- Re-run and re-interact
- Repeat until excellent
Classify the deliverable by behavior to choose verification:
- Static (document, image): Render and view with read_media
- Animated (motion graphics): Record video, review full sequence
- Interactive (app, website): Use it — click all buttons, navigate, test states
- Produces output (script, tool): Test with varied and edge-case inputs
- Makes sound (audio): Listen to actual content
Question the choices: Don't just improve execution — question fundamental
choices. Is this the right approach, or just the first? Would a different
direction produce a higher quality ceiling?
Material improvement gate: Classify planned changes as TRANSFORMATIVE,
STRUCTURAL, or INCREMENTAL. If only INCREMENTAL changes remain, the approach
may need to change — not the polish level.
Builder agent — scope and parallelism:
Before spawning a builder, define:
- ONE cohesive deliverable (one file, one feature, one surface)
- ONE defect family or architectural problem
- Will the builder need runtime feedback? (if yes, work inline instead)
CRITICAL: commit before spawning builders. Builders run in git worktrees
cloned from the current commit. If your files aren't committed, the worktree
starts empty and builders can't see the existing code. Stage and commit all
current work before spawning any builder agents.
If the work has independent pieces, spawn MULTIPLE builders in parallel
(invoke them as separate Agent tools in a single turn). Each runs in its
own worktree. You merge outputs after all complete.
Examples of good parallel splits:
- Builder 1: Rewrite the hero section (layout spec)
- Builder 2: Implement the contact form (API spec)
- Builder 3: Build the responsive nav (mobile spec)
Do NOT bundle unrelated fixes into one builder. The builder will report
split required if over-scoped — that's feedback, not failure.
Work INLINE if: changes are incremental, well-scoped within current
context, or coupled to other in-progress work.
Background media operations: For slow media operations (video generation,
batch image analysis), spawn the media-worker agent in background. It runs
the MCP tool and writes results to the session's media_results/ directory.
Continue working while it runs — check results when needed by reading the
output file.
Keep iterating within Phase 2 until the deliverable feels genuinely ready.
This is a tight inner loop: build → verify (read_media, test, interact) →
identify gaps → fix → re-verify. Do not rush to Phase 3. The regression guard
is the final gate, not the improvement mechanism. If you're still finding
issues through verification, stay in Phase 2.
Phase 3: Pre-Submission Gate
When you believe the deliverable is ready — not just "done" but "a demanding
user would be satisfied":
- Spawn the regression-guard agent with:
- Previous version (Answer A) — if this is a revision
- Current version (Answer B / candidate)
- Evaluation criteria
- Read
verdict.json from the regression-guard's worktree:
- pass: Proceed to Phase 4 (submit)
- fail: Go back to Phase 2 — address the regressions, then re-check
- mixed: Present tradeoffs to user, let them decide
- If this is the first iteration (no prior answer), skip regression guard.
Phase 4: Submit Answer
Call new_answer (workflow-tools MCP) to submit the current deliverable.
Phase 5: Evaluation (Parallel)
Choose the evaluation method:
Path A: Default single-model evaluation
If no --eval-backends flag and no .massgen-quality/eval_config.json:
-
round-evaluator agent (foreground):
- Task: evaluate the deliverable against criteria
- Input: deliverable path, criteria, any prior evaluation history
- Output:
verdict.json, critique_packet.md, next_tasks.json
-
execution-trace-analyzer agent (foreground):
- Task: analyze how the answer production phase went
- Input: session transcript path from
.massgen-quality/environment.json
- Output:
process_report.md, process_verdict.json
Wait for both to complete.
Path B: Multi-model evaluation via team step mode (default multi-model)
If --eval-backends is specified OR .massgen-quality/eval_config.json exists
with "eval_mode": "massgen" — AND --massgen-orchestrated flag is NOT set
and orchestrator is not "massgen" in eval_config.json:
Claude orchestrates N massgen step processes directly, one per backend model.
Each backend independently evaluates the deliverable. Claude detects consensus
and extracts the winning evaluation. Total parallel tasks: N + 1 (N
round-evaluator step processes + 1 execution-trace-analyzer Agent).
Pre-checks
- Read
.massgen-quality/environment.json — is massgen.available true?
- Check
api_keys in environment.json — are the requested backends authenticated?
- Map:
openai/codex → openai, gemini → google, claude/claude_code → anthropic, grok → xai
- Verify
--step support (read plugin_dir from environment.json):
bash "<plugin_dir>/scripts/run-massgen.sh" --help 2>&1 | grep -q "\-\-step"
- If massgen unavailable,
--step unsupported, or ALL backends lack auth:
fall back to Path A (warn user)
Short name expansion
| Short name | Expands to |
|---|
codex | codex/gpt-5.4 |
gemini | gemini/gemini-3.1-pro-preview |
claude | claude/claude-opus-4-6 |
grok | grok/grok-4.20-0309-reasoning |
Full specs like codex/gpt-5.4 are also accepted.
Compose the round-evaluator prompt
Build a self-contained text prompt following
eval-prompt-template.md. The prompt
includes the evaluation identity, criteria, deliverable file paths (never
inline content), output format instructions, and prior evaluation history
(if round > 1). Write the composed prompt to
sessions/<id>/eval_team/round_NNN/eval_prompt.md.
IMPORTANT: Never embed deliverable content in the prompt. Reference
deliverables by absolute file path and ensure context_paths in the massgen
config includes the deliverable's parent directory. If a deliverable is not
already on disk (e.g. composed in memory), write it to a file first, then
reference the path.
Initialize eval session
Create the eval session directory:
sessions/<id>/eval_team/round_NNN/
config.json # session_id, models, agent_mapping, max_rounds, timeout
eval_prompt.md # composed round-evaluator prompt
eval_criteria.json # massgen-format criteria
configs/ # per-backend YAML configs
agents/ # massgen writes step results here
Write config.json:
{
"session_id": "eval_<timestamp>",
"query": "evaluation",
"models": ["codex/gpt-5.4", "gemini/gemini-3.1-pro-preview"],
"agent_mapping": {
"agent_a": "codex/gpt-5.4",
"agent_b": "gemini/gemini-3.1-pro-preview"
},
"max_rounds": 3,
"timeout_seconds": 600
}
Generate per-backend configs
Read plugin_dir from .massgen-quality/environment.json. For each backend,
generate a step mode config using the wrapper:
bash "<plugin_dir>/scripts/run-massgen.sh" \
--quickstart --headless \
--config sessions/<id>/eval_team/round_NNN/configs/<agent_id>_step.yaml \
--config-backend <type> \
--config-model <model> \
--config-agent-id <agent_id>
Add --config-docker if Docker is available. Split the backend spec on /
(e.g., codex/gpt-5.4 → --config-backend codex --config-model gpt-5.4).
Use absolute paths for --config.
Convert evaluation criteria to massgen format
The MCP generate_eval_criteria tool outputs {id, name, description, weight}.
MassGen expects {text, category}. Convert each criterion:
{"text": "<description>", "category": "must"}
Write the converted array to
sessions/<id>/eval_team/round_NNN/eval_criteria.json.
Launch N + 1 parallel tasks
Launch all tasks in a single message with multiple tool calls:
N background Bash tasks (one per backend):
bash "<plugin_dir>/scripts/run-massgen.sh" \
--step \
--session-dir "<abs_eval_session_dir>" \
--config "<abs_config_path>" \
--eval-criteria "<abs_criteria_path>" \
--automation \
"$(cat <abs_eval_prompt_path>)"
Each with run_in_background: true. Record the background task ID for each.
The wrapper handles API key sourcing and massgen resolution automatically.
1 execution-trace-analyzer agent (foreground, unchanged):
- Task: analyze how the answer production phase went
- Input: session transcript path from
.massgen-quality/environment.json
- Output:
process_report.md, process_verdict.json
Collect results and detect consensus
Follow the team step mode protocol (see
team-protocol.md):
- As each background task completes (auto-notification), read
<eval_session_dir>/agents/<agent_id>/last_action.json
- Wait until ALL N agents have completed their step
- Check consensus:
- If all agents answered (round 1): no consensus yet, re-launch all agents
- If votes exist: check
seen_steps for stale votes, count valid votes,
majority wins
- If consensus reached: proceed to parsing
- If no consensus and rounds remain: re-launch every agent that does not
have a valid (non-stale) vote for a current answer
- Max 3 consensus rounds. Force consensus at 600s timeout or if
--cost-budget
exceeded — use the answer with the most votes; if no votes, use the first
answer submitted.
- If only 1 backend available, use its answer directly (no consensus needed).
Parse winning evaluation
- Read the consensus winner's answer from
last_action.json → answer_text
(or from agents/<winner>/<step>/answer.json → answer)
- Parse the answer text for structured evaluation content (see
eval-prompt-template.md parsing section):
- Extract
verdict.json → write to sessions/<id>/evaluations/round_NNN/verdict.json
- Extract
critique_packet.md → write to sessions/<id>/evaluations/round_NNN/critique_packet.md
- Extract
next_tasks.json → write to sessions/<id>/evaluations/round_NNN/next_tasks.json
- If parsing fails: use the raw answer as
critique_packet.md with a default
{"schema_version": "1", "verdict": "iterate", "scores": {}} verdict
Wait for both the step mode consensus AND the execution-trace-analyzer to
complete before proceeding to Phase 6.
Path C: Multi-model evaluation via MassGen orchestrator (legacy)
If --massgen-orchestrated flag is set OR .massgen-quality/eval_config.json
has "orchestrator": "massgen":
MassGen handles all orchestration internally as a single batch process. Simpler
but less controllable — subject to monitor timeouts, no error recovery, no
progress visibility.
Pre-checks: Same as Path B (massgen available, API keys, backend mapping).
Fall back to Path A if unavailable.
Launch:
-
Write the round-evaluator prompt (same identity, criteria, and deliverable
file paths as Path B — never inline content) to
sessions/<id>/eval_team/round_NNN/eval_prompt.md
-
Generate a massgen multi-agent config with the specified backends
(see /massgen-run skill's references/config-templates.md).
Include context_paths with the project root so agents can read deliverables.
-
Check Docker availability — use Docker by default, --no-docker to skip
-
Run in parallel:
-
When massgen completes:
- Read
status.json → extract winning agent's answer (the consensus evaluation)
- Parse the answer for
verdict.json, critique_packet.md, next_tasks.json
content and write to sessions/<id>/evaluations/round_NNN/
- If parsing fails: fall back to using the raw answer as critique_packet.md
with a default iterate verdict
Wait for both evaluation tasks to complete before proceeding.
eval_config.json
Optional persistent config at .massgen-quality/eval_config.json:
{
"eval_mode": "massgen",
"orchestrator": "claude",
"backends": ["codex/gpt-5.4", "gemini/gemini-3.1-pro-preview"],
"docker": true
}
orchestrator: "claude" (default) for Path B team step mode, "massgen"
for Path C legacy automation mode
docker: true (default) to use Docker isolation for massgen agents
Phase 6: Synthesize and Persist
Read the evaluation outputs while you still have full context:
-
Read verdict.json:
- converged: Report success. Present the evaluation summary. Done.
- iterate: Continue below.
-
Read next_tasks.json:
- Check
approach_assessment.ceiling_status
- If
ceiling_reached: warn user that a different approach may be needed
- Read
fix_tasks and evolution_tasks
-
Read process_report.md:
- Extract the top process learnings
- Save PROCESS learnings to memory (see Memory section below)
-
Read critique_packet.md:
- Note the
improvement_spec and preserve list for the next round
Phase 7: Context Boundary (if iterating)
MANDATORY: When verdict is iterate, you MUST call EnterPlanMode before
doing any more work. Do NOT start the next round inline. Do NOT skip this step.
The context boundary is critical — it forces fresh thinking and prevents
approach anchoring from accumulated context.
- Call
EnterPlanMode immediately after reading the evaluation outputs.
- Write a next-round handoff plan containing:
- Verdict and round number
- Session directory path
- Evaluation output file paths — point the next agent to read these:
sessions/<id>/evaluations/round_NNN/verdict.json
sessions/<id>/evaluations/round_NNN/critique_packet.md
sessions/<id>/evaluations/round_NNN/next_tasks.json
sessions/<id>/evaluations/round_NNN/process_report.md
- Deliverable path(s)
- Explicit instruction: "Read the above evaluation files for full context
before starting work"
- Brief process learnings (tool/environment notes only, not approach details)
- Call
ExitPlanMode — the plan approval dialog appears. In the recommended
Docker tmux launch flow, the watcher auto-selects "clear context"; otherwise
the user selects option 1 manually.
- Next iteration starts fresh from the plan with a clean context window.
Key principle: The plan should point to files, not summarize them. The next
agent reads the full evaluation outputs and synthesizes for itself. Summaries
lose nuance — file paths preserve it.
Why this matters: Without the context boundary, later rounds carry forward
stale reasoning, anchor to the previous approach, and accumulate token bloat.
The fresh start is what makes iteration actually improve quality rather than
just polish the same approach.
Loop Control
- Default max rounds: 3 (user can override with
--max-rounds)
- Exit when: verdict is
converged, max rounds reached, or user intervenes
- Each round saves deliverable snapshots to
.massgen-quality/sessions/{id}/round_NNN/
Memory Persistence
Memory should capture how to work better (process), NOT what to work on
(content/approach). The fresh-context restart is deliberate — anchoring the
next round to the previous approach defeats the purpose.
SAVE to memory (process learnings):
- Tool strategy: "use read_media before submitting visual deliverables"
- Error avoidance: "this project requires CONTEXT.md for media tools"
- Effort allocation: "don't spend 40% of time on CSS before core logic works"
- Verification gaps: "always test mobile viewport, not just desktop"
- Environment notes: "uses pnpm not npm", "API needs auth header"
DO NOT save to memory (anchors next round):
- What approach was tried or what ceiling was reached
- What the evaluator said was good or bad about the deliverable
- What breakthroughs were found or what should be preserved
- Scores, verdicts, or evaluation details
The next round gets its direction from next_tasks.json and critique_packet.md
(read from the session directory via the plan handoff). Memory is for execution
strategy only — the same distinction the execution-trace-analyzer makes:
"how the agent worked" not "what the agent built."
When to save:
After the execution-trace-analyzer completes, save its top process learnings
to memory. Format as "Remember: ..." statements.
Media Tools
Check .massgen-quality/environment.json → capabilities (written by session
hook) to see which media capabilities are available before using them.
has_vision: read_media works — use it for output-first verification
has_image_gen: generate_media with mode=image works
has_video_gen: generate_media with mode=video works
has_audio_gen: generate_media with mode=audio works
If the task involves visual deliverables and has_vision is true, use
read_media for verification. Read the /read-media skill for the
critical-first approach and before/after comparison technique.
If the task involves creating images/video/audio and the corresponding
capability is available, read the /image-generation, /video-generation,
or /audio-generation skills for backend details and parameters.
Key Principles
- Default to iterate: The bar is high. First drafts are 6s, not 8s.
- Rebuild, don't patch: Each iteration rebuilds from the strongest elements.
- Evidence over feelings: Ground every evaluation in observable evidence.
- Breakthrough amplification: When one component is great, spread its technique.
- No feature accumulation: Features on a mediocre foundation = mediocre result.
Gotchas
Common failure points — check here first when something goes wrong.
-
read_media fails with "CONTEXT.md not found": Create a CONTEXT.md in
the workspace root with a brief task description. Only needed for read_media,
not generate_media.
-
Agent fails with "not in a git repository": The launch script should
have initialized git. Relaunch via launch-claude-tmux.sh, or manually:
echo '.env' >> .gitignore && git init && git add -A && git commit -m "init"
-
generate_media mode error: Use mode="image", "video", or "audio" —
NOT "text_to_image" or "text_to_video".
-
Builder does a full rebuild when it shouldn't: Scope was too broad. Split
into narrower briefs — one surface, one defect, one architectural move per builder.
-
read_media can't compare before/after: All files must be in a SINGLE call
(file_paths=["v1.png", "v2.png"]). Separate calls can't compare.
-
API keys not working in sandbox: Keys must be in your shell profile
(~/.zshrc) before creating the sandbox. Inline export doesn't persist.
-
MCP servers show "failed": Run uv sync --reinstall inside the sandbox
if dependencies changed, then reconnect.
-
Builders can't see existing files / write to wrong place: Files weren't
committed before spawning. Worktrees clone from the latest commit — uncommitted
files don't exist in the worktree. Commit first.
-
"Cannot create agent worktree" in Docker sandbox: WorktreeCreate hooks must
be in project .claude/settings.json, not plugin hooks. The setup script
(scripts/setup-sandbox.sh) configures this automatically.
This section will grow as common patterns emerge from usage.
References