Run any Skill in Manus with one click

$pwd:

refine

Name: Refine
Author: massgen

// Run a MassGen-style quality refinement loop. Generates eval criteria, produces/improves answers with subagent help, regression-guards before submission, then evaluates with round-evaluator and trace-analyzer in parallel. Invoke with /refine.

Run Skill in Manus

$ git log --oneline --stat

stars:1

forks:0

updated:March 23, 2026 at 17:37

File Explorer

4 files

SKILL.md

readonly

related-skills.json

same repository

evaluate.md

from "massgen/massgen-refinery"

Quick one-shot evaluation of work against quality criteria. Generates criteria, runs the round-evaluator, and presents the verdict. Lighter than /refine — does not auto-improve, just evaluates. Use for pre-PR quality checks or getting a critical assessment.

2026-03-231

massgen-run.md

from "massgen/massgen-refinery"

Launch a MassGen multi-agent run. Multiple LLM backends (codex, gemini, claude, grok) collaborate on a task via voting and consensus. Runs in Docker containers by default.

2026-03-231

team-massgen-run.md

from "massgen/massgen-refinery"

Launch parallel MassGen step mode processes — the lead manages all background tasks directly, tracks answers/votes, detects consensus, and synthesizes the result.

2026-03-231

read-media.md

from "massgen/massgen-refinery"

Analyze media files (images, video, audio) using AI vision and audio models with a critical-first lens. Use for output-first verification — render your deliverable, then read_media to see what it actually looks like. Distinguishes fundamental issues from surface-level fixes.

2026-03-221

package.json

"author": "massgen"

"repository": "massgen/massgen-refinery"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

Run any Skill with one click

name	refine
description	Run a MassGen-style quality refinement loop. Generates eval criteria, produces/improves answers with subagent help, regression-guards before submission, then evaluates with round-evaluator and trace-analyzer in parallel. Invoke with /refine.
user-invocable	true
argument-hint	<query or path-to-deliverable> [--criteria 'custom criteria'] [--max-rounds N] [--prior-answer path] [--eval-backends codex/gpt-5.4 gemini/gemini-3-flash] [--massgen-orchestrated] [--cost-budget N]
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, WebFetch, WebSearch, Agent, Skill, EnterPlanMode, ExitPlanMode

Quality Refinement Loop

You are orchestrating a MassGen-style quality refinement workflow. This is not a one-shot task — it is an iterative loop that produces genuinely high-quality output through structured evaluation and improvement.

Autonomy

Default: fully autonomous. Do not ask the user for confirmation at any step — proceed through the entire workflow without stopping. Generate criteria, produce the answer, evaluate, iterate, all without pausing for approval. Do NOT use AskUserQuestion during the refinement loop — it blocks execution in non-interactive environments (e.g. Docker sandbox). Make your best judgment and proceed.

The user can override this at invocation time:

--interactive or --confirm-criteria: pause for criteria confirmation
--confirm-each-round: pause before each iteration
If the user says "stop at checkpoints" or "ask me before iterating", honor that

Otherwise: execute the full loop autonomously until converged or max rounds hit.

Workflow

Phase 1: Setup

Accept the user's query and any prior answer or deliverable path.
Git repo check — the launch script handles git init, but verify:
```
git rev-parse --is-inside-work-tree
```
If this fails, warn the user to relaunch via launch-claude-tmux.sh.
If you plan to use read_media for verification (visual/audio deliverables), create a CONTEXT.md in the workspace root with a brief task description. Required by the vision/audio API. Not needed for generate_media.
Call init_session (quality-tools MCP) with an optional label describing the task. This creates a new timestamped session directory for isolated state.
Generate 3-7 evaluation criteria appropriate for the task.
- If the user provided custom criteria, use those instead.
- Call generate_eval_criteria (quality-tools MCP) to store them.
- Do NOT ask for confirmation unless the user requested --interactive.

Phase 2: Answer Production

Produce or improve the deliverable. Core principles during production:

Output-first verification: Do not just write — run, view, interact:

Run/view the output
Interact as a user would (click buttons, navigate, test inputs)
Identify gaps from the interaction
Fix and enhance
Re-run and re-interact
Repeat until excellent

Classify the deliverable by behavior to choose verification:

Static (document, image): Render and view with read_media
Animated (motion graphics): Record video, review full sequence
Interactive (app, website): Use it — click all buttons, navigate, test states
Produces output (script, tool): Test with varied and edge-case inputs
Makes sound (audio): Listen to actual content

Question the choices: Don't just improve execution — question fundamental choices. Is this the right approach, or just the first? Would a different direction produce a higher quality ceiling?

Material improvement gate: Classify planned changes as TRANSFORMATIVE, STRUCTURAL, or INCREMENTAL. If only INCREMENTAL changes remain, the approach may need to change — not the polish level.

Builder agent — scope and parallelism:

Before spawning a builder, define:

ONE cohesive deliverable (one file, one feature, one surface)
ONE defect family or architectural problem
Will the builder need runtime feedback? (if yes, work inline instead)

CRITICAL: commit before spawning builders. Builders run in git worktrees cloned from the current commit. If your files aren't committed, the worktree starts empty and builders can't see the existing code. Stage and commit all current work before spawning any builder agents.

If the work has independent pieces, spawn MULTIPLE builders in parallel (invoke them as separate Agent tools in a single turn). Each runs in its own worktree. You merge outputs after all complete.

Examples of good parallel splits:

Builder 1: Rewrite the hero section (layout spec)
Builder 2: Implement the contact form (API spec)
Builder 3: Build the responsive nav (mobile spec)

Do NOT bundle unrelated fixes into one builder. The builder will report split required if over-scoped — that's feedback, not failure.

Work INLINE if: changes are incremental, well-scoped within current context, or coupled to other in-progress work.

Background media operations: For slow media operations (video generation, batch image analysis), spawn the media-worker agent in background. It runs the MCP tool and writes results to the session's media_results/ directory. Continue working while it runs — check results when needed by reading the output file.

Keep iterating within Phase 2 until the deliverable feels genuinely ready. This is a tight inner loop: build → verify (read_media, test, interact) → identify gaps → fix → re-verify. Do not rush to Phase 3. The regression guard is the final gate, not the improvement mechanism. If you're still finding issues through verification, stay in Phase 2.

Phase 3: Pre-Submission Gate

When you believe the deliverable is ready — not just "done" but "a demanding user would be satisfied":

Spawn the regression-guard agent with:
- Previous version (Answer A) — if this is a revision
- Current version (Answer B / candidate)
- Evaluation criteria
Read verdict.json from the regression-guard's worktree:
- pass: Proceed to Phase 4 (submit)
- fail: Go back to Phase 2 — address the regressions, then re-check
- mixed: Present tradeoffs to user, let them decide
If this is the first iteration (no prior answer), skip regression guard.

Phase 4: Submit Answer

Call new_answer (workflow-tools MCP) to submit the current deliverable.

Phase 5: Evaluation (Parallel)

Choose the evaluation method:

Path A: Default single-model evaluation

If no --eval-backends flag and no .massgen-quality/eval_config.json:

round-evaluator agent (foreground):
- Task: evaluate the deliverable against criteria
- Input: deliverable path, criteria, any prior evaluation history
- Output: verdict.json, critique_packet.md, next_tasks.json
execution-trace-analyzer agent (foreground):
- Task: analyze how the answer production phase went
- Input: session transcript path from .massgen-quality/environment.json
- Output: process_report.md, process_verdict.json

Wait for both to complete.

Path B: Multi-model evaluation via team step mode (default multi-model)

If --eval-backends is specified OR .massgen-quality/eval_config.json exists with "eval_mode": "massgen" — AND --massgen-orchestrated flag is NOT set and orchestrator is not "massgen" in eval_config.json:

Claude orchestrates N massgen step processes directly, one per backend model. Each backend independently evaluates the deliverable. Claude detects consensus and extracts the winning evaluation. Total parallel tasks: N + 1 (N round-evaluator step processes + 1 execution-trace-analyzer Agent).

Pre-checks

Read .massgen-quality/environment.json — is massgen.available true?
Check api_keys in environment.json — are the requested backends authenticated?
- Map: openai/codex → openai, gemini → google, claude/claude_code → anthropic, grok → xai

Verify --step support (read plugin_dir from environment.json):

bash "<plugin_dir>/scripts/run-massgen.sh" --help 2>&1 | grep -q "\-\-step"

If massgen unavailable, --step unsupported, or ALL backends lack auth: fall back to Path A (warn user)

Short name expansion

Short name	Expands to
`codex`	`codex/gpt-5.4`
`gemini`	`gemini/gemini-3.1-pro-preview`
`claude`	`claude/claude-opus-4-6`
`grok`	`grok/grok-4.20-0309-reasoning`

Full specs like codex/gpt-5.4 are also accepted.

Compose the round-evaluator prompt

Build a self-contained text prompt following eval-prompt-template.md. The prompt includes the evaluation identity, criteria, deliverable file paths (never inline content), output format instructions, and prior evaluation history (if round > 1). Write the composed prompt to sessions/<id>/eval_team/round_NNN/eval_prompt.md.

IMPORTANT: Never embed deliverable content in the prompt. Reference deliverables by absolute file path and ensure context_paths in the massgen config includes the deliverable's parent directory. If a deliverable is not already on disk (e.g. composed in memory), write it to a file first, then reference the path.

Initialize eval session

Create the eval session directory:

sessions/<id>/eval_team/round_NNN/
  config.json              # session_id, models, agent_mapping, max_rounds, timeout
  eval_prompt.md           # composed round-evaluator prompt
  eval_criteria.json       # massgen-format criteria
  configs/                 # per-backend YAML configs
  agents/                  # massgen writes step results here

Write config.json:

{
  "session_id": "eval_<timestamp>",
  "query": "evaluation",
  "models": ["codex/gpt-5.4", "gemini/gemini-3.1-pro-preview"],
  "agent_mapping": {
    "agent_a": "codex/gpt-5.4",
    "agent_b": "gemini/gemini-3.1-pro-preview"
  },
  "max_rounds": 3,
  "timeout_seconds": 600
}

Generate per-backend configs

Read plugin_dir from .massgen-quality/environment.json. For each backend, generate a step mode config using the wrapper:

bash "<plugin_dir>/scripts/run-massgen.sh" \
  --quickstart --headless \
  --config sessions/<id>/eval_team/round_NNN/configs/<agent_id>_step.yaml \
  --config-backend <type> \
  --config-model <model> \
  --config-agent-id <agent_id>

Add --config-docker if Docker is available. Split the backend spec on / (e.g., codex/gpt-5.4 → --config-backend codex --config-model gpt-5.4). Use absolute paths for --config.

Convert evaluation criteria to massgen format

The MCP generate_eval_criteria tool outputs {id, name, description, weight}. MassGen expects {text, category}. Convert each criterion:

{"text": "<description>", "category": "must"}

Write the converted array to sessions/<id>/eval_team/round_NNN/eval_criteria.json.

Launch N + 1 parallel tasks

Launch all tasks in a single message with multiple tool calls:

N background Bash tasks (one per backend):

bash "<plugin_dir>/scripts/run-massgen.sh" \
  --step \
  --session-dir "<abs_eval_session_dir>" \
  --config "<abs_config_path>" \
  --eval-criteria "<abs_criteria_path>" \
  --automation \
  "$(cat <abs_eval_prompt_path>)"

Each with run_in_background: true. Record the background task ID for each. The wrapper handles API key sourcing and massgen resolution automatically.

1 execution-trace-analyzer agent (foreground, unchanged):

Task: analyze how the answer production phase went
Input: session transcript path from .massgen-quality/environment.json
Output: process_report.md, process_verdict.json

Collect results and detect consensus

Follow the team step mode protocol (see team-protocol.md):

As each background task completes (auto-notification), read <eval_session_dir>/agents/<agent_id>/last_action.json
Wait until ALL N agents have completed their step
Check consensus:
- If all agents answered (round 1): no consensus yet, re-launch all agents
- If votes exist: check seen_steps for stale votes, count valid votes, majority wins
- If consensus reached: proceed to parsing
- If no consensus and rounds remain: re-launch every agent that does not have a valid (non-stale) vote for a current answer
Max 3 consensus rounds. Force consensus at 600s timeout or if --cost-budget exceeded — use the answer with the most votes; if no votes, use the first answer submitted.
If only 1 backend available, use its answer directly (no consensus needed).

Parse winning evaluation

Read the consensus winner's answer from last_action.json → answer_text (or from agents/<winner>/<step>/answer.json → answer)
Parse the answer text for structured evaluation content (see eval-prompt-template.md parsing section):
- Extract verdict.json → write to sessions/<id>/evaluations/round_NNN/verdict.json
- Extract critique_packet.md → write to sessions/<id>/evaluations/round_NNN/critique_packet.md
- Extract next_tasks.json → write to sessions/<id>/evaluations/round_NNN/next_tasks.json
If parsing fails: use the raw answer as critique_packet.md with a default {"schema_version": "1", "verdict": "iterate", "scores": {}} verdict

Wait for both the step mode consensus AND the execution-trace-analyzer to complete before proceeding to Phase 6.

Path C: Multi-model evaluation via MassGen orchestrator (legacy)

If --massgen-orchestrated flag is set OR .massgen-quality/eval_config.json has "orchestrator": "massgen":

MassGen handles all orchestration internally as a single batch process. Simpler but less controllable — subject to monitor timeouts, no error recovery, no progress visibility.

Pre-checks: Same as Path B (massgen available, API keys, backend mapping). Fall back to Path A if unavailable.

Launch:

Write the round-evaluator prompt (same identity, criteria, and deliverable file paths as Path B — never inline content) to sessions/<id>/eval_team/round_NNN/eval_prompt.md
Generate a massgen multi-agent config with the specified backends (see /massgen-run skill's references/config-templates.md). Include context_paths with the project root so agents can read deliverables.
Check Docker availability — use Docker by default, --no-docker to skip
Run in parallel:
- MassGen batch evaluation (via Bash, run_in_background: true):
```
bash "<plugin_dir>/scripts/run-massgen.sh" \
  --automation \
  --config <eval_config.yaml> \
  "$(cat <abs_eval_prompt_path>)"
```
- execution-trace-analyzer agent (foreground, unchanged)
- Spawn massgen-monitor agent (background) with the LOG_DIR
When massgen completes:
- Read status.json → extract winning agent's answer (the consensus evaluation)
- Parse the answer for verdict.json, critique_packet.md, next_tasks.json content and write to sessions/<id>/evaluations/round_NNN/
- If parsing fails: fall back to using the raw answer as critique_packet.md with a default iterate verdict

Wait for both evaluation tasks to complete before proceeding.

eval_config.json

Optional persistent config at .massgen-quality/eval_config.json:

{
  "eval_mode": "massgen",
  "orchestrator": "claude",
  "backends": ["codex/gpt-5.4", "gemini/gemini-3.1-pro-preview"],
  "docker": true
}

orchestrator: "claude" (default) for Path B team step mode, "massgen" for Path C legacy automation mode
docker: true (default) to use Docker isolation for massgen agents

Phase 6: Synthesize and Persist

Read the evaluation outputs while you still have full context:

Read verdict.json:
- converged: Report success. Present the evaluation summary. Done.
- iterate: Continue below.
Read next_tasks.json:
- Check approach_assessment.ceiling_status
- If ceiling_reached: warn user that a different approach may be needed
- Read fix_tasks and evolution_tasks
Read process_report.md:
- Extract the top process learnings
- Save PROCESS learnings to memory (see Memory section below)
Read critique_packet.md:
- Note the improvement_spec and preserve list for the next round

Phase 7: Context Boundary (if iterating)

MANDATORY: When verdict is iterate, you MUST call EnterPlanMode before doing any more work. Do NOT start the next round inline. Do NOT skip this step. The context boundary is critical — it forces fresh thinking and prevents approach anchoring from accumulated context.

Call EnterPlanMode immediately after reading the evaluation outputs.
Write a next-round handoff plan containing:
- Verdict and round number
- Session directory path
- Evaluation output file paths — point the next agent to read these:
  - sessions/<id>/evaluations/round_NNN/verdict.json
  - sessions/<id>/evaluations/round_NNN/critique_packet.md
  - sessions/<id>/evaluations/round_NNN/next_tasks.json
  - sessions/<id>/evaluations/round_NNN/process_report.md
- Deliverable path(s)
- Explicit instruction: "Read the above evaluation files for full context before starting work"
- Brief process learnings (tool/environment notes only, not approach details)
Call ExitPlanMode — the plan approval dialog appears. In the recommended Docker tmux launch flow, the watcher auto-selects "clear context"; otherwise the user selects option 1 manually.
Next iteration starts fresh from the plan with a clean context window.

Key principle: The plan should point to files, not summarize them. The next agent reads the full evaluation outputs and synthesizes for itself. Summaries lose nuance — file paths preserve it.

Why this matters: Without the context boundary, later rounds carry forward stale reasoning, anchor to the previous approach, and accumulate token bloat. The fresh start is what makes iteration actually improve quality rather than just polish the same approach.

Loop Control

Default max rounds: 3 (user can override with --max-rounds)
Exit when: verdict is converged, max rounds reached, or user intervenes
Each round saves deliverable snapshots to .massgen-quality/sessions/{id}/round_NNN/

Memory Persistence

Memory should capture how to work better (process), NOT what to work on (content/approach). The fresh-context restart is deliberate — anchoring the next round to the previous approach defeats the purpose.

SAVE to memory (process learnings):

Tool strategy: "use read_media before submitting visual deliverables"
Error avoidance: "this project requires CONTEXT.md for media tools"
Effort allocation: "don't spend 40% of time on CSS before core logic works"
Verification gaps: "always test mobile viewport, not just desktop"
Environment notes: "uses pnpm not npm", "API needs auth header"

DO NOT save to memory (anchors next round):

What approach was tried or what ceiling was reached
What the evaluator said was good or bad about the deliverable
What breakthroughs were found or what should be preserved
Scores, verdicts, or evaluation details

The next round gets its direction from next_tasks.json and critique_packet.md (read from the session directory via the plan handoff). Memory is for execution strategy only — the same distinction the execution-trace-analyzer makes: "how the agent worked" not "what the agent built."

When to save: After the execution-trace-analyzer completes, save its top process learnings to memory. Format as "Remember: ..." statements.

Media Tools

Check .massgen-quality/environment.json → capabilities (written by session hook) to see which media capabilities are available before using them.

has_vision: read_media works — use it for output-first verification
has_image_gen: generate_media with mode=image works
has_video_gen: generate_media with mode=video works
has_audio_gen: generate_media with mode=audio works

If the task involves visual deliverables and has_vision is true, use read_media for verification. Read the /read-media skill for the critical-first approach and before/after comparison technique.

If the task involves creating images/video/audio and the corresponding capability is available, read the /image-generation, /video-generation, or /audio-generation skills for backend details and parameters.

Key Principles

Default to iterate: The bar is high. First drafts are 6s, not 8s.
Rebuild, don't patch: Each iteration rebuilds from the strongest elements.
Evidence over feelings: Ground every evaluation in observable evidence.
Breakthrough amplification: When one component is great, spread its technique.
No feature accumulation: Features on a mediocre foundation = mediocre result.

Gotchas

Common failure points — check here first when something goes wrong.

read_media fails with "CONTEXT.md not found": Create a CONTEXT.md in the workspace root with a brief task description. Only needed for read_media, not generate_media.
Agent fails with "not in a git repository": The launch script should have initialized git. Relaunch via launch-claude-tmux.sh, or manually: echo '.env' >> .gitignore && git init && git add -A && git commit -m "init"
generate_media mode error: Use mode="image", "video", or "audio" — NOT "text_to_image" or "text_to_video".
Builder does a full rebuild when it shouldn't: Scope was too broad. Split into narrower briefs — one surface, one defect, one architectural move per builder.
read_media can't compare before/after: All files must be in a SINGLE call (file_paths=["v1.png", "v2.png"]). Separate calls can't compare.
API keys not working in sandbox: Keys must be in your shell profile (~/.zshrc) before creating the sandbox. Inline export doesn't persist.
MCP servers show "failed": Run uv sync --reinstall inside the sandbox if dependencies changed, then reconnect.
Builders can't see existing files / write to wrong place: Files weren't committed before spawning. Worktrees clone from the latest commit — uncommitted files don't exist in the worktree. Commit first.
"Cannot create agent worktree" in Docker sandbox: WorktreeCreate hooks must be in project .claude/settings.json, not plugin hooks. The setup script (scripts/setup-sandbox.sh) configures this automatically.

This section will grow as common patterns emerge from usage.

References

workflow.md — Detailed workflow reference
output-formats.md — verdict.json, next_tasks.json schemas
eval-prompt-template.md — Round-evaluator prompt composition for step mode
step-mode.md — MassGen step mode CLI interface and session directory structure
team-protocol.md — Consensus detection, stale vote invalidation, round flow

name	refine
description	Run a MassGen-style quality refinement loop. Generates eval criteria, produces/improves answers with subagent help, regression-guards before submission, then evaluates with round-evaluator and trace-analyzer in parallel. Invoke with /refine.
user-invocable	true
argument-hint	<query or path-to-deliverable> [--criteria 'custom criteria'] [--max-rounds N] [--prior-answer path] [--eval-backends codex/gpt-5.4 gemini/gemini-3-flash] [--massgen-orchestrated] [--cost-budget N]
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, WebFetch, WebSearch, Agent, Skill, EnterPlanMode, ExitPlanMode

Quality Refinement Loop

Autonomy

The user can override this at invocation time:

--interactive or --confirm-criteria: pause for criteria confirmation
--confirm-each-round: pause before each iteration
If the user says "stop at checkpoints" or "ask me before iterating", honor that

Otherwise: execute the full loop autonomously until converged or max rounds hit.

Workflow

Phase 1: Setup

Accept the user's query and any prior answer or deliverable path.
Git repo check — the launch script handles git init, but verify:
```
git rev-parse --is-inside-work-tree
```
If this fails, warn the user to relaunch via launch-claude-tmux.sh.
If you plan to use read_media for verification (visual/audio deliverables), create a CONTEXT.md in the workspace root with a brief task description. Required by the vision/audio API. Not needed for generate_media.
Call init_session (quality-tools MCP) with an optional label describing the task. This creates a new timestamped session directory for isolated state.
Generate 3-7 evaluation criteria appropriate for the task.
- If the user provided custom criteria, use those instead.
- Call generate_eval_criteria (quality-tools MCP) to store them.
- Do NOT ask for confirmation unless the user requested --interactive.

Phase 2: Answer Production

Produce or improve the deliverable. Core principles during production:

Output-first verification: Do not just write — run, view, interact:

Run/view the output
Interact as a user would (click buttons, navigate, test inputs)
Identify gaps from the interaction
Fix and enhance
Re-run and re-interact
Repeat until excellent

Classify the deliverable by behavior to choose verification:

Static (document, image): Render and view with read_media
Animated (motion graphics): Record video, review full sequence
Interactive (app, website): Use it — click all buttons, navigate, test states
Produces output (script, tool): Test with varied and edge-case inputs
Makes sound (audio): Listen to actual content

Question the choices: Don't just improve execution — question fundamental choices. Is this the right approach, or just the first? Would a different direction produce a higher quality ceiling?

Material improvement gate: Classify planned changes as TRANSFORMATIVE, STRUCTURAL, or INCREMENTAL. If only INCREMENTAL changes remain, the approach may need to change — not the polish level.

Builder agent — scope and parallelism:

Before spawning a builder, define:

ONE cohesive deliverable (one file, one feature, one surface)
ONE defect family or architectural problem
Will the builder need runtime feedback? (if yes, work inline instead)

If the work has independent pieces, spawn MULTIPLE builders in parallel (invoke them as separate Agent tools in a single turn). Each runs in its own worktree. You merge outputs after all complete.

Examples of good parallel splits:

Builder 1: Rewrite the hero section (layout spec)
Builder 2: Implement the contact form (API spec)
Builder 3: Build the responsive nav (mobile spec)

Do NOT bundle unrelated fixes into one builder. The builder will report split required if over-scoped — that's feedback, not failure.

Work INLINE if: changes are incremental, well-scoped within current context, or coupled to other in-progress work.

Phase 3: Pre-Submission Gate

When you believe the deliverable is ready — not just "done" but "a demanding user would be satisfied":

Spawn the regression-guard agent with:
- Previous version (Answer A) — if this is a revision
- Current version (Answer B / candidate)
- Evaluation criteria
Read verdict.json from the regression-guard's worktree:
- pass: Proceed to Phase 4 (submit)
- fail: Go back to Phase 2 — address the regressions, then re-check
- mixed: Present tradeoffs to user, let them decide
If this is the first iteration (no prior answer), skip regression guard.

Phase 4: Submit Answer

Call new_answer (workflow-tools MCP) to submit the current deliverable.

Phase 5: Evaluation (Parallel)

Choose the evaluation method:

Path A: Default single-model evaluation

If no --eval-backends flag and no .massgen-quality/eval_config.json:

round-evaluator agent (foreground):
- Task: evaluate the deliverable against criteria
- Input: deliverable path, criteria, any prior evaluation history
- Output: verdict.json, critique_packet.md, next_tasks.json
execution-trace-analyzer agent (foreground):
- Task: analyze how the answer production phase went
- Input: session transcript path from .massgen-quality/environment.json
- Output: process_report.md, process_verdict.json

Wait for both to complete.

Path B: Multi-model evaluation via team step mode (default multi-model)

Pre-checks

Read .massgen-quality/environment.json — is massgen.available true?
Check api_keys in environment.json — are the requested backends authenticated?
- Map: openai/codex → openai, gemini → google, claude/claude_code → anthropic, grok → xai

Verify --step support (read plugin_dir from environment.json):

bash "<plugin_dir>/scripts/run-massgen.sh" --help 2>&1 | grep -q "\-\-step"

If massgen unavailable, --step unsupported, or ALL backends lack auth: fall back to Path A (warn user)

Short name expansion

Short name	Expands to
`codex`	`codex/gpt-5.4`
`gemini`	`gemini/gemini-3.1-pro-preview`
`claude`	`claude/claude-opus-4-6`
`grok`	`grok/grok-4.20-0309-reasoning`

Full specs like codex/gpt-5.4 are also accepted.

Compose the round-evaluator prompt

Initialize eval session

Create the eval session directory:

sessions/<id>/eval_team/round_NNN/
  config.json              # session_id, models, agent_mapping, max_rounds, timeout
  eval_prompt.md           # composed round-evaluator prompt
  eval_criteria.json       # massgen-format criteria
  configs/                 # per-backend YAML configs
  agents/                  # massgen writes step results here

Write config.json:

{
  "session_id": "eval_<timestamp>",
  "query": "evaluation",
  "models": ["codex/gpt-5.4", "gemini/gemini-3.1-pro-preview"],
  "agent_mapping": {
    "agent_a": "codex/gpt-5.4",
    "agent_b": "gemini/gemini-3.1-pro-preview"
  },
  "max_rounds": 3,
  "timeout_seconds": 600
}

Generate per-backend configs

Read plugin_dir from .massgen-quality/environment.json. For each backend, generate a step mode config using the wrapper:

bash "<plugin_dir>/scripts/run-massgen.sh" \
  --quickstart --headless \
  --config sessions/<id>/eval_team/round_NNN/configs/<agent_id>_step.yaml \
  --config-backend <type> \
  --config-model <model> \
  --config-agent-id <agent_id>

Add --config-docker if Docker is available. Split the backend spec on / (e.g., codex/gpt-5.4 → --config-backend codex --config-model gpt-5.4). Use absolute paths for --config.

Convert evaluation criteria to massgen format

The MCP generate_eval_criteria tool outputs {id, name, description, weight}. MassGen expects {text, category}. Convert each criterion:

{"text": "<description>", "category": "must"}

Write the converted array to sessions/<id>/eval_team/round_NNN/eval_criteria.json.

Launch N + 1 parallel tasks

Launch all tasks in a single message with multiple tool calls:

N background Bash tasks (one per backend):

bash "<plugin_dir>/scripts/run-massgen.sh" \
  --step \
  --session-dir "<abs_eval_session_dir>" \
  --config "<abs_config_path>" \
  --eval-criteria "<abs_criteria_path>" \
  --automation \
  "$(cat <abs_eval_prompt_path>)"

Each with run_in_background: true. Record the background task ID for each. The wrapper handles API key sourcing and massgen resolution automatically.

1 execution-trace-analyzer agent (foreground, unchanged):

Task: analyze how the answer production phase went
Input: session transcript path from .massgen-quality/environment.json
Output: process_report.md, process_verdict.json

Collect results and detect consensus

Follow the team step mode protocol (see team-protocol.md):

As each background task completes (auto-notification), read <eval_session_dir>/agents/<agent_id>/last_action.json
Wait until ALL N agents have completed their step
Check consensus:
- If all agents answered (round 1): no consensus yet, re-launch all agents
- If votes exist: check seen_steps for stale votes, count valid votes, majority wins
- If consensus reached: proceed to parsing
- If no consensus and rounds remain: re-launch every agent that does not have a valid (non-stale) vote for a current answer
Max 3 consensus rounds. Force consensus at 600s timeout or if --cost-budget exceeded — use the answer with the most votes; if no votes, use the first answer submitted.
If only 1 backend available, use its answer directly (no consensus needed).

Parse winning evaluation

Read the consensus winner's answer from last_action.json → answer_text (or from agents/<winner>/<step>/answer.json → answer)
Parse the answer text for structured evaluation content (see eval-prompt-template.md parsing section):
- Extract verdict.json → write to sessions/<id>/evaluations/round_NNN/verdict.json
- Extract critique_packet.md → write to sessions/<id>/evaluations/round_NNN/critique_packet.md
- Extract next_tasks.json → write to sessions/<id>/evaluations/round_NNN/next_tasks.json
If parsing fails: use the raw answer as critique_packet.md with a default {"schema_version": "1", "verdict": "iterate", "scores": {}} verdict

Wait for both the step mode consensus AND the execution-trace-analyzer to complete before proceeding to Phase 6.

Path C: Multi-model evaluation via MassGen orchestrator (legacy)

If --massgen-orchestrated flag is set OR .massgen-quality/eval_config.json has "orchestrator": "massgen":

MassGen handles all orchestration internally as a single batch process. Simpler but less controllable — subject to monitor timeouts, no error recovery, no progress visibility.

Pre-checks: Same as Path B (massgen available, API keys, backend mapping). Fall back to Path A if unavailable.

Launch:

Write the round-evaluator prompt (same identity, criteria, and deliverable file paths as Path B — never inline content) to sessions/<id>/eval_team/round_NNN/eval_prompt.md
Generate a massgen multi-agent config with the specified backends (see /massgen-run skill's references/config-templates.md). Include context_paths with the project root so agents can read deliverables.
Check Docker availability — use Docker by default, --no-docker to skip
Run in parallel:
- MassGen batch evaluation (via Bash, run_in_background: true):
```
bash "<plugin_dir>/scripts/run-massgen.sh" \
  --automation \
  --config <eval_config.yaml> \
  "$(cat <abs_eval_prompt_path>)"
```
- execution-trace-analyzer agent (foreground, unchanged)
- Spawn massgen-monitor agent (background) with the LOG_DIR
When massgen completes:
- Read status.json → extract winning agent's answer (the consensus evaluation)
- Parse the answer for verdict.json, critique_packet.md, next_tasks.json content and write to sessions/<id>/evaluations/round_NNN/
- If parsing fails: fall back to using the raw answer as critique_packet.md with a default iterate verdict

Wait for both evaluation tasks to complete before proceeding.

eval_config.json

Optional persistent config at .massgen-quality/eval_config.json:

{
  "eval_mode": "massgen",
  "orchestrator": "claude",
  "backends": ["codex/gpt-5.4", "gemini/gemini-3.1-pro-preview"],
  "docker": true
}

orchestrator: "claude" (default) for Path B team step mode, "massgen" for Path C legacy automation mode
docker: true (default) to use Docker isolation for massgen agents

Phase 6: Synthesize and Persist

Read the evaluation outputs while you still have full context:

Read verdict.json:
- converged: Report success. Present the evaluation summary. Done.
- iterate: Continue below.
Read next_tasks.json:
- Check approach_assessment.ceiling_status
- If ceiling_reached: warn user that a different approach may be needed
- Read fix_tasks and evolution_tasks
Read process_report.md:
- Extract the top process learnings
- Save PROCESS learnings to memory (see Memory section below)
Read critique_packet.md:
- Note the improvement_spec and preserve list for the next round

Phase 7: Context Boundary (if iterating)

Call EnterPlanMode immediately after reading the evaluation outputs.
Write a next-round handoff plan containing:
- Verdict and round number
- Session directory path
- Evaluation output file paths — point the next agent to read these:
  - sessions/<id>/evaluations/round_NNN/verdict.json
  - sessions/<id>/evaluations/round_NNN/critique_packet.md
  - sessions/<id>/evaluations/round_NNN/next_tasks.json
  - sessions/<id>/evaluations/round_NNN/process_report.md
- Deliverable path(s)
- Explicit instruction: "Read the above evaluation files for full context before starting work"
- Brief process learnings (tool/environment notes only, not approach details)
Call ExitPlanMode — the plan approval dialog appears. In the recommended Docker tmux launch flow, the watcher auto-selects "clear context"; otherwise the user selects option 1 manually.
Next iteration starts fresh from the plan with a clean context window.

Key principle: The plan should point to files, not summarize them. The next agent reads the full evaluation outputs and synthesizes for itself. Summaries lose nuance — file paths preserve it.

Loop Control

Default max rounds: 3 (user can override with --max-rounds)
Exit when: verdict is converged, max rounds reached, or user intervenes
Each round saves deliverable snapshots to .massgen-quality/sessions/{id}/round_NNN/

Memory Persistence

SAVE to memory (process learnings):

Tool strategy: "use read_media before submitting visual deliverables"
Error avoidance: "this project requires CONTEXT.md for media tools"
Effort allocation: "don't spend 40% of time on CSS before core logic works"
Verification gaps: "always test mobile viewport, not just desktop"
Environment notes: "uses pnpm not npm", "API needs auth header"

DO NOT save to memory (anchors next round):

What approach was tried or what ceiling was reached
What the evaluator said was good or bad about the deliverable
What breakthroughs were found or what should be preserved
Scores, verdicts, or evaluation details

When to save: After the execution-trace-analyzer completes, save its top process learnings to memory. Format as "Remember: ..." statements.

Media Tools

Check .massgen-quality/environment.json → capabilities (written by session hook) to see which media capabilities are available before using them.

has_vision: read_media works — use it for output-first verification
has_image_gen: generate_media with mode=image works
has_video_gen: generate_media with mode=video works
has_audio_gen: generate_media with mode=audio works

Key Principles

Default to iterate: The bar is high. First drafts are 6s, not 8s.
Rebuild, don't patch: Each iteration rebuilds from the strongest elements.
Evidence over feelings: Ground every evaluation in observable evidence.
Breakthrough amplification: When one component is great, spread its technique.
No feature accumulation: Features on a mediocre foundation = mediocre result.

Gotchas

Common failure points — check here first when something goes wrong.

read_media fails with "CONTEXT.md not found": Create a CONTEXT.md in the workspace root with a brief task description. Only needed for read_media, not generate_media.
Agent fails with "not in a git repository": The launch script should have initialized git. Relaunch via launch-claude-tmux.sh, or manually: echo '.env' >> .gitignore && git init && git add -A && git commit -m "init"
generate_media mode error: Use mode="image", "video", or "audio" — NOT "text_to_image" or "text_to_video".
Builder does a full rebuild when it shouldn't: Scope was too broad. Split into narrower briefs — one surface, one defect, one architectural move per builder.
read_media can't compare before/after: All files must be in a SINGLE call (file_paths=["v1.png", "v2.png"]). Separate calls can't compare.
API keys not working in sandbox: Keys must be in your shell profile (~/.zshrc) before creating the sandbox. Inline export doesn't persist.
MCP servers show "failed": Run uv sync --reinstall inside the sandbox if dependencies changed, then reconnect.
Builders can't see existing files / write to wrong place: Files weren't committed before spawning. Worktrees clone from the latest commit — uncommitted files don't exist in the worktree. Commit first.
"Cannot create agent worktree" in Docker sandbox: WorktreeCreate hooks must be in project .claude/settings.json, not plugin hooks. The setup script (scripts/setup-sandbox.sh) configures this automatically.

This section will grow as common patterns emerge from usage.

References

workflow.md — Detailed workflow reference
output-formats.md — verdict.json, next_tasks.json schemas
eval-prompt-template.md — Round-evaluator prompt composition for step mode
step-mode.md — MassGen step mode CLI interface and session directory structure
team-protocol.md — Consensus detection, stale vote invalidation, round flow

refine

More from this repository

More from this repository

Quality Refinement Loop

Autonomy

Workflow

Phase 1: Setup

Phase 2: Answer Production

Phase 3: Pre-Submission Gate

Phase 4: Submit Answer

Phase 5: Evaluation (Parallel)

Path A: Default single-model evaluation

Path B: Multi-model evaluation via team step mode (default multi-model)

Pre-checks

Short name expansion

Compose the round-evaluator prompt

Initialize eval session

Generate per-backend configs

Convert evaluation criteria to massgen format

Launch N + 1 parallel tasks

Collect results and detect consensus

Parse winning evaluation

Path C: Multi-model evaluation via MassGen orchestrator (legacy)

eval_config.json

Phase 6: Synthesize and Persist

Phase 7: Context Boundary (if iterating)

Loop Control

Memory Persistence

Media Tools

Key Principles

Gotchas

References

Quality Refinement Loop

Autonomy

Workflow

Phase 1: Setup

Phase 2: Answer Production

Phase 3: Pre-Submission Gate

Phase 4: Submit Answer

Phase 5: Evaluation (Parallel)

Path A: Default single-model evaluation

Path B: Multi-model evaluation via team step mode (default multi-model)

Pre-checks

Short name expansion

Compose the round-evaluator prompt

Initialize eval session

Generate per-backend configs

Convert evaluation criteria to massgen format

Launch N + 1 parallel tasks

Collect results and detect consensus

Parse winning evaluation

Path C: Multi-model evaluation via MassGen orchestrator (legacy)

eval_config.json

Phase 6: Synthesize and Persist

Phase 7: Context Boundary (if iterating)

Loop Control

Memory Persistence

Media Tools

Key Principles

Gotchas

References