| name | refine |
| description | Iterative prompt and schema refinement using TDD methodology for LLM workloads. Triggers: "refine the prompt", "improve extraction", "iterate on schema", "prompt TDD", "tune the model". |
| version | 2 |
| tier | protocol |
Iterative prompt and schema refinement using TDD methodology for LLM workloads.
[!!!] CRITICAL BOOT SEQUENCE:
- LOAD STANDARDS: IF NOT LOADED, Read
~/.claude/directives/COMMANDS.md, ~/.claude/directives/INVARIANTS.md, and ~/.claude/directives/TAGS.md.
- GUARD: "Quick task"? NO SHORTCUTS. See
¶INV_SKILL_PROTOCOL_MANDATORY.
- EXECUTE: FOLLOW THE PROTOCOL BELOW EXACTLY.
⛔ GATE CHECK — Do NOT proceed to Phase 0 until ALL are filled in:
Output this block in chat with every blank filled:
Boot proof:
- COMMANDS.md — §CMD spotted:
________
- INVARIANTS.md — ¶INV spotted:
________
- TAGS.md — §FEED spotted:
________
[!!!] If ANY blank above is empty: STOP. Go back to step 1 and load the missing file. Do NOT read Phase 0 until every blank is filled.
Refinement Protocol (The Iteration Engine)
[!!!] DO NOT USE THE BUILT-IN PLAN MODE (EnterPlanMode tool). This protocol has its own planning system — Phase 1 (Interrogation / Manifest Creation) and Phase 2 (Experiment Design). The engine's plan lives in the session directory as a reviewable artifact, not in a transient tool state. Use THIS protocol's phases, not the IDE's.
ARGUMENTS: Accepts optional flags:
--auto: Run N iterations automatically (default: suggestion mode)
--dry-run: Show what would happen without executing
--iterations N: Set max iterations for auto mode (default: 5)
--manifest <path>: Use existing manifest instead of interrogation
--plan <path>: Skip planning, use existing REFINE_PLAN.md
--case <path>: Focus on a single case instead of running all cases
--continue: Resume from last iteration in current session directory
Session Parameters (for §CMD_PARSE_PARAMETERS)
Merge into the JSON passed to session.sh activate:
{
"taskType": "CHANGESET",
"phases": [
{"major": 0, "minor": 0, "name": "Setup"},
{"major": 1, "minor": 0, "name": "Interrogation"},
{"major": 2, "minor": 0, "name": "Planning"},
{"major": 3, "minor": 0, "name": "Validation"},
{"major": 4, "minor": 0, "name": "Baseline"},
{"major": 5, "minor": 0, "name": "Iteration Loop"},
{"major": 6, "minor": 0, "name": "Synthesis"}
],
"nextSkills": ["/refine", "/test", "/implement", "/analyze", "/chores"],
"provableDebriefItems": ["§CMD_MANAGE_DIRECTIVES", "§CMD_PROCESS_DELEGATIONS", "§CMD_DISPATCH_APPROVAL", "§CMD_CAPTURE_SIDE_DISCOVERIES", "§CMD_MANAGE_ALERTS", "§CMD_REPORT_LEFTOVER_WORK"],
"directives": ["TESTING.md", "PITFALLS.md", "CONTRIBUTING.md"],
"planTemplate": "~/.claude/skills/refine/assets/TEMPLATE_REFINE_PLAN.md",
"logTemplate": "~/.claude/skills/refine/assets/TEMPLATE_REFINE_LOG.md",
"debriefTemplate": "~/.claude/skills/refine/assets/TEMPLATE_REFINE.md",
"modes": {
"accuracy": {"label": "Accuracy", "description": "Precision and correctness", "file": "~/.claude/skills/refine/modes/accuracy.md"},
"speed": {"label": "Speed", "description": "Latency and token efficiency", "file": "~/.claude/skills/refine/modes/speed.md"},
"robustness": {"label": "Robustness", "description": "Edge case handling", "file": "~/.claude/skills/refine/modes/robustness.md"},
"custom": {"label": "Custom", "description": "User-defined", "file": "~/.claude/skills/refine/modes/custom.md"}
}
}
0. Setup Phase
-
Intent: Execute §CMD_REPORT_INTENT_TO_USER.
- I am starting Phase 0: Setup phase.
- I will
§CMD_USE_ONLY_GIVEN_CONTEXT for Phase 0 only (Strict Bootloader — expires at Phase 1).
- My focus is REFINEMENT (
§CMD_REFUSE_OFF_COURSE applies).
- I will
§CMD_LOAD_AUTHORITY_FILES to ensure all templates and standards are loaded.
- I will
§CMD_FIND_TAGGED_FILES to identify active alerts (#active-alert).
- I will
§CMD_PARSE_PARAMETERS to define the flight plan.
- I will
§CMD_MAINTAIN_SESSION_DIR to establish working space.
- I will select the Refinement Mode (Accuracy / Speed / Robustness / Custom).
- I will
§CMD_ASSUME_ROLE using the selected mode's preset.
- I will obey
§CMD_NO_MICRO_NARRATION and ¶INV_CONCISE_CHAT (Silence Protocol).
Constraint: Do NOT read any project files (source code, docs) in Phase 0. Only load the required system templates/standards.
-
Required Context: Execute §CMD_LOAD_AUTHORITY_FILES (multi-read) for the following files:
~/.claude/skills/refine/assets/TEMPLATE_REFINE_PLAN.md (Template for experiment planning)
~/.claude/skills/refine/assets/TEMPLATE_REFINE_LOG.md (Template for experiment logging)
~/.claude/skills/refine/assets/TEMPLATE_REFINE.md (Template for session debrief)
~/.claude/skills/refine/assets/MANIFEST_SCHEMA.json (Schema for workload manifest)
.claude/directives/PITFALLS.md (Known pitfalls and gotchas — project-level, load if exists)
-
Parse Arguments: Check for flags in the user's command:
--manifest <path>: Skip interrogation, use existing manifest
--plan <path>: Skip planning, use existing REFINE_PLAN.md
--auto: Run automated iteration loop (default: suggestion mode)
--dry-run: Show what would happen without executing
--iterations N: Set max iterations for auto mode (default: 5)
--case <path>: Focus on a single case instead of running all cases
--continue: Resume from last iteration in current session
-
Parse Parameters: Execute §CMD_PARSE_PARAMETERS.
- CRITICAL: Output the JSON BEFORE proceeding.
-
Session Location: Execute §CMD_MAINTAIN_SESSION_DIR.
-
Identify Recent Truth: Execute §CMD_FIND_TAGGED_FILES for #active-alert.
- If any files are found, add them to
contextPaths for ingestion.
6.1. Refinement Mode Selection: Execute AskUserQuestion (multiSelect: false):
> "What refinement objective should I optimize for?"
> - "Accuracy" (Recommended) — Precision-focused: maximize extraction correctness
> - "Speed" — Efficiency-focused: minimize latency and token usage
> - "Robustness" — Resilience-focused: handle edge cases and diverse inputs
> - "Custom" — Define your own optimization objective
**On selection**: Read the corresponding `modes/{mode}.md` file. It defines Role, Goal, Mindset, and Configuration (iteration focus, hypothesis style, success metric).
**On "Custom"**: Read ALL 3 named mode files first (`modes/accuracy.md`, `modes/speed.md`, `modes/robustness.md`), then accept user's framing. Parse into role/goal/mindset.
**Record**: Store the selected mode. It configures:
* Phase 0 Step 6.2 role (from mode file)
* Phase 5 iteration focus, hypothesis style, and success metric (from mode file)
6.2. Assume Role: Execute §CMD_ASSUME_ROLE using the selected mode's Role, Goal, and Mindset from the loaded mode file.
-
Resume Check: Does --continue flag exist?
- If Yes:
- Read
REFINE_LOG.md from session directory.
- Parse last
🏁 Iteration Complete or 📈 Metrics entry to find iteration number.
- Read manifest path from log or ask user.
- Skip to Phase 5 (Iteration Loop) starting at iteration N+1.
- If No: Continue to manifest check.
-
Manifest Check: Does --manifest <path> exist?
- If Yes: Read the manifest, validate against schema, proceed to plan check.
- If No: Proceed to Phase 1 (Interrogation).
-
Plan Check: Does --plan <path> exist?
- If Yes: Read the plan, skip to Phase 3 (Validation).
- If No: Proceed to Phase 2 (Planning).
§CMD_VERIFY_PHASE_EXIT — Phase 0
Output this block in chat with every blank filled:
Phase 0 proof:
- Mode:
________ (accuracy / speed / robustness / custom)
- Role:
________ (quote the role name from the mode preset)
- Session dir:
________
- Templates loaded:
________
- Parameters parsed:
________
- Flags parsed:
________
- Routing:
________
Phase Transition
Phase 0 always proceeds to Phase 1 — no transition question needed.
1. Interrogation Phase (Manifest Creation)
Build the workload manifest through structured questioning.
Intent: Execute §CMD_REPORT_INTENT_TO_USER.
- I am moving to Phase 1: Interrogation (Manifest Creation).
- I will ask questions to understand the workload configuration.
- I will build a
refine.manifest.json from your answers.
- I will
§CMD_LOG_TO_DETAILS to capture the Q&A.
Interrogation Depth Selection
Before asking any questions, present this choice via AskUserQuestion (multiSelect: false):
"How deep should manifest interrogation go?"
| Depth | Minimum Rounds | When to Use |
|---|
| Short | 3+ | Simple workload, clear paths, few cases |
| Medium | 6+ | Moderate complexity, custom validators, overlays |
| Long | 9+ | Complex multi-stage pipeline, many edge cases |
| Absolute | Until ALL questions resolved | Novel workload type, zero ambiguity tolerance |
Record the user's choice. This sets the minimum — the agent can always ask more, and the user can always say "proceed" after the minimum is met.
Interrogation Protocol (Rounds)
[!!!] CRITICAL: You MUST complete at least the minimum rounds for the chosen depth. Track your round count visibly.
Round counter: Output it on every round: "Round N / {depth_minimum}+"
Interrogation Topics (Refinement)
Examples of themes to explore. Adapt to the workload — skip irrelevant ones, invent new ones as needed.
Standard topics (typically covered once):
- Iteration goals — What specific improvements are you targeting? What's "good enough"?
- Baseline metrics — What's the current pass rate? Where are the worst failures?
- Evaluation criteria — How do you measure success? Automated diffs, visual review, both?
- Failure modes — What kinds of errors are most common? Structural, value, missing fields?
- Resource constraints — Cost per iteration? API rate limits? Time budget?
- Stopping conditions — When should we stop iterating? Pass rate? Plateau? Budget?
- Prompt engineering specifics — Which prompt sections are most suspicious? Any known weak spots?
- Data characteristics — How varied are the cases? Format consistency? Outliers?
- Edge cases — Known tricky inputs? Cases that always fail?
- Comparison strategy — Diff-based, overlay-based, or manual inspection?
Repeatable topics (can be selected any number of times):
- Followup — Clarify or revisit answers from previous rounds
- Devil's advocate — Challenge assumptions and decisions made so far
- What-if scenarios — Explore hypotheticals, edge cases, and alternative futures
- Deep dive — Drill into a specific topic from a previous round in much more detail
Each round:
- Pick an uncovered topic (or a repeatable topic).
- Execute
§CMD_ASK_ROUND_OF_QUESTIONS via AskUserQuestion (3-5 targeted questions on that topic).
- On response: Execute
§CMD_LOG_TO_DETAILS immediately.
- If the user asks a counter-question: ANSWER it, verify understanding, then resume.
Structured Manifest Questions
Within the interrogation rounds, cover these manifest-specific fields:
Core Configuration (Round A):
- "What is this workload called?" →
workloadId
- "Where are the prompt files that control extraction?" →
promptPaths
- "Where are the schema files (if any)?" →
schemaPaths
- "Where are the test input files (cases)?" →
casePaths (accept glob patterns)
- "Do you have expected output files for comparison?" →
expectedPaths (optional)
Execution Configuration (Round B):
- "What command runs extraction on a single case?" →
runCommand
- "Where should extraction output be written?" →
outputPath
- "Do you have a command to generate visual overlays?" →
overlayCommand (optional)
- "Any custom validation scripts to run?" →
validationScripts (optional)
Advanced Configuration (Round C — Optional):
- "Custom critique prompt for visual analysis?" →
critiquePrompt (optional)
- "Custom critique script instead of Claude?" →
critiqueScript (optional)
- "Max iterations for auto mode?" →
maxIterations (default: 5)
Interrogation Exit Gate
After reaching minimum rounds, present this choice via AskUserQuestion (multiSelect: true):
"Round N complete (minimum met). What next?"
- "Proceed to assemble manifest" — (terminal: if selected, skip all others and move on)
- "More interrogation (3 more rounds)" — Standard topic rounds, then this gate re-appears
- "Devil's advocate round" — 1 round challenging assumptions, then this gate re-appears
- "What-if scenarios round" — 1 round exploring hypotheticals, then this gate re-appears
- "Deep dive round" — 1 round drilling into a prior topic, then this gate re-appears
Execution order (when multiple selected): Standard rounds first → Devil's advocate → What-ifs → Deep dive → re-present exit gate.
Assemble Manifest
- Construct: Build the manifest JSON from collected answers.
- Validate: Check against
MANIFEST_SCHEMA.json.
- Present: Show the manifest to the user. Execute
AskUserQuestion (multiSelect: false):
"Manifest ready. Confirm?"
- "Confirmed" — Manifest is correct, proceed
- "I have changes" — Let me adjust before proceeding
§CMD_VERIFY_PHASE_EXIT — Phase 1
Output this block in chat with every blank filled:
Phase 1 proof:
- Depth chosen:
________
- Rounds completed:
________ / ________+
- DETAILS.md entries:
________
- Manifest validated:
________
- User confirmed:
________
Phase Transition
Execute §CMD_TRANSITION_PHASE_WITH_OPTIONAL_WALKTHROUGH:
completedPhase: "1: Interrogation"
nextPhase: "2: Planning"
prevPhase: "0: Setup"
custom: "Skip to Phase 3: Validation | Jump straight to single-fixture test"
2. Planning Phase (Experiment Design)
Before iterating, design the experiment. Measure twice, cut once.
Intent: Execute §CMD_REPORT_INTENT_TO_USER.
- I am moving to Phase 2: Planning (Experiment Design).
- I will analyze current failures and form ranked hypotheses.
- I will
§CMD_POPULATE_LOADED_TEMPLATE using REFINE_PLAN.md template.
- I will
§CMD_WAIT_FOR_USER_CONFIRMATION before proceeding.
Step A: Gather Failure Context
- If continuing from prior session: Read prior
REFINE.md or REFINE_LOG.md for context.
- If fresh: Run a quick baseline scan (dry-run) to understand current failure patterns.
- Categorize: Group failures by symptom type (bounding box, missing fields, wrong values, structural).
Step B: Form Hypotheses
- Analyze Patterns: What do failing cases have in common?
- Hypothesize: For each failure pattern, propose a root cause.
- Rank: Order hypotheses by:
- Likelihood: How confident are we this is the cause?
- Testability: Can we isolate and test this cheaply?
- Impact: How many cases would this fix?
Step C: Design Experiments
- Map: Assign each hypothesis to a specific experiment.
- Sequence: Order experiments by priority (high-impact, high-confidence first).
- Define Changes: For each experiment, specify:
- The exact file and line to modify
- The current text and proposed text
- Which cases will test this change
Step D: Select Cases
- Focus Cases: Pick 3-5 cases that best test the hypotheses.
- Regression Guards: Identify 2-3 passing cases that must stay passing.
- Exclusions: Note any cases to ignore (and why).
Step E: Define Success Criteria
- Quantitative: What pass rate are we targeting?
- Qualitative: What visual/structural improvements do we expect?
- Exit Conditions: When do we stop iterating?
Step F: Create Plan
- Generate: Execute
§CMD_POPULATE_LOADED_TEMPLATE (Schema: REFINE_PLAN.md).
- Present: Show the plan to the user. Execute
AskUserQuestion (multiSelect: false):
"Refinement plan ready. Proceed?"
- "Approved" — Plan is good, begin execution
- "Needs revision" — Adjust the plan first
§CMD_VERIFY_PHASE_EXIT — Phase 2
Output this block in chat with every blank filled:
Phase 2 proof:
- Failure context:
________
- Hypotheses ranked:
________
- Experiments designed:
________
- Cases selected:
________
- Success criteria:
________
- REFINE_PLAN.md written:
________
- User approved:
________
Phase Transition
Execute §CMD_TRANSITION_PHASE_WITH_OPTIONAL_WALKTHROUGH:
completedPhase: "2: Planning"
nextPhase: "3: Validation"
prevPhase: "1: Interrogation"
custom: "Skip to Phase 4: Baseline | Manifest already validated, go straight to baseline"
3. Validation Phase (Single-Fixture Test)
Prove the manifest works before committing to the full loop.
Intent: Execute §CMD_REPORT_INTENT_TO_USER.
- I am moving to Phase 3: Validation (Single-Fixture Test).
- I will run ONE case through the pipeline to verify the manifest.
- If validation fails, I will help fix the manifest interactively.
Step A: Select Test Fixture
- Expand: Resolve
casePaths globs to get actual file list.
- Select: Pick the FIRST case for validation.
- Announce: "Running validation with case:
[path]"
Step B: Execute Pipeline (Single Fixture)
- Run Extraction: Execute
runCommand with {case} substituted.
- If Error: Log
🛑 Validation Failure, ask user to fix runCommand.
- Check Output: Verify
outputPath file was created.
- If Missing: Log
🛑 Validation Failure, ask user to fix outputPath.
- Run Overlay (if configured): Execute
overlayCommand.
- If Error: Log
🛑 Validation Failure, ask user to fix overlayCommand.
- Run Validators (if configured): Execute each
validationScripts entry.
- If Error: Log
🛑 Validation Failure, show which script failed.
Step C: Validation Result
-
If All Passed:
- Log
✅ Validation Success to REFINE_LOG.md.
- Ask: "Validation passed. Where should I save the manifest?"
- Write manifest to specified path (default: alongside workload code).
- Proceed to Phase 4.
-
If Any Failed:
- Log
🛑 Validation Failure with details.
- Ask: "Validation failed. What would you like to fix?"
- Update manifest based on user input.
- Loop: Return to Step B and retry (max 3 attempts).
- If 3 failures: Abort with "Please fix the manifest manually and re-run with
--manifest <path>."
§CMD_VERIFY_PHASE_EXIT — Phase 3
Output this block in chat with every blank filled:
Phase 3 proof:
- Test fixture:
________
- Pipeline result:
________
- Manifest saved:
________
- Validation logged:
________
Phase Transition
Execute §CMD_TRANSITION_PHASE_WITH_OPTIONAL_WALKTHROUGH:
completedPhase: "3: Validation"
nextPhase: "4: Baseline"
prevPhase: "2: Planning"
4. Baseline Phase (Initial Metrics)
Establish the starting point before any refinement.
Intent: Execute §CMD_REPORT_INTENT_TO_USER.
- I am moving to Phase 4: Baseline (Initial Metrics).
- I will run ALL cases to establish baseline metrics.
- This is iteration 0 — no changes have been made yet.
If --dry-run: Skip actual execution, show what WOULD happen, then STOP.
Step A: Run Cases
- Expand: Resolve
casePaths globs to get full case list.
- Filter (if
--case <path> specified): Reduce to just the specified case.
- Execute: For each case:
- Run
runCommand
- Run
overlayCommand (if configured)
- Run
validationScripts (if configured)
- Compare output to
expectedPaths (if configured)
- Log: Append
🎯 Iteration Start entry with baseline metrics.
Step B: Collect Baseline Metrics
- Quantitative (if
expectedPaths configured):
- Passing cases: count where output matches expected
- Failing cases: list with diff summary
- Qualitative (always):
- Generate overlay images for visual inspection
- Note any obvious errors visible in overlays
Step C: Present Baseline
- Report: "Baseline:
X/Y cases passing (Z%)"
- List Failures: Show which cases failed and why (if known).
- Execute
AskUserQuestion (multiSelect: false):
"Baseline: X/Y passing. Ready to begin refinement iteration?"
- "Begin" — Start iterating on refinements
- "Let me review" — I want to inspect the baseline first
§CMD_VERIFY_PHASE_EXIT — Phase 4
Output this block in chat with every blank filled:
Phase 4 proof:
- Cases executed:
________
- Baseline metrics:
________
- Baseline presented:
________
- User confirmed:
________
Phase Transition
Execute §CMD_TRANSITION_PHASE_WITH_OPTIONAL_WALKTHROUGH:
completedPhase: "4: Baseline"
nextPhase: "5: Iteration Loop"
prevPhase: "3: Validation"
5. Iteration Loop (The Core Cycle)
Analyze → Critique → Suggest → Apply → Measure → Repeat
Intent: Execute §CMD_REPORT_INTENT_TO_USER.
- I am moving to Phase 5: Iteration Loop.
- I will analyze failures, critique visually, suggest edits, and measure impact.
- Mode:
[Suggestion / Auto], Max iterations: N
⏱️ Logging Heartbeat (CHECK BEFORE EVERY TOOL CALL)
Before calling any tool, ask yourself:
Have I made 2+ tool calls since my last log entry?
→ YES: Log NOW before doing anything else. This is not optional.
→ NO: Proceed with the tool call.
[!!!] If you make 3 tool calls without logging, you are FAILING the protocol. The log is your brain — unlogged work is invisible work.
🔄 For Each Iteration (1 to maxIterations):
Step A: Analyze Failures
- JSON Diff (if
expectedPaths configured):
- For each failing case, compute diff between output and expected.
- Categorize errors: missing fields, wrong values, extra fields, structural mismatch.
- Validation Errors (if
validationScripts configured):
- Collect error messages from failed validators.
- Log: Append findings to REFINE_LOG.md using appropriate thought triggers.
Step B: Visual Critique (Reviewer Agent)
-
Select Pages: Choose pages for review:
- All failing pages (from Step A analysis)
- Random sample of N pages if many failures (default: 5)
- Or specific pages flagged by user
-
Prepare Images: Download/copy overlay images to tmp/:
- Full-page overlays:
tmp/layout-overlay-page-{N}.png
- Layout JSON:
tmp/layout.json
- (Optional) Quadrant tiles if precision needed
-
Launch Reviewer Agent:
Task(subagent_type="reviewer", prompt=`
Review extraction results for case ${caseId}.
**Images to analyze** (use Read tool):
${overlayPaths.map(p => `- ${p}`).join('\n')}
**Layout JSON**:
- ${layoutJsonPath}
**Pages**: ${selectedPages.join(', ')}
Analyze each overlay image, cross-reference with layout JSON, and return a CritiqueReport JSON.
Run ALL checks from your §CRITIQUE_CHECKLIST.
Include actionable recommendations for each issue found.
`)
-
Process Results:
- Parse CritiqueReport JSON from task result
- Log
👁️ Critique entry with:
- Overall score
- Issue count by type
- Top 3 recommendations
- Feed recommendations into Step C (Hypothesis)
-
Manual Override (if --manual-critique flag):
- If
overallScore < 70: Present images to user for confirmation
- User can add issues the agent missed
- User can reject false positives
Step C: Form Hypothesis
- Synthesize: Combine JSON diff + validation errors + visual critique.
- Hypothesize: "The prompt lacks guidance on [X], causing [Y] errors."
- Log: Append
🔬 Hypothesis entry.
Step D: Suggest Edit
- Read Prompts: Load files from
promptPaths.
- Generate Suggestion: Based on hypothesis, propose specific edit.
- Include: file, line range, current text, proposed text.
- Log: Append
🔧 Suggestion entry.
Step E: Apply Edit
-
If Suggestion Mode:
- Present the suggested edit to the user.
- Ask: "Apply this edit? [Yes / Modify / Skip]"
- If Yes: Apply via Edit tool, log
✏️ Edit Applied.
- If Modify: Get user's version, apply, log.
- If Skip: Log
🅿️ Parking Lot, continue to next iteration.
-
If Auto Mode:
- Apply the edit directly via Edit tool.
- Log
✏️ Edit Applied (Auto).
Step F: Measure Impact
- Re-run: Execute all cases with updated prompts.
- Compare: Calculate new metrics vs previous iteration.
- Log: Append
📊 Result entry.
Step G: Regression Check
- If metrics improved or neutral: Continue.
- If metrics degraded:
- Log
⚠️ Regression Detected with details (which cases regressed, by how much).
- Important: Do NOT revert the edit via git. The log is append-only — the failed experiment is valuable data.
- If Auto Mode: Log the regression, formulate alternative hypothesis, continue to next iteration with a different approach.
- If Suggestion Mode: Ask user: "This edit caused regression. Options: (A) Accept tradeoff and continue, (B) Try different hypothesis next iteration, (C) Stop and analyze."
- The next iteration's hypothesis should account for why this one failed.
Step H: Convergence Check
- If all cases passing: Log
🏁 Iteration Complete (Converged), exit loop.
- If max iterations reached: Log
🏁 Iteration Complete (Max Reached), exit loop.
- If Auto Mode and no improvement for 2 iterations: Log
🏁 Iteration Complete (Plateau), exit loop.
- Otherwise: Continue to next iteration.
§CMD_VERIFY_PHASE_EXIT — Phase 5
Output this block in chat with every blank filled:
Phase 5 proof:
- Iterations completed:
________
- Each iteration logged:
________
- Exit condition:
________
- REFINE_LOG.md entries:
________
Phase Transition
Execute §CMD_TRANSITION_PHASE_WITH_OPTIONAL_WALKTHROUGH:
completedPhase: "5: Iteration Loop"
nextPhase: "6: Synthesis"
prevPhase: "4: Baseline"
custom: "Re-run baseline comparison | Compare current state to original baseline"
6. Synthesis Phase (Debrief)
Summarize the refinement session.
1. Announce Intent
Execute §CMD_REPORT_INTENT_TO_USER.
- I am moving to Phase 6: Synthesis.
- I will
§CMD_PROCESS_CHECKLISTS (if any discovered checklists exist).
- I will
§CMD_GENERATE_DEBRIEF_USING_TEMPLATE following assets/TEMPLATE_REFINE.md EXACTLY.
- I will
§CMD_REPORT_RESULTING_ARTIFACTS to list outputs.
- I will
§CMD_REPORT_SESSION_SUMMARY.
STOP: Output the block above first.
2. Execution — SEQUENTIAL, NO SKIPPING
[!!!] CRITICAL: Execute these steps IN ORDER. Do NOT skip to step 3 or 4 without completing step 1. The debrief FILE is the primary deliverable — chat output alone is not sufficient.
Step 0 (CHECKLISTS): Execute §CMD_PROCESS_CHECKLISTS — process any discovered CHECKLIST.md files. Read ~/.claude/directives/commands/CMD_PROCESS_CHECKLISTS.md for the algorithm. Skips silently if no checklists were discovered. This MUST run before the debrief to satisfy ¶INV_CHECKLIST_BEFORE_CLOSE.
Step 1 (THE DELIVERABLE): Execute §CMD_GENERATE_DEBRIEF_USING_TEMPLATE (Dest: REFINE.md).
- Write the file using the Write tool. This MUST produce a real file in the session directory.
- Populate iteration history table.
- List all edits made with impact.
- Document remaining failures.
- Capture insights and recommendations.
Step 2: Execute §CMD_REPORT_RESULTING_ARTIFACTS — list all created files in chat.
REFINE_PLAN.md — Experiment design and hypotheses
REFINE_LOG.md — Full experiment journal
REFINE.md — Session debrief
refine.manifest.json — Workload configuration (if created)
- Modified prompt/schema files
Step 3: Execute §CMD_REPORT_SESSION_SUMMARY — 2-paragraph summary in chat.
Step 4: Execute §CMD_WALK_THROUGH_RESULTS with this configuration:
§CMD_WALK_THROUGH_RESULTS Configuration:
mode: "results"
gateQuestion: "Refinement complete. Walk through remaining issues and recommendations?"
debriefFile: "REFINE.md"
templateFile: "~/.claude/skills/refine/assets/TEMPLATE_REFINE.md"
§CMD_VERIFY_PHASE_EXIT — Phase 6 (PROOF OF WORK)
Output this block in chat with every blank filled:
Phase 6 proof:
- REFINE.md written:
________ (real file path)
- Tags line:
________
- Artifacts listed:
________
- Session summary:
________
If ANY blank above is empty: GO BACK and complete it before proceeding.
Step 5: Execute §CMD_DEACTIVATE_AND_PROMPT_NEXT_SKILL — deactivate session with description, present skill progression menu.
Post-Synthesis: If the user continues talking (without choosing a skill), obey §CMD_CONTINUE_OR_CLOSE_SESSION.
Appendix: SDK CLI (Required Tooling)
[!!!] CRITICAL: Use the @finch/sdk CLI for all extraction operations. Do NOT write custom scripts.
The SDK CLI reads configuration from environment variables and provides a complete interface for the refinement workflow.
Setup
The CLI uses dotenv and reads from .env:
S3_ENDPOINT=http://localhost:9000
S3_BUCKET=finch-uploads
TEMPORAL_ADDRESS=localhost:7233
TEMPORAL_NAMESPACE=finch
Pre-flight: Verify Services
Before running any extraction, verify the required services are running:
curl -s http://localhost:9000/minio/health/live && echo "MinIO: OK" || echo "MinIO: NOT RUNNING"
curl -s http://localhost:8080/api/v1/namespaces | grep -q finch && echo "Temporal: OK" || echo "Temporal: NOT RUNNING"
ps aux | grep -q "[t]s-node.*worker.ts" && echo "Worker: OK" || echo "Worker: NOT RUNNING"
If worker is not running, start it:
yarn workspace @finch/temporal dev &
yarn workspace @finch/temporal dev
If Docker services are down, start them:
yarn dev:deps
Timeout Handling
Large PDFs (50+ pages) may take 15-30 minutes for full extraction. If the CLI times out:
-
Check if workflow is still running:
curl -s "http://localhost:8080/api/v1/namespaces/finch/workflows" | \
jq '.executions[:3] | .[] | {workflowId: .execution.workflowId, status: .status}'
-
Wait for existing workflow with longer timeout:
npx tsx packages/sdk/src/cli.ts estimate wait <workflowId> --timeout 1800000
-
Download overlays after completion:
npx tsx packages/sdk/src/cli.ts estimate layout overlays <caseId> -o tmp/overlays/<caseId>
Common Commands
npx tsx packages/sdk/src/cli.ts estimate run <caseId> --debug-overlay -o tmp/overlays/<caseId>
npx tsx packages/sdk/src/cli.ts estimate layout run <caseId> --debug-overlay --wait
npx tsx packages/sdk/src/cli.ts estimate layout overlays <caseId> -o tmp/overlays
npx tsx packages/sdk/src/cli.ts estimate layout get <caseId> -o tmp/layout.json
npx tsx packages/sdk/src/cli.ts estimate status <workflowId>
npx tsx packages/sdk/src/cli.ts estimate review run <caseId> --wait -o tmp/review.json
Why CLI Over Scripts
| Aspect | CLI | Custom Script |
|---|
| Config | Env vars (.env) | Hardcoded values |
| Namespace | Auto-detected | Easy to get wrong |
| Error handling | Structured JSON | Ad-hoc |
| Maintenance | One place | N scripts to update |
Appendix: Reviewer Agent
Visual critique is handled by the reviewer subagent (~/.claude/agents/reviewer.md).
Inputs:
- Local paths to overlay images (prepared in
tmp/)
- Path to layout JSON file
- List of page numbers to review
Output: Structured CritiqueReport JSON (schema: ~/.claude/skills/refine/assets/SCHEMA_CRITIQUE_REPORT.json)
Checklist: The agent runs ALL checks from §CRITIQUE_CHECKLIST:
- Table bounds (top edge, bottom edge, group headers, comments, totals)
- Scope detection (headers, totals, overlaps, types, continuations)
- Structural (diagrams, metrics, breadcrumbs)
- JSON-visual consistency (box matches, counts, phantoms)
Legacy: If critiqueScript is specified in the manifest, that script is used instead of the reviewer agent.
Appendix: Invariants
The protocol respects these invariants:
- §INV_MANIFEST_COLOCATED: Manifests live with workload code, not in a central registry.
- §INV_SURGICAL_SUGGESTIONS: The suggestion LLM sees actual prompt content.
- §INV_NO_SILENT_REGRESSION: Auto-mode flags metric degradation.
- §INV_VALIDATE_BEFORE_ITERATE: Single-case validation before the loop.
- §INV_VISUAL_ONLY_VALID: Workloads without
expectedPaths are valid.
- §INV_SDK_CLI_OVER_SCRIPTS: Use
@finch/sdk CLI for extraction operations. Do NOT write custom tmp/ scripts for upload, extract, wait, or download — the CLI already does this with proper config handling.