com um clique
com um clique
The 5-step build pipeline (PRE-FLIGHT, DATA, STRATEGY, OPTIMIZE, file structure). Use when creating a new experiment — pre-flight documentation, building datasets, implementing strategies.
Emergency stop — kill all agents, clean APC, restore references, prune worktrees, verify DB clean. Use when stopping a cycle or recovering from a crash.
| name | instruction-tuning |
| description | Self-Improving Instruction Tuning |
Sub-agents are clones of .claude/. Every sub-agent branches from HEAD and inherits all rules, examples, agent definitions, skills, and reference files. When the orchestrator updates .claude/ and commits, the next sub-agent born from HEAD gets every change automatically. There is no separate "orchestrator instructions" vs "sub-agent instructions" — they are the same repo. Updating yourself IS updating the clones. This is why commit-before-launch discipline matters: uncommitted changes don't propagate.
The core question: "How would I build this correctly?"
The orchestrator watches each sub-agent attempt, identifies where it diverges from correct behavior, and asks: "What generalized instruction would have prevented this?" Then patches .claude/ and reruns. The loop converges when the sub-agent produces correct output without any instruction-specific help.
The agent output is throwaway. The instruction improvements are the product.
There is NO backwards compatibility. The experiments are the test harness for .claude/. Every iteration can and should change anything — instructions, utilities, infrastructure, shared code, agent definitions, hooks, skills. Nothing is sacred except the principle that the instructions must improve. If an example doesn't teach the right pattern, rewrite it. If a shared utility makes the wrong thing easy, restructure it. If a rule is ignored because it's a paragraph buried in 10KB of text, promote it to a hard stop. The orchestrator must not hesitate to make breaking changes — the next sub-agent starts fresh from HEAD.
Before running this skill, confirm with the user:
/tmp/. These are cleaned up after completion.Proceed only with explicit user confirmation.
max_concurrent_agents: 4 # shared DB constraint
max_cycles: 5 # stop after 5 instruction-improvement cycles
convergence_target: 24 # out of 28 reviewer points
agent_budget: 200 # tool calls per experiment agent
reviewer_budget: 40 # tool calls per reviewer
The user provides a list of hypothesis prompts. Each prompt describes a causal relationship to test.
## Hypothesis 1: [Signal Type] — [Instrument/Universe]
When [condition based on available data], take [position].
The [causal mechanism] creates [expected return pattern].
SYMBOLS: [instrument(s)]
DATE_RANGE: [start] to [end]
## Hypothesis 2: [Different Signal Type] — [Different Universe]
[Condition] predicts [outcome]. [Causal mechanism].
SYMBOLS: [instrument(s)]
DATE_RANGE: [start] to [end]
One hypothesis at a time. Iterate until the agent produces clean output, then move to the next.
Before launching, ensure HEAD is clean and instructions are pruned:
# 1. Delete previous experiment output (worktrees branch from HEAD)
rm -rf experiments/{experiment_name}/
# 2. Prune .claude/ — check for contradictions, duplication, stale references
# Measure: wc -c .claude/rules/*.md .claude/CLAUDE.md
# If total > 50KB, consolidate before launching (agents pay per-turn)
wc -c .claude/rules/*.md .claude/CLAUDE.md
# 3. Commit clean HEAD
git add -A && git commit -m "Clean for {hypothesis} iter {N}"
# 5. Verify clean state
python -m shared.agent_protocol clean
python -m shared.db_monitor status # verify 2 connections, no queries
WorktreeCreate hook creates worktrees at /tmp/claudodidact-worktrees/ (outside repo). Agents can't see reference_experiments/ or main-repo files. No mv needed.
You are an experiment agent. Follow .claude/agents/experiment-agent.md exactly.
HYPOTHESIS: {hypothesis}
SYMBOLS: {symbols}
DATE_RANGE: {date_range}
EXPERIMENT_DIR: experiments/{NN}_{experiment_name}
CYCLE: {hypothesis}_iter{N}
APC_CHANNEL: {hypothesis}_iter{N}_experiment
Read all mandatory files in .claude/ before writing code. Follow the 5-step pipeline.
Fill the commit gate matrix with specific evidence.
IMPORTANT: Before writing ANY code, run `ls shared/` and check what functions already exist.
Run `python -m shared.system_monitor` before starting.
Check symbol density with `shared/db_monitor.get_density(symbol)`.
Test on 5 trading days first before expanding to the full date range.
All generated files (parquet, etc.) go in output/ subdirectory.
After STEP 3 passes, ALWAYS run STEP 5b (Optuna parameter optimization).
A null at one parameterization says nothing — search the space.
Use shared/optuna_utils.py. Train-only objective. Seed with baseline. Re-verify after.
See shared/optuna_utils.py for the API.
Launch with model: opus, isolation: worktree, run_in_background: true.
See docs/orchestrator-process.md § Rule 3 for the full protocol.
Do ALL productive work BEFORE entering the monitor loop (prepare reviewer prompt, read reference, prepare adversary prompt). Then:
sleep 180 && python -m shared.apc read <channel> --new
Do NOT use apc monitor or apc wait with run_in_background — their output goes to a file the user never sees. The sleep-then-read loop is the ONLY way to surface progress in the chat.
prev_decision, raw SQL) for the instruction improvement step.When the agent completes, scan the output for bugs. The orchestrator does this FIRST — before launching reviewer/adversary. Check:
CorporateActionLedger / build_default_split_ledger() as PRIMARY? Magnitude-only = will miss forward splits on leveraged ETFs.If the orchestrator finds a bug directly, skip reviewer/adversary — diagnose the instruction gap, patch it, delete the experiment, and rerun immediately. This saves ~160K tokens per iteration.
Only launch reviewer + adversary when the orchestrator cannot find bugs in a quick scan (~5 min). That's when the experiment needs deeper forensic analysis.
For each bug found:
.claude/ for contradictions with all patchesrm -rf experiments/{experiment_name}/
git add -A && git commit -m "H{X} iter {N}: found {bug}, patched {file}"
# Now HEAD is clean — relaunch from Step 2
Repeat until the agent produces clean output (no bugs found in Step 4).
Once the agent produces a clean experiment (STEP 3 passes, no bugs in diagnose), relaunch with Optuna enabled. The agent runs STEPS 0-3 with hardcoded params, then STEP 5b (200 trials max). This is slower (~200 extra tool calls) so only do it after the instruction set is producing correct output.
The Optuna pass answers: "Does any parameterization of this hypothesis produce signal?" A null at hardcoded params is not a null result — a null across 200 Optuna trials IS.
If Optuna finds better params, the agent re-verifies (STEP 3 again with optimized params) and reports baseline vs optimized side-by-side.
Only run after the agent passes the orchestrator's quick scan AND Optuna has run.
# Restore references BEFORE launching reviewer/adversary (they need them)
mv /tmp/claudodidact_references reference_experiments
Launch in parallel:
If they find new bugs, patch and rerun. If they can't break it, the hypothesis is done.
Before committing instruction changes, the orchestrator MUST run this checklist EVERY cycle:
8a. Measure token cost:
wc -c .claude/rules/*.md .claude/CLAUDE.md # Target: under 50KB total
This is the highest per-turn cost — every byte loads on every turn for every agent. Sub-agents inherit ALL rules regardless of paths: frontmatter (#8395), so content reduction is the only real optimization for sub-agent token cost.
8b. Duplication audit:
# Search for concepts that appear in multiple files
grep -rl "key_term_from_patch" .claude/rules/ .claude/reference/ .claude/agents/
Decision tree:
reference/ (loaded on-demand, not auto-loaded per turn)8c. Content classification — what to keep vs move:
8d. Cross-reference validation:
# Verify all cross-references point to actual files
grep -r "reference/" .claude/rules/ | grep -o 'reference/[^ )]*' | sort -u
# Check each file exists
8e. Generalization check: Re-read each patch. Does it state a principle or a recipe? Can a coding agent extrapolate to a novel case? (See Step 5 self-check.)
8f. Scoring alignment: Confirm reviewer scoring criteria still match updated rules. A new rule without a corresponding scoring check is unenforceable.
8g. End-to-end read: Read all affected files after patching — a patch at line 50 may contradict something at line 200.
Cycle {N} Results:
| Hypothesis | A-class | C-class | Verification | Total | Pass? |
|-----------|---------|---------|-------------|-------|-------|
| H1 | /14 | /8 | /6 | /28 | Y/N |
| H2 | /14 | /8 | /6 | /28 | Y/N |
| ... | ... | ... | ... | ... | ... |
Convergence: {count passing} / {total} >= {convergence_target}/28
Convergence reached when: Fresh agents (not the same ones from earlier cycles) score 24+/28 on the reviewer without any hints or extra guidance beyond the instruction files.
If not converged: Go to Cycle N+1 with updated instructions.
If converged OR max_cycles reached: Stop and produce the final report.
Before updating any instructions, utilities, or code, the orchestrator asks: "How would I do everything better?"
After each run (experiment + reviewer + adversary), fill this grid. Each row is a question the orchestrator asks itself. The answers drive what gets patched.
| Area | Question | This Run | Action |
|---|---|---|---|
| Sub-agent instructions | Did the agent follow the instructions correctly? Where did it diverge? | ||
| Sub-agent instructions | What instruction was missing that would have prevented the divergence? | ||
| Sub-agent instructions | What instruction exists but was too vague, too long, or contradicted by another? | ||
| Orchestrator process | Did monitoring catch problems early enough? What signal was missed? | ||
| Orchestrator process | Was the diagnose step thorough enough? What did the reviewer/adversary find that I missed? | ||
| Orchestrator process | Did I waste tokens? (unnecessary polls, duplicate work, over-monitoring, under-monitoring) | ||
| Paradigm example | Does the example show the pattern the agent needed? What's missing? | ||
| Paradigm example | Did the agent invent something that should be in the example for the next agent? | ||
| Shared infrastructure | Did the agent work around a gap in shared/? Should that be a utility? | ||
| Verification | Did Check 6/7 catch what they should? What slipped through? | ||
| Verification | Are the reviewer scoring criteria aligned with the current rules? | ||
| Optimization | Did the agent run Optuna (MANDATORY)? If not, that's a failure. If so: train-only objective, baseline seeded, re-verified, gap analyzed? Does the report show baseline vs optimized? | ||
| Optimization | Did the agent have sufficient Optuna instructions? What pattern was missing? | ||
| Token efficiency | Can any always-loaded rule be moved behind a paths filter or into reference/? | ||
| Tool call efficiency | How many tool calls did the agent use vs budget? Where did it spend the most? Were there wasted retries, unnecessary reads, or repeated queries? | ||
| Tool call patterns | Did the agent read files it didn't need? Did it restart scripts unnecessarily? Were there patterns the instructions could prevent (e.g., always reading X before Y)? | ||
| Generalization | Is every proposed patch a general principle, not a specific fix? |
Only after filling this grid should the orchestrator update instructions, utilities, example code, or infrastructure. The grid is the input; the patches are the output.
After each cycle:
Commit: (pending) in the cycle log means the orchestrator forgot this step.docs/curriculum-state.md with the cycle log entry and commit hash..claude/hooks/cleanup-research-agents.shpython -m shared.agent_protocol cleanAfter the final cycle:
# Instruction Tuning Report
## Summary
- Cycles run: {N}
- Converged: YES/NO
- Final scores: [table]
## Instruction Changes by Cycle
[Log of all changes]
## Remaining Gaps
[Any known issues the instructions don't yet cover]
## Recommendation
[Whether the instruction set is ready for production use]
.claude/rules/The full scorecard combines reviewer checks with process checks:
| # | Check | Source | Max |
|---|---|---|---|
| 1-7 | A-class temporal checks | Reviewer Audit 1 | 14 |
| 8-11 | C-class accounting checks | Reviewer Audit 2 | 8 |
| 12-14 | Verification completeness | Reviewer Audit 3 | 6 |
| 15 | Pre-flight document exists | Process | 1 |
| 16 | Reference experiments read before coding | Process | 1 |
| 17 | No hard-stop violations during execution | Process | 2 |
| 18 | Commit gate matrix filled with specific evidence | Process | 1 |
| 19 | No retry-on-failure behavior | Process | 1 |
| 20 | Clean worktree state on completion | Process | 1 |
| Total | 35 |
Process checks (15-20) are scored by the orchestrator, not the reviewer.