一键导入
agent-validation-v420
Agent validation overhaul: reward weight overrides, fitness decline gate, pinned data, staged experiments
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Agent validation overhaul: reward weight overrides, fitness decline gate, pinned data, staged experiments
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
| name | agent-validation-v420 |
| description | Agent validation overhaul: reward weight overrides, fitness decline gate, pinned data, staged experiments |
| author | Claude Code |
| date | "2026-02-22T00:00:00.000Z" |
| version | v4.2.0 |
| Item | Details |
|---|---|
| Date | 2026-02-22 |
| Goal | Stop agents from burning money ($150+ wasted) — give them effective levers, pin data for reproducibility, validate cheaply before scaling |
| Environment | Google Colab A100, Alpaca Trading v4.2.0, Python 3.10 |
| Status | Implementation complete, awaiting Colab validation |
Three agent validation experiments were run (Jan 31, Feb 18, Feb 19 2026), burning $150+ in GPU time with zero useful results. Root cause: agents had only ONE lever (entropy), LR adjustment was a confirmed no-op (cosine scheduler overwrites), and the Reward Engineer was explicitly read-only. Agents could diagnose overtrading, DSR dominance, and direction collapse but literally could not fix anything. Additionally, training data wasn't pinned between experiments, causing baseline drift of 76% between runs.
Files: vectorized_env.py, multi_agent.py
Agents can now adjust reward component weights mid-training via set_reward_weight_override():
{direction, magnitude, pnl, stop_tp, exploration, slippage, drawdown, dsr}Reward Engineer prompt rewritten from read-only diagnostic to actionable:
# When to adjust (ONLY when clear evidence exists):
# - Overtrading (HOLD < 20%): increase slippage +0.02 to +0.05
# - DSR dominating P&L: decrease dsr -0.02 to -0.04
# - Direction collapse (< 40% accuracy): increase direction +0.02 to +0.05
# - P&L stagnant while other metrics OK: increase pnl +0.02 to +0.05
File: multi_agent.py
Requires 3 consecutive fitness declines before ANY intervention (entropy or weight changes). Prevents agents from intervening when training is going well.
def _is_fitness_declining(self) -> bool:
if len(self._validation_history) < 3:
return False
recent = [v.get('fitness_score', 0) for v in self._validation_history[-3:]]
return recent[-1] < recent[-2] and recent[-2] < recent[-3]
_agent_rejections increment)File: agent_validation_analysis.ipynb
DATA_START = datetime(2025, 2, 1, tzinfo=tz.UTC)
DATA_END = datetime(2026, 2, 14, tzinfo=tz.UTC) # Fixed cutoff
Historical data is immutable — same dates = same bars = same results for same seed. Data is sliced after prefetch_all_data() (function doesn't support start/end params directly).
File: agent_memory.py, multi_agent.py
MultiAgentConfig.seed field — passed through to save_run_summary() (was always 0)_compute_patterns() aggregates weight adjustment success rate across runs| Stage | Cost | Time | What it proves |
|---|---|---|---|
| 0. Local tests | $0 | 5 min | Code works, bounds enforced, plumbing connected |
| 1. Smoke (1 sym, 1 seed, 2M) | ~$0.50 | ~5 min | Agents can adjust weights, pinned data works |
| 2. Quick A/B (2 sym, 2 seeds, 10M) | ~$5 | ~30 min | Agents help or don't hurt |
| 3. Full (2 sym, 5 seeds, 50M) | ~$40 | ~20 hrs | Statistical significance |
Stop after any stage that fails. Fix and re-run that stage, don't escalate.
| Experiment | Date | What Happened | Root Cause |
|---|---|---|---|
| Jan 31 2026 | Treatment = baseline metrics | Agents weren't connected to training loop | Plumbing error |
| Feb 18 2026 | Agents took 5-9 actions, fitness -38.2% | Only lever was entropy. Agents made aggressive changes that hurt | Single lever (entropy), no bounds, no fitness gate |
| Feb 19 2026 | Agents took 0 actions | Phase gates blocked everything, but baseline degraded 76% from Feb 18 | Training data not pinned, agents too conservative |
| All runs | Reward Engineer diagnosed problems correctly but recommended "continue" | Reward Engineer was explicitly read-only | Design flaw — see problem, can't fix it |
| All runs | LR adjustments attempted but no effect | Cosine scheduler overwrites LR every step | LR adjustment is a no-op |
| All runs | Agent memory records seed=0 for all runs | Seed not passed through MultiAgentConfig | Bug in plumbing |
# MultiAgentConfig (v4.2.0)
MultiAgentConfig(
symbol=symbol,
seed=seed, # NEW: seed tracking
enable_hyperparameter_tuner=False, # Disabled — consolidated into Reward Engineer
enable_reward_engineer=True, # NOW ACTIONABLE (was read-only)
reward_interval=8, # Increased frequency
max_cumulative_entropy_multiplier=1.25, # Tightened from 1.5
min_cumulative_entropy_multiplier=0.75, # Tightened from 0.5
no_intervention_before_pct=50.0, # Raised from 30.0
)
TestRewardWeightOverrideBounds (5): per-call clamp, cumulative clamp, negative clamp, invalid component, valid componentsTestRewardWeightOverrideNormalization (4): sum-to-one, floor prevents zero, weights shiftTestAdjustRewardWeightsAction (6): action in valid set, aliases, apply action, phase gate, fitness gate, empty changesTestFitnessDecliningGate (6): requires 3 snapshots, declining/improving/flat, only last 3, entropy blockedTestLRAdjustmentNoop (2): skipped not rejected, doesn't countTestAgentMemoryWeightTracking (2): patterns with weight data, seed in config~/.claude/plans/fancy-giggling-puffin.mdnotebooks/agent_validation_analysis.ipynb (v2.0.0)tests/test_multi_agent.py (87 passed, 5 skipped)vectorized_env.py, multi_agent.py, agent_memory.pyNormalize long-form CODEX cycle folders to short form before notebooks run. Trigger: cyc001_reg001_*, hard-coded cyc paths breaking, staged CODEX raw data failing in Notebooks 1/2.
v5.6.0 joint multi-TF model: single model per symbol with broadcast 1Hour context replaces dual 15Min/1Hour models. Trigger: (1) replacing weighted-voting model aggregation, (2) adding broadcast features to vectorized env, (3) limited training data + worried about overfitting from doubling obs_dim, (4) backtest builder mismatch with newer feature counts.
DEPRECATED in v5.6.0 — see joint-multi-tf-v560 skill. Documents the v5.2.0 dual-model approach (train separate 15Min/1Hour models, combine via weighted voting). Still relevant for: (1) loading legacy v5.5.0 dual models, (2) understanding the historical aggregation layer, (3) resampling pattern via origin='start'.
Surface a shipped-but-undocumented CLI feature in user-facing docs. Trigger: user reports a known feature missing from README/readthedocs even though the CLI command exists.
KINTSUGI Snakefile + CLI changes that route SLURM jobs around accounts saturated by OTHER users on the same QOS pool. Trigger: QOSGrpMemLimit, jobs stuck pending despite available GPU slots in config, noisy neighbor on shared QOS, multi-user investment pool exhaustion, _build_cycle_assignment static-vs-live.
KINTSUGI SLURM batch processing: Maximize throughput using multi-account resource calculation with GPU+CPU pools per account. Trigger: SLURM job submission, batch processing, resource maximization, GPU+CPU concurrent, headless processing, resource pool.