一键导入
agent-validation-v430
Agent validation v4.3.0 — Make agents act effectively by disabling harmful actions, lowering gates, and injecting cross-run learning
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Agent validation v4.3.0 — Make agents act effectively by disabling harmful actions, lowering gates, and injecting cross-run learning
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
Normalize long-form CODEX cycle folders to short form before notebooks run. Trigger: cyc001_reg001_*, hard-coded cyc paths breaking, staged CODEX raw data failing in Notebooks 1/2.
v5.6.0 joint multi-TF model: single model per symbol with broadcast 1Hour context replaces dual 15Min/1Hour models. Trigger: (1) replacing weighted-voting model aggregation, (2) adding broadcast features to vectorized env, (3) limited training data + worried about overfitting from doubling obs_dim, (4) backtest builder mismatch with newer feature counts.
DEPRECATED in v5.6.0 — see joint-multi-tf-v560 skill. Documents the v5.2.0 dual-model approach (train separate 15Min/1Hour models, combine via weighted voting). Still relevant for: (1) loading legacy v5.5.0 dual models, (2) understanding the historical aggregation layer, (3) resampling pattern via origin='start'.
Surface a shipped-but-undocumented CLI feature in user-facing docs. Trigger: user reports a known feature missing from README/readthedocs even though the CLI command exists.
KINTSUGI Snakefile + CLI changes that route SLURM jobs around accounts saturated by OTHER users on the same QOS pool. Trigger: QOSGrpMemLimit, jobs stuck pending despite available GPU slots in config, noisy neighbor on shared QOS, multi-user investment pool exhaustion, _build_cycle_assignment static-vs-live.
KINTSUGI SLURM batch processing: Maximize throughput using multi-account resource calculation with GPU+CPU pools per account. Trigger: SLURM job submission, batch processing, resource maximization, GPU+CPU concurrent, headless processing, resource pool.
| name | agent-validation-v430 |
| description | Agent validation v4.3.0 — Make agents act effectively by disabling harmful actions, lowering gates, and injecting cross-run learning |
| author | Claude Code |
| date | "2025-02-22T00:00:00.000Z" |
| Field | Value |
|---|---|
| Date | Feb 22, 2025 |
| Goal | Make agents act on the RIGHT things (reward weights, checkpoints) while preventing harmful actions (entropy) |
| Environment | Google Colab A100 GPU, Python 3.10, PPO reinforcement learning |
| Status | Implemented, tests passing (91/91), awaiting experiment validation |
| Supersedes | agent-validation-v420 (v4.2.0) |
Four agent validation experiments (Jan 31 - Feb 19) revealed a fundamental problem:
| Run | Agents Acted? | PF Change | Fitness Change | Root Cause |
|---|---|---|---|---|
| Run 1-2 | NO (all "continue") | ~0% | ~0% | Fitness gate (3 consecutive) unreachable with PPO oscillation |
| Run 3 | YES (entropy 1.3x) | -6.3% | -38.2% | Entropy increase harmful (confirmed across 2 experiments) |
| Run 4 | NO (all "continue") | +7.0% | +144% | Fitness gate still blocking; natural PPO improvement |
Core insight: Agents are excellent diagnosticians (identify overtrading, direction collapse, reward imbalance) but the system either blocks them from acting or gives them harmful actions. The solution: remove harmful actions, lower gates for safe actions, and inject institutional memory.
File: multi_agent.py — validation_callback() start
Before ANY agent consultation, save checkpoint if current_fitness > self._best_fitness. Prevents losing peak model state between agent consultation windows.
# At start of validation_callback, before agent logic:
if current_fitness > self._best_fitness:
checkpoint_path = f"checkpoints/auto_best_{trainer.global_step}.pt"
os.makedirs(os.path.dirname(checkpoint_path), exist_ok=True)
# Clean up previous auto_best (only auto_best, not agent checkpoints)
if self._best_checkpoint_path and self._best_checkpoint_path.startswith("checkpoints/auto_best_"):
os.remove(self._best_checkpoint_path)
trainer.save(checkpoint_path)
self._best_fitness = current_fitness
self._best_fitness_step = trainer.global_step
self._best_checkpoint_path = checkpoint_path
adjust_entropy actionFile: multi_agent.py — _apply_action(), agent prompts
Two independent experiments confirm entropy adjustments hurt: v2.4 (-38.2%) and Run 3 (-38.2%). PPO's cosine schedule manages entropy optimally.
_apply_action() returns early with "Disabled" messageVALID_ACTION_TYPES for alias resolution (backward compat)File: multi_agent.py — _is_fitness_declining(strict=True)
def _is_fitness_declining(self, strict: bool = True) -> bool:
if not strict: # Moderate gate for low-risk actions (reward weights)
# 2 consecutive declines OR >30% drop from peak
two_declining = recent[-1] < recent[-2]
peak_drop = self._best_fitness > 0.01 and recent[-1] < self._best_fitness * 0.7
return two_declining or peak_drop
# Strict gate (3 consecutive) for high-risk actions (rollback, halt)
return recent[-1] < recent[-2] and recent[-2] < recent[-3]
strict=True (default): 3 consecutive declines — for rollback/haltstrict=False: 2 consecutive OR >30% from peak — for reward weightsFile: multi_agent.py — _apply_action() reward weights section
With phase gate at 30% and first consultation at ~31%, agents got barely 1 chance. At 15%, Reward Engineer gets 2-3 more consultation windows. Safe because reward weights are bounded (+-0.05/call, +-0.15 cumulative).
File: multi_agent.py — _get_current_metrics(), _consult_agent_simple()
Agent memory was accumulated but never injected into prompts. Now adds previous_runs to metrics:
metrics['previous_runs'] = {
'total_runs': patterns.get('total_runs', 0),
'avg_fitness_with_agents': patterns.get('avg_fitness_with_agents', 0),
'best_fitness_ever': patterns.get('best_fitness_ever', 0),
'weight_adjustment_success_rate': patterns.get('weight_adjustment_success_rate', None),
'weight_adjustments_total': patterns.get('weight_adjustments_total', 0),
'recent_actions': [...] # Last 3 runs' actions and fitness
}
Rendered as **CROSS-RUN LEARNING** section in consultation prompt.
File: notebooks/agent_validation_runpod.ipynb
N_SEEDS=5 (from 3) — needed for p<0.05 with Cohen's d ~1.0TIMESTEPS=200_000_000 — production length, ~20+ validation windowsMultiAgentConfig(seed=seed, no_intervention_before_pct=15.0, reward_interval=8)| Attempt | Why Failed | Lesson Learned |
|---|---|---|
| Run 1-2: Standard fitness gate (3 consecutive) | PPO fitness oscillates — 3 consecutive declines almost never occurs | Need tiered gates: strict for risky actions, moderate for bounded actions |
| Run 3: Agent increases entropy 1.3x | Entropy increase destroys PPO learning (-38.2% fitness) | NEVER adjust entropy during training — cosine schedule is optimal |
| v2.4: Entropy increase experiment | Independent confirmation: also -38.2% fitness decline | Two independent failures = permanent disable, not parameter tuning |
| Agent memory without prompt injection | Data accumulated but agents never saw it — repeated same mistakes | Memory is useless unless injected into the prompt context |
| Phase gate at 30% for reward weights | First consultation at ~31% = barely 1 chance before midpoint | Bounded actions (+-0.05) deserve lower gates than unbounded ones |
| n=3 seeds for A/B experiments | p>0.12 even with Cohen's d ~1.0 | Need n=5 minimum for statistical power with training variance |
# MultiAgentConfig
MultiAgentConfig(
symbol=symbol,
seed=seed, # Cross-run tracking
no_intervention_before_pct=15.0, # Lowered from 50% (entropy gone)
reward_interval=8, # Reward Engineer primary lever
risk_interval=5, # Risk Analyst frequent checks
log_agent_responses=True,
)
# Experiment config
N_SEEDS = 5 # Statistical power
TIMESTEPS = 200_000_000 # Production length
TRAINING_MODE = 'production' # 2048,1024,512,256 network
# Fitness decline gate
_is_fitness_declining(strict=True) # 3 consecutive — rollback/halt
_is_fitness_declining(strict=False) # 2 consecutive OR >30% from peak — reward weights
# Phase gates
# entropy: DISABLED (no phase gate needed)
# reward weights: 15% progress
# rollback/halt: no phase gate (always available when fitness declining)
.claude/plans/piped-cooking-kahan.mdnotebooks/agent_validation_runpod.ipynbtests/test_multi_agent.py (91 tests, 5 GPU-only skips)alpaca_trading/training/multi_agent.pyalpaca_trading/training/agent_memory.py.skills/plugins/trading/agent-validation-v420/