ワンクリックで
reward-function-v410
v4.1.0 reward function redesign to fix overtrading and DSR dominance
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
メニュー
v4.1.0 reward function redesign to fix overtrading and DSR dominance
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
SOC 職業分類に基づく
Normalize long-form CODEX cycle folders to short form before notebooks run. Trigger: cyc001_reg001_*, hard-coded cyc paths breaking, staged CODEX raw data failing in Notebooks 1/2.
v5.6.0 joint multi-TF model: single model per symbol with broadcast 1Hour context replaces dual 15Min/1Hour models. Trigger: (1) replacing weighted-voting model aggregation, (2) adding broadcast features to vectorized env, (3) limited training data + worried about overfitting from doubling obs_dim, (4) backtest builder mismatch with newer feature counts.
DEPRECATED in v5.6.0 — see joint-multi-tf-v560 skill. Documents the v5.2.0 dual-model approach (train separate 15Min/1Hour models, combine via weighted voting). Still relevant for: (1) loading legacy v5.5.0 dual models, (2) understanding the historical aggregation layer, (3) resampling pattern via origin='start'.
Surface a shipped-but-undocumented CLI feature in user-facing docs. Trigger: user reports a known feature missing from README/readthedocs even though the CLI command exists.
KINTSUGI Snakefile + CLI changes that route SLURM jobs around accounts saturated by OTHER users on the same QOS pool. Trigger: QOSGrpMemLimit, jobs stuck pending despite available GPU slots in config, noisy neighbor on shared QOS, multi-user investment pool exhaustion, _build_cycle_assignment static-vs-live.
KINTSUGI SLURM batch processing: Maximize throughput using multi-account resource calculation with GPU+CPU pools per account. Trigger: SLURM job submission, batch processing, resource maximization, GPU+CPU concurrent, headless processing, resource pool.
| name | reward-function-v410 |
| description | v4.1.0 reward function redesign to fix overtrading and DSR dominance |
| author | Claude Code |
| date | "2026-02-21T00:00:00.000Z" |
| Item | Details |
|---|---|
| Date | 2026-02-21 |
| Goal | Fix overtrading (HOLD 5-12%), DSR dominance, and direction collapse identified by v3.0 agent validation experiment |
| Environment | Google Colab A100, v4.0.0 baseline, 50M timesteps standard mode |
| Status | Implemented, pending A/B validation |
v3.0 agent validation (50M timesteps, 2 symbols x 3 seeds) produced best PF=1.205 and fitness=0.153 — far below APPROVED thresholds. The Reward Engineer flagged CRITICAL issues in ALL runs:
Additionally, _calculate_slippage() had a Python for-loop with 2 .item() calls per env (2048 CUDA syncs/step), halving FPS.
def _get_curriculum_weights(self, progress: float) -> dict:
direction_w = max(0.10, 0.35 * (1.0 - progress)) # 0.35 -> 0.10 (floor raised)
pnl_w = min(0.60, 0.15 + 0.55 * progress) # 0.15 -> 0.60
drawdown_w = min(0.15, 0.03 + 0.12 * progress) # 0.03 -> 0.15
exploration_w = max(0.02, 0.12 * (1.0 - progress)) # 0.12 -> 0.02
dsr_w = max(0.03, 0.10 * (1.0 - progress * 0.7)) # 0.10 -> 0.03
magnitude_w = 0.05; stop_tp_w = 0.10; slippage_w = 0.06
# Normalize to sum=1.0
raw = {all weights}; total = sum(raw.values())
return {k: v / total for k, v in raw.items()}
# OLD: torch.clamp(dsr * 10, -2.0, 2.0)
# NEW:
dsr_scaled = torch.clamp(dsr * 5, -1.0, 1.0)
# OLD: return total_cost * 10.0 (slippage_weight=0.04)
# NEW: return total_cost * 20.0 (slippage_weight=0.06)
# Net effect: 3x stronger penalty
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
| Fixed DSR weight at 10% | DSR raw magnitude >> P&L raw magnitude → DSR dominates despite lower weight | Weight alone doesn't control contribution; must also control raw signal scaling |
| Direction floor at 0.05 | Direction signal vanishes in late training → negative direction reward | Floor must be ≥0.10 for meaningful gradient signal |
| Slippage weight 0.04 + 10x scale | Net slippage penalty per trade is only 0.0004-0.0012 weighted → negligible vs DSR reward from trading | Must consider end-to-end magnitude: weight * scale * typical_raw_signal |
| Non-normalized curriculum weights | Weights sum to 0.96 at start, 1.06 at end → implicit weight changes confuse analysis | Always normalize to 1.0 |
| Python for-loop in _calculate_slippage | 2 .item() calls per env × 1024 envs = 2048 CUDA syncs/step → halves FPS | The same anti-pattern as _get_observations(): always vectorize, never .item() in hot paths |
# Curriculum (start -> end, raw before normalization)
direction_w: 0.35 -> 0.10 # floor=0.10, was 0.40->0.05
pnl_w: 0.15 -> 0.60 # cap=0.60, was 0.10->0.55
drawdown_w: 0.03 -> 0.15
exploration_w: 0.12 -> 0.02
dsr_w: 0.10 -> 0.03 # NEW: decays (was fixed 0.10)
magnitude_w: 0.05 # fixed
stop_tp_w: 0.10 # fixed
slippage_w: 0.06 # was 0.04
# DSR scaling
dsr_scale: 5 # was 10
dsr_clamp: [-1.0, 1.0] # was [-2.0, 2.0]
# Slippage
slippage_scale: 20 # was 10
direction_threshold: 0.002 # was 0.003
weight * scale * typical_raw_magnitude — not just the weight