원클릭으로
reward-function-v410
v4.1.0 reward function redesign to fix overtrading and DSR dominance
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
v4.1.0 reward function redesign to fix overtrading and DSR dominance
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
Normalize long-form CODEX cycle folders to short form before notebooks run. Trigger: cyc001_reg001_*, hard-coded cyc paths breaking, staged CODEX raw data failing in Notebooks 1/2.
v5.6.0 joint multi-TF model: single model per symbol with broadcast 1Hour context replaces dual 15Min/1Hour models. Trigger: (1) replacing weighted-voting model aggregation, (2) adding broadcast features to vectorized env, (3) limited training data + worried about overfitting from doubling obs_dim, (4) backtest builder mismatch with newer feature counts.
DEPRECATED in v5.6.0 — see joint-multi-tf-v560 skill. Documents the v5.2.0 dual-model approach (train separate 15Min/1Hour models, combine via weighted voting). Still relevant for: (1) loading legacy v5.5.0 dual models, (2) understanding the historical aggregation layer, (3) resampling pattern via origin='start'.
Surface a shipped-but-undocumented CLI feature in user-facing docs. Trigger: user reports a known feature missing from README/readthedocs even though the CLI command exists.
KINTSUGI Snakefile + CLI changes that route SLURM jobs around accounts saturated by OTHER users on the same QOS pool. Trigger: QOSGrpMemLimit, jobs stuck pending despite available GPU slots in config, noisy neighbor on shared QOS, multi-user investment pool exhaustion, _build_cycle_assignment static-vs-live.
KINTSUGI SLURM batch processing: Maximize throughput using multi-account resource calculation with GPU+CPU pools per account. Trigger: SLURM job submission, batch processing, resource maximization, GPU+CPU concurrent, headless processing, resource pool.
| name | reward-function-v410 |
| description | v4.1.0 reward function redesign to fix overtrading and DSR dominance |
| author | Claude Code |
| date | "2026-02-21T00:00:00.000Z" |
| Item | Details |
|---|---|
| Date | 2026-02-21 |
| Goal | Fix overtrading (HOLD 5-12%), DSR dominance, and direction collapse identified by v3.0 agent validation experiment |
| Environment | Google Colab A100, v4.0.0 baseline, 50M timesteps standard mode |
| Status | Implemented, pending A/B validation |
v3.0 agent validation (50M timesteps, 2 symbols x 3 seeds) produced best PF=1.205 and fitness=0.153 — far below APPROVED thresholds. The Reward Engineer flagged CRITICAL issues in ALL runs:
Additionally, _calculate_slippage() had a Python for-loop with 2 .item() calls per env (2048 CUDA syncs/step), halving FPS.
def _get_curriculum_weights(self, progress: float) -> dict:
direction_w = max(0.10, 0.35 * (1.0 - progress)) # 0.35 -> 0.10 (floor raised)
pnl_w = min(0.60, 0.15 + 0.55 * progress) # 0.15 -> 0.60
drawdown_w = min(0.15, 0.03 + 0.12 * progress) # 0.03 -> 0.15
exploration_w = max(0.02, 0.12 * (1.0 - progress)) # 0.12 -> 0.02
dsr_w = max(0.03, 0.10 * (1.0 - progress * 0.7)) # 0.10 -> 0.03
magnitude_w = 0.05; stop_tp_w = 0.10; slippage_w = 0.06
# Normalize to sum=1.0
raw = {all weights}; total = sum(raw.values())
return {k: v / total for k, v in raw.items()}
# OLD: torch.clamp(dsr * 10, -2.0, 2.0)
# NEW:
dsr_scaled = torch.clamp(dsr * 5, -1.0, 1.0)
# OLD: return total_cost * 10.0 (slippage_weight=0.04)
# NEW: return total_cost * 20.0 (slippage_weight=0.06)
# Net effect: 3x stronger penalty
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
| Fixed DSR weight at 10% | DSR raw magnitude >> P&L raw magnitude → DSR dominates despite lower weight | Weight alone doesn't control contribution; must also control raw signal scaling |
| Direction floor at 0.05 | Direction signal vanishes in late training → negative direction reward | Floor must be ≥0.10 for meaningful gradient signal |
| Slippage weight 0.04 + 10x scale | Net slippage penalty per trade is only 0.0004-0.0012 weighted → negligible vs DSR reward from trading | Must consider end-to-end magnitude: weight * scale * typical_raw_signal |
| Non-normalized curriculum weights | Weights sum to 0.96 at start, 1.06 at end → implicit weight changes confuse analysis | Always normalize to 1.0 |
| Python for-loop in _calculate_slippage | 2 .item() calls per env × 1024 envs = 2048 CUDA syncs/step → halves FPS | The same anti-pattern as _get_observations(): always vectorize, never .item() in hot paths |
# Curriculum (start -> end, raw before normalization)
direction_w: 0.35 -> 0.10 # floor=0.10, was 0.40->0.05
pnl_w: 0.15 -> 0.60 # cap=0.60, was 0.10->0.55
drawdown_w: 0.03 -> 0.15
exploration_w: 0.12 -> 0.02
dsr_w: 0.10 -> 0.03 # NEW: decays (was fixed 0.10)
magnitude_w: 0.05 # fixed
stop_tp_w: 0.10 # fixed
slippage_w: 0.06 # was 0.04
# DSR scaling
dsr_scale: 5 # was 10
dsr_clamp: [-1.0, 1.0] # was [-2.0, 2.0]
# Slippage
slippage_scale: 20 # was 10
direction_threshold: 0.002 # was 0.003
weight * scale * typical_raw_magnitude — not just the weight