원클릭으로 Manus에서 모든 스킬 실행

시작하기

reward-function-v410

스타2

포크0

업데이트2026년 3월 26일 12:54

v4.1.0 reward function redesign to fix overtrading and DSR dominance

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

smith6jt-cop

smith6jt-cop/Skills_Registry

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

reward-function-v410 - Research Notes

Experiment Overview

Item	Details
Date	2026-02-21
Goal	Fix overtrading (HOLD 5-12%), DSR dominance, and direction collapse identified by v3.0 agent validation experiment
Environment	Google Colab A100, v4.0.0 baseline, 50M timesteps standard mode
Status	Implemented, pending A/B validation

Context

v3.0 agent validation (50M timesteps, 2 symbols x 3 seeds) produced best PF=1.205 and fitness=0.153 — far below APPROVED thresholds. The Reward Engineer flagged CRITICAL issues in ALL runs:

DSR contributes ~35% of effective reward despite 10% weight (raw signal is systematically larger than position-size-attenuated P&L)
HOLD rate of 5-12% (pathological; normal is 30-70%) because frequent trading accumulates DSR reward
Negative direction component in late training because direction weight decays to negligible 0.05

Additionally, _calculate_slippage() had a Python for-loop with 2 .item() calls per env (2048 CUDA syncs/step), halving FPS.

Verified Workflow

Curriculum Weight Formula (vectorized_env.py)

def _get_curriculum_weights(self, progress: float) -> dict:
    direction_w = max(0.10, 0.35 * (1.0 - progress))   # 0.35 -> 0.10 (floor raised)
    pnl_w = min(0.60, 0.15 + 0.55 * progress)           # 0.15 -> 0.60
    drawdown_w = min(0.15, 0.03 + 0.12 * progress)       # 0.03 -> 0.15
    exploration_w = max(0.02, 0.12 * (1.0 - progress))   # 0.12 -> 0.02
    dsr_w = max(0.03, 0.10 * (1.0 - progress * 0.7))     # 0.10 -> 0.03
    magnitude_w = 0.05; stop_tp_w = 0.10; slippage_w = 0.06
    # Normalize to sum=1.0
    raw = {all weights}; total = sum(raw.values())
    return {k: v / total for k, v in raw.items()}

DSR Scaling

# OLD: torch.clamp(dsr * 10, -2.0, 2.0)
# NEW:
dsr_scaled = torch.clamp(dsr * 5, -1.0, 1.0)

Slippage Scaling

# OLD: return total_cost * 10.0  (slippage_weight=0.04)
# NEW: return total_cost * 20.0  (slippage_weight=0.06)
# Net effect: 3x stronger penalty

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Fixed DSR weight at 10%	DSR raw magnitude >> P&L raw magnitude → DSR dominates despite lower weight	Weight alone doesn't control contribution; must also control raw signal scaling
Direction floor at 0.05	Direction signal vanishes in late training → negative direction reward	Floor must be ≥0.10 for meaningful gradient signal
Slippage weight 0.04 + 10x scale	Net slippage penalty per trade is only 0.0004-0.0012 weighted → negligible vs DSR reward from trading	Must consider end-to-end magnitude: weight * scale * typical_raw_signal
Non-normalized curriculum weights	Weights sum to 0.96 at start, 1.06 at end → implicit weight changes confuse analysis	Always normalize to 1.0
Python for-loop in _calculate_slippage	2 `.item()` calls per env × 1024 envs = 2048 CUDA syncs/step → halves FPS	The same anti-pattern as `_get_observations()`: always vectorize, never `.item()` in hot paths

Final Parameters

# Curriculum (start -> end, raw before normalization)
direction_w: 0.35 -> 0.10  # floor=0.10, was 0.40->0.05
pnl_w: 0.15 -> 0.60         # cap=0.60, was 0.10->0.55
drawdown_w: 0.03 -> 0.15
exploration_w: 0.12 -> 0.02
dsr_w: 0.10 -> 0.03         # NEW: decays (was fixed 0.10)
magnitude_w: 0.05           # fixed
stop_tp_w: 0.10             # fixed
slippage_w: 0.06            # was 0.04

# DSR scaling
dsr_scale: 5                # was 10
dsr_clamp: [-1.0, 1.0]      # was [-2.0, 2.0]

# Slippage
slippage_scale: 20           # was 10
direction_threshold: 0.002   # was 0.003

Key Insights

The effective contribution of a reward component is weight * scale * typical_raw_magnitude — not just the weight
DSR responds to every step's return, producing consistently large signals. P&L is attenuated by position_size_pct (2.6-7.9%). This asymmetry meant DSR dominated despite lower weight
Overtrading is rational when DSR reward from trading > slippage penalty from trading. Must make slippage > DSR for unnecessary trades
Normalization prevents implicit weight drift as training progresses and component weights change
Direction floor of 0.10 (not 0.05) ensures the model always gets meaningful gradient about directional accuracy

References

Bengio et al. (2009) "Curriculum Learning", ICML
Corwin & Schultz (2012) "A Simple Way to Estimate Bid-Ask Spreads from Daily High and Low Prices"
Moody & Saffell (1998) "Reinforcement Learning for Trading Systems and Portfolios"
v3.0 agent validation experiment results (agent_validation_results.zip)

이 저장소의 다른 Skills

같은 저장소

cycle-dir-normalization

smith6jt-cop/Skills_Registry

Normalize long-form CODEX cycle folders to short form before notebooks run. Trigger: cyc001_reg001_*, hard-coded cyc paths breaking, staged CODEX raw data failing in Notebooks 1/2.

2026-04-162

joint-multi-tf-v560

smith6jt-cop/Skills_Registry

v5.6.0 joint multi-TF model: single model per symbol with broadcast 1Hour context replaces dual 15Min/1Hour models. Trigger: (1) replacing weighted-voting model aggregation, (2) adding broadcast features to vectorized env, (3) limited training data + worried about overfitting from doubling obs_dim, (4) backtest builder mismatch with newer feature counts.

2026-04-112

multi-timeframe-training

smith6jt-cop/Skills_Registry

DEPRECATED in v5.6.0 — see joint-multi-tf-v560 skill. Documents the v5.2.0 dual-model approach (train separate 15Min/1Hour models, combine via weighted voting). Still relevant for: (1) loading legacy v5.5.0 dual models, (2) understanding the historical aggregation layer, (3) resampling pattern via origin='start'.

2026-04-112

dashboard-feature-discovery

smith6jt-cop/Skills_Registry

Surface a shipped-but-undocumented CLI feature in user-facing docs. Trigger: user reports a known feature missing from README/readthedocs even though the CLI command exists.

2026-04-082

live-aware-account-routing

smith6jt-cop/Skills_Registry

KINTSUGI Snakefile + CLI changes that route SLURM jobs around accounts saturated by OTHER users on the same QOS pool. Trigger: QOSGrpMemLimit, jobs stuck pending despite available GPU slots in config, noisy neighbor on shared QOS, multi-user investment pool exhaustion, _build_cycle_assignment static-vs-live.

2026-04-082

slurm-concurrent-processing

smith6jt-cop/Skills_Registry

KINTSUGI SLURM batch processing: Maximize throughput using multi-account resource calculation with GPU+CPU pools per account. Trigger: SLURM job submission, batch processing, resource maximization, GPU+CPU concurrent, headless processing, resource pool.

2026-04-082

name	reward-function-v410
description	v4.1.0 reward function redesign to fix overtrading and DSR dominance
author	Claude Code
date	"2026-02-21T00:00:00.000Z"

reward-function-v410 - Research Notes

Experiment Overview

Item	Details
Date	2026-02-21
Goal	Fix overtrading (HOLD 5-12%), DSR dominance, and direction collapse identified by v3.0 agent validation experiment
Environment	Google Colab A100, v4.0.0 baseline, 50M timesteps standard mode
Status	Implemented, pending A/B validation

Context

v3.0 agent validation (50M timesteps, 2 symbols x 3 seeds) produced best PF=1.205 and fitness=0.153 — far below APPROVED thresholds. The Reward Engineer flagged CRITICAL issues in ALL runs:

DSR contributes ~35% of effective reward despite 10% weight (raw signal is systematically larger than position-size-attenuated P&L)
HOLD rate of 5-12% (pathological; normal is 30-70%) because frequent trading accumulates DSR reward
Negative direction component in late training because direction weight decays to negligible 0.05

Additionally, _calculate_slippage() had a Python for-loop with 2 .item() calls per env (2048 CUDA syncs/step), halving FPS.

Verified Workflow

Curriculum Weight Formula (vectorized_env.py)

def _get_curriculum_weights(self, progress: float) -> dict:
    direction_w = max(0.10, 0.35 * (1.0 - progress))   # 0.35 -> 0.10 (floor raised)
    pnl_w = min(0.60, 0.15 + 0.55 * progress)           # 0.15 -> 0.60
    drawdown_w = min(0.15, 0.03 + 0.12 * progress)       # 0.03 -> 0.15
    exploration_w = max(0.02, 0.12 * (1.0 - progress))   # 0.12 -> 0.02
    dsr_w = max(0.03, 0.10 * (1.0 - progress * 0.7))     # 0.10 -> 0.03
    magnitude_w = 0.05; stop_tp_w = 0.10; slippage_w = 0.06
    # Normalize to sum=1.0
    raw = {all weights}; total = sum(raw.values())
    return {k: v / total for k, v in raw.items()}

DSR Scaling

# OLD: torch.clamp(dsr * 10, -2.0, 2.0)
# NEW:
dsr_scaled = torch.clamp(dsr * 5, -1.0, 1.0)

Slippage Scaling

# OLD: return total_cost * 10.0  (slippage_weight=0.04)
# NEW: return total_cost * 20.0  (slippage_weight=0.06)
# Net effect: 3x stronger penalty

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Fixed DSR weight at 10%	DSR raw magnitude >> P&L raw magnitude → DSR dominates despite lower weight	Weight alone doesn't control contribution; must also control raw signal scaling
Direction floor at 0.05	Direction signal vanishes in late training → negative direction reward	Floor must be ≥0.10 for meaningful gradient signal
Slippage weight 0.04 + 10x scale	Net slippage penalty per trade is only 0.0004-0.0012 weighted → negligible vs DSR reward from trading	Must consider end-to-end magnitude: weight * scale * typical_raw_signal
Non-normalized curriculum weights	Weights sum to 0.96 at start, 1.06 at end → implicit weight changes confuse analysis	Always normalize to 1.0
Python for-loop in _calculate_slippage	2 `.item()` calls per env × 1024 envs = 2048 CUDA syncs/step → halves FPS	The same anti-pattern as `_get_observations()`: always vectorize, never `.item()` in hot paths

Final Parameters

# Curriculum (start -> end, raw before normalization)
direction_w: 0.35 -> 0.10  # floor=0.10, was 0.40->0.05
pnl_w: 0.15 -> 0.60         # cap=0.60, was 0.10->0.55
drawdown_w: 0.03 -> 0.15
exploration_w: 0.12 -> 0.02
dsr_w: 0.10 -> 0.03         # NEW: decays (was fixed 0.10)
magnitude_w: 0.05           # fixed
stop_tp_w: 0.10             # fixed
slippage_w: 0.06            # was 0.04

# DSR scaling
dsr_scale: 5                # was 10
dsr_clamp: [-1.0, 1.0]      # was [-2.0, 2.0]

# Slippage
slippage_scale: 20           # was 10
direction_threshold: 0.002   # was 0.003

Key Insights

The effective contribution of a reward component is weight * scale * typical_raw_magnitude — not just the weight
DSR responds to every step's return, producing consistently large signals. P&L is attenuated by position_size_pct (2.6-7.9%). This asymmetry meant DSR dominated despite lower weight
Overtrading is rational when DSR reward from trading > slippage penalty from trading. Must make slippage > DSR for unnecessary trades
Normalization prevents implicit weight drift as training progresses and component weights change
Direction floor of 0.10 (not 0.05) ensures the model always gets meaningful gradient about directional accuracy

References

Bengio et al. (2009) "Curriculum Learning", ICML
Corwin & Schultz (2012) "A Simple Way to Estimate Bid-Ask Spreads from Daily High and Low Prices"
Moody & Saffell (1998) "Reinforcement Learning for Trading Systems and Portfolios"
v3.0 agent validation experiment results (agent_validation_results.zip)