원클릭으로 Manus에서 모든 스킬 실행

calibration-analysis

스타0

포크0

업데이트2026년 2월 26일 19:53

Systematic analysis of model predictions vs realized outcomes. Use when computing Brier Score, Information Ratio, calibration curves, PnL attribution, or conditional calibration by regime/volatility/funding. Identifies which models are adding noise vs value.

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

trudumb

trudumb/hyper_make

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

Calibration Analysis Skill

Purpose

Systematic analysis of model predictions vs realized outcomes. This tells you exactly where your models are wrong and by how much.

When to Use

Daily model health checks
Debugging unexpected PnL patterns
Deciding which model component to improve next
Validating changes before/after deployment
Any time you want to know "is this model actually working?"

Prerequisites

measurement-infrastructure must be implemented and logging data
At least 24 hours of prediction data (preferably 1 week+)

Key Metrics

1. Brier Score

The mean squared error of probability predictions:

BS = (1/N) Σ (pᵢ - oᵢ)²

Where:

pᵢ = predicted probability
oᵢ ∈ {0, 1} = actual outcome

Interpretation:

BS = 0: Perfect predictions
BS = 0.25: Predicting 50% for everything (random)
BS > 0.25: Worse than random

2. Brier Score Decomposition

Split Brier score into interpretable components:

BS = Reliability - Resolution + Uncertainty

Reliability = (1/N) Σₖ nₖ(p̄ₖ - ōₖ)²
  - Measures calibration quality
  - Lower is better
  - If your 70% predictions hit 70%, this is 0

Resolution = (1/N) Σₖ nₖ(ōₖ - ō)²
  - Measures discrimination ability
  - Higher is better
  - Do your high predictions differ from low predictions?

Uncertainty = ō(1 - ō)
  - Base rate variance
  - Not controllable, just inherent difficulty

3. Information Ratio

IR = Resolution / Uncertainty

Interpretation:

IR > 1.0: Model predictions carry useful information
IR ≈ 1.0: Model is roughly as good as predicting base rate
IR < 1.0: Model is adding noise (REMOVE IT)

Implementation

Brier Score Decomposition

compute_brier_decomposition(predictions, outcomes, num_bins) bins predictions and computes reliability, resolution, uncertainty, and IR. See implementation.md for full code.

Calibration Curve

build_calibration_curve(predictions, outcomes, num_bins) sorts predictions into equal-sized bins, computes realized rates, and adds Wilson score 95% CIs. See implementation.md for full code.

Conditional Calibration

Slice calibration by conditioning variables (VolatilityQuartile, FundingRegime, TimeOfDay, InventoryState, RecentFillRate, BookImbalance, Regime). Groups records by condition, requires min 100 samples per slice, computes Brier decomposition per slice. Regime slicing uses cascade/volatile/trending/quiet thresholds at 0.5 probability. See implementation.md for full code.

PnL Attribution

Decompose daily PnL into four components: spread capture (revenue from bid-ask), adverse selection (loss from fills before adverse moves), inventory cost (mark-to-market), and fees (1.5 bps maker). Also breaks down PnL by regime (quiet/trending/volatile/cascade) with time fractions. See implementation.md for full PnLAttribution struct and compute_pnl_attribution() code.

Daily Report Template

=== Calibration Report: {date} ===

PnL Attribution
───────────────────────────────────────────
Gross PnL:              ${gross_pnl:>10.2}
├── Spread Capture:     ${spread_capture:>10.2}  {spread_status}
├── Adverse Selection:  ${adverse_selection:>10.2}  {as_status}
├── Inventory Cost:     ${inventory_cost:>10.2}
└── Fees:               ${fees:>10.2}

Model Calibration
───────────────────────────────────────────
                        Brier   IR      Status
Fill Prediction (1s):   {fp_brier:.3}  {fp_ir:.2}   {fp_status}
Fill Prediction (10s):  {fp10_brier:.3} {fp10_ir:.2}  {fp10_status}
Adverse Selection:      {as_brier:.3}  {as_ir:.2}   {as_status_model}
Volatility (RMSE):      {vol_rmse:.6}          {vol_status}

Regime Distribution
───────────────────────────────────────────
            Time    PnL         PnL/Hour
Quiet:      {quiet_time:>4.0%}   ${quiet_pnl:>8.2}   ${quiet_rate:>6.2}/hr
Trending:   {trend_time:>4.0%}   ${trend_pnl:>8.2}   ${trend_rate:>6.2}/hr
Volatile:   {vol_time:>4.0%}   ${vol_pnl:>8.2}   ${vol_rate:>6.2}/hr
Cascade:    {casc_time:>4.0%}   ${casc_pnl:>8.2}   ${casc_rate:>6.2}/hr

Conditional Calibration Issues
───────────────────────────────────────────
{conditional_issues}

Actionable Items
───────────────────────────────────────────
{action_items}

Report Generation

generate_daily_report(date) loads prediction records, computes PnL attribution, runs Brier decomposition on fill (1s/10s) and adverse selection models, checks conditional calibration across regime/volatility/funding slices, and generates prioritized action items. Action item priorities: HIGH for AS > 50% of spread capture or IR < 1.0; MEDIUM for cascade losses or conditional issues. See implementation.md for full code.

Alert Thresholds

Default alert thresholds:

Threshold	Default	Meaning
`min_information_ratio`	1.0	Below this, model is useless
`max_brier_score`	0.25	Above this, worse than random
`max_daily_loss`	$500	Dollar amount
`max_adverse_selection_ratio`	0.7	AS / spread_capture
`max_cascade_loss`	$100	Dollar amount in cascade regime

check_alerts(report, thresholds) emits Alert::Critical for IR below threshold or daily loss exceeding limit. See implementation.md for full AlertThresholds struct and check_alerts() code.

Common Patterns

"Model is well-calibrated overall but fails in regime X"

This is the most common pattern. Solution:

Identify which regime has IR < 1.0
Either: train regime-specific model, or
Fall back to wider spreads / simpler model in that regime

"Calibration looks good but still losing money"

Possible causes:

Good calibration on the wrong metric (e.g., calibrated fill prediction but adverse selection is the real problem)
Execution slippage not captured in calibration
Latency effects (predictions stale by the time orders placed)

"IR > 1 but Brier score is high"

Model has good discrimination (can tell high from low) but poor calibration (predictions don't match frequencies). Fix with isotonic regression or Platt scaling.

Dependencies

Requires: measurement-infrastructure (prediction logs with outcomes)
Enables: All model improvement work, daily-calibration-report

Next Steps

After analyzing calibration:

Identify weakest component (lowest IR or biggest PnL drag)
Read that component's skill file
Use signal-audit to identify better features
Implement improvement
Re-run calibration to validate

Supporting Files

implementation.md -- All Rust code: Brier decomposition, calibration curve, conditional calibration, PnL attribution, report generation, action items, alert thresholds

이 저장소의 다른 Skills

같은 저장소

checkpoint-management

trudumb/hyper_make

State persistence, prior transfer, and warmup lifecycle. Read when working on checkpoint/, adding new checkpoint fields, debugging cold starts or stale priors, or understanding serde(default) requirements and backward compatibility rules.

2026-02-260

config-derivation

trudumb/hyper_make

Documents auto_derive.rs first-principles parameter derivation from capital and exchange metadata. Use when onboarding new assets, debugging parameter mismatches, understanding why gamma/max_position/target_liquidity have their values, or adding new derived parameters.

2026-02-260

infrastructure-ops

trudumb/hyper_make

WebSocket management, event loop, rate limiting, reconnection, recovery, metrics, and order execution infrastructure. Use when working on orchestrator/, infra/, messages/, core/, fills/, or execution/ modules, debugging connectivity or order placement, adding message handlers, or investigating stale data and latency issues.

2026-02-260

learning-pipeline

trudumb/hyper_make

Documents the 9 learning feedback loops, SpreadBandit Thompson Sampling, adaptive ensemble, confidence tracking, and baseline tracker. Use when debugging learning behavior, tuning reward attribution, investigating model weight decay, or understanding how fills translate into parameter updates.

2026-02-260

risk-management

trudumb/hyper_make

Layered risk system with monitors, circuit breakers, kill switch, and position guards. Use when working on risk/, safety/, or monitoring/ modules, debugging position limits, emergency shutdowns, spread widening, or adding new risk monitors. Covers RiskMonitor trait, severity escalation, and defense-first architecture.

2026-02-260

spread-chain

trudumb/hyper_make

Documents the additive spread composition pipeline from GLFT optimal through to final bid/ask prices. Use when debugging wide spreads, investigating spread component contributions, tuning defensive behavior, or understanding why quotes are wider than expected. Critical for incident triage.

2026-02-260

name	calibration-analysis
description	Systematic analysis of model predictions vs realized outcomes. Use when computing Brier Score, Information Ratio, calibration curves, PnL attribution, or conditional calibration by regime/volatility/funding. Identifies which models are adding noise vs value.
requires	["measurement-infrastructure"]
user-invocable	false

Calibration Analysis Skill

Purpose

Systematic analysis of model predictions vs realized outcomes. This tells you exactly where your models are wrong and by how much.

When to Use

Daily model health checks
Debugging unexpected PnL patterns
Deciding which model component to improve next
Validating changes before/after deployment
Any time you want to know "is this model actually working?"

Prerequisites

measurement-infrastructure must be implemented and logging data
At least 24 hours of prediction data (preferably 1 week+)

Key Metrics

1. Brier Score

The mean squared error of probability predictions:

BS = (1/N) Σ (pᵢ - oᵢ)²

Where:

pᵢ = predicted probability
oᵢ ∈ {0, 1} = actual outcome

Interpretation:

BS = 0: Perfect predictions
BS = 0.25: Predicting 50% for everything (random)
BS > 0.25: Worse than random

2. Brier Score Decomposition

Split Brier score into interpretable components:

BS = Reliability - Resolution + Uncertainty

Reliability = (1/N) Σₖ nₖ(p̄ₖ - ōₖ)²
  - Measures calibration quality
  - Lower is better
  - If your 70% predictions hit 70%, this is 0

Resolution = (1/N) Σₖ nₖ(ōₖ - ō)²
  - Measures discrimination ability
  - Higher is better
  - Do your high predictions differ from low predictions?

Uncertainty = ō(1 - ō)
  - Base rate variance
  - Not controllable, just inherent difficulty

3. Information Ratio

IR = Resolution / Uncertainty

Interpretation:

IR > 1.0: Model predictions carry useful information
IR ≈ 1.0: Model is roughly as good as predicting base rate
IR < 1.0: Model is adding noise (REMOVE IT)

Implementation

Brier Score Decomposition

compute_brier_decomposition(predictions, outcomes, num_bins) bins predictions and computes reliability, resolution, uncertainty, and IR. See implementation.md for full code.

Calibration Curve

build_calibration_curve(predictions, outcomes, num_bins) sorts predictions into equal-sized bins, computes realized rates, and adds Wilson score 95% CIs. See implementation.md for full code.

Conditional Calibration

PnL Attribution

Daily Report Template

=== Calibration Report: {date} ===

PnL Attribution
───────────────────────────────────────────
Gross PnL:              ${gross_pnl:>10.2}
├── Spread Capture:     ${spread_capture:>10.2}  {spread_status}
├── Adverse Selection:  ${adverse_selection:>10.2}  {as_status}
├── Inventory Cost:     ${inventory_cost:>10.2}
└── Fees:               ${fees:>10.2}

Model Calibration
───────────────────────────────────────────
                        Brier   IR      Status
Fill Prediction (1s):   {fp_brier:.3}  {fp_ir:.2}   {fp_status}
Fill Prediction (10s):  {fp10_brier:.3} {fp10_ir:.2}  {fp10_status}
Adverse Selection:      {as_brier:.3}  {as_ir:.2}   {as_status_model}
Volatility (RMSE):      {vol_rmse:.6}          {vol_status}

Regime Distribution
───────────────────────────────────────────
            Time    PnL         PnL/Hour
Quiet:      {quiet_time:>4.0%}   ${quiet_pnl:>8.2}   ${quiet_rate:>6.2}/hr
Trending:   {trend_time:>4.0%}   ${trend_pnl:>8.2}   ${trend_rate:>6.2}/hr
Volatile:   {vol_time:>4.0%}   ${vol_pnl:>8.2}   ${vol_rate:>6.2}/hr
Cascade:    {casc_time:>4.0%}   ${casc_pnl:>8.2}   ${casc_rate:>6.2}/hr

Conditional Calibration Issues
───────────────────────────────────────────
{conditional_issues}

Actionable Items
───────────────────────────────────────────
{action_items}

Report Generation

Alert Thresholds

Default alert thresholds:

Threshold	Default	Meaning
`min_information_ratio`	1.0	Below this, model is useless
`max_brier_score`	0.25	Above this, worse than random
`max_daily_loss`	$500	Dollar amount
`max_adverse_selection_ratio`	0.7	AS / spread_capture
`max_cascade_loss`	$100	Dollar amount in cascade regime

check_alerts(report, thresholds) emits Alert::Critical for IR below threshold or daily loss exceeding limit. See implementation.md for full AlertThresholds struct and check_alerts() code.

Common Patterns

"Model is well-calibrated overall but fails in regime X"

This is the most common pattern. Solution:

Identify which regime has IR < 1.0
Either: train regime-specific model, or
Fall back to wider spreads / simpler model in that regime

"Calibration looks good but still losing money"

Possible causes:

Good calibration on the wrong metric (e.g., calibrated fill prediction but adverse selection is the real problem)
Execution slippage not captured in calibration
Latency effects (predictions stale by the time orders placed)

"IR > 1 but Brier score is high"

Model has good discrimination (can tell high from low) but poor calibration (predictions don't match frequencies). Fix with isotonic regression or Platt scaling.

Dependencies

Requires: measurement-infrastructure (prediction logs with outcomes)
Enables: All model improvement work, daily-calibration-report

Next Steps

After analyzing calibration:

Identify weakest component (lowest IR or biggest PnL drag)
Read that component's skill file
Use signal-audit to identify better features
Implement improvement
Re-run calibration to validate

Supporting Files

implementation.md -- All Rust code: Brier decomposition, calibration curve, conditional calibration, PnL attribution, report generation, action items, alert thresholds