| name | risk-management |
| description | Layered risk system with monitors, circuit breakers, kill switch, and position guards. Use when working on risk/, safety/, or monitoring/ modules, debugging position limits, emergency shutdowns, spread widening, or adding new risk monitors. Covers RiskMonitor trait, severity escalation, and defense-first architecture. |
| user-invocable | false |
Risk Management Skill
Purpose
Understand and modify the layered risk management system that protects the market maker from catastrophic loss. This system gates every quote cycle — if risk says "no", no quotes go out.
Defense first: when uncertain, widen spreads. Missing a trade is cheap; getting run over in a cascade is not.
When to Use
- Working on
risk/, safety/, or monitoring/ modules
- Debugging why quotes are being gated or spreads are widening
- Adding new risk monitors or circuit breakers
- Investigating position limit breaches or kill switch triggers
- Understanding why the system stopped quoting
Module Map
All paths relative to src/market_maker/:
risk/
mod.rs # Re-exports, RiskState, RiskAggregator, RiskMonitor trait
state.rs # RiskState — unified snapshot of all risk-relevant data
monitor.rs # RiskMonitor trait, RiskSeverity, RiskAction, RiskAssessment
aggregator.rs # AggregatedRisk — collects assessments, computes max severity
limits.rs # RiskLimits, RiskChecker — soft/hard position & order limits
circuit_breaker.rs # CircuitBreakerMonitor — market condition triggers
drawdown.rs # DrawdownTracker — equity drawdown with daily/lifetime thresholds
kill_switch.rs # KillSwitch — immediate trading halt
position_guard.rs # PositionGuard — inventory limits & soft threshold scaling
reentry.rs # ReentryController — controlled re-engagement after drawdown
monitors/
cascade.rs # CascadeMonitor — OI-based liquidation cascade detection
data_staleness.rs # DataStalenessMonitor — feed staleness triggers cautious mode
drawdown.rs # DrawdownMonitor — position-specific drawdown
loss.rs # LossMonitor — cumulative loss monitoring
position.rs # PositionMonitor — concentration & leverage limits
rate_limit.rs # RateLimitMonitor — order rate limit tracking
safety/
auditor.rs # SafetyAuditor — periodic state reconciliation
monitoring/
alerter.rs # Alerter — thread-safe alerting with deduplication
dashboard.rs # DashboardState — real-time terminal display
postmortem.rs # Post-trade analysis
Architecture
Data Flow
All monitors evaluate the same RiskState snapshot in a single pass. This prevents race conditions, stale data, and partial updates.
Market Data + Position State + PnL
-> RiskState (unified snapshot)
-> [LossMonitor, PositionMonitor, CascadeMonitor, ...]
-> RiskAggregator (max severity, spread factor, kill reasons)
-> Quote Engine (gates quotes, widens spreads, or kills trading)
RiskMonitor Trait
Every risk monitor implements:
trait RiskMonitor {
fn name(&self) -> &str;
fn evaluate(&self, state: &RiskState) -> RiskAssessment;
}
struct RiskAssessment {
severity: RiskSeverity,
action: RiskAction,
reason: String,
spread_multiplier: f64,
}
Severity Escalation
Normal -> Caution -> Warning -> Critical -> Emergency
noop log only widen 1.5x pull quotes kill switch
The RiskAggregator takes the maximum severity across all monitors.
Monitor Catalog
1. LossMonitor (monitors/loss.rs)
- Tracks cumulative realized + unrealized PnL
- Thresholds: warning at 50% of daily limit, critical at 80%, emergency at 100%
2. PositionMonitor (monitors/position.rs)
- Hard invariant:
inventory.abs() <= max_inventory
- Soft threshold at ~70% — begins reducing quote sizes
- Monitors concentration across assets in multi-asset mode
3. CascadeMonitor (monitors/cascade.rs)
- Detects liquidation cascades via OI drops > 2% in 1 minute
- Immediately widens spreads and may pull quotes
- Critical for crypto: cascades can move price 5-10% in seconds
4. DataStalenessMonitor (monitors/data_staleness.rs)
- Monitors feed freshness (L2 book, trades, Binance)
- Triggers cautious mode if data older than 5s
- Prevents quoting on stale information
5. DrawdownMonitor (monitors/drawdown.rs)
- Peak-to-trough drawdown per position and aggregate
- Daily and lifetime drawdown limits via
DrawdownTracker with high-water mark
6. RateLimitMonitor (monitors/rate_limit.rs)
- Proactive: slows down before hitting exchange rate limits
- Reactive: backs off after receiving rate limit rejections
Circuit Breaker System
CircuitBreakerMonitor in circuit_breaker.rs handles market-condition triggers:
| Trigger | Detection | Response |
|---|
| OI drop > threshold | open_interest_delta_1m | Widen spreads, may pull quotes |
| Funding extreme | abs(funding_rate) > threshold | Widen spreads |
| Spread blowout | Market spread > 5x normal | Reduce size or pause |
| Fill collapse | Fill rate drops to 0 | Check connectivity |
| Model degradation | IR < critical threshold | Reduce model weight |
Kill Switch
KillSwitch in kill_switch.rs — last line of defense:
- Immediate: cancels all orders, closes positions at market
- Triggered by: Emergency-level assessments from any monitor
- Requires manual reset: system won't restart automatically
- Logs everything: full state dump for post-mortem
Position Guard
PositionGuard enforces inventory limits with soft/hard thresholds:
- 0-70% of limit: full quote sizes on both sides
- 70-100% of limit: linearly reduce quote size on the expanding side
- At 100%: only quotes that reduce position (reduce-only mode)
Re-entry After Drawdown
ReentryController in reentry.rs:
- After significant drawdown, don't jump back to full size
- Gradually increase position limits over time
- Reset if another drawdown occurs during recovery
- Configurable recovery period and scaling curve
Safety Auditor
SafetyAuditor in safety/auditor.rs — periodic reconciliation:
- Order cleanup (expired fill windows)
- Stale pending detection (orders not on exchange)
- Stuck cancel detection (cancel requests that didn't execute)
- Orphan reconciliation (exchange orders not in local tracking)
- Reduce-only reporting
Quote Engine Integration
Risk assessments flow into the quote engine via AggregatedRisk:
spread_factor = max(1.0, risk.spread_multiplier)
effective_spread = base_spread * spread_factor
if risk.action == PullQuotes { return no_quotes; }
if risk.action == KillSwitch { cancel_all_and_halt(); }
Key Invariants
These must ALWAYS hold — violations are bugs:
inventory.abs() <= max_inventory
ask_price > bid_price
spread >= min_spread_bps (never tighter than fee + minimum edge)
- Kill switch state persists across restarts
Common Debugging
"Why did it stop quoting?"
- Check
DashboardState for active risk triggers
- Look at
RiskAggregator — which monitor returned highest severity?
- Check kill switch (requires manual reset)
- Verify data freshness — stale data gates quotes
"Why are spreads so wide?"
- Check
spread_multiplier in AggregatedRisk
- Cascade monitor — OI drop detected?
- Drawdown state — in recovery mode?
- Regime detection — high-vol regime naturally widens
"Position limit breach"
- Check
PositionGuard — hard limit hit?
- Verify reduce-only mode active
- Check for orphan fills
- Run
SafetyAuditor for full reconciliation
Adding a New Risk Monitor
- Implement
RiskMonitor trait in risk/monitors/
- Add to monitor list in
RiskAggregator
- Use
RiskState fields only — don't add new data sources to hot path
- Default to
RiskSeverity::Normal when uncertain
- Include clear
reason strings for debugging
- Add tests covering all severity transitions
Supporting Files
| File | Description |
|---|
| references/monitor-template.md | Complete template for implementing a new RiskMonitor: trait skeleton, registration in RiskAggregator, testing requirements for all severity transitions, common patterns (threshold escalation, hysteresis, cooldown), and implementation checklist |
Known Issues from Production
- Feb 9: Drawdown denominator bug —
summary() divided by peak_pnl (tiny after 1 fill, e.g. 0.002 USD) showing 1000%+ drawdown. Fixed to use account_value denominator with min_peak_for_drawdown guard. See risk/state.rs:253-262.
- Feb 12: Kill switch checkpoint persistence — Kill switch
triggered=true persists across restarts. After a kill switch event, must manually clear kill_switch.triggered=false in checkpoint or use --max-position-usd override. Without this, the system won't restart.
- Feb 12: Emergency clearing death spiral — Emergency position clearing at 32% of max (threshold was 0.01 = 1%) combined with binary side-clearing created a feedback loop: clear → no bids → position can't reduce → re-trigger → permanent paralysis. Fixed with graduated widening (PositionZone: Green/Yellow/Red/Kill) instead of binary clear.
- Feb 18: Emergency threshold regime-wiring —
max_position_fraction now varies by regime (Calm=0.90, Normal=0.80, Volatile=0.70, Extreme=0.50). Moderate positions route through sigma spike (2x widening) rather than emergency clearing.