| name | ssmd-health-metrics |
| description | Catalog of all ssmd health metrics โ sources, query methods, labels, healthy/unhealthy thresholds. Covers data pipeline Prometheus metrics, HTTP endpoints, infrastructure components, DQ scores, and composite health scoring. Use when checking environment health, interpreting metrics, setting alert thresholds, or building health dashboards. |
ssmd-health-metrics
Data Pipeline Metrics (Prometheus)
Connectors and archivers expose Prometheus metrics on :8080/metrics, scraped by Cloud Monitoring.
Connector Metrics
| Metric | Type | Labels | Healthy | Unhealthy |
|---|
ssmd_connector_websocket_connected | Gauge 0/1 | feed, shard | 1 (all shards) | 0 for >2min |
ssmd_connector_messages_total | Counter | feed, type | rate > 0 | rate = 0 for >5min |
ssmd_connector_idle_seconds | Gauge | feed | < 300s | > 300s |
ssmd_connector_markets_subscribed | Gauge | feed, shard | > 0 | 0 |
ssmd_connector_shards_total | Gauge | feed | matches config | mismatch |
Archiver Metrics
| Metric | Type | Labels | Healthy | Unhealthy |
|---|
ssmd_archiver_messages_total | Counter | feed, stream | rate > 0 | rate = 0 for >5min |
ssmd_archiver_gaps_total | Counter | feed, stream | rate = 0 | rate > 0 |
ssmd_archiver_validation_failures_total | Counter | feed, stream | rate = 0 | rate > 0 |
ssmd_archiver_parse_failures_total | Counter | feed, stream | rate = 0 | rate > 0 |
Querying via Cloud Monitoring (gcloud CLI)
gcloud monitoring time-series list \
--project=massive-acrobat-227416 \
--filter='metric.type="prometheus.googleapis.com/ssmd_connector_websocket_connected/gauge"' \
--interval-start-time=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ)
Querying via port-forward
Rust containers have no wget/curl. Use port-forward + local curl:
kubectl port-forward -n ssmd deploy/kalshi-crypto-connector 8080:8080 &
sleep 2 && curl -s http://localhost:8080/metrics | grep -E 'websocket_connected|idle_seconds|messages_total' && kill %1
HTTP Health Endpoints
| Component | Deployment | Port | Health Path | Healthy | Unhealthy |
|---|
| Connector | kalshi-crypto-connector | 8080 | /health | 200 + connected:true | 503 or stale |
| Connector | kraken-futures-connector | 8080 | /health | 200 + connected:true | 503 or stale |
| Connector | polymarket-connector | 8080 | /health | 200 + connected:true | 503 or stale |
| Archiver | kalshi-crypto-archiver | 8080 | /health | 200 | 503 |
| Archiver | kraken-futures-archiver | 8080 | /health | 200 | 503 |
| Archiver | polymarket-archiver | 8080 | /health | 200 | 503 |
| data-ts | ssmd-data-ts | 8080 | /health | {"status":"ok"} | error |
| data-ts (API) | ssmd-data-ts | 8080 | /v1/markets/lookup | 200 + JSON | 401/500 |
Connectors and archivers require kubectl port-forward. data-ts is also accessible via LoadBalancer from allowed CIDRs (home IPs) โ no port-forward needed: curl -s http://<LB-IP>:8080/health. The /v1/markets/lookup endpoint requires Authorization: Bearer <api-key> with datasets:read scope and serves as an end-to-end health probe (API + DB).
Infrastructure Components
| Component | Resource | Health Check | Healthy | Unhealthy |
|---|
| NATS | StatefulSet nats-0 (ns: nats) | kubectl exec -n nats deploy/nats-box -- nats server check jetstream | OK | error/timeout |
| NATS streams | โ | kubectl exec -n nats deploy/nats-box -- nats stream ls | msgs increasing, last msg recent | stale |
| Postgres | StatefulSet ssmd-postgres-0 | data-ts /health endpoint | {"status":"ok"} | error |
| Redis | Deployment ssmd-redis | kubectl exec -n ssmd deploy/ssmd-redis -- redis-cli ping | PONG | error/timeout |
| Operator | Deployment ssmd-operator | kubectl get deploy ssmd-operator -n ssmd | available = desired | 0 available |
| CDC | Deployment ssmd-cdc | kubectl get deploy ssmd-cdc -n ssmd | 1/1 Running | restart/crash |
DQ Scores
Source: dq_daily_scores table. See ssmd-dq-checks skill for full 13-check catalog.
| Grade | Score | Meaning |
|---|
| GREEN | >= 98 | Pipeline healthy |
| YELLOW | >= 85 | Minor issues |
| RED | < 85 | Investigate promptly |
Composite Health Score (CLI)
Source: ssmd health daily command.
Weights: 20% Kalshi + 15% Kraken + 10% Polymarket + 15% Funding + 5% Archive + 15% Completeness + 10% Parquet + 10% SLA
Hard RED overrides: stream no data, connector score 0, funding >1h old, no archiver sync in 24h.
Cloud Monitoring Alerts
Defined in terraform/gke-prod/monitoring.tf:
| Alert | Condition | Severity |
|---|
| WebSocket disconnected | websocket_connected < 1 for > 2min | Critical |
| Message rate zero | messages_total rate = 0 for > 5min | Warning |
Known Metric Quirks
- Ghost shards: After connector restarts, old shard metrics persist. DQ
connection_uptime takes min across shards, so ghost shards drag scores down.
- Polymarket coverage gaps: Activity concentrates in US hours. Coverage < 100% on 15-min slot checks may be normal.
- u64 underflow in parquet-gen:
parse_batch_dropped for fanout types can show values > 2^63. DQ treats as 0. Upstream fix pending.