Run any Skill in Manus with one click

ssmd-health-metrics

Stars1

Forks0

UpdatedFebruary 19, 2026 at 17:17

Catalog of all ssmd health metrics — sources, query methods, labels, healthy/unhealthy thresholds. Covers data pipeline Prometheus metrics, HTTP endpoints, infrastructure components, DQ scores, and composite health scoring. Use when checking environment health, interpreting metrics, setting alert thresholds, or building health dashboards.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

aaronwald

aaronwald/dlawskillz

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Network and Computer Systems AdministratorsComputer and Mathematical Occupations·SOC 15-1244

SKILL.md

readonly

name	ssmd-health-metrics
description	Catalog of all ssmd health metrics — sources, query methods, labels, healthy/unhealthy thresholds. Covers data pipeline Prometheus metrics, HTTP endpoints, infrastructure components, DQ scores, and composite health scoring. Use when checking environment health, interpreting metrics, setting alert thresholds, or building health dashboards.

ssmd-health-metrics

Data Pipeline Metrics (Prometheus)

Connectors and archivers expose Prometheus metrics on :8080/metrics, scraped by Cloud Monitoring.

Connector Metrics

Metric	Type	Labels	Healthy	Unhealthy
`ssmd_connector_websocket_connected`	Gauge 0/1	`feed`, `shard`	1 (all shards)	0 for >2min
`ssmd_connector_messages_total`	Counter	`feed`, `type`	rate > 0	rate = 0 for >5min
`ssmd_connector_idle_seconds`	Gauge	`feed`	< 300s	> 300s
`ssmd_connector_markets_subscribed`	Gauge	`feed`, `shard`	> 0	0
`ssmd_connector_shards_total`	Gauge	`feed`	matches config	mismatch

Archiver Metrics

Metric	Type	Labels	Healthy	Unhealthy
`ssmd_archiver_messages_total`	Counter	`feed`, `stream`	rate > 0	rate = 0 for >5min
`ssmd_archiver_gaps_total`	Counter	`feed`, `stream`	rate = 0	rate > 0
`ssmd_archiver_validation_failures_total`	Counter	`feed`, `stream`	rate = 0	rate > 0
`ssmd_archiver_parse_failures_total`	Counter	`feed`, `stream`	rate = 0	rate > 0

Querying via Cloud Monitoring (gcloud CLI)

gcloud monitoring time-series list \
  --project=massive-acrobat-227416 \
  --filter='metric.type="prometheus.googleapis.com/ssmd_connector_websocket_connected/gauge"' \
  --interval-start-time=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ)

Querying via port-forward

Rust containers have no wget/curl. Use port-forward + local curl:

kubectl port-forward -n ssmd deploy/kalshi-crypto-connector 8080:8080 &
sleep 2 && curl -s http://localhost:8080/metrics | grep -E 'websocket_connected|idle_seconds|messages_total' && kill %1

HTTP Health Endpoints

Component	Deployment	Port	Health Path	Healthy	Unhealthy
Connector	`kalshi-crypto-connector`	8080	`/health`	200 + `connected:true`	503 or `stale`
Connector	`kraken-futures-connector`	8080	`/health`	200 + `connected:true`	503 or `stale`
Connector	`polymarket-connector`	8080	`/health`	200 + `connected:true`	503 or `stale`
Archiver	`kalshi-crypto-archiver`	8080	`/health`	200	503
Archiver	`kraken-futures-archiver`	8080	`/health`	200	503
Archiver	`polymarket-archiver`	8080	`/health`	200	503
data-ts	`ssmd-data-ts`	8080	`/health`	`{"status":"ok"}`	error
data-ts (API)	`ssmd-data-ts`	8080	`/v1/markets/lookup`	200 + JSON	401/500

Connectors and archivers require kubectl port-forward. data-ts is also accessible via LoadBalancer from allowed CIDRs (home IPs) — no port-forward needed: curl -s http://<LB-IP>:8080/health. The /v1/markets/lookup endpoint requires Authorization: Bearer <api-key> with datasets:read scope and serves as an end-to-end health probe (API + DB).

Infrastructure Components

Component	Resource	Health Check	Healthy	Unhealthy
NATS	StatefulSet `nats-0` (ns: nats)	`kubectl exec -n nats deploy/nats-box -- nats server check jetstream`	OK	error/timeout
NATS streams	—	`kubectl exec -n nats deploy/nats-box -- nats stream ls`	msgs increasing, last msg recent	stale
Postgres	StatefulSet `ssmd-postgres-0`	data-ts `/health` endpoint	`{"status":"ok"}`	error
Redis	Deployment `ssmd-redis`	`kubectl exec -n ssmd deploy/ssmd-redis -- redis-cli ping`	PONG	error/timeout
Operator	Deployment `ssmd-operator`	`kubectl get deploy ssmd-operator -n ssmd`	available = desired	0 available
CDC	Deployment `ssmd-cdc`	`kubectl get deploy ssmd-cdc -n ssmd`	1/1 Running	restart/crash

DQ Scores

Source: dq_daily_scores table. See ssmd-dq-checks skill for full 13-check catalog.

Grade	Score	Meaning
GREEN	>= 98	Pipeline healthy
YELLOW	>= 85	Minor issues
RED	< 85	Investigate promptly

Composite Health Score (CLI)

Source: ssmd health daily command.

Weights: 20% Kalshi + 15% Kraken + 10% Polymarket + 15% Funding + 5% Archive + 15% Completeness + 10% Parquet + 10% SLA

Hard RED overrides: stream no data, connector score 0, funding >1h old, no archiver sync in 24h.

Cloud Monitoring Alerts

Defined in terraform/gke-prod/monitoring.tf:

Alert	Condition	Severity
WebSocket disconnected	`websocket_connected < 1` for > 2min	Critical
Message rate zero	`messages_total` rate = 0 for > 5min	Warning

Known Metric Quirks

Ghost shards: After connector restarts, old shard metrics persist. DQ connection_uptime takes min across shards, so ghost shards drag scores down.
Polymarket coverage gaps: Activity concentrates in US hours. Coverage < 100% on 15-min slot checks may be normal.
u64 underflow in parquet-gen: parse_batch_dropped for fanout types can show values > 2^63. DQ treats as 0. Upstream fix pending.

More from this repository

same repository

ssmdorch

aaronwald/dlawskillz

Orchestrate multi-stage pipelines for ssmd workflows. Pre-defined pipelines chain skills and agents in sequence — deploy, feature, hotfix, and dq-investigate. Use when user says "ssmdorch", "orchestrate", "pipeline", or named pipelines like "ssmdorch deploy", "ssmdorch feature", "ssmdorch hotfix", "ssmdorch dq".

2026-04-041

ssmdstorm

aaronwald/dlawskillz

Multi-agent orchestration for ssmd market data system tasks. Extends waldstorm with domain-specific experts for secmaster, data feeds, trading APIs, and data quality. Use when working on connectors, exchanges, NATS pipelines, market metadata, or user says "ssmdstorm", "market data task", "exchange work".

2026-04-041

waldstorm

aaronwald/dlawskillz

Multi-agent orchestration that analyzes tasks through specialized expert panels (security, performance, architecture, etc.), synthesizes prioritized actions, then plans and executes. Use when facing complex tasks requiring multiple perspectives, architectural decisions, new features needing security/performance/quality review, or user says "waldstorm", "expert review", "analyze this task".

2026-04-041

defensive-coding

aaronwald/dlawskillz

This skill MUST be used before writing any implementation code — feature work, bug fixes, pipeline stages, data processing, API handlers, K8s manifests, or integration code. Enforces fail-loud patterns, input/output validation, connection verification, and pre-commit gates. Triggered automatically on any code writing task. Also use when user says "defensive", "fail-fast", "validate", "check failures", "harden".

2026-03-151

ssmd-health-run

aaronwald/dlawskillz

Procedures for running ssmd health checks and interpreting results. Covers quick triage, data pipeline health, infrastructure checks (NATS, Postgres, Redis), connector/archiver deep dives via port-forward, and Cloud Monitoring queries. Use when checking environment health, investigating degradation, or verifying health after deployments.

2026-02-191

ssmd-dq-checks

aaronwald/dlawskillz

Catalog of ssmd DQ checks — what each measures, pass/fail criteria, common failure modes and fixes, fanout rules, and accountability SLOs. Use when investigating DQ failures, understanding check behavior, or adding new checks.

2026-02-181

name	ssmd-health-metrics
description	Catalog of all ssmd health metrics — sources, query methods, labels, healthy/unhealthy thresholds. Covers data pipeline Prometheus metrics, HTTP endpoints, infrastructure components, DQ scores, and composite health scoring. Use when checking environment health, interpreting metrics, setting alert thresholds, or building health dashboards.

ssmd-health-metrics

Data Pipeline Metrics (Prometheus)

Connectors and archivers expose Prometheus metrics on :8080/metrics, scraped by Cloud Monitoring.

Connector Metrics

Metric	Type	Labels	Healthy	Unhealthy
`ssmd_connector_websocket_connected`	Gauge 0/1	`feed`, `shard`	1 (all shards)	0 for >2min
`ssmd_connector_messages_total`	Counter	`feed`, `type`	rate > 0	rate = 0 for >5min
`ssmd_connector_idle_seconds`	Gauge	`feed`	< 300s	> 300s
`ssmd_connector_markets_subscribed`	Gauge	`feed`, `shard`	> 0	0
`ssmd_connector_shards_total`	Gauge	`feed`	matches config	mismatch

Archiver Metrics

Metric	Type	Labels	Healthy	Unhealthy
`ssmd_archiver_messages_total`	Counter	`feed`, `stream`	rate > 0	rate = 0 for >5min
`ssmd_archiver_gaps_total`	Counter	`feed`, `stream`	rate = 0	rate > 0
`ssmd_archiver_validation_failures_total`	Counter	`feed`, `stream`	rate = 0	rate > 0
`ssmd_archiver_parse_failures_total`	Counter	`feed`, `stream`	rate = 0	rate > 0

Querying via Cloud Monitoring (gcloud CLI)

gcloud monitoring time-series list \
  --project=massive-acrobat-227416 \
  --filter='metric.type="prometheus.googleapis.com/ssmd_connector_websocket_connected/gauge"' \
  --interval-start-time=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ)

Querying via port-forward

Rust containers have no wget/curl. Use port-forward + local curl:

kubectl port-forward -n ssmd deploy/kalshi-crypto-connector 8080:8080 &
sleep 2 && curl -s http://localhost:8080/metrics | grep -E 'websocket_connected|idle_seconds|messages_total' && kill %1

HTTP Health Endpoints

Component	Deployment	Port	Health Path	Healthy	Unhealthy
Connector	`kalshi-crypto-connector`	8080	`/health`	200 + `connected:true`	503 or `stale`
Connector	`kraken-futures-connector`	8080	`/health`	200 + `connected:true`	503 or `stale`
Connector	`polymarket-connector`	8080	`/health`	200 + `connected:true`	503 or `stale`
Archiver	`kalshi-crypto-archiver`	8080	`/health`	200	503
Archiver	`kraken-futures-archiver`	8080	`/health`	200	503
Archiver	`polymarket-archiver`	8080	`/health`	200	503
data-ts	`ssmd-data-ts`	8080	`/health`	`{"status":"ok"}`	error
data-ts (API)	`ssmd-data-ts`	8080	`/v1/markets/lookup`	200 + JSON	401/500

Infrastructure Components

Component	Resource	Health Check	Healthy	Unhealthy
NATS	StatefulSet `nats-0` (ns: nats)	`kubectl exec -n nats deploy/nats-box -- nats server check jetstream`	OK	error/timeout
NATS streams	—	`kubectl exec -n nats deploy/nats-box -- nats stream ls`	msgs increasing, last msg recent	stale
Postgres	StatefulSet `ssmd-postgres-0`	data-ts `/health` endpoint	`{"status":"ok"}`	error
Redis	Deployment `ssmd-redis`	`kubectl exec -n ssmd deploy/ssmd-redis -- redis-cli ping`	PONG	error/timeout
Operator	Deployment `ssmd-operator`	`kubectl get deploy ssmd-operator -n ssmd`	available = desired	0 available
CDC	Deployment `ssmd-cdc`	`kubectl get deploy ssmd-cdc -n ssmd`	1/1 Running	restart/crash

DQ Scores

Source: dq_daily_scores table. See ssmd-dq-checks skill for full 13-check catalog.

Grade	Score	Meaning
GREEN	>= 98	Pipeline healthy
YELLOW	>= 85	Minor issues
RED	< 85	Investigate promptly

Composite Health Score (CLI)

Source: ssmd health daily command.

Weights: 20% Kalshi + 15% Kraken + 10% Polymarket + 15% Funding + 5% Archive + 15% Completeness + 10% Parquet + 10% SLA

Hard RED overrides: stream no data, connector score 0, funding >1h old, no archiver sync in 24h.

Cloud Monitoring Alerts

Defined in terraform/gke-prod/monitoring.tf:

Alert	Condition	Severity
WebSocket disconnected	`websocket_connected < 1` for > 2min	Critical
Message rate zero	`messages_total` rate = 0 for > 5min	Warning

Known Metric Quirks

Ghost shards: After connector restarts, old shard metrics persist. DQ connection_uptime takes min across shards, so ghost shards drag scores down.
Polymarket coverage gaps: Activity concentrates in US hours. Coverage < 100% on 15-min slot checks may be normal.
u64 underflow in parquet-gen: parse_batch_dropped for fanout types can show values > 2^63. DQ treats as 0. Upstream fix pending.