| name | observability-patterns |
| description | Observability: structured logs, metrics (RED/USE), tracing, SLO/SLI. Triggers: logging, metrics, Prometheus, Grafana, OpenTelemetry, trace, monitoring. |
| effort | medium |
| user-invocable | false |
| allowed-tools | Read |
Observability Patterns
Structured Logging
Python (structlog)
import structlog
logger = structlog.get_logger()
logger.info("user_created", user_id=user.id, email=user.email, source="api")
logger.error("payment_failed", order_id=order.id, error=str(e), amount=amount)
Node.js (Pino)
import pino from "pino";
const logger = pino({ level: "info", transport: { target: "pino-pretty" } });
logger.info({ userId: user.id, action: "login" }, "User logged in");
logger.error({ err, orderId: order.id }, "Payment processing failed");
Log Levels
| Level | Use For |
|---|
error | Failures requiring attention |
warn | Unexpected but handled situations |
info | Business events, state transitions |
debug | Development diagnostics |
Rules
- Always use structured key-value pairs, not string interpolation
- Include correlation IDs for request tracing
- Never log sensitive data (passwords, tokens, PII)
- Log at boundaries: API entry/exit, external calls, state changes
OpenTelemetry
Python Setup
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("http_requests_total", description="Total HTTP requests")
request_duration = meter.create_histogram("http_request_duration_seconds")
@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
request_counter.add(1, {"method": "POST", "endpoint": "/orders"})
with tracer.start_as_current_span("validate_order"):
validate(order_id)
with tracer.start_as_current_span("charge_payment"):
charge(order_id)
Node.js Setup
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: "http://otel-collector:4317" }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Prometheus Metrics
Metric Types
http_requests_total = Counter("http_requests_total", "Total requests", ["method", "status", "path"])
request_duration = Histogram("request_duration_seconds", "Request latency",
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0])
active_connections = Gauge("active_connections", "Current active connections")
Key Metrics (RED Method)
| Metric | Type | Purpose |
|---|
| Request Rate | Counter | Traffic volume |
| Request Errors | Counter | Error rate |
| Request Duration | Histogram | Latency distribution |
Key Metrics (USE Method - Infrastructure)
| Metric | Type | Purpose |
|---|
| Utilization | Gauge | % resource used |
| Saturation | Gauge | Queue depth |
| Errors | Counter | Error count |
Health Check Endpoints
@app.get("/health")
async def health():
checks = {
"database": await check_db(),
"redis": await check_redis(),
"disk": check_disk_space(),
}
status = "healthy" if all(checks.values()) else "degraded"
code = 200 if status == "healthy" else 503
return JSONResponse({"status": status, "checks": checks}, status_code=code)
@app.get("/ready")
async def readiness():
"""Kubernetes readiness probe - can this instance serve traffic?"""
return {"ready": True}
@app.get("/live")
async def liveness():
"""Kubernetes liveness probe - is the process alive?"""
return {"alive": True}
Error Tracking (Sentry)
import sentry_sdk
sentry_sdk.init(
dsn="https://key@sentry.io/project",
traces_sample_rate=0.1,
profiles_sample_rate=0.1,
environment="production",
)
with sentry_sdk.push_scope() as scope:
scope.set_tag("order_id", order.id)
scope.set_context("payment", {"amount": amount, "currency": "USD"})
sentry_sdk.capture_exception(e)
SLI/SLO Definition
Example SLOs
| Service | SLI | SLO | Window |
|---|
| API | Availability (2xx / total) | 99.9% | 30 days |
| API | Latency p99 | < 500ms | 30 days |
| Search | Result relevance | > 80% | 7 days |
| Ingest | Processing success rate | 99.5% | 30 days |
Error Budget
Error Budget = 1 - SLO = 1 - 0.999 = 0.1%
Monthly budget = 30 days * 24h * 60min * 0.001 = 43.2 minutes
Alerting Rules
Prometheus Alert Examples
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
Alert Best Practices
- Alert on symptoms (high latency), not causes (high CPU)
- Include runbook links in annotations
- Set appropriate severity: page only for user-impacting issues
- Use
for duration to avoid flapping
Rules
- MUST emit logs as structured JSON with a
trace_id (or correlation ID) propagated across service boundaries — unstructured logs are unsearchable at scale
- MUST alert on symptoms (user-visible latency/errors), not causes (high CPU, queue depth) — causes change, symptoms are stable
- NEVER log PII, PHI, or credentials — even at debug level. Logs leak to aggregation systems, backups, and disk snapshots.
- NEVER use histograms without explicit buckets tuned to the expected range — default buckets either miss the p99 or waste cardinality
- CRITICAL: SLOs define alerts, not metrics. An SLO of "99.9% of requests under 200ms" produces one alert ("burn rate exceeded") — not six alerts on each constituent metric.
- MANDATORY: every alert includes a runbook link in its annotations. An alert without a runbook pages someone who then has to invent a response at 3am.
Gotchas
trace_id must propagate via headers (traceparent / X-Correlation-ID) AND be added to every log line AND span. Partial propagation produces broken traces — one missing middleware and the span disappears from the UI.
- OpenTelemetry SDK defaults to
otlp/grpc on port 4317. Many corporate networks block it; otlp/http on 4318 with traces path works through proxies. Check connectivity before debugging app-level issues.
- High-cardinality labels (user IDs, request IDs) on Prometheus metrics explode the time-series database. Prefer exemplars or traces for per-request data; keep metric labels to bounded categorical values.
for: 5m in Prometheus delays alerts by 5 minutes AND requires the condition to hold throughout — a 4m59s spike does not alert, even if it re-occurs. Use shorter for on critical alerts with a higher severity threshold.
- Log-based metrics (counting log lines matching a regex) are expensive and brittle. A log format change silently breaks the metric; prefer direct instrumentation from the code.
- Sampled traces (probability = 0.1) miss rare errors by design. For error-path visibility, use "always-sample on error" (tail sampling or head-based with error flag).
When NOT to Load
- For implementing an alert rule or log aggregator — use the tool's docs (Prometheus, Grafana, Datadog); this skill is pattern-level
- For performance tuning (profiling, flame graphs) — use
/performance-profiling
- For incident response mechanics during an outage — use
/workflow incident-response
- For generic logging library choice in a new project — use
/app-builder which handles scaffolding
- For security auditing of log content (PII leaks) — use
/security-patterns and /hipaa-validate