Run any Skill in Manus with one click

$pwd:

observability-patterns

Name: Observability Patterns
Author: softspark

// Observability: structured logs, metrics (RED/USE), tracing, SLO/SLI. Triggers: logging, metrics, Prometheus, Grafana, OpenTelemetry, trace, monitoring.

Run Skill in Manus

$ git log --oneline --stat

stars:143

forks:18

updated:May 6, 2026 at 12:03

SKILL.md

readonly

package.json

"author": "softspark"

"repository": "softspark/ai-toolkit"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	observability-patterns
description	Observability: structured logs, metrics (RED/USE), tracing, SLO/SLI. Triggers: logging, metrics, Prometheus, Grafana, OpenTelemetry, trace, monitoring.
effort	medium
user-invocable	false
allowed-tools	Read

Observability Patterns

Structured Logging

Python (structlog)

import structlog

logger = structlog.get_logger()

logger.info("user_created", user_id=user.id, email=user.email, source="api")
logger.error("payment_failed", order_id=order.id, error=str(e), amount=amount)

Node.js (Pino)

import pino from "pino";

const logger = pino({ level: "info", transport: { target: "pino-pretty" } });

logger.info({ userId: user.id, action: "login" }, "User logged in");
logger.error({ err, orderId: order.id }, "Payment processing failed");

Log Levels

Level	Use For
`error`	Failures requiring attention
`warn`	Unexpected but handled situations
`info`	Business events, state transitions
`debug`	Development diagnostics

Rules

Always use structured key-value pairs, not string interpolation
Include correlation IDs for request tracing
Never log sensitive data (passwords, tokens, PII)
Log at boundaries: API entry/exit, external calls, state changes

OpenTelemetry

Python Setup

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

request_counter = meter.create_counter("http_requests_total", description="Total HTTP requests")
request_duration = meter.create_histogram("http_request_duration_seconds")

@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    request_counter.add(1, {"method": "POST", "endpoint": "/orders"})
    with tracer.start_as_current_span("validate_order"):
        validate(order_id)
    with tracer.start_as_current_span("charge_payment"):
        charge(order_id)

Node.js Setup

import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: "http://otel-collector:4317" }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Prometheus Metrics

Metric Types

# Counter - monotonically increasing (requests, errors)
http_requests_total = Counter("http_requests_total", "Total requests", ["method", "status", "path"])

# Histogram - distribution (latency, sizes)
request_duration = Histogram("request_duration_seconds", "Request latency",
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0])

# Gauge - current value (connections, queue size)
active_connections = Gauge("active_connections", "Current active connections")

Key Metrics (RED Method)

Metric	Type	Purpose
Request Rate	Counter	Traffic volume
Request Errors	Counter	Error rate
Request Duration	Histogram	Latency distribution

Key Metrics (USE Method - Infrastructure)

Metric	Type	Purpose
Utilization	Gauge	% resource used
Saturation	Gauge	Queue depth
Errors	Counter	Error count

Health Check Endpoints

# FastAPI example
@app.get("/health")
async def health():
    checks = {
        "database": await check_db(),
        "redis": await check_redis(),
        "disk": check_disk_space(),
    }
    status = "healthy" if all(checks.values()) else "degraded"
    code = 200 if status == "healthy" else 503
    return JSONResponse({"status": status, "checks": checks}, status_code=code)

@app.get("/ready")
async def readiness():
    """Kubernetes readiness probe - can this instance serve traffic?"""
    return {"ready": True}

@app.get("/live")
async def liveness():
    """Kubernetes liveness probe - is the process alive?"""
    return {"alive": True}

Error Tracking (Sentry)

import sentry_sdk

sentry_sdk.init(
    dsn="https://key@sentry.io/project",
    traces_sample_rate=0.1,
    profiles_sample_rate=0.1,
    environment="production",
)

# Automatic exception capture + manual context
with sentry_sdk.push_scope() as scope:
    scope.set_tag("order_id", order.id)
    scope.set_context("payment", {"amount": amount, "currency": "USD"})
    sentry_sdk.capture_exception(e)

SLI/SLO Definition

Example SLOs

Service	SLI	SLO	Window
API	Availability (2xx / total)	99.9%	30 days
API	Latency p99	< 500ms	30 days
Search	Result relevance	> 80%	7 days
Ingest	Processing success rate	99.5%	30 days

Error Budget

Error Budget = 1 - SLO = 1 - 0.999 = 0.1%
Monthly budget = 30 days * 24h * 60min * 0.001 = 43.2 minutes

Alerting Rules

Prometheus Alert Examples

groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for 5 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning

Alert Best Practices

Alert on symptoms (high latency), not causes (high CPU)
Include runbook links in annotations
Set appropriate severity: page only for user-impacting issues
Use for duration to avoid flapping

Rules

MUST emit logs as structured JSON with a trace_id (or correlation ID) propagated across service boundaries — unstructured logs are unsearchable at scale
MUST alert on symptoms (user-visible latency/errors), not causes (high CPU, queue depth) — causes change, symptoms are stable
NEVER log PII, PHI, or credentials — even at debug level. Logs leak to aggregation systems, backups, and disk snapshots.
NEVER use histograms without explicit buckets tuned to the expected range — default buckets either miss the p99 or waste cardinality
CRITICAL: SLOs define alerts, not metrics. An SLO of "99.9% of requests under 200ms" produces one alert ("burn rate exceeded") — not six alerts on each constituent metric.
MANDATORY: every alert includes a runbook link in its annotations. An alert without a runbook pages someone who then has to invent a response at 3am.

Gotchas

trace_id must propagate via headers (traceparent / X-Correlation-ID) AND be added to every log line AND span. Partial propagation produces broken traces — one missing middleware and the span disappears from the UI.
OpenTelemetry SDK defaults to otlp/grpc on port 4317. Many corporate networks block it; otlp/http on 4318 with traces path works through proxies. Check connectivity before debugging app-level issues.
High-cardinality labels (user IDs, request IDs) on Prometheus metrics explode the time-series database. Prefer exemplars or traces for per-request data; keep metric labels to bounded categorical values.
for: 5m in Prometheus delays alerts by 5 minutes AND requires the condition to hold throughout — a 4m59s spike does not alert, even if it re-occurs. Use shorter for on critical alerts with a higher severity threshold.
Log-based metrics (counting log lines matching a regex) are expensive and brittle. A log format change silently breaks the metric; prefer direct instrumentation from the code.
Sampled traces (probability = 0.1) miss rare errors by design. For error-path visibility, use "always-sample on error" (tail sampling or head-based with error flag).

When NOT to Load

For implementing an alert rule or log aggregator — use the tool's docs (Prometheus, Grafana, Datadog); this skill is pattern-level
For performance tuning (profiling, flame graphs) — use /performance-profiling
For incident response mechanics during an outage — use /workflow incident-response
For generic logging library choice in a new project — use /app-builder which handles scaffolding
For security auditing of log content (PII leaks) — use /security-patterns and /hipaa-validate

name	observability-patterns
description	Observability: structured logs, metrics (RED/USE), tracing, SLO/SLI. Triggers: logging, metrics, Prometheus, Grafana, OpenTelemetry, trace, monitoring.
effort	medium
user-invocable	false
allowed-tools	Read

Observability Patterns

Structured Logging

Python (structlog)

import structlog

logger = structlog.get_logger()

logger.info("user_created", user_id=user.id, email=user.email, source="api")
logger.error("payment_failed", order_id=order.id, error=str(e), amount=amount)

Node.js (Pino)

import pino from "pino";

const logger = pino({ level: "info", transport: { target: "pino-pretty" } });

logger.info({ userId: user.id, action: "login" }, "User logged in");
logger.error({ err, orderId: order.id }, "Payment processing failed");

Log Levels

Level	Use For
`error`	Failures requiring attention
`warn`	Unexpected but handled situations
`info`	Business events, state transitions
`debug`	Development diagnostics

Rules

Always use structured key-value pairs, not string interpolation
Include correlation IDs for request tracing
Never log sensitive data (passwords, tokens, PII)
Log at boundaries: API entry/exit, external calls, state changes

OpenTelemetry

Python Setup

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

request_counter = meter.create_counter("http_requests_total", description="Total HTTP requests")
request_duration = meter.create_histogram("http_request_duration_seconds")

@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    request_counter.add(1, {"method": "POST", "endpoint": "/orders"})
    with tracer.start_as_current_span("validate_order"):
        validate(order_id)
    with tracer.start_as_current_span("charge_payment"):
        charge(order_id)

Node.js Setup

import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: "http://otel-collector:4317" }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Prometheus Metrics

Metric Types

# Counter - monotonically increasing (requests, errors)
http_requests_total = Counter("http_requests_total", "Total requests", ["method", "status", "path"])

# Histogram - distribution (latency, sizes)
request_duration = Histogram("request_duration_seconds", "Request latency",
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0])

# Gauge - current value (connections, queue size)
active_connections = Gauge("active_connections", "Current active connections")

Key Metrics (RED Method)

Metric	Type	Purpose
Request Rate	Counter	Traffic volume
Request Errors	Counter	Error rate
Request Duration	Histogram	Latency distribution

Key Metrics (USE Method - Infrastructure)

Metric	Type	Purpose
Utilization	Gauge	% resource used
Saturation	Gauge	Queue depth
Errors	Counter	Error count

Health Check Endpoints

# FastAPI example
@app.get("/health")
async def health():
    checks = {
        "database": await check_db(),
        "redis": await check_redis(),
        "disk": check_disk_space(),
    }
    status = "healthy" if all(checks.values()) else "degraded"
    code = 200 if status == "healthy" else 503
    return JSONResponse({"status": status, "checks": checks}, status_code=code)

@app.get("/ready")
async def readiness():
    """Kubernetes readiness probe - can this instance serve traffic?"""
    return {"ready": True}

@app.get("/live")
async def liveness():
    """Kubernetes liveness probe - is the process alive?"""
    return {"alive": True}

Error Tracking (Sentry)

import sentry_sdk

sentry_sdk.init(
    dsn="https://key@sentry.io/project",
    traces_sample_rate=0.1,
    profiles_sample_rate=0.1,
    environment="production",
)

# Automatic exception capture + manual context
with sentry_sdk.push_scope() as scope:
    scope.set_tag("order_id", order.id)
    scope.set_context("payment", {"amount": amount, "currency": "USD"})
    sentry_sdk.capture_exception(e)

SLI/SLO Definition

Example SLOs

Service	SLI	SLO	Window
API	Availability (2xx / total)	99.9%	30 days
API	Latency p99	< 500ms	30 days
Search	Result relevance	> 80%	7 days
Ingest	Processing success rate	99.5%	30 days

Error Budget

Error Budget = 1 - SLO = 1 - 0.999 = 0.1%
Monthly budget = 30 days * 24h * 60min * 0.001 = 43.2 minutes

Alerting Rules

Prometheus Alert Examples

groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for 5 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning

Alert Best Practices

Alert on symptoms (high latency), not causes (high CPU)
Include runbook links in annotations
Set appropriate severity: page only for user-impacting issues
Use for duration to avoid flapping

Rules

MUST emit logs as structured JSON with a trace_id (or correlation ID) propagated across service boundaries — unstructured logs are unsearchable at scale
MUST alert on symptoms (user-visible latency/errors), not causes (high CPU, queue depth) — causes change, symptoms are stable
NEVER log PII, PHI, or credentials — even at debug level. Logs leak to aggregation systems, backups, and disk snapshots.
NEVER use histograms without explicit buckets tuned to the expected range — default buckets either miss the p99 or waste cardinality
CRITICAL: SLOs define alerts, not metrics. An SLO of "99.9% of requests under 200ms" produces one alert ("burn rate exceeded") — not six alerts on each constituent metric.
MANDATORY: every alert includes a runbook link in its annotations. An alert without a runbook pages someone who then has to invent a response at 3am.

Gotchas

trace_id must propagate via headers (traceparent / X-Correlation-ID) AND be added to every log line AND span. Partial propagation produces broken traces — one missing middleware and the span disappears from the UI.
OpenTelemetry SDK defaults to otlp/grpc on port 4317. Many corporate networks block it; otlp/http on 4318 with traces path works through proxies. Check connectivity before debugging app-level issues.
High-cardinality labels (user IDs, request IDs) on Prometheus metrics explode the time-series database. Prefer exemplars or traces for per-request data; keep metric labels to bounded categorical values.
for: 5m in Prometheus delays alerts by 5 minutes AND requires the condition to hold throughout — a 4m59s spike does not alert, even if it re-occurs. Use shorter for on critical alerts with a higher severity threshold.
Log-based metrics (counting log lines matching a regex) are expensive and brittle. A log format change silently breaks the metric; prefer direct instrumentation from the code.
Sampled traces (probability = 0.1) miss rare errors by design. For error-path visibility, use "always-sample on error" (tail sampling or head-based with error flag).

When NOT to Load

For implementing an alert rule or log aggregator — use the tool's docs (Prometheus, Grafana, Datadog); this skill is pattern-level
For performance tuning (profiling, flame graphs) — use /performance-profiling
For incident response mechanics during an outage — use /workflow incident-response
For generic logging library choice in a new project — use /app-builder which handles scaffolding
For security auditing of log content (PII leaks) — use /security-patterns and /hipaa-validate