원클릭으로 Manus에서 모든 스킬 실행

시작하기

observability-monitoring

스타2

포크1

업데이트2026년 4월 5일 22:44

Patterns for metrics, tracing, logging, alerting, dashboards, and SLO definition

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

navraj007in

navraj007in/architecture-cowork-plugin

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

Observability & Monitoring Skill

Comprehensive observability strategy: metrics collection, distributed tracing, structured logging, alerting rules, dashboard design, and SLO/SLA definitions.

Overview

Observability answers "what is happening in production?" across three pillars:

Metrics — numeric signals over time (CPU %, error rate, latency p99)
Logs — structured text events from code (request IDs, decisions, state changes)
Traces — request flows across services (entry → service A → service B → exit)

This skill provides framework-agnostic patterns and thresholds for building each pillar.

Pillar 1: Metrics

RED Method (Request-driven Services)

For services handling requests, collect three metrics per endpoint:

Metric	Prometheus	Purpose	Query
Rate	`http_requests_total`	requests per second	`rate(http_requests_total[1m])`
Errors	`http_requests_total{status=~"5.."}`	failed %	`rate(http_requests_total{status=~"5.."}[1m]) / rate(http_requests_total[1m])`
Duration	`http_request_duration_seconds`	latency histogram	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))`

Implementation pattern (Node.js + Prometheus client):

import promClient from 'prom-client';

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Middleware (Express)
app.use((req, res, next) => {
  const startTime = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - startTime) / 1000;
    httpRequestDuration
      .labels(req.method, req.route?.path || req.url, res.statusCode)
      .observe(duration);
    httpRequestTotal
      .labels(req.method, req.route?.path || req.url, res.statusCode)
      .inc();
  });
  next();
});

USE Method (Resource-driven Services)

For background workers or batch services, collect three metrics per resource:

Metric	Prometheus	Purpose	Alert threshold
Utilization	`process_cpu_seconds_total`, `process_resident_memory_bytes`	% of capacity used	CPU > 70%, Memory > 80%
Saturation	`job_queue_length`, `db_connection_pool_active`	items waiting	queue > 100 items
Errors	`task_failures_total`, `db_connection_errors_total`	failed operations	> 1% of operations

Example (background worker):

const jobQueueLength = new promClient.Gauge({
  name: 'job_queue_length',
  help: 'Number of jobs waiting in queue'
});

const taskFailures = new promClient.Counter({
  name: 'task_failures_total',
  help: 'Total failed tasks',
  labelNames: ['task_type', 'error_code']
});

// Worker loop
async function processJob(job: Job) {
  try {
    jobQueueLength.set(await queue.length());
    await job.execute();
  } catch (error) {
    taskFailures.labels(job.type, error.code).inc();
    throw error;
  }
}

Golden Signals (All Services)

Monitor these four signals universally:

Signal	Metric	MVP	Growth	Enterprise
Latency	p50, p95, p99	p99 < 500ms	p99 < 200ms	p99 < 100ms
Traffic	requests/sec	> 0	track trends	auto-scale rule
Errors	error rate %	< 1%	< 0.5%	< 0.1%
Saturation	queue depth, CPU %	manual review	alert @ 70% CPU	auto-scale @ 60%

Prometheus dashboard queries:

# Latency (p99)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Request rate
sum(rate(http_requests_total[1m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))

# CPU utilization
rate(process_cpu_seconds_total[1m]) * 100

Pillar 2: Distributed Tracing

OpenTelemetry Semantic Conventions

Standard attribute names ensure interoperability across tools (Datadog, Jaeger, New Relic, etc.).

HTTP attributes:

http.request.method          "GET" | "POST" | ...
http.url                     "https://example.com/users?id=123"
http.target                  "/users?id=123" (path + query)
http.host                    "example.com"
http.status_code             200, 404, 500, ...
http.request.body.size       bytes
http.response.body.size      bytes

Database attributes:

db.system                    "postgresql" | "mysql" | "mongodb" | ...
db.name                      "users_db"
db.statement                 "SELECT * FROM users WHERE id = ?"
db.connection_string         "postgresql://..." (scrubbed of credentials)
db.operation                 "SELECT" | "INSERT" | "UPDATE" | "DELETE"
db.rows_affected             N

Service/span attributes:

service.name                 "api-server"
service.version              "0.1.0"
trace.id                     unique identifier
span.id                      unique per span
span.parent_id               parent span ID (or root span has no parent)
span.kind                    "SERVER" | "CLIENT" | "INTERNAL" | "PRODUCER" | "CONSUMER"
span.status                  "OK" | "ERROR" | "UNSET"

Trace Context Propagation

Pass trace IDs across service boundaries for end-to-end visibility.

HTTP headers (W3C Trace Context standard):

traceparent: version-traceId-spanId-traceFlags
Example: traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Implementation (Node.js + OpenTelemetry):

import { getTracer } from '@opentelemetry/api';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';

const tracer = getTracer('api-server');

// Outbound HTTP client call
const span = tracer.startSpan('external-api-call', {
  attributes: {
    'http.method': 'GET',
    'http.url': 'https://dependency-api.example.com/data',
    'span.kind': 'CLIENT'
  }
});

// Automatic propagation via fetch or axios with OpenTelemetry instrumentation
const response = await fetch('https://dependency-api.example.com/data');
span.end();

Trace Sampling Rules

Control what fraction of requests are traced (100% tracing is expensive; smart sampling targets slow/errored requests).

Sampling strategies by stage:

Stage	Sampling rule	Example
MVP	100% (trace all)	`ratio=1.0` — smaller user base, capture everything
Growth	Error + tail latency	Trace all errors + slowest 5% (`ratio=0.05` for success paths)
Enterprise	Intelligent sampling	Datadog intelligent sampling, New Relic tail-based sampling

Implementation (Node.js):

import { NodeTracerProvider, BatchSpanProcessor } from '@opentelemetry/node';
import { ProbabilitySampler } from '@opentelemetry/core';

const provider = new NodeTracerProvider({
  sampler: new ProbabilitySampler(0.1) // 10% sampling for growth stage
});

Tail-based sampling rules (Enterprise):

# Example: sample all errors + slowest 1%
tail_sampling:
  policies:
    - name: error-policy
      type: status_code
      status_code:
        status_codes: [ERROR]
    
    - name: latency-policy
      type: latency
      latency:
        threshold_ms: 1000
        upper_threshold_ms: 5000

Pillar 3: Structured Logging

Log Levels and Guidelines

Use standardized levels consistently:

Level	When to use	Example
DEBUG	Development only; very detailed state	`db_query_params: {...}`, `cache_hit: true`
INFO	Notable events in request flow	`"user created"`, `"payment processed"`, `"email sent"`
WARN	Recoverable issue needing attention	`"retry after 3 failures"`, `"fallback to default value"`
ERROR	Request/operation failed, needs investigation	`"database connection timeout"`, `"payment gateway returned 500"`
FATAL	Service cannot continue, immediate human action needed	`"out of disk space"`, `"database unreachable"`

Guidelines:

Never log at DEBUG in production (disable via config)
Use ERROR only for actual errors (not every validation failure)
WARN for things that worked but are sub-optimal
Include request context (trace ID, user ID) in every log

Structured Log Fields

Use consistent field names across all services for dashboard aggregation:

Request context (always include in every log):

{
  "trace_id": "0af7651916cd43dd",
  "span_id": "b7ad6b716920",
  "request_id": "req-xyz",
  "user_id": "user-123",
  "tenant_id": "org-456",
  "session_id": "sess-789"
}

Business context:

{
  "entity_type": "order",
  "entity_id": "order-001",
  "action": "create",
  "status": "pending",
  "amount": 123.45,
  "currency": "USD"
}

Execution context:

{
  "service": "api-server",
  "version": "0.1.0",
  "environment": "production",
  "function": "processPayment",
  "duration_ms": 234
}

Error context (only when level is ERROR or FATAL):

{
  "error_code": "PAYMENT_TIMEOUT",
  "error_message": "Payment gateway did not respond within 30s",
  "error_stack": "...stack trace...",
  "retry_count": 2,
  "retriable": true
}

Structured Logging Implementation

Node.js (Winston):

import winston from 'winston';

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.json(),
  defaultMeta: {
    service: 'api-server',
    version: process.env.APP_VERSION,
    environment: process.env.NODE_ENV
  },
  transports: [
    new winston.transports.Console({
      format: winston.format.simple()
    }),
    new winston.transports.File({
      filename: 'logs/error.log',
      level: 'error',
      format: winston.format.json()
    })
  ]
});

// Log with context
logger.info('order created', {
  trace_id: req.traceId,
  user_id: req.userId,
  entity_type: 'order',
  entity_id: orderId,
  amount: 123.45,
  duration_ms: Date.now() - startTime
});

logger.error('payment failed', {
  trace_id: req.traceId,
  error_code: error.code,
  error_message: error.message,
  retry_count: retries,
  retriable: error.retriable
});

Python (structlog):

import structlog

logger = structlog.get_logger()

logger.info(
    "order_created",
    trace_id=trace_id,
    user_id=user_id,
    entity_type="order",
    entity_id=order_id,
    amount=123.45,
    duration_ms=elapsed_time
)

logger.error(
    "payment_failed",
    trace_id=trace_id,
    error_code=error.code,
    error_message=str(error),
    retry_count=retries
)

Log aggregation (Loki/ELK/Datadog):

All logs are JSON-serialized. Log aggregation systems parse and index these fields automatically.

Example query (Grafana Loki):

{service="api-server", level="error"} | json | error_code != ""

Alerting Rules

Alert Thresholds by Stage

Use these thresholds as baselines; adjust per service's SLA.

MVP stage:

Alert	Threshold	Duration	Action
Service Down	Status code 5xx > 10%	1 min	Page on-call
High Latency	p95 > 500ms	5 min	Monitor; page if sustained
Error Spike	5x baseline error rate	2 min	Page on-call

Growth stage:

Alert	Threshold	Duration	Action
Service Down	Status code 5xx > 5%	2 min	Page on-call
High Latency	p95 > 200ms	10 min	Page on-call
High Latency	p99 > 500ms	5 min	Page on-call
Error Spike	Error rate > 1%	5 min	Create incident
CPU Saturation	> 70% for 10 min	10 min	Auto-scale or page
Memory Leak	Memory usage rising > 50%/hour	sustained	Page on-call

Enterprise stage:

Alert	Threshold	Duration	Action
Service Degradation	Status code 5xx > 0.5%	1 min	Create incident
Latency SLO Miss	p99 > SLO target	5 min	Create incident
Error Budget Burn	Consumed > 10%/day	realtime	Page on-call
CPU Saturation	> 60% for 5 min	sustained	Auto-scale
Database Connection Pool	Active > 80%	5 min	Page DBA
Disk Space	Free < 10%	realtime	Critical alert

Prometheus Alert Rules (YAML)

File location: monitoring/alerts/rules.yaml or integrated into Prometheus config

groups:
  - name: api-server
    interval: 30s
    rules:
      # MVP threshold
      - alert: HighErrorRate
        expr: |
          (sum(rate(http_requests_total{status=~"5.."}[1m])) 
           / sum(rate(http_requests_total[1m]))) > 0.1
        for: 1m
        annotations:
          summary: "High error rate (>10%) in api-server"
          description: "Error rate: {{ $value | humanizePercentage }}"

      # Growth threshold
      - alert: HighLatencyP95
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.2
        for: 10m
        annotations:
          summary: "High latency (p95 > 200ms)"

      # CPU saturation (growth stage)
      - alert: HighCPUUsage
        expr: |
          rate(process_cpu_seconds_total[1m]) > 0.7
        for: 10m
        annotations:
          summary: "CPU utilization > 70%"

      # Memory leak detection
      - alert: MemoryLeak
        expr: |
          rate(process_resident_memory_bytes[1h]) > 52428800  # 50 MiB/hour
        for: 30m
        annotations:
          summary: "Possible memory leak (memory growing > 50 MiB/hour)"

SLO and SLA Definition

SLO vs SLA

Term	Means	Owner	Consequence
SLO (Service Level Objective)	Internal goal (e.g., "99% uptime")	Engineering team	Guides investment + on-call
SLA (Service Level Agreement)	Promise to customers; has penalties	Product/Legal	Financial

SLOs are typically more aggressive than SLAs (SLO: 99.9%, SLA: 99% — gives 0.9% buffer).

SLO Template (per service)

## [Service Name] SLO

### Availability
- **Target:** 99.5% uptime
- **Budget:** 21.6 minutes downtime/month
- **Measurement:** HTTP status 2xx or 3xx / total requests

### Latency
- **Target:** 95th percentile < 200ms
- **Measurement:** p95(request duration) measured over 1-minute windows

### Error Rate
- **Target:** Error rate < 0.5%
- **Measurement:** 5xx responses / total requests

### Duration & Review
- **Quarter:** Q2 2026 (Apr-Jun)
- **Review:** Monthly; escalate if budget consumed > 33%/month

Error Budget Tracking

Once SLO is defined, calculate "error budget" — how much failure is acceptable:

Availability SLO: 99.5%
Allowed downtime per month: (1 - 0.995) × 30 days × 24 hours = ~21.6 minutes

If 5 minutes of unexpected downtime occurs on April 3:
  Remaining budget: 21.6 - 5 = 16.6 minutes for the rest of April
  % consumed: (5 / 21.6) × 100 = 23% of monthly budget

Decision rule:

If < 33% budget consumed in first week → proceed with risky changes/deployments
If > 66% budget consumed → freeze risky changes, focus on stability

Dashboard Design

Essential Dashboards (by service type)

REST API service:

RED metrics (rate, errors, duration p50/p95/p99)
Instance health (CPU, memory, disk)
Database connection pool (active, idle, wait queue)
Cache hit rate (if applicable)

Background worker:

USE metrics (utilization, saturation, errors)
Job queue length and processing time
Failed job count and retry rate
Resource trends (CPU/memory over 24h)

Web frontend:

Core Web Vitals (LCP, FID, CLS)
JavaScript errors (tracked via Sentry/Rollbar)
User session count and session duration
Page load performance by route

Database:

Query latency (p95, p99)
Connection pool (active, idle, max)
Replication lag (if applicable)
Disk usage trend
Slow query count

Prometheus + Grafana Setup

Prometheus configuration (prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-server'
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/metrics'
    scrape_interval: 10s

  - job_name: 'database'
    static_configs:
      - targets: ['localhost:9187']  # postgres_exporter

Grafana dashboard (JSON):

{
  "dashboard": {
    "title": "API Server — RED Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "sum(rate(http_requests_total[1m]))"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) / sum(rate(http_requests_total[1m]))"
        }]
      },
      {
        "title": "Latency P99",
        "targets": [{
          "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
        }]
      }
    ]
  }
}

Multi-Cloud Observability

Provider-Specific SDKs

Integrate with cloud-native observability platforms for rich integrations.

Datadog:

import { datadogRum } from '@datadog/browser-rum';
import tracer from 'dd-trace';

// Backend
tracer.init(); // Auto-instruments HTTP, DB, cache
tracer.trace('custom-operation', () => {
  // custom logic
});

// Frontend
datadogRum.init({
  applicationId: 'app-id',
  clientToken: 'token',
  site: 'datadoghq.com',
  service: 'web-app',
  env: 'production',
  sessionSampleRate: 100,
  sessionReplaySampleRate: 20,
  trackUserInteractions: true
});

New Relic:

const newrelic = require('newrelic');

newrelic.instrumentLoadedModule('pg', new newrelic.API.QuerySpec({
  operation: 'query'
}));

newrelic.startSegment('custom-segment', false, () => {
  // custom logic
});

AWS CloudWatch:

import { CloudWatchClient, PutMetricDataCommand } from "@aws-sdk/client-cloudwatch";

const client = new CloudWatchClient({ region: 'us-east-1' });
await client.send(new PutMetricDataCommand({
  Namespace: 'api-server',
  MetricData: [{
    MetricName: 'RequestCount',
    Value: 1,
    Unit: 'Count'
  }]
}));

Observability Checklist

Metrics: RED (rate, errors, duration) for all APIs; USE (utilization, saturation, errors) for workers
Tracing: OpenTelemetry initialized; trace IDs propagated across service boundaries
Logging: Structured JSON logs with trace_id, user_id, entity_id, service, environment
Alerting: Alert rules defined per stage (MVP/Growth/Enterprise); on-call escalation path documented
SLO: Availability, latency, and error rate targets documented; error budget calculated
Dashboards: RED dashboard, resource dashboard, SLO dashboard created in Grafana/Datadog
Log retention: Retention policy documented (e.g., 30 days for INFO, 90 days for ERROR)
Privacy: No PII, API keys, or credentials in logs; sensitive fields masked or hashed

name	Observability & Monitoring
description	Patterns for metrics, tracing, logging, alerting, dashboards, and SLO definition

observability-monitoring

이 저장소의 다른 Skills

이 저장소의 다른 Skills

Observability & Monitoring Skill

Overview

Pillar 1: Metrics

RED Method (Request-driven Services)

USE Method (Resource-driven Services)

Golden Signals (All Services)

Pillar 2: Distributed Tracing

OpenTelemetry Semantic Conventions

Trace Context Propagation

Trace Sampling Rules

Pillar 3: Structured Logging

Log Levels and Guidelines

Structured Log Fields

Structured Logging Implementation

Alerting Rules

Alert Thresholds by Stage

Prometheus Alert Rules (YAML)

SLO and SLA Definition

SLO vs SLA

SLO Template (per service)

Error Budget Tracking

Dashboard Design

Essential Dashboards (by service type)

Prometheus + Grafana Setup

Multi-Cloud Observability

Provider-Specific SDKs

Observability Checklist

Observability & Monitoring Skill

Overview

Pillar 1: Metrics

RED Method (Request-driven Services)

USE Method (Resource-driven Services)

Golden Signals (All Services)

Pillar 2: Distributed Tracing

OpenTelemetry Semantic Conventions

Trace Context Propagation

Trace Sampling Rules

Pillar 3: Structured Logging

Log Levels and Guidelines

Structured Log Fields

Structured Logging Implementation

Alerting Rules

Alert Thresholds by Stage

Prometheus Alert Rules (YAML)

SLO and SLA Definition

SLO vs SLA

SLO Template (per service)

Error Budget Tracking

Dashboard Design

Essential Dashboards (by service type)

Prometheus + Grafana Setup

Multi-Cloud Observability

Provider-Specific SDKs

Observability Checklist