원클릭으로
observability-monitoring
Patterns for metrics, tracing, logging, alerting, dashboards, and SLO definition
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Patterns for metrics, tracing, logging, alerting, dashboards, and SLO definition
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
Structured web research methodology for market analysis, competitor research, and technology evaluation. Ensures research uses live web data with source citations and confidence tags.
Mermaid diagram templates for solution architecture, service communication, C4 Context/Container, data flow, agent flow, deployment, and sequence diagrams. Use when generating architecture diagrams.
Solution Design Language (SDL) specification — schema, validation, normalization, and generation rules
Core methodology for analysing requirements, building system manifests, and generating architecture deliverables. Use when planning or designing any software architecture.
Extract and render all Mermaid diagrams from architecture blueprints to PNG images. Creates a diagrams/ folder with high-quality images ready for presentations and documentation.
Convert architecture documents (blueprints, stakeholder presentations) from Markdown to professionally formatted Word (.docx) files. Applies corporate styling, embeds PNG diagrams, and creates presentation-ready documents.
| name | Observability & Monitoring |
| description | Patterns for metrics, tracing, logging, alerting, dashboards, and SLO definition |
Comprehensive observability strategy: metrics collection, distributed tracing, structured logging, alerting rules, dashboard design, and SLO/SLA definitions.
Observability answers "what is happening in production?" across three pillars:
This skill provides framework-agnostic patterns and thresholds for building each pillar.
For services handling requests, collect three metrics per endpoint:
| Metric | Prometheus | Purpose | Query |
|---|---|---|---|
| Rate | http_requests_total | requests per second | rate(http_requests_total[1m]) |
| Errors | http_requests_total{status=~"5.."} | failed % | rate(http_requests_total{status=~"5.."}[1m]) / rate(http_requests_total[1m]) |
| Duration | http_request_duration_seconds | latency histogram | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m])) |
Implementation pattern (Node.js + Prometheus client):
import promClient from 'prom-client';
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
// Middleware (Express)
app.use((req, res, next) => {
const startTime = Date.now();
res.on('finish', () => {
const duration = (Date.now() - startTime) / 1000;
httpRequestDuration
.labels(req.method, req.route?.path || req.url, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, req.route?.path || req.url, res.statusCode)
.inc();
});
next();
});
For background workers or batch services, collect three metrics per resource:
| Metric | Prometheus | Purpose | Alert threshold |
|---|---|---|---|
| Utilization | process_cpu_seconds_total, process_resident_memory_bytes | % of capacity used | CPU > 70%, Memory > 80% |
| Saturation | job_queue_length, db_connection_pool_active | items waiting | queue > 100 items |
| Errors | task_failures_total, db_connection_errors_total | failed operations | > 1% of operations |
Example (background worker):
const jobQueueLength = new promClient.Gauge({
name: 'job_queue_length',
help: 'Number of jobs waiting in queue'
});
const taskFailures = new promClient.Counter({
name: 'task_failures_total',
help: 'Total failed tasks',
labelNames: ['task_type', 'error_code']
});
// Worker loop
async function processJob(job: Job) {
try {
jobQueueLength.set(await queue.length());
await job.execute();
} catch (error) {
taskFailures.labels(job.type, error.code).inc();
throw error;
}
}
Monitor these four signals universally:
| Signal | Metric | MVP | Growth | Enterprise |
|---|---|---|---|---|
| Latency | p50, p95, p99 | p99 < 500ms | p99 < 200ms | p99 < 100ms |
| Traffic | requests/sec | > 0 | track trends | auto-scale rule |
| Errors | error rate % | < 1% | < 0.5% | < 0.1% |
| Saturation | queue depth, CPU % | manual review | alert @ 70% CPU | auto-scale @ 60% |
Prometheus dashboard queries:
# Latency (p99)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Request rate
sum(rate(http_requests_total[1m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))
# CPU utilization
rate(process_cpu_seconds_total[1m]) * 100
Standard attribute names ensure interoperability across tools (Datadog, Jaeger, New Relic, etc.).
HTTP attributes:
http.request.method "GET" | "POST" | ...
http.url "https://example.com/users?id=123"
http.target "/users?id=123" (path + query)
http.host "example.com"
http.status_code 200, 404, 500, ...
http.request.body.size bytes
http.response.body.size bytes
Database attributes:
db.system "postgresql" | "mysql" | "mongodb" | ...
db.name "users_db"
db.statement "SELECT * FROM users WHERE id = ?"
db.connection_string "postgresql://..." (scrubbed of credentials)
db.operation "SELECT" | "INSERT" | "UPDATE" | "DELETE"
db.rows_affected N
Service/span attributes:
service.name "api-server"
service.version "0.1.0"
trace.id unique identifier
span.id unique per span
span.parent_id parent span ID (or root span has no parent)
span.kind "SERVER" | "CLIENT" | "INTERNAL" | "PRODUCER" | "CONSUMER"
span.status "OK" | "ERROR" | "UNSET"
Pass trace IDs across service boundaries for end-to-end visibility.
HTTP headers (W3C Trace Context standard):
traceparent: version-traceId-spanId-traceFlags
Example: traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
Implementation (Node.js + OpenTelemetry):
import { getTracer } from '@opentelemetry/api';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
const tracer = getTracer('api-server');
// Outbound HTTP client call
const span = tracer.startSpan('external-api-call', {
attributes: {
'http.method': 'GET',
'http.url': 'https://dependency-api.example.com/data',
'span.kind': 'CLIENT'
}
});
// Automatic propagation via fetch or axios with OpenTelemetry instrumentation
const response = await fetch('https://dependency-api.example.com/data');
span.end();
Control what fraction of requests are traced (100% tracing is expensive; smart sampling targets slow/errored requests).
Sampling strategies by stage:
| Stage | Sampling rule | Example |
|---|---|---|
| MVP | 100% (trace all) | ratio=1.0 — smaller user base, capture everything |
| Growth | Error + tail latency | Trace all errors + slowest 5% (ratio=0.05 for success paths) |
| Enterprise | Intelligent sampling | Datadog intelligent sampling, New Relic tail-based sampling |
Implementation (Node.js):
import { NodeTracerProvider, BatchSpanProcessor } from '@opentelemetry/node';
import { ProbabilitySampler } from '@opentelemetry/core';
const provider = new NodeTracerProvider({
sampler: new ProbabilitySampler(0.1) // 10% sampling for growth stage
});
Tail-based sampling rules (Enterprise):
# Example: sample all errors + slowest 1%
tail_sampling:
policies:
- name: error-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: latency-policy
type: latency
latency:
threshold_ms: 1000
upper_threshold_ms: 5000
Use standardized levels consistently:
| Level | When to use | Example |
|---|---|---|
| DEBUG | Development only; very detailed state | db_query_params: {...}, cache_hit: true |
| INFO | Notable events in request flow | "user created", "payment processed", "email sent" |
| WARN | Recoverable issue needing attention | "retry after 3 failures", "fallback to default value" |
| ERROR | Request/operation failed, needs investigation | "database connection timeout", "payment gateway returned 500" |
| FATAL | Service cannot continue, immediate human action needed | "out of disk space", "database unreachable" |
Guidelines:
Use consistent field names across all services for dashboard aggregation:
Request context (always include in every log):
{
"trace_id": "0af7651916cd43dd",
"span_id": "b7ad6b716920",
"request_id": "req-xyz",
"user_id": "user-123",
"tenant_id": "org-456",
"session_id": "sess-789"
}
Business context:
{
"entity_type": "order",
"entity_id": "order-001",
"action": "create",
"status": "pending",
"amount": 123.45,
"currency": "USD"
}
Execution context:
{
"service": "api-server",
"version": "0.1.0",
"environment": "production",
"function": "processPayment",
"duration_ms": 234
}
Error context (only when level is ERROR or FATAL):
{
"error_code": "PAYMENT_TIMEOUT",
"error_message": "Payment gateway did not respond within 30s",
"error_stack": "...stack trace...",
"retry_count": 2,
"retriable": true
}
Node.js (Winston):
import winston from 'winston';
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.json(),
defaultMeta: {
service: 'api-server',
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV
},
transports: [
new winston.transports.Console({
format: winston.format.simple()
}),
new winston.transports.File({
filename: 'logs/error.log',
level: 'error',
format: winston.format.json()
})
]
});
// Log with context
logger.info('order created', {
trace_id: req.traceId,
user_id: req.userId,
entity_type: 'order',
entity_id: orderId,
amount: 123.45,
duration_ms: Date.now() - startTime
});
logger.error('payment failed', {
trace_id: req.traceId,
error_code: error.code,
error_message: error.message,
retry_count: retries,
retriable: error.retriable
});
Python (structlog):
import structlog
logger = structlog.get_logger()
logger.info(
"order_created",
trace_id=trace_id,
user_id=user_id,
entity_type="order",
entity_id=order_id,
amount=123.45,
duration_ms=elapsed_time
)
logger.error(
"payment_failed",
trace_id=trace_id,
error_code=error.code,
error_message=str(error),
retry_count=retries
)
Log aggregation (Loki/ELK/Datadog):
All logs are JSON-serialized. Log aggregation systems parse and index these fields automatically.
Example query (Grafana Loki):
{service="api-server", level="error"} | json | error_code != ""
Use these thresholds as baselines; adjust per service's SLA.
MVP stage:
| Alert | Threshold | Duration | Action |
|---|---|---|---|
| Service Down | Status code 5xx > 10% | 1 min | Page on-call |
| High Latency | p95 > 500ms | 5 min | Monitor; page if sustained |
| Error Spike | 5x baseline error rate | 2 min | Page on-call |
Growth stage:
| Alert | Threshold | Duration | Action |
|---|---|---|---|
| Service Down | Status code 5xx > 5% | 2 min | Page on-call |
| High Latency | p95 > 200ms | 10 min | Page on-call |
| High Latency | p99 > 500ms | 5 min | Page on-call |
| Error Spike | Error rate > 1% | 5 min | Create incident |
| CPU Saturation | > 70% for 10 min | 10 min | Auto-scale or page |
| Memory Leak | Memory usage rising > 50%/hour | sustained | Page on-call |
Enterprise stage:
| Alert | Threshold | Duration | Action |
|---|---|---|---|
| Service Degradation | Status code 5xx > 0.5% | 1 min | Create incident |
| Latency SLO Miss | p99 > SLO target | 5 min | Create incident |
| Error Budget Burn | Consumed > 10%/day | realtime | Page on-call |
| CPU Saturation | > 60% for 5 min | sustained | Auto-scale |
| Database Connection Pool | Active > 80% | 5 min | Page DBA |
| Disk Space | Free < 10% | realtime | Critical alert |
File location: monitoring/alerts/rules.yaml or integrated into Prometheus config
groups:
- name: api-server
interval: 30s
rules:
# MVP threshold
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[1m]))
/ sum(rate(http_requests_total[1m]))) > 0.1
for: 1m
annotations:
summary: "High error rate (>10%) in api-server"
description: "Error rate: {{ $value | humanizePercentage }}"
# Growth threshold
- alert: HighLatencyP95
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.2
for: 10m
annotations:
summary: "High latency (p95 > 200ms)"
# CPU saturation (growth stage)
- alert: HighCPUUsage
expr: |
rate(process_cpu_seconds_total[1m]) > 0.7
for: 10m
annotations:
summary: "CPU utilization > 70%"
# Memory leak detection
- alert: MemoryLeak
expr: |
rate(process_resident_memory_bytes[1h]) > 52428800 # 50 MiB/hour
for: 30m
annotations:
summary: "Possible memory leak (memory growing > 50 MiB/hour)"
| Term | Means | Owner | Consequence |
|---|---|---|---|
| SLO (Service Level Objective) | Internal goal (e.g., "99% uptime") | Engineering team | Guides investment + on-call |
| SLA (Service Level Agreement) | Promise to customers; has penalties | Product/Legal | Financial |
SLOs are typically more aggressive than SLAs (SLO: 99.9%, SLA: 99% — gives 0.9% buffer).
## [Service Name] SLO
### Availability
- **Target:** 99.5% uptime
- **Budget:** 21.6 minutes downtime/month
- **Measurement:** HTTP status 2xx or 3xx / total requests
### Latency
- **Target:** 95th percentile < 200ms
- **Measurement:** p95(request duration) measured over 1-minute windows
### Error Rate
- **Target:** Error rate < 0.5%
- **Measurement:** 5xx responses / total requests
### Duration & Review
- **Quarter:** Q2 2026 (Apr-Jun)
- **Review:** Monthly; escalate if budget consumed > 33%/month
Once SLO is defined, calculate "error budget" — how much failure is acceptable:
Availability SLO: 99.5%
Allowed downtime per month: (1 - 0.995) × 30 days × 24 hours = ~21.6 minutes
If 5 minutes of unexpected downtime occurs on April 3:
Remaining budget: 21.6 - 5 = 16.6 minutes for the rest of April
% consumed: (5 / 21.6) × 100 = 23% of monthly budget
Decision rule:
REST API service:
Background worker:
Web frontend:
Database:
Prometheus configuration (prometheus.yml):
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'api-server'
static_configs:
- targets: ['localhost:3000']
metrics_path: '/metrics'
scrape_interval: 10s
- job_name: 'database'
static_configs:
- targets: ['localhost:9187'] # postgres_exporter
Grafana dashboard (JSON):
{
"dashboard": {
"title": "API Server — RED Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "sum(rate(http_requests_total[1m]))"
}]
},
{
"title": "Error Rate",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) / sum(rate(http_requests_total[1m]))"
}]
},
{
"title": "Latency P99",
"targets": [{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
}]
}
]
}
}
Integrate with cloud-native observability platforms for rich integrations.
Datadog:
import { datadogRum } from '@datadog/browser-rum';
import tracer from 'dd-trace';
// Backend
tracer.init(); // Auto-instruments HTTP, DB, cache
tracer.trace('custom-operation', () => {
// custom logic
});
// Frontend
datadogRum.init({
applicationId: 'app-id',
clientToken: 'token',
site: 'datadoghq.com',
service: 'web-app',
env: 'production',
sessionSampleRate: 100,
sessionReplaySampleRate: 20,
trackUserInteractions: true
});
New Relic:
const newrelic = require('newrelic');
newrelic.instrumentLoadedModule('pg', new newrelic.API.QuerySpec({
operation: 'query'
}));
newrelic.startSegment('custom-segment', false, () => {
// custom logic
});
AWS CloudWatch:
import { CloudWatchClient, PutMetricDataCommand } from "@aws-sdk/client-cloudwatch";
const client = new CloudWatchClient({ region: 'us-east-1' });
await client.send(new PutMetricDataCommand({
Namespace: 'api-server',
MetricData: [{
MetricName: 'RequestCount',
Value: 1,
Unit: 'Count'
}]
}));