원클릭으로 Manus에서 모든 스킬 실행

$pwd:

devops-monitoring-observability

Name: Devops Monitoring Observability
Author: justanesta

// Monitoring, observability, and alerting for production systems. Use this skill when implementing structured logging, Prometheus metrics, OpenTelemetry tracing, alerting strategies, SLOs/SLIs, or dashboard design. Covers the three pillars of observability, PromQL, distributed tracing, error budgets, and alert fatigue prevention for microservices and data pipelines.

Manus에서 실행

$ git log --oneline --stat

stars:2

forks:0

updated:2026년 2월 15일 04:18

파일 탐색기

6 개 파일

SKILL.md

readonly

related-skills.json

같은 저장소

data-engineering-cloud-infrastructure.md

from "justanesta/claude-code-resources"

Patterns for building and managing cloud data infrastructure on AWS and GCP using Infrastructure as Code, data lake architectures, cost optimization, and security best practices.

2026-02-152

data-eng-data-quality.md

from "justanesta/claude-code-resources"

Data quality validation, observability, and monitoring for data pipelines. Use this skill when implementing data quality checks with Great Expectations or Soda Core, designing schema contracts, building anomaly detection, or establishing data observability practices. Covers validation frameworks, quality metrics, SLAs, freshness monitoring, and lineage tracking.

2026-02-152

data-eng-streaming-patterns.md

from "justanesta/claude-code-resources"

Streaming data patterns for event-driven architectures and real-time processing. Use this skill when building Kafka pipelines, implementing CDC, designing event sourcing systems, or working with stream processing frameworks like Flink and Kafka Streams. Covers delivery guarantees, backpressure, dead letter queues, and production-grade streaming infrastructure.

2026-02-152

data-eng-testing-patterns.md

from "justanesta/claude-code-resources"

Testing patterns for data engineering pipelines and transformations. Use this skill when writing tests for SQL transforms, dbt models, data contracts, pipeline integration tests, or managing test data. Covers pytest-sql, dbt testing, contract testing, regression testing, and synthetic data generation for reliable data infrastructure.

2026-02-152

data-eng-warehouse-patterns.md

from "justanesta/claude-code-resources"

Patterns and best practices for cloud data warehouses (Snowflake, BigQuery, Redshift), lakehouse architectures, Data Vault 2.0, and ELT pipeline design

2026-02-152

devops-cicd-patterns.md

from "justanesta/claude-code-resources"

Production-ready patterns for continuous integration and continuous deployment pipelines across GitHub Actions, GitLab CI, and general pipeline design principles.

2026-02-152

package.json

"author": "justanesta"

"repository": "justanesta/claude-code-resources"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

네트워크·컴퓨터 시스템 관리자컴퓨터 및 수학직15-1244L4

name	devops-monitoring-observability
description	Monitoring, observability, and alerting for production systems. Use this skill when implementing structured logging, Prometheus metrics, OpenTelemetry tracing, alerting strategies, SLOs/SLIs, or dashboard design. Covers the three pillars of observability, PromQL, distributed tracing, error budgets, and alert fatigue prevention for microservices and data pipelines.

DevOps: Monitoring and Observability

Production-grade monitoring, observability, and alerting patterns for microservices, data pipelines, and distributed systems.

Core Principles

Three pillars of observability -- Logs, metrics, and traces are complementary signals. Logs tell you what happened, metrics tell you how the system is performing, and traces tell you how a request flowed through services.
SLIs drive SLOs drive alerts -- Define Service Level Indicators (measurable signals), set Service Level Objectives (targets), and derive alerts from error budget burn rates. Never alert on raw thresholds disconnected from user impact.
Structured logging everywhere -- Emit JSON-formatted logs with consistent fields (timestamp, service, trace_id, level). Unstructured text logs are unsearchable at scale.
Instrument at boundaries -- Focus on service entry points, external calls (databases, APIs, queues), and critical business transactions. Over-instrumenting internal functions creates noise.
Alerts must be actionable -- Every alert should have a clear owner, a runbook, and a defined severity. If nobody needs to act, it should not be an alert.

Structured Logging Patterns

Emit structured JSON logs with correlation IDs for cross-service tracing and consistent fields for aggregation.

import structlog
import uuid

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer(),
    ],
)
logger = structlog.get_logger()

def handle_request(request):
    """Bind correlation ID at request entry, propagate through all log calls."""
    correlation_id = request.headers.get("X-Correlation-ID", str(uuid.uuid4()))
    structlog.contextvars.bind_contextvars(
        correlation_id=correlation_id,
        service="order-service",
        endpoint=request.path,
    )
    logger.info("request_started", method=request.method, user_id=request.user_id)
    try:
        result = process_order(request)
        logger.info("request_completed", status="success", order_id=result.id)
        return result
    except Exception as e:
        logger.error("request_failed", error=str(e), exc_info=True)
        raise

See structured-logging.md for:

Log level guidelines and when to use each
Correlation ID propagation across async boundaries
Python structlog and Node.js pino configuration
Sensitive data redaction patterns

Prometheus Metrics

Use counters, gauges, and histograms to capture system and business behavior. Query with PromQL for dashboards and alerts.

from prometheus_client import Counter, Histogram, Gauge
import time

REQUEST_COUNT = Counter("http_requests_total", "Total HTTP requests",
    ["method", "endpoint", "status"])
REQUEST_DURATION = Histogram("http_request_duration_seconds", "Latency in seconds",
    ["method", "endpoint"],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0])
ACTIVE_CONNECTIONS = Gauge("active_connections", "Active connections", ["service"])

def handle_request(method, endpoint):
    ACTIVE_CONNECTIONS.labels(service="api").inc()
    start = time.monotonic()
    try:
        result = process(method, endpoint)
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status="200").inc()
        return result
    except Exception:
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status="500").inc()
        raise
    finally:
        REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(
            time.monotonic() - start)
        ACTIVE_CONNECTIONS.labels(service="api").dec()

See prometheus-patterns.md for:

PromQL query patterns for RED and USE methods
Recording rules for pre-computed aggregations
Alerting rules with multi-window burn rates
Histogram bucket selection strategies

OpenTelemetry Instrumentation

Use OpenTelemetry for distributed tracing with automatic context propagation across service boundaries.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "order-service", "service.version": "1.2.0"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317")))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

def process_order(order_id: str, items: list[dict]) -> dict:
    with tracer.start_as_current_span(
        "process_order",
        attributes={"order.id": order_id, "order.item_count": len(items)},
    ) as span:
        with tracer.start_as_current_span("validate_inventory"):
            available = check_inventory(items)
            span.set_attribute("order.items_available", available)
        with tracer.start_as_current_span("charge_payment"):
            payment = charge(order_id, items)
            span.add_event("payment_processed", {"payment.id": payment.id})
        return {"order_id": order_id, "status": "completed"}

See opentelemetry-patterns.md for:

Auto-instrumentation for Flask, FastAPI, Django, SQLAlchemy
Baggage and context propagation across async tasks
Custom span attributes and events best practices
Exporter configuration for Jaeger, Zipkin, and Grafana Tempo

Alerting Design

Design alerts around user impact with clear severity levels and multi-window burn rate thresholds to prevent alert fatigue.

groups:
  - name: slo_alerts
    rules:
      # Page: 2% budget consumed in 1 hour (fast burn)
      - alert: HighErrorBudgetBurn
        expr: |
          (sum(rate(http_requests_total{status=~"5.."}[1h]))
           / sum(rate(http_requests_total[1h]))) > (14.4 * 0.001)
          and
          (sum(rate(http_requests_total{status=~"5.."}[5m]))
           / sum(rate(http_requests_total[5m]))) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error budget burn on {{ $labels.service }}"
          runbook: "https://runbooks.internal/slo-budget-burn"

See alerting-patterns.md for:

Severity level definitions (P1-P4) and routing rules
PagerDuty and Opsgenie integration patterns
Runbook templates and incident response workflows
Alert grouping, inhibition, and silencing strategies

SLIs, SLOs, and Error Budgets

Define SLIs from real user signals, set SLO targets, and use error budgets to balance reliability with velocity.

from dataclasses import dataclass

@dataclass
class SLODefinition:
    name: str
    sli_query: str        # PromQL query returning ratio 0-1
    target: float         # e.g. 0.999 for 99.9%
    window_days: int      # Rolling window (typically 30)
    burn_rate_pages: dict

    @property
    def error_budget(self) -> float:
        return 1.0 - self.target

    def budget_remaining(self, current_error_rate: float, elapsed_days: int) -> dict:
        budget_total = self.error_budget * self.window_days * 24 * 60
        budget_consumed = current_error_rate * elapsed_days * 24 * 60
        remaining_pct = max(0, (budget_total - budget_consumed) / budget_total)
        return {
            "remaining_pct": round(remaining_pct * 100, 2),
            "on_track": remaining_pct > (1 - elapsed_days / self.window_days),
        }

availability_slo = SLODefinition(
    name="api-availability",
    sli_query='sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))',
    target=0.999, window_days=30,
    burn_rate_pages={"critical": 14.4, "warning": 6.0, "ticket": 3.0},
)

See slo-patterns.md for:

SLI definitions for availability, latency, throughput, and correctness
Error budget policies and decision frameworks
Multi-window, multi-burn-rate alerting math
SLO documentation templates and stakeholder reporting

Dashboard Design Patterns

Structure dashboards around the RED method (Rate, Errors, Duration) for services and the USE method (Utilization, Saturation, Errors) for resources.

# Grafana dashboard -- layered approach (Level 1: RED overview)
panels:
  - title: "Request Rate"
    expr: 'sum(rate(http_requests_total[5m])) by (service)'
  - title: "Error Rate (%)"
    expr: '100 * sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
           / sum(rate(http_requests_total[5m])) by (service)'
    thresholds: [{value: 1, color: "yellow"}, {value: 5, color: "red"}]
  - title: "p50 / p95 / p99 Latency"
    expr:
      - 'histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))'
      - 'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))'
  - title: "Error Budget Remaining"
    expr: 'slo:budget_remaining:ratio * 100'
    type: gauge

Anti-Patterns

Avoid	Use Instead
Logging unstructured text strings	JSON-formatted structured logs with consistent fields
Alerting on raw CPU/memory thresholds	Alert on SLO burn rates that reflect user impact
Creating one dashboard per incident	Layered dashboards: overview, service detail, debug
Using `rate()` over short windows on slow metrics	Match rate window to scrape interval (at least 4x)
Logging everything at DEBUG in production	Log at INFO by default, enable DEBUG per-service dynamically
Separate, uncorrelated logs, metrics, and traces	Correlate with shared trace_id across all three signals
Alerting on every single error occurrence	Alert on error rates exceeding SLO-derived thresholds
Cardinality explosion from unbounded label values	Bound labels to known enums; never use user IDs as labels
Ignoring alert noise until it becomes unbearable	Review alert quality monthly; delete alerts with < 10% action rate
Building custom monitoring from scratch	Use OpenTelemetry standards with vendor-neutral exporters

Performance

Control metric cardinality -- Every unique label combination creates a new time series. Never use user IDs or unbounded values as labels
Use recording rules -- Pre-compute frequently used PromQL aggregations to avoid expensive queries on every dashboard refresh
Batch span exports -- Use BatchSpanProcessor to avoid blocking application threads
Sample traces at scale -- Use head-based sampling (1-10%) for normal traffic, tail-based (100%) for errors
Buffer logs asynchronously -- Use a local buffer (Fluent Bit, Vector) that ships in batches
Right-size histogram buckets -- Align to SLO thresholds (e.g., 100ms, 250ms, 500ms, 1s, 2.5s)

source: Google SRE Book, OpenTelemetry documentation, Prometheus best practices, Grafana Labs documentation

devops-monitoring-observability

이 저장소의 다른 Skills

이 저장소의 다른 Skills

DevOps: Monitoring and Observability

Core Principles

Structured Logging Patterns

Prometheus Metrics

OpenTelemetry Instrumentation

Alerting Design

SLIs, SLOs, and Error Budgets

Dashboard Design Patterns

Anti-Patterns

Performance

DevOps: Monitoring and Observability

Core Principles

Structured Logging Patterns

Prometheus Metrics

OpenTelemetry Instrumentation

Alerting Design

SLIs, SLOs, and Error Budgets

Dashboard Design Patterns

Anti-Patterns

Performance