원클릭으로 Manus에서 모든 스킬 실행

monitoring

스타1

포크0

업데이트2026년 4월 2일 18:59

Implement observability with metrics, logs, and traces. Set up alerting, dashboards, and SLIs/SLOs for system reliability.

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

HelixDevelopment

HelixDevelopment/HelixAgent

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

관련 직업SOC

SOC 직업 분류 기준

네트워크·컴퓨터 시스템 관리자컴퓨터 및 수학직·SOC 15-1244

SKILL.md

readonly

이 저장소의 다른 Skills

같은 저장소

onboarding-agent

HelixDevelopment/HelixAgent

Interactive onboarding workflow that interviews users to understand their coding goals and generates PR-ready implementation plans. Use when starting a new development task to ensure clear requirements and structured execution.

2026-05-161

gamma-security-basics

HelixDevelopment/HelixAgent

Implement security best practices for Gamma integration. Use when securing API keys, implementing access controls, or auditing Gamma security configuration. Trigger with phrases like "gamma security", "gamma API key security", "gamma secure", "gamma credentials", "gamma access control".

2026-05-161

documentation

HelixDevelopment/HelixAgent

Write effective technical documentation including READMEs, API docs, architecture decisions, and inline code documentation.

2026-05-061

azure-devops

HelixDevelopment/HelixAgent

Build and manage CI/CD pipelines with Azure DevOps. Configure builds, releases, and automate software delivery workflows.

2026-04-021

azure-functions

HelixDevelopment/HelixAgent

Develop, deploy, and manage Azure Functions for serverless computing. Supports HTTP triggers, timers, queues, and event-driven architectures.

2026-04-021

azure-resource-management

HelixDevelopment/HelixAgent

Manage Azure resources effectively using CLI, Portal, Bicep, and ARM templates. Use for provisioning, organizing, and maintaining cloud infrastructure.

2026-04-021

name	monitoring
description	Implement observability with metrics, logs, and traces. Set up alerting, dashboards, and SLIs/SLOs for system reliability.
triggers	["/monitoring","/observability","/alerting"]

Observability and Monitoring

This skill covers implementing comprehensive observability through metrics, logs, traces, and alerts to ensure system reliability and performance.

When to use this skill

Use this skill when you need to:

Set up monitoring for applications and infrastructure
Define and track SLIs and SLOs
Create meaningful dashboards
Configure intelligent alerting
Implement distributed tracing

Prerequisites

Monitoring stack (Prometheus, Grafana, Datadog, New Relic)
Log aggregation (ELK, Loki, Splunk)
Application instrumentation libraries
Alerting channels (PagerDuty, Slack, email)

Guidelines

Observability Pillars

Metrics

Numeric measurements over time
Used for: trends, capacity planning, alerting
Tools: Prometheus, InfluxDB, CloudWatch

Logs

Discrete event records
Used for: debugging, auditing, forensics
Tools: ELK Stack, Loki, Splunk

Traces

Request flow across services
Used for: latency analysis, dependency mapping
Tools: Jaeger, Zipkin, OpenTelemetry

Metrics Strategy

Metric Types

Counters: Always increasing (requests, errors)
Gauges: Current value (temperature, queue size)
Histograms: Distribution of values (response times)
Summaries: Percentile calculations (p95, p99)

Prometheus Example

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Define metrics
request_count = Counter('http_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'Request duration')
active_connections = Gauge('active_connections', 'Current active connections')

# Instrument code
@request_duration.time()
def handle_request(request):
    request_count.labels(method=request.method, endpoint=request.path).inc()
    active_connections.inc()
    try:
        return process_request(request)
    finally:
        active_connections.dec()

Key Metrics to Track

RED metrics (Rate, Errors, Duration) for services
USE metrics (Utilization, Saturation, Errors) for resources
Business metrics (transactions, revenue)
Infrastructure metrics (CPU, memory, disk, network)

SLIs and SLOs

Service Level Indicators (SLIs) Quantifiable measures of service quality:

Availability: percentage of successful requests
Latency: response time percentiles
Throughput: requests per second
Error rate: failed requests percentage

Service Level Objectives (SLOs) Target reliability levels:

# Example SLOs
availability_slo: 99.9%  # Three nines
latency_p95_slo: 200ms   # 95% of requests under 200ms
error_rate_slo: 0.1%     # Less than 0.1% errors

Error Budgets

SLO of 99.9% = 0.1% error budget
43.8 minutes of downtime per month
Use for: feature launch decisions, maintenance windows

Alerting

Alerting Principles

Alert on symptoms, not causes
Use multiple severity levels
Include runbook links
Prevent alert fatigue

Prometheus Alerting Rules

# alerts.yml
groups:
  - name: service_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"

Dashboards

Grafana Dashboard Guidelines

Use template variables for reusability
Group related panels
Include RED metrics for each service
Add documentation links
Use consistent color schemes

Dashboard Structure

Overview Dashboard
├── Service Health (availability, latency, errors)
├── Infrastructure (CPU, memory, disk)
├── Business Metrics (transactions, revenue)
└── Alerts Status

Service Detail Dashboard
├── Request Rate
├── Error Rate
├── Latency Distribution
├── Top Endpoints
└── Dependencies

Distributed Tracing

OpenTelemetry Setup

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Add exporter
otlp_exporter = OTLPSpanExporter(endpoint="otel-collector:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Instrument code
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("customer.id", customer_id)
    
    with tracer.start_as_current_span("validate_payment"):
        validate_payment(payment_info)
    
    with tracer.start_as_current_span("update_inventory"):
        update_inventory(items)

Log Management

Structured Logging

import structlog

logger = structlog.get_logger()

logger.info(
    "order_processed",
    order_id=order_id,
    customer_id=customer_id,
    amount=amount,
    duration_ms=processing_time,
)

Log Levels

ERROR: Action required immediately
WARN: Attention needed soon
INFO: Normal operation visibility
DEBUG: Detailed troubleshooting

Examples

See the examples/ directory for:

prometheus-configs/ - Prometheus configuration examples
grafana-dashboards/ - Dashboard JSON models
alert-rules/ - Alerting rule definitions
opentelemetry-setup/ - Tracing configuration

monitoring

Observability and Monitoring

When to use this skill

Prerequisites

Guidelines

Observability Pillars

Metrics Strategy

SLIs and SLOs

Alerting

Dashboards

Distributed Tracing

Log Management

Examples

References

Observability and Monitoring

When to use this skill

Prerequisites

Guidelines

Observability Pillars

Metrics Strategy

SLIs and SLOs

Alerting

Dashboards

Distributed Tracing

Log Management

Examples

References