원클릭으로
monitoring
Implement observability with metrics, logs, and traces. Set up alerting, dashboards, and SLIs/SLOs for system reliability.
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Implement observability with metrics, logs, and traces. Set up alerting, dashboards, and SLIs/SLOs for system reliability.
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
Interactive onboarding workflow that interviews users to understand their coding goals and generates PR-ready implementation plans. Use when starting a new development task to ensure clear requirements and structured execution.
Implement security best practices for Gamma integration. Use when securing API keys, implementing access controls, or auditing Gamma security configuration. Trigger with phrases like "gamma security", "gamma API key security", "gamma secure", "gamma credentials", "gamma access control".
Write effective technical documentation including READMEs, API docs, architecture decisions, and inline code documentation.
Build and manage CI/CD pipelines with Azure DevOps. Configure builds, releases, and automate software delivery workflows.
Develop, deploy, and manage Azure Functions for serverless computing. Supports HTTP triggers, timers, queues, and event-driven architectures.
Manage Azure resources effectively using CLI, Portal, Bicep, and ARM templates. Use for provisioning, organizing, and maintaining cloud infrastructure.
| name | monitoring |
| description | Implement observability with metrics, logs, and traces. Set up alerting, dashboards, and SLIs/SLOs for system reliability. |
| triggers | ["/monitoring","/observability","/alerting"] |
This skill covers implementing comprehensive observability through metrics, logs, traces, and alerts to ensure system reliability and performance.
Use this skill when you need to:
Metrics
Logs
Traces
Metric Types
Prometheus Example
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Define metrics
request_count = Counter('http_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'Request duration')
active_connections = Gauge('active_connections', 'Current active connections')
# Instrument code
@request_duration.time()
def handle_request(request):
request_count.labels(method=request.method, endpoint=request.path).inc()
active_connections.inc()
try:
return process_request(request)
finally:
active_connections.dec()
Key Metrics to Track
Service Level Indicators (SLIs) Quantifiable measures of service quality:
Service Level Objectives (SLOs) Target reliability levels:
# Example SLOs
availability_slo: 99.9% # Three nines
latency_p95_slo: 200ms # 95% of requests under 200ms
error_rate_slo: 0.1% # Less than 0.1% errors
Error Budgets
Alerting Principles
Prometheus Alerting Rules
# alerts.yml
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
runbook_url: "https://wiki.internal/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
Grafana Dashboard Guidelines
Dashboard Structure
Overview Dashboard
├── Service Health (availability, latency, errors)
├── Infrastructure (CPU, memory, disk)
├── Business Metrics (transactions, revenue)
└── Alerts Status
Service Detail Dashboard
├── Request Rate
├── Error Rate
├── Latency Distribution
├── Top Endpoints
└── Dependencies
OpenTelemetry Setup
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Add exporter
otlp_exporter = OTLPSpanExporter(endpoint="otel-collector:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Instrument code
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("customer.id", customer_id)
with tracer.start_as_current_span("validate_payment"):
validate_payment(payment_info)
with tracer.start_as_current_span("update_inventory"):
update_inventory(items)
Structured Logging
import structlog
logger = structlog.get_logger()
logger.info(
"order_processed",
order_id=order_id,
customer_id=customer_id,
amount=amount,
duration_ms=processing_time,
)
Log Levels
See the examples/ directory for:
prometheus-configs/ - Prometheus configuration examplesgrafana-dashboards/ - Dashboard JSON modelsalert-rules/ - Alerting rule definitionsopentelemetry-setup/ - Tracing configuration