// Comprehensive observability skill for monitoring, logging, distributed tracing, alerting, and SLI/SLO implementation across distributed systems. Includes dashboard generation, alert rule creation, error budget calculation, and metrics analysis. Use when implementing monitoring stacks, designing alerting strategies, setting up distributed tracing, or defining SLO frameworks.
| name | senior-observability |
| title | Senior Observability Skill Package |
| description | Comprehensive observability skill for monitoring, logging, distributed tracing, alerting, and SLI/SLO implementation across distributed systems. Includes dashboard generation, alert rule creation, error budget calculation, and metrics analysis. Use when implementing monitoring stacks, designing alerting strategies, setting up distributed tracing, or defining SLO frameworks. |
| domain | engineering |
| subdomain | observability-operations |
| difficulty | advanced |
| time-saved | 4-8 hours per observability implementation |
| frequency | Weekly |
| use-cases | ["Implementing comprehensive monitoring with Prometheus and Grafana","Setting up distributed tracing with OpenTelemetry and Jaeger","Designing alerting strategies with multi-burn-rate SLO alerting","Creating dashboards for service health visibility using RED/USE methods","Calculating SLI/SLO targets and error budgets"] |
| related-agents | ["cs-observability-engineer"] |
| related-skills | ["senior-devops","senior-backend","senior-secops"] |
| related-commands | [] |
| orchestrated-by | ["cs-observability-engineer"] |
| dependencies | {"scripts":["dashboard_generator.py","alert_rule_generator.py","slo_calculator.py","metrics_analyzer.py"],"references":["monitoring_patterns.md","logging_architecture.md","distributed_tracing.md","alerting_runbooks.md"],"assets":["dashboard_templates/","alert_templates/","runbook_template.md"]} |
| compatibility | {"python-version":"3.8+","platforms":["macos","linux","windows"]} |
| tech-stack | ["Python 3.8+","Prometheus","Grafana","OpenTelemetry","Jaeger","ELK Stack","DataDog","CloudWatch","NewRelic"] |
| examples | [{"title":"Generate Grafana Dashboard","input":"python3 scripts/dashboard_generator.py --service payment-api --type api --platform grafana --output json","output":"Complete Grafana dashboard JSON with RED method panels, resource metrics, and variable templating"},{"title":"Generate SLO-Based Alerts","input":"python3 scripts/alert_rule_generator.py --service payment-api --slo-target 99.9 --platform prometheus --output yaml","output":"Prometheus AlertManager rules with multi-burn-rate alerting and runbook links"},{"title":"Calculate Error Budget","input":"python3 scripts/slo_calculator.py --input metrics.csv --slo-type availability --target 99.9 --window 30d --output json","output":"SLO status report with error budget remaining, burn rate, and recommendations"},{"title":"Generate NewRelic Dashboard","input":"python3 scripts/dashboard_generator.py --service payment-api --type api --platform newrelic --output json","output":"NewRelic dashboard JSON with NRQL queries for RED method panels and SLO tracking"},{"title":"Generate NewRelic Alerts","input":"python3 scripts/alert_rule_generator.py --service payment-api --slo-target 99.9 --platform newrelic --output json","output":"NewRelic alert policy with multi-burn-rate NRQL conditions and notification channels"}] |
| stats | {"downloads":0,"stars":0,"rating":0,"reviews":0} |
| version | v1.0.0 |
| author | Claude Skills Team |
| contributors | [] |
| created | "2025-12-16T00:00:00.000Z" |
| updated | "2025-12-16T00:00:00.000Z" |
| license | MIT |
| tags | ["observability","monitoring","logging","tracing","alerting","slo","sli","prometheus","grafana","opentelemetry","jaeger","datadog","cloudwatch","newrelic","nrql","engineering","senior"] |
| featured | false |
| verified | true |
Complete toolkit for senior observability engineering with modern monitoring, logging, tracing, and alerting best practices.
This skill provides comprehensive observability capabilities through four core Python automation tools and extensive reference documentation. Whether implementing monitoring stacks, designing alerting strategies, setting up distributed tracing, or defining SLO frameworks, this skill delivers production-ready observability solutions.
Senior observability engineers use this skill for metrics collection (Prometheus, DataDog, CloudWatch, NewRelic), visualization (Grafana dashboards, NewRelic Dashboards), distributed tracing (OpenTelemetry, Jaeger), centralized logging (ELK Stack, Loki, NewRelic Logs), and alerting (AlertManager, PagerDuty, NewRelic Alerts). The skill covers the Four Golden Signals, RED/USE methods, SLI/SLO frameworks, and incident response patterns.
Core Value: Reduce mean-time-to-detection (MTTD) by 60%+ and mean-time-to-resolution (MTTR) by 40%+ while improving system reliability through comprehensive observability practices and automated tooling.
This skill provides four core capabilities through automated scripts:
# Script 1: Dashboard Generator - Create Grafana/DataDog dashboards
python3 scripts/dashboard_generator.py --service my-api --type api --platform grafana --output json
# Script 2: Alert Rule Generator - Create Prometheus/DataDog alert rules
python3 scripts/alert_rule_generator.py --service my-api --slo-target 99.9 --platform prometheus --output yaml
# Script 3: SLO Calculator - Calculate error budgets and burn rates
python3 scripts/slo_calculator.py --input metrics.csv --slo-type availability --target 99.9 --output json
# Script 4: Metrics Analyzer - Analyze patterns, anomalies, and trends
python3 scripts/metrics_analyzer.py --input metrics.csv --analysis-type anomaly --output json
Generate production-ready dashboard configurations for Grafana, DataDog, or CloudWatch.
Usage:
python3 scripts/dashboard_generator.py \
--service "payment-api" \
--type api \
--platform grafana \
--output json \
--file dashboards/payment-api.json
Arguments:
--service / -s: Service name (required)--type / -t: Service type - api, database, queue, cache, web (default: api)--platform / -p: Target platform - grafana, datadog, cloudwatch, newrelic (default: grafana)--output / -o: Output format - json, yaml, text (default: text)--file / -f: Write output to file--verbose / -v: Enable verbose outputFeatures:
Generate alerting rules for Prometheus AlertManager, DataDog, CloudWatch, NewRelic, or PagerDuty.
Usage:
python3 scripts/alert_rule_generator.py \
--service "payment-api" \
--slo-target 99.9 \
--platform prometheus \
--severity critical,warning \
--output yaml \
--file alerts/payment-api.yaml
Arguments:
--service / -s: Service name (required)--slo-target: SLO availability target percentage (default: 99.9)--platform / -p: Target platform - prometheus, datadog, cloudwatch, newrelic, pagerduty (default: prometheus)--severity: Severity levels to generate - critical, warning, info (default: critical,warning)--output / -o: Output format - yaml, json, text (default: yaml)--file / -f: Write output to file--runbook-url: Base URL for runbook links--verbose / -v: Enable verbose outputFeatures:
Calculate SLI/SLO targets, error budgets, and burn rates from metrics data.
Usage:
python3 scripts/slo_calculator.py \
--input metrics.csv \
--slo-type availability \
--target 99.9 \
--window 30d \
--output json \
--file slo-report.json
Arguments:
--input / -i: Input metrics file (CSV or JSON) (required)--slo-type: Type of SLO - availability, latency, throughput (default: availability)--target: SLO target percentage (default: 99.9)--window: Time window - 7d, 30d, 90d (default: 30d)--output / -o: Output format - json, text, markdown, csv (default: text)--file / -f: Write output to file--verbose / -v: Enable verbose outputFeatures:
Analyze metrics patterns to detect anomalies, trends, and optimization opportunities.
Usage:
python3 scripts/metrics_analyzer.py \
--input metrics.csv \
--analysis-type anomaly \
--metrics http_requests_total,http_request_duration_seconds \
--threshold 3.0 \
--output json \
--file analysis-report.json
Arguments:
--input / -i: Input metrics file (CSV or JSON) (required)--analysis-type: Analysis type - anomaly, trend, correlation, baseline, cardinality (default: anomaly)--metrics: Comma-separated metric names to analyze (optional, analyzes all if not specified)--threshold: Anomaly detection threshold in standard deviations (default: 3.0)--output / -o: Output format - json, text, markdown, csv (default: text)--file / -f: Write output to file--verbose / -v: Enable verbose outputFeatures:
references/monitoring_patterns.md)Comprehensive guide to metrics collection and visualization patterns:
references/newrelic_patterns.md)Complete NewRelic observability guide:
references/logging_architecture.md)Complete logging strategy and implementation guide:
references/distributed_tracing.md)End-to-end distributed tracing implementation:
references/alerting_runbooks.md)Alerting strategy and incident response patterns:
assets/dashboard_templates/)Pre-built Grafana dashboard JSON templates:
api_service_dashboard.json - RED method dashboard for API servicesdatabase_dashboard.json - USE method dashboard for databaseskubernetes_dashboard.json - Cluster and workload metricsslo_overview_dashboard.json - Error budget and SLO trackingNewRelic Dashboard Templates:
newrelic_service_overview.json - RED method dashboard with NRQL queriesnewrelic_slo_dashboard.json - SLO tracking with burn rate visualizationassets/alert_templates/)Production-ready alert rule templates:
availability_alerts.yaml - Service availability alertslatency_alerts.yaml - Latency percentile alertsresource_alerts.yaml - CPU, memory, disk alertsslo_burn_rate_alerts.yaml - Multi-window burn rate alertsNewRelic Alert Templates:
newrelic_slo_alerts.json - Multi-burn-rate NRQL alert conditionsnewrelic_infrastructure_alerts.json - CPU, memory, disk, container alertsassets/runbook_template.md)Standardized incident response runbook format with sections for alert context, diagnostic steps, remediation actions, and escalation criteria.
Goal: Deploy comprehensive observability infrastructure for microservices.
Duration: 4-6 hours
Steps:
dashboard_generator.pyalert_rule_generator.pyGoal: Define Service Level Indicators and Objectives with error budget policies.
Duration: 2-3 hours
Steps:
slo_calculator.pyGoal: Design symptom-based alerting with comprehensive runbooks.
Duration: 3-4 hours
Steps:
alert_rule_generator.pyGoal: Create comprehensive dashboards using RED/USE methodologies.
Duration: 2-3 hours
Steps:
dashboard_generator.py# Generate API service dashboard (Grafana)
python3 scripts/dashboard_generator.py -s my-api -t api -p grafana -o json
# Generate API service dashboard (NewRelic)
python3 scripts/dashboard_generator.py -s my-api -t api -p newrelic -o json
# Generate database dashboard
python3 scripts/dashboard_generator.py -s my-db -t database -p grafana -o json
# Create SLO-based alerts for 99.9% availability (Prometheus)
python3 scripts/alert_rule_generator.py -s my-api --slo-target 99.9 -p prometheus -o yaml
# Create SLO-based alerts for 99.9% availability (NewRelic)
python3 scripts/alert_rule_generator.py -s my-api --slo-target 99.9 -p newrelic -o json
# Calculate error budget from metrics export
python3 scripts/slo_calculator.py -i prometheus_export.csv --target 99.9 --window 30d -o markdown
# Detect anomalies in latency metrics
python3 scripts/metrics_analyzer.py -i metrics.csv --analysis-type anomaly --threshold 3.0 -o json
# Analyze metric cardinality
python3 scripts/metrics_analyzer.py -i metrics.csv --analysis-type cardinality -o text
Problem: Prometheus memory usage growing, queries timing out
Solution: Use metrics_analyzer.py --analysis-type cardinality to identify high-cardinality labels, then aggregate or drop unnecessary labels
Problem: Too many alerts, on-call burnout
Solution: Implement multi-burn-rate alerting using alert_rule_generator.py, add inhibition rules, increase alert thresholds for non-critical services
Problem: Traces not connecting across services Solution: Verify trace context propagation headers (traceparent, W3C format), check sampling configuration, ensure OpenTelemetry collector is receiving spans
Problem: Grafana dashboards loading slowly Solution: Use recording rules for expensive queries, reduce time range defaults, add query caching, limit panel count per dashboard