name	observability
version	2.1.0
description	Observability specialists hub for monitoring, logging, tracing, and alerting. Routes to specialists for metrics collection, log aggregation, distributed tracing, and incident response. Use for system observability, debugging production issues, and performance monitoring.

Observability

Central hub for monitoring, logging, tracing, and system observability.

Phase 0: Expertise Loading

expertise_check:
  domain: observability
  file: .claude/expertise/observability.yaml

  if_exists:
    - Load monitoring patterns
    - Load alerting rules
    - Apply SLO definitions

  if_not_exists:
    - Flag discovery mode
    - Document patterns learned

When to Use This Skill

Use observability when:

Setting up monitoring infrastructure
Implementing logging strategies
Configuring distributed tracing
Creating dashboards and alerts
Debugging production issues

Observability Pillars

Pillar	Purpose
Metrics	Quantitative measurements
Logs	Event records
Traces	Request flow tracking
Alerts	Incident notification

Tool Ecosystem

Metrics

tools:
  - Prometheus
  - Grafana
  - Datadog
  - CloudWatch
metrics_types:
  - Counters
  - Gauges
  - Histograms
  - Summaries

Logging

tools:
  - ELK Stack (Elasticsearch, Logstash, Kibana)
  - Loki
  - Splunk
  - CloudWatch Logs
patterns:
  - Structured logging (JSON)
  - Log levels
  - Correlation IDs

Tracing

tools:
  - Jaeger
  - Zipkin
  - OpenTelemetry
  - X-Ray
patterns:
  - Span context propagation
  - Baggage items
  - Sampling strategies

SLO/SLI/SLA

definitions:
  SLI: "Service Level Indicator - measurable metric"
  SLO: "Service Level Objective - target value"
  SLA: "Service Level Agreement - contractual commitment"

example:
  SLI: "Request latency p99"
  SLO: "99% of requests < 200ms"
  SLA: "99.9% availability per month"

MCP Requirements

claude-flow: For orchestration
Bash: For tool CLI commands

Recursive Improvement Integration (v2.1)

Eval Harness Integration

benchmark: observability-benchmark-v1
  tests:
    - obs-001: Monitoring coverage
    - obs-002: Alert quality
  minimum_scores:
    monitoring_coverage: 0.85
    alert_quality: 0.80

Memory Namespace

namespaces:
  - observability/configs/{id}: Monitoring configs
  - observability/dashboards: Dashboard templates
  - improvement/audits/observability: Skill audits

Uncertainty Handling

confidence_check:
  if confidence >= 0.8:
    - Proceed with implementation
  if confidence 0.5-0.8:
    - Confirm tool stack
  if confidence < 0.5:
    - Ask for infrastructure details

Cross-Skill Coordination

Works with: infrastructure, deployment-readiness, performance-analysis

!! SKILL COMPLETION VERIFICATION (MANDATORY) !!

Agent Spawning: Spawned agent via Task()
Agent Registry Validation: Agent from registry
TodoWrite Called: Called with 5+ todos
Work Delegation: Delegated to agents

Remember: Skill() -> Task() -> TodoWrite() - ALWAYS

Core Principles

1. Three Pillars Integration

Comprehensive observability requires unified collection and correlation of metrics, logs, and traces - no single pillar provides complete system visibility.

In practice:

Implement metrics collection for quantitative measurements (counters, gauges, histograms)
Deploy structured logging with correlation IDs for event tracking across services
Configure distributed tracing with span context propagation for request flow visualization
Correlate all three pillars using common identifiers (trace IDs, request IDs, user IDs)

2. Proactive Alerting with SLO-Based Thresholds

Alerting must be driven by Service Level Objectives that reflect actual user impact, not arbitrary metric thresholds that generate noise.

In practice:

Define SLIs (Service Level Indicators) that measure user-facing behavior (p99 latency, error rate)
Set SLOs (Service Level Objectives) based on business requirements (99% requests < 200ms)
Configure alerts to fire when SLO burn rate exceeds acceptable thresholds
Implement multi-window alerting to distinguish temporary spikes from sustained degradation

3. Context-Aware Monitoring with Dynamic Baselines

Effective monitoring adapts to changing system behavior through machine learning baselines, not static thresholds that break during normal traffic variations.

In practice:

Use anomaly detection algorithms to learn normal behavior patterns
Implement seasonal baselines that adjust for daily/weekly traffic cycles
Correlate metrics across services to identify cascading failures
Apply intelligent noise reduction to focus on actionable signals

Anti-Patterns

Anti-Pattern	Problem	Solution
Metric-Only Monitoring	Collecting metrics without logs or traces misses critical context for debugging failures	Implement all three pillars with correlation, use traces to investigate metric anomalies
Alert Fatigue from Static Thresholds	Setting fixed thresholds generates false alarms during traffic variations, causing alert fatigue	Use SLO-based alerting with burn rate calculations and dynamic baselines that adapt to traffic patterns
Unstructured Logging	Free-form log messages prevent automated analysis and correlation across services	Adopt structured logging with JSON format, include correlation IDs, define standard log levels
Missing Sampling Strategies	Tracing 100% of requests creates performance overhead and storage costs	Implement adaptive sampling: high rates for errors/slow requests, low rates for successful fast requests
Dashboard Proliferation	Creating dozens of uncategorized dashboards makes critical information undiscoverable	Organize dashboards by audience (SRE, developers, business), implement role-based access, standardize layouts

Conclusion

The Observability skill establishes comprehensive system visibility through unified metrics, logs, and traces coordinated with intelligent alerting and dynamic monitoring. By implementing all three pillars with proper correlation, organizations gain the ability to debug complex distributed systems, proactively detect degradation, and understand user impact. The integration with tools like Prometheus for metrics, ELK/Loki for logs, and Jaeger/OpenTelemetry for traces provides production-grade observability infrastructure.

The SLO-based alerting framework transforms monitoring from reactive firefighting into proactive quality management. By defining Service Level Objectives that reflect actual business requirements and configuring alerts based on SLO burn rates, teams receive actionable notifications about genuine user impact rather than noisy metric threshold violations. The recursive improvement integration through benchmark evaluation ensures observability implementations meet quality standards for monitoring coverage and alert quality.

Organizations implementing this skill benefit from faster incident detection and resolution, reduced mean time to recovery (MTTR), and deeper understanding of system behavior under load. The expertise-aware workflow enables teams to leverage documented monitoring patterns and alerting rules specific to their infrastructure, preventing common pitfalls and accelerating observability maturity. When coordinated with infrastructure, deployment-readiness, and performance-analysis skills, observability creates a complete operational excellence framework.

name	observability
version	2.1.0
description	Observability specialists hub for monitoring, logging, tracing, and alerting. Routes to specialists for metrics collection, log aggregation, distributed tracing, and incident response. Use for system observability, debugging production issues, and performance monitoring.

Observability

Central hub for monitoring, logging, tracing, and system observability.

Phase 0: Expertise Loading

expertise_check:
  domain: observability
  file: .claude/expertise/observability.yaml

  if_exists:
    - Load monitoring patterns
    - Load alerting rules
    - Apply SLO definitions

  if_not_exists:
    - Flag discovery mode
    - Document patterns learned

When to Use This Skill

Use observability when:

Setting up monitoring infrastructure
Implementing logging strategies
Configuring distributed tracing
Creating dashboards and alerts
Debugging production issues

Observability Pillars

Pillar	Purpose
Metrics	Quantitative measurements
Logs	Event records
Traces	Request flow tracking
Alerts	Incident notification

Tool Ecosystem

Metrics

tools:
  - Prometheus
  - Grafana
  - Datadog
  - CloudWatch
metrics_types:
  - Counters
  - Gauges
  - Histograms
  - Summaries

Logging

tools:
  - ELK Stack (Elasticsearch, Logstash, Kibana)
  - Loki
  - Splunk
  - CloudWatch Logs
patterns:
  - Structured logging (JSON)
  - Log levels
  - Correlation IDs

Tracing

tools:
  - Jaeger
  - Zipkin
  - OpenTelemetry
  - X-Ray
patterns:
  - Span context propagation
  - Baggage items
  - Sampling strategies

SLO/SLI/SLA

definitions:
  SLI: "Service Level Indicator - measurable metric"
  SLO: "Service Level Objective - target value"
  SLA: "Service Level Agreement - contractual commitment"

example:
  SLI: "Request latency p99"
  SLO: "99% of requests < 200ms"
  SLA: "99.9% availability per month"

MCP Requirements

claude-flow: For orchestration
Bash: For tool CLI commands

Recursive Improvement Integration (v2.1)

Eval Harness Integration

benchmark: observability-benchmark-v1
  tests:
    - obs-001: Monitoring coverage
    - obs-002: Alert quality
  minimum_scores:
    monitoring_coverage: 0.85
    alert_quality: 0.80

Memory Namespace

namespaces:
  - observability/configs/{id}: Monitoring configs
  - observability/dashboards: Dashboard templates
  - improvement/audits/observability: Skill audits

Uncertainty Handling

confidence_check:
  if confidence >= 0.8:
    - Proceed with implementation
  if confidence 0.5-0.8:
    - Confirm tool stack
  if confidence < 0.5:
    - Ask for infrastructure details

Cross-Skill Coordination

Works with: infrastructure, deployment-readiness, performance-analysis

!! SKILL COMPLETION VERIFICATION (MANDATORY) !!

Agent Spawning: Spawned agent via Task()
Agent Registry Validation: Agent from registry
TodoWrite Called: Called with 5+ todos
Work Delegation: Delegated to agents

Remember: Skill() -> Task() -> TodoWrite() - ALWAYS

Core Principles

1. Three Pillars Integration

Comprehensive observability requires unified collection and correlation of metrics, logs, and traces - no single pillar provides complete system visibility.

In practice:

Implement metrics collection for quantitative measurements (counters, gauges, histograms)
Deploy structured logging with correlation IDs for event tracking across services
Configure distributed tracing with span context propagation for request flow visualization
Correlate all three pillars using common identifiers (trace IDs, request IDs, user IDs)

2. Proactive Alerting with SLO-Based Thresholds

Alerting must be driven by Service Level Objectives that reflect actual user impact, not arbitrary metric thresholds that generate noise.

In practice:

Define SLIs (Service Level Indicators) that measure user-facing behavior (p99 latency, error rate)
Set SLOs (Service Level Objectives) based on business requirements (99% requests < 200ms)
Configure alerts to fire when SLO burn rate exceeds acceptable thresholds
Implement multi-window alerting to distinguish temporary spikes from sustained degradation

3. Context-Aware Monitoring with Dynamic Baselines

Effective monitoring adapts to changing system behavior through machine learning baselines, not static thresholds that break during normal traffic variations.

In practice:

Use anomaly detection algorithms to learn normal behavior patterns
Implement seasonal baselines that adjust for daily/weekly traffic cycles
Correlate metrics across services to identify cascading failures
Apply intelligent noise reduction to focus on actionable signals

Anti-Patterns

Anti-Pattern	Problem	Solution
Metric-Only Monitoring	Collecting metrics without logs or traces misses critical context for debugging failures	Implement all three pillars with correlation, use traces to investigate metric anomalies
Alert Fatigue from Static Thresholds	Setting fixed thresholds generates false alarms during traffic variations, causing alert fatigue	Use SLO-based alerting with burn rate calculations and dynamic baselines that adapt to traffic patterns
Unstructured Logging	Free-form log messages prevent automated analysis and correlation across services	Adopt structured logging with JSON format, include correlation IDs, define standard log levels
Missing Sampling Strategies	Tracing 100% of requests creates performance overhead and storage costs	Implement adaptive sampling: high rates for errors/slow requests, low rates for successful fast requests
Dashboard Proliferation	Creating dozens of uncategorized dashboards makes critical information undiscoverable	Organize dashboards by audience (SRE, developers, business), implement role-based access, standardize layouts