| name | tsh-implementing-observability |
| description | Observability patterns for logging, monitoring, alerting, and distributed tracing. Use when implementing metrics collection, log aggregation, alerting rules, or distributed tracing across services. |
| user-invocable | false |
Observability Patterns
When to Use
- Setting up monitoring and alerting for applications
- Implementing centralized logging
- Adding distributed tracing to microservices
- Designing SLOs/SLIs and error budgets
- Creating dashboards and runbooks
Three Pillars of Observability
| Pillar | Purpose | Tools |
|---|
| Metrics | Quantitative measurements over time | Prometheus, CloudWatch, Datadog, Grafana |
| Logs | Discrete events with context | ELK, Loki, CloudWatch Logs, Splunk |
| Traces | Request flow across services | Jaeger, Zipkin, X-Ray, Tempo |
Stack Detection
Check which observability stack the project uses:
prometheus.yml or ServiceMonitor ā Prometheus
fluent-bit.conf or fluentd.conf ā Fluent Bit/Fluentd
otel-collector-config.yaml ā OpenTelemetry
- AWS with
aws_cloudwatch_* resources ā CloudWatch
datadog-agent or DD_* env vars ā Datadog
Use context7 to look up stack-specific configuration syntax.
Solution Decision Matrix
Metrics Stack
| Scenario | Recommended Solution |
|---|
| Kubernetes-native, cost-sensitive | Prometheus + Grafana |
| AWS-native, simple setup | CloudWatch Metrics |
| Multi-cloud, enterprise | Datadog or New Relic |
| OpenTelemetry-first | Prometheus with OTLP receiver |
Logging Stack
| Scenario | Recommended Solution |
|---|
| Kubernetes, cost-sensitive | Loki + Grafana |
| AWS-native | CloudWatch Logs |
| High volume, complex queries | Elasticsearch (ELK) |
| Multi-cloud, managed | Datadog Logs or Splunk |
Tracing Stack
| Scenario | Recommended Solution |
|---|
| Kubernetes, open-source | Jaeger or Tempo |
| AWS-native | X-Ray |
| Multi-cloud, correlated | Datadog APM |
| Vendor-agnostic | OpenTelemetry ā any backend |
Kubernetes Observability Pattern
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Applications ā
ā (instrumented with OpenTelemetry SDK or auto-inst) ā
āāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā OTLP
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā OpenTelemetry Collector ā
ā (receives, processes, exports telemetry) ā
āāāāāāāāā¬āāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāā¬āāāāāāāāāā
ā ā ā
ā¼ ā¼ ā¼
Prometheus Loki Tempo/Jaeger
(metrics) (logs) (traces)
ā ā ā
āāāāāāāāāāāāāāāāāā¬ā“āāāāāāāāāāāāāāāāāā
ā¼
Grafana
(visualization)
SLO/SLI Framework
Key Metrics (RED Method for Services)
| Metric | Description | Example SLI |
|---|
| Rate | Requests per second | rate(http_requests_total[5m]) |
| Errors | Failed requests | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | Latency distribution | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
Key Metrics (USE Method for Resources)
| Metric | Description | Example |
|---|
| Utilization | % time resource is busy | CPU usage, memory usage |
| Saturation | Queue depth, waiting | Pod pending, connection pool |
| Errors | Error count | OOM kills, disk errors |
SLO Definition Template
slo:
name: api-availability
description: "API returns successful responses"
sli:
metric: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
target: 99.9%
window: 30d
error_budget: 0.1%
Alerting Strategy
Alert Severity Levels
| Severity | Response | Example |
|---|
| Critical | Page on-call immediately | Service down, data loss risk |
| Warning | Investigate within hours | Error rate elevated, disk 80% |
| Info | Review during business hours | Deployment completed, scaling event |
Alert Quality Rules
- Actionable: Every alert must have a clear response action
- Relevant: Alert on symptoms (user impact), not causes
- Unique: Avoid duplicate alerts for same incident
- Timely: Alert early enough to prevent impact
Alert Template (Prometheus)
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
runbook_url: "https://runbooks.example.com/high-error-rate"
Structured Logging
Log Format (JSON)
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-api",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user-789",
"error": {
"type": "PaymentGatewayError",
"message": "Connection timeout"
},
"context": {
"payment_id": "pay-123",
"amount": 99.99
}
}
Required Log Fields
| Field | Purpose | Correlation |
|---|
timestamp | When event occurred | Time-based queries |
level | Severity (debug/info/warn/error) | Filtering |
service | Source service name | Service filtering |
trace_id | Distributed trace identifier | Cross-service correlation |
message | Human-readable description | Search |
Process
- Discover context ā Check existing observability setup (Prometheus, CloudWatch, etc.)
- Choose stack ā Use decision matrix based on environment and requirements
- Instrument apps ā Add OpenTelemetry SDK or auto-instrumentation
- Configure collection ā Set up collectors, exporters, and storage
- Define SLOs ā Establish SLIs, targets, and error budgets
- Create alerts ā Implement actionable alerts with runbooks
- Build dashboards ā Create service and infrastructure dashboards
- Document runbooks ā Write response procedures for each alert
Checklist
Anti-Patterns
| Don't | Do |
|---|
| Alert on every metric threshold | Alert on user-impacting symptoms |
| Log everything at DEBUG in production | Use appropriate log levels |
| Unstructured log messages | Structured JSON logging |
| Missing trace context | Propagate trace IDs across services |
| Dashboards with 50+ panels | Focused dashboards per service/domain |
| Alerts without runbooks | Every alert links to response procedure |
| Store logs indefinitely | Define retention based on compliance needs |
Related Skills
tsh-implementing-kubernetes - For K8s-native observability setup
tsh-implementing-ci-cd - For pipeline observability integration
tsh-managing-secrets - For secure credential storage for observability tools