| name | service-mesh-observability |
| description | Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication. |
Service Mesh Observability
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
When to Use This Skill
- Setting up distributed tracing across services
- Implementing service mesh metrics and dashboards
- Debugging latency and error issues
- Defining SLOs for service communication
- Visualizing service dependencies
- Troubleshooting mesh connectivity
Core Concepts
1. Three Pillars of Observability
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Observability ā
āāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāā¤
ā Metrics ā Traces ā Logs ā
ā ā ā ā
ā ⢠Request rate ā ⢠Span context ā ⢠Access logs ā
ā ⢠Error rate ā ⢠Latency ā ⢠Error details ā
ā ⢠Latency P50 ā ⢠Dependencies ā ⢠Debug info ā
ā ⢠Saturation ā ⢠Bottlenecks ā ⢠Audit trail ā
āāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāā
2. Golden Signals for Mesh
| Signal | Description | Alert Threshold |
|---|
| Latency | Request duration P50, P99 | P99 > 500ms |
| Traffic | Requests per second | Anomaly detection |
| Errors | 5xx error rate | > 1% |
| Saturation | Resource utilization | > 80% |
Templates and detailed worked examples
Full template library and detailed worked examples live in references/details.md. Read that file when you need the concrete templates.
Best Practices
Do's
- Sample appropriately - 100% in dev, 1-10% in prod
- Use trace context - Propagate headers consistently
- Set up alerts - For golden signals
- Correlate metrics/traces - Use exemplars
- Retain strategically - Hot/cold storage tiers
Don'ts
- Don't over-sample - Storage costs add up
- Don't ignore cardinality - Limit label values
- Don't skip dashboards - Visualize dependencies
- Don't forget costs - Monitor observability costs