// Advanced observability patterns with OpenTelemetry, distributed tracing, eBPF monitoring, SLO/SLI implementation, and production strategies. Use when implementing observability, monitoring, or distributed tracing.
| name | moai-observability-advanced |
| version | 3.0.0 |
| updated | 2025-11-19 |
| status | stable |
| description | Advanced observability patterns with OpenTelemetry, distributed tracing, eBPF monitoring, SLO/SLI implementation, and production strategies. Use when implementing observability, monitoring, or distributed tracing. |
| allowed-tools | ["Read","Bash","WebSearch","WebFetch"] |
Production-grade observability with OpenTelemetry, distributed tracing, and SLO/SLI implementation.
OpenTelemetry in 5 Minutes:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger import JaegerExporter
# Setup tracing
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
# Use tracer
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_request"):
# Your code here
result = do_work()
Auto-triggers: observability, monitoring, tracing, OpenTelemetry, metrics, logs, SLO, SLI
Automatic (recommended):
from opentelemetry.instrumentation.flask import FlaskInstrument or
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)
# Auto-instrument HTTP requests
RequestsInstrumentor().instrument()
Manual (fine-grained control):
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.value", calculate_value())
# Nested span
with tracer.start_as_current_span("validate_payment"):
validate(order_id)
from opentelemetry.propagate import inject, extract
# Inject context into headers
headers = {}
inject(headers)
requests.post(url, headers=headers)
# Extract context from headers
ctx = extract(request.headers)
with trace.use_context(ctx):
process_request()
Common SLIs:
Example SLOs:
api_service:
availability:
target: 99.9% # 3 nines
window: 30 days
latency:
target: 95% of requests < 200ms
window: 30 days
error_budget:
allowed_failures: 0.1% # 43.2 minutes/month
def calculate_error_budget(slo_target, time_window_seconds):
"""Calculate remaining error budget"""
allowed_downtime = time_window_seconds * (1 - slo_target)
return allowed_downtime
# Example: 99.9% SLO for 30 days
budget = calculate_error_budget(0.999, 30 * 24 * 60 * 60)
print(f"Allowed downtime: {budget / 60:.2f} minutes") # 43.2 minutes
Trace ID: abc123
├─ Span: API Gateway (200ms)
│ ├─ Span: Auth Service (50ms)
│ └─ Span: Order Service (150ms)
│ ├─ Span: Database Query (80ms)
│ └─ Span: Payment Gateway (70ms)
Critical Path: Longest span chain determines total latency Parallel Work: Concurrent spans reduce total time Bottlenecks: Spans with high duration or error rates
from jaeger_client import Config
config = Config(
config={
'sampler': {'type': 'const', 'param': 1},
'local_agent': {'reporting_host': 'localhost'},
},
service_name='my-service',
)
tracer = config.initialize_tracer()
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Define metrics
request_count = Counter('http_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'Request duration')
active_users = Gauge('active_users', 'Currently active users')
# Use metrics
@request_duration.time()
def handle_request():
request_count.labels(method='GET', endpoint='/api/users').inc()
# Process request
active_users.set(get_active_user_count())
# Expose metrics
start_http_server(8000) # Metrics at :8000/metrics
Key Panels:
Use cases:
Tools:
Example (network monitoring):
# Install Pixie
px deploy
# Monitor HTTP traffic
px live http_data
import structlog
logger = structlog.get_logger()
logger.info(
"order_processed",
order_id="12345",
user_id="user_789",
amount=99.99,
duration_ms=150
)
# Output: {"event": "order_processed", "order_id": "12345", ...}
from python_logging_elk import ElkHandler
handler = ElkHandler(
host='elasticsearch.example.com',
port=9200,
index='application-logs'
)
logger.addHandler(handler)
Good Alert:
Example:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
annotations:
summary: "Error rate above 5% for 5 minutes"
description: "{{ $value }}% of requests are failing"
labels:
severity: critical
✅ DO:
❌ DON'T:
For detailed implementation:
Related Skills:
moai-essentials-perf: Performance optimizationmoai-cloud-aws-advanced: AWS CloudWatch integrationmoai-cloud-gcp-advanced: Google Cloud MonitoringTools: OpenTelemetry, Prometheus, Grafana, Jaeger, ELK, Pixie
Version: 3.0.0
Last Updated: 2025-11-19
Status: Production Ready