一键导入
observability-unified-dashboard
// Design unified dashboards with golden signals, OpenTelemetry correlation, SLO tracking, and Grafana 11 auto-correlations for metrics/logs/traces.
// Design unified dashboards with golden signals, OpenTelemetry correlation, SLO tracking, and Grafana 11 auto-correlations for metrics/logs/traces.
Analyzes and optimizes frontend performance using Core Web Vitals, bundle analysis, lazy loading, image optimization, and caching strategies
Design RESTful APIs with OpenAPI 3.1/3.2, resource modeling, HTTP semantics, versioning, pagination, HATEOAS, and OWASP API Security.
Design data pipelines with quality checks, orchestration, and governance using modern data stack patterns for robust ELT/ETL workflows.
Validate WCAG 2.2 compliance (A/AA/AAA) with ARIA, color contrast, keyboard navigation, screen readers, and automated testing via axe-core/Pa11y.
Design Kafka architectures with exactly-once semantics, Kafka Streams, ksqlDB, Schema Registry (Avro/Protobuf), performance tuning, and KRaft.
Design RabbitMQ architectures with exchanges, quorum queues, routing patterns, clustering, dead letter exchanges, and AMQP best practices.
| name | Observability Unified Dashboard |
| slug | observability-unified-dashboard |
| description | Design unified dashboards with golden signals, OpenTelemetry correlation, SLO tracking, and Grafana 11 auto-correlations for metrics/logs/traces. |
| capabilities | ["Unified dashboard design integrating metrics, logs, and traces","Golden Signals implementation (latency, traffic, errors, saturation)","OpenTelemetry pipeline configuration with OTLP protocol","Grafana 11 correlation setup (metrics → logs → traces)","SLO/SLI tracking and alerting","Multi-data source query orchestration","Real-time compliance scoring and trend analysis","Automated alert routing and escalation"] |
| inputs | {"services":{"type":"array","description":"List of services/systems to monitor","required":true},"deployment_env":{"type":"string","description":"Environment context (production, staging, dev)","required":true},"slo_targets":{"type":"object","description":"SLO definitions per service (e.g., 99.9% availability, p95 latency <200ms)","required":false},"telemetry_sources":{"type":"object","description":"Data sources (Prometheus, Loki, Tempo, Jaeger, cloud providers)","required":true},"alert_destinations":{"type":"array","description":"Alert routing (PagerDuty, Slack, email, webhooks)","required":false},"custom_metrics":{"type":"array","description":"Business-specific metrics beyond golden signals","required":false}} |
| outputs | {"dashboard_config":{"type":"object","description":"Grafana dashboard JSON with panels for golden signals and SLOs"},"correlation_rules":{"type":"array","description":"Automatic correlation rules linking metrics, logs, and traces"},"alert_rules":{"type":"array","description":"Alert definitions with thresholds, severity, and routing"},"otel_pipeline":{"type":"object","description":"OpenTelemetry Collector configuration (receivers, processors, exporters)"},"slo_summary":{"type":"object","description":"SLO compliance report with error budget burn rate"}} |
| keywords | ["observability","golden signals","OpenTelemetry","Grafana","SLO","SLI","unified dashboard","metrics","logs","traces","correlation","alerting","SRE"] |
| version | 1.0.0 |
| owner | cognitive-toolworks |
| license | Apache-2.0 |
| security | {"secrets":"Avoid logging sensitive data; redact PII/credentials in traces and logs","compliance":"GDPR/CCPA-compliant log retention; encrypt telemetry data in transit (TLS)"} |
| links | [{"title":"Google SRE Book: The Four Golden Signals","url":"https://sre.google/sre-book/monitoring-distributed-systems/","accessed":"2025-10-26"},{"title":"OpenTelemetry Official Documentation","url":"https://opentelemetry.io/docs/","accessed":"2025-10-26"},{"title":"Grafana 11 Correlations Documentation","url":"https://grafana.com/docs/grafana/latest/administration/correlations/","accessed":"2025-10-26"},{"title":"Prometheus Best Practices","url":"https://prometheus.io/docs/practices/","accessed":"2025-10-26"},{"title":"NIST SP 800-92: Guide to Computer Security Log Management","url":"https://csrc.nist.gov/publications/detail/sp/800-92/final","accessed":"2025-10-26"}] |
Purpose: Design comprehensive observability dashboards that unify metrics, logs, and traces using the Four Golden Signals framework (Latency, Traffic, Errors, Saturation), automate correlations across telemetry signals via OpenTelemetry, track SLO compliance, and configure Grafana 11 with automatic drill-down from alerts to root cause.
When to Use:
Orchestrates:
observability-slo-calculator: Computes SLO compliance, error budgets, and burn rates.observability-stack-configurator: Deploys Prometheus, Loki, Tempo, Grafana stack.Complements:
database-postgres-architect, database-mongodb-architect, database-redis-architect: Adds database-specific metrics (query latency, cache hit rate).cloud-kubernetes-architect: Integrates k8s metrics (pod restarts, resource saturation).Mandatory Inputs:
services: List of services/systems to monitor (≥1). Each service should have a unique service.name (OpenTelemetry resource attribute).telemetry_sources: At least one data source for metrics (e.g., Prometheus), logs (e.g., Loki), or traces (e.g., Tempo).deployment_env: Environment context to filter data (production, staging, dev).Validation Steps:
/api/v1/status/config, Loki /ready, Tempo /ready).trace_id and span_id propagation.<SLI> <comparator> <target> format (e.g., availability >= 99.9%, p95_latency < 200ms).service.name attributes.Goal: Generate a minimal unified dashboard with the Four Golden Signals for a single service using existing Prometheus/Loki/Tempo data sources.
Steps:
services list (or use first entry if priority not specified).histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="<service>"}[5m]))sum(rate(http_requests_total{job="<service>"}[5m]))sum(rate(http_requests_total{job="<service>", status=~"5.."}[5m])) / sum(rate(http_requests_total{job="<service>"}[5m])) * 100avg(process_cpu_seconds_total{job="<service>"}) and avg(process_resident_memory_bytes{job="<service>"})Token Budget: ≤2k tokens (no deep SLO calculation, no complex correlation setup).
Goal: Create a comprehensive dashboard with Golden Signals, SLO compliance tracking, automated correlations, and multi-service support.
Steps:
observability-slo-calculator (T2):
slo_targets, services, and telemetry_sources (Prometheus for metrics, Loki for logs).service.name and trace_id.trace_id.POST /api/datasources/uid/<uid>/correlations).receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 10s
send_batch_size: 1024
resource:
attributes:
- key: deployment.environment
value: ${DEPLOYMENT_ENV}
action: insert
exporters:
prometheus:
endpoint: "prometheus:9090"
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
tempo:
endpoint: "tempo:4317"
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [loki]
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [tempo]
Token Budget: ≤6k tokens (includes SLO calculation delegation, correlation setup, multi-service config).
Goal: Deploy a production-grade unified observability platform with custom metrics, anomaly detection, capacity planning, and compliance reporting.
Steps:
observability-stack-configurator (T3):
abs(current_latency - baseline_latency) / baseline_latency > 0.3 (30% deviation).trace_id propagation, check OTLP endpoints).Token Budget: ≤12k tokens (includes stack deployment, advanced analytics, compliance checks, documentation).
Ambiguity Resolution:
slo_targets not provided:
service.name:
service.name attributes per service."Stop Conditions:
Thresholds:
Required Fields:
{
dashboard_config: {
uid: string; // Grafana dashboard UID (unique identifier)
title: string; // Dashboard title (e.g., "Unified Observability: Production Services")
panels: Array<{ // Array of dashboard panels
id: number; // Panel ID (unique within dashboard)
title: string; // Panel title (e.g., "Latency (p95)")
type: string; // Panel type (graph, stat, table, heatmap)
targets: Array<{ // Data source queries
datasource: string; // Data source UID (Prometheus, Loki, Tempo)
expr: string; // Query expression (PromQL, LogQL, TraceQL)
}>;
alert?: { // Optional alert configuration
name: string; // Alert rule name
condition: string; // Alert condition (PromQL expression)
for: string; // Duration threshold (e.g., "5m")
annotations: {
summary: string; // Alert summary (e.g., "High error rate detected")
};
};
}>;
templating: { // Dashboard variables (environment, service selector)
list: Array<{
name: string; // Variable name (e.g., "environment")
type: string; // Variable type (query, custom, interval)
query: string; // Query to populate variable options
}>;
};
};
correlation_rules: Array<{ // Grafana correlations
source_datasource: string; // Source data source UID (e.g., Prometheus)
target_datasource: string; // Target data source UID (e.g., Loki)
label: string; // Correlation link label (e.g., "View Logs")
field: string; // Field to use for correlation (e.g., "trace_id")
transformation: string; // Optional transformation (e.g., regex extraction)
}>;
alert_rules: Array<{ // Alert definitions
name: string; // Alert rule name
expr: string; // PromQL expression for alert condition
for: string; // Duration threshold (e.g., "5m")
severity: "critical" | "warning" | "info";
annotations: {
summary: string; // Human-readable alert summary
runbook_url?: string; // Link to runbook for resolution steps
};
route?: { // Alert routing (optional)
receiver: string; // Receiver name (PagerDuty, Slack, email)
};
}>;
otel_pipeline: { // OpenTelemetry Collector configuration
receivers: object; // OTLP receivers (grpc, http)
processors: object; // Batch, resource, attributes processors
exporters: object; // Prometheus, Loki, Tempo exporters
service: {
pipelines: {
metrics: { receivers: string[]; processors: string[]; exporters: string[] };
logs: { receivers: string[]; processors: string[]; exporters: string[] };
traces: { receivers: string[]; processors: string[]; exporters: string[] };
};
};
};
slo_summary: { // SLO compliance report (from observability-slo-calculator)
services: Array<{
name: string; // Service name
slo_compliance: number; // Current SLO compliance (0-100%)
error_budget_remaining: number; // Error budget remaining (%)
burn_rate: number; // Current error budget burn rate (multiplier)
time_to_exhaustion_hours: number | null; // Hours until error budget exhausted (null if budget positive)
status: "compliant" | "warning" | "breached";
}>;
overall_compliance: number; // Average compliance across all services
};
}
Optional Fields:
custom_panels: Array of custom dashboard panels for business-specific metrics.capacity_forecast: Object with resource usage predictions (disk full in X days, etc.).compliance_report: Object with NIST SP 800-92, GDPR/CCPA compliance status.runbook_links: Object mapping alert names to runbook URLs.Format: JSON for dashboard_config, correlation_rules, otel_pipeline, slo_summary. YAML for alert_rules (Grafana-managed alerting format).
Input:
services:
- name: "frontend-web"
type: "web"
- name: "api-gateway"
type: "api"
- name: "order-service"
type: "backend"
deployment_env: "production"
slo_targets:
- service: "api-gateway"
availability: 99.95%
p95_latency_ms: 200
- service: "order-service"
availability: 99.9%
p95_latency_ms: 500
telemetry_sources:
metrics: "prometheus-prod"
logs: "loki-prod"
traces: "tempo-prod"
alert_destinations:
- type: "pagerduty"
integration_key: "REDACTED"
- type: "slack"
webhook_url: "REDACTED"
Output (T2 Summary):
Dashboard: "Unified Observability: E-commerce Production"
- Overview Panel: 3 services, 2/3 SLO compliant (order-service at 99.85%, warning)
- Golden Signals (per service):
api-gateway: Latency p95=150ms ✅, Traffic=1200 req/s, Errors=0.5% ✅, CPU=45%
order-service: Latency p95=480ms ✅, Traffic=300 req/s, Errors=2.1% ⚠️, CPU=78%
- Correlations: 3 rules (Prometheus alert → Loki logs → Tempo traces)
- Alerts:
- CRITICAL: order-service error rate >5% for 5m → PagerDuty
- WARNING: order-service SLO burn rate 7.2× (budget exhausted in 4 days) → Slack
OpenTelemetry Pipeline: OTLP receivers → batch processor → Prometheus/Loki/Tempo exporters
SLO Summary:
- api-gateway: 99.98% compliant (6% error budget remaining, burn rate 0.8×)
- order-service: 99.85% compliant (85% error budget consumed, burn rate 7.2× ⚠️)
- Overall: 99.92% compliance
Link to Full Example: See skills/observability-unified-dashboard/examples/ecommerce-unified-dashboard.txt
Custom Metrics Added:
sum(rate(payment_transactions_total{status="success"}[5m])) / sum(rate(payment_transactions_total[5m])) * 100histogram_quantile(0.95, rate(db_query_duration_seconds_bucket{db="orders"}[5m]))sum(rate(redis_hits_total[5m])) / (sum(rate(redis_hits_total[5m])) + sum(rate(redis_misses_total[5m]))) * 100Anomaly Detection Alert:
- alert: LatencyAnomalyDetected
expr: |
abs(
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
-
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[7d] offset 7d))
) / histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[7d] offset 7d)) > 0.3
for: 10m
severity: warning
annotations:
summary: "Latency 30% higher than 7-day baseline"
runbook_url: "https://runbooks.example.com/latency-anomaly"
Token Budget Compliance:
Validation Checklist:
Safety & Auditability:
Determinism:
Official Documentation:
Complementary Skills:
observability-slo-calculator: Computes SLO compliance, error budgets, burn rates (invoke for T2/T3).observability-stack-configurator: Deploys Prometheus, Loki, Tempo, Grafana stack (invoke for T3).OpenTelemetry Collector Reference Config:
Grafana Dashboard Examples:
SLO/Error Budget Methodology:
Security & Compliance: