ワンクリックで
observability-rca
// Use this skill when performing root cause analysis on incidents detected by Elastic Observability. Activate when the user reports a production issue, outage, degraded performance, or asks to investigate alerts.
// Use this skill when performing root cause analysis on incidents detected by Elastic Observability. Activate when the user reports a production issue, outage, degraded performance, or asks to investigate alerts.
| name | observability-rca |
| description | Use this skill when performing root cause analysis on incidents detected by Elastic Observability. Activate when the user reports a production issue, outage, degraded performance, or asks to investigate alerts. |
| metadata | {"version":"0.1.0","visibility":"public"} |
Start with high-level health checks:
elastic es cluster health
elastic slos list
Check for widespread vs isolated issues by querying error rates across services:
elastic es query 'FROM logs-* | WHERE @timestamp > NOW() - 1 HOUR AND log.level == "error" | STATS errors = COUNT(*) BY service.name | SORT errors DESC | LIMIT 20'
Establish when the issue started:
elastic es query 'FROM logs-* | WHERE service.name == "<affected-service>" AND log.level == "error" | STATS errors = COUNT(*) BY bucket = BUCKET(@timestamp, 1 minute) | SORT bucket | LIMIT 120'
Look for deployment or config changes around that time:
elastic es query 'FROM logs-* | WHERE @timestamp > NOW() - 4 HOURS AND (message LIKE "*deploy*" OR message LIKE "*restart*" OR message LIKE "*config*") | SORT @timestamp DESC | LIMIT 20'
Cross-service dependencies:
elastic es query 'FROM traces-* | WHERE @timestamp > NOW() - 1 HOUR AND service.name == "<service>" | STATS avg_duration = AVG(transaction.duration.us), error_rate = COUNT_DISTINCT(CASE(event.outcome == "failure", trace.id)) BY service.target.name | SORT avg_duration DESC'
Infrastructure metrics:
elastic es query 'FROM metrics-system.cpu-* | WHERE @timestamp > NOW() - 2 HOURS | STATS cpu = AVG(system.cpu.total.norm.pct) BY host.name, bucket = BUCKET(@timestamp, 5 minute) | WHERE cpu > 0.8 | SORT bucket'
Network connectivity:
elastic es query 'FROM metrics-system.network-* | WHERE @timestamp > NOW() - 1 HOUR | STATS dropped = SUM(system.network.in.dropped), errors = SUM(system.network.in.errors) BY host.name | WHERE dropped > 0 OR errors > 0'
| Symptom | Check | Likely Cause |
|---|---|---|
| High latency across services | CPU/memory metrics | Resource exhaustion |
| Intermittent 5xx errors | Dependency health | Downstream service failure |
| Connection timeouts | Network metrics | Network partition or DNS issue |
| Gradual degradation | Disk/memory trends | Resource leak |
| Sudden spike then recovery | Deploy logs | Bad deployment (auto-rolled-back) |
After identifying root cause, document:
Use when invoking the elastic CLI via elastic_cli or choosing CLI vs MCP/native Kibana tools. Covers shorthands, serverless gotchas, command names, and docs/ESQL flags.
Use this skill when writing or debugging ES|QL queries for Elasticsearch. Activate when the user asks to query logs, metrics, traces, or any Elasticsearch data using ES|QL syntax.
Use this skill when working with Elastic SLOs (Service Level Objectives). Activate when the user asks about SLO status, burn rates, error budgets, or needs to create and manage SLO definitions.