| name | debug-with-grafana |
| description | Structured diagnostic workflow for debugging application issues using Grafana observability data. Use when the user reports errors, latency spikes, service degradation, HTTP 500s, or wants to investigate why a service is behaving unexpectedly. Triggers for: "my API is returning 500 errors", "latency is spiking", "service seems down", "help me debug using Grafana", "investigate why requests are failing", "something is wrong with my service".
|
Debug with Grafana
A structured 7-step diagnostic workflow for debugging application issues using
Prometheus metrics, Loki logs, and Grafana resources. Follow steps in order —
each step informs the next.
Prerequisites
gcx must be installed and configured with a valid context before running
any commands. If not configured, use the setup-gcx skill first:
gcx config view
gcx config use-context <context-name>
Diagnostic Workflow
Step 1: Discover Datasources
List all available datasources to identify Prometheus and Loki UIDs. All
subsequent query commands require a datasource UID via -d <uid>.
gcx datasources list -o json
gcx datasources list -t prometheus -o json
gcx datasources list -t loki -o json
PROM_UID=$(gcx datasources list -t prometheus -o json 2>/dev/null | \
python3 -c "import json,sys; print(json.load(sys.stdin)['datasources'][0]['uid'])")
LOKI_UID=$(gcx datasources list -t loki -o json 2>/dev/null | \
python3 -c "import json,sys; print(json.load(sys.stdin)['datasources'][0]['uid'])")
Expected output shape:
{
"datasources": [
{"uid": "<uid>", "name": "<display-name>", "type": "prometheus", ...},
{"uid": "<uid>", "name": "<display-name>", "type": "loki", ...}
]
}
If no datasources appear, confirm the context is pointing at the correct
Grafana instance. See references/error-recovery.md for auth and
datasource-not-found recovery patterns.
JSON output piping: When piping gcx output through external tools, never
use 2>&1 — gcx writes hints to stderr that break JSON parsers. Use
2>/dev/null to suppress stderr, or use --json field1,field2 to select
fields directly without piping:
gcx datasources list -t prometheus --json uid
gcx metrics query -d <prom-uid> 'up' --json metric,value
Use --json list to discover available fields for any command.
Step 2: Confirm Data Availability
Before querying specific metrics, confirm the target service is instrumented
and data is flowing. This avoids wasting time on empty results.
gcx metrics query -d <prom-uid> 'up' -o json
gcx metrics labels -d <prom-uid> -l job -o json
gcx logs labels -d <loki-uid> -l job -o json
gcx logs series -d <loki-uid> -M '{job="<service-name>"}' -o json
gcx metrics query -d <prom-uid> 'up{job="<service-name>"}' -o json
Expected output shape:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{"metric": {"__name__": "up", "job": "<service-name>", "instance": "<host:port>"}, "value": [<timestamp>, "<0-or-1>"]}
]
}
}
A value of "0" means the service is down or not being scraped. Empty
result array means the metric is absent — see Failure Mode 3 in
references/error-recovery.md.
Step 3: Query Error Rates
Query the HTTP 5xx error rate over the relevant time window to establish
whether an error spike exists and when it began.
gcx metrics query -d <prom-uid> \
'rate(http_requests_total{job="<service-name>",status=~"5.."}[5m])' \
--from now-1h --to now --step 1m -o json
gcx metrics query -d <prom-uid> \
'rate(http_requests_total{job="<service-name>",status=~"5.."}[5m])' \
--from now-1h --to now --step 1m -o graph
gcx metrics query -d <prom-uid> \
'rate(http_requests_total{job="<service-name>",status=~"5.."}[5m]) / rate(http_requests_total{job="<service-name>"}[5m])' \
--from now-1h --to now --step 1m -o json
gcx metrics query -d <prom-uid> \
'sum by(status) (rate(http_requests_total{job="<service-name>"}[5m]))' \
--from now-1h --to now --step 1m -o json
Expected output shape (matrix for range queries):
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {"job": "<service-name>", "status": "<code>"},
"values": [[<timestamp>, "<rate>"], ...]
}
]
}
}
Note the timestamp where the rate increases — this is the incident start time.
Use this window in subsequent steps.
Step 4: Query Latency
Query request latency to determine whether the service is slow (latency issue)
or failing fast (error issue). High latency often precedes error spikes.
gcx metrics query -d <prom-uid> \
'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="<service-name>"}[5m]))' \
--from now-1h --to now --step 1m -o json
gcx metrics query -d <prom-uid> \
'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="<service-name>"}[5m]))' \
--from now-1h --to now --step 1m -o graph
gcx metrics query -d <prom-uid> \
'rate(http_request_duration_seconds_sum{job="<service-name>"}[5m]) / rate(http_request_duration_seconds_count{job="<service-name>"}[5m])' \
--from now-1h --to now --step 1m -o json
gcx metrics query -d <prom-uid> \
'histogram_quantile(0.95, sum by(le, handler) (rate(http_request_duration_seconds_bucket{job="<service-name>"}[5m])))' \
--from now-1h --to now --step 1m -o json
Expected output shape:
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {"job": "<service-name>"},
"values": [[<timestamp>, "<seconds>"], ...]
}
]
}
}
Compare the latency onset time with the error onset time from Step 3. If
latency rose before errors, a dependency or resource constraint is likely.
Step 5: Correlate Logs
Query Loki for error logs in the time window identified in Steps 3 and 4.
Logs provide the specific error messages, stack traces, and context that
metrics cannot.
gcx logs query -d <loki-uid> \
'{job="<service-name>"} |= "error"' \
--from now-1h --to now -o json
gcx logs query -d <loki-uid> \
'{job="<service-name>"} | json | level="error"' \
--from now-1h --to now -o json
gcx logs query -d <loki-uid> \
'count_over_time({job="<service-name>"} |= "error" [5m])' \
--from now-1h --to now --step 1m -o json
gcx logs query -d <loki-uid> \
'{job="<service-name>"} |~ "timeout|connection refused|OOM|panic"' \
--from now-1h --to now -o json
Expected output shape (streams):
{
"status": "success",
"data": {
"resultType": "streams",
"result": [
{
"stream": {"job": "<service-name>", "level": "<level>"},
"values": [["<ns-timestamp>", "<log-line>"], ...]
}
]
}
}
LogQL pitfall: Loki requires at least one non-empty label matcher in the
stream selector. {} and {} |~ "pattern" will be rejected. Always include
at least one label, e.g., {job=~".+"} as a catch-all.
Look for:
- Repeated error messages pointing to a specific code path or dependency
- Timestamps of first error matching the metric spike time from Step 3
- Stack traces or panic messages that identify the root cause
- Upstream service names in error messages (database, external APIs)
Step 5b: Correlate Traces (if Tempo is available)
If a Tempo datasource exists, search for traces matching the incident window.
Traces show individual request paths and identify slow or failing spans.
gcx datasources list -t tempo -o json
gcx traces query -d <tempo-uid> '{ status = error }' --from now-1h --to now
gcx traces query -d <tempo-uid> '{ resource.service.name = "<service-name>" }' --from now-1h --to now
gcx traces query -d <tempo-uid> \
'{ resource.service.name = "<service-name>" && duration > 1s }' \
--from now-1h --to now
gcx traces get -d <tempo-uid> <trace-id>
gcx traces get -d <tempo-uid> <trace-id> --llm
TraceQL attribute scoping: Tempo requires scoped attribute names. Use
resource. for resource-level attributes and span. for span-level:
resource.service.name (not service.name)
span.http.status_code (not http.status_code)
Use name (unscoped) for the span name, duration for span duration,
and status for span status. Use trace:rootService and trace:rootName
for root span attributes (not rootServiceName or rootTraceName).
Discover available labels:
gcx traces labels -d <tempo-uid>
gcx traces labels -d <tempo-uid> -l resource.service.name
Common mistake: gcx traces labels -l service.name will fail — Tempo
parses the dot as an identifier boundary. Always fully qualify:
-l resource.service.name, not -l service.name.
See references/traceql-patterns.md for full
TraceQL syntax reference.
Step 6: Check Related Dashboards and Resources
Check whether relevant dashboards exist that give broader context, and inspect
related Grafana resources that may explain the issue (e.g., alert rules that
are firing).
gcx alert rules list -o json | jq '.[] | .rules[]? | select(.labels.job == "<service-name>")'
gcx resources pull dashboards -o json
gcx resources get dashboards -o json | jq '.items[] | select(.metadata.name | test("<service-name>"; "i"))'
gcx resources get dashboards/<dashboard-uid> -o json
Capture a visual snapshot of a relevant dashboard
If a relevant dashboard UID is known, capture a PNG snapshot to visually
inspect panel layout and current state. This is especially useful when
diagnosing layout regressions, missing data, or anomalous panel values.
gcx resources get dashboards/<dashboard-uid> -ojson | \
jq '.spec.templating.list[] | {name, type, current: .current.value}'
gcx dashboards snapshot <dashboard-uid> --output-dir ./debug-snapshots \
--var cluster=<cluster> --var job=<service-name> --since 1h
gcx dashboards snapshot <dashboard-uid> --from now-1h --to now \
--var cluster=<cluster> --var job=<service-name> --output-dir ./debug-snapshots
gcx dashboards snapshot <dashboard-uid> --panel <panel-id> \
--output-dir ./debug-snapshots
Cross-reference with metrics and logs:
- Are there alert rules in a firing or pending state for this service?
- Do existing dashboards show additional signals (queue depth, DB connections,
memory pressure)?
- Do dashboard panel queries reveal which metrics are being monitored?
- Does the dashboard snapshot show unexpected panel states or missing data?
Step 7: Summarize Findings
After completing Steps 1-6, synthesize the findings into a clear diagnostic
summary for the user.
Structure the summary as:
Service: <service-name>
Time window: <from> to <to>
Incident start: <timestamp from error rate onset>
Error signal:
- Error rate: <trend description, not fabricated value>
- Status codes: <which codes are elevated>
Latency signal:
- P95 latency: <trend description>
- Latency onset: <before/after/same time as errors>
Log evidence:
- Error pattern: <recurring message or exception>
- First occurrence: <timestamp>
- Frequency: <how often in the window>
Related resources:
- Firing alerts: <names or "none found">
- Relevant dashboards: <names or UIDs>
Likely root cause:
- <Primary hypothesis based on all signals>
Recommended next actions:
1. <Specific action — check dependency, review deploy, inspect resource usage>
2. <Additional action>
Use -o graph for any visualizations shared with the user. Use -o json for
data retrieved for your own analysis.
Example Scenarios
Scenario 1: HTTP 500 Error Spike
Trigger: User reports "my API started returning 500 errors 30 minutes ago".
Command sequence:
gcx datasources list -t prometheus -o json
gcx datasources list -t loki -o json
gcx metrics query -d <prom-uid> 'up{job="api"}' -o json
gcx metrics query -d <prom-uid> \
'rate(http_requests_total{job="api",status=~"5.."}[5m])' \
--from now-2h --to now --step 1m -o graph
gcx metrics query -d <prom-uid> \
'sum by(status) (rate(http_requests_total{job="api"}[5m]))' \
--from now-2h --to now --step 1m -o json
gcx metrics query -d <prom-uid> \
'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m]))' \
--from now-2h --to now --step 1m -o graph
gcx logs query -d <loki-uid> \
'{job="api"} |= "error"' \
--from now-2h --to now -o json
gcx alert rules list -o json | jq '.[] | .rules[]? | select(.state == "firing")'
Expected output shape at Step 3 (matrix):
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {"job": "api", "status": "500"},
"values": [[<timestamp>, "<rate>"], ...]
}
]
}
}
Interpretation: Look for the timestamp where values shows the rate
increasing from baseline. Match this to log timestamps in Step 5.
Scenario 2: Latency Degradation
Trigger: User reports "requests are taking much longer than usual, no errors yet".
Command sequence:
gcx datasources list -t prometheus -o json
gcx metrics query -d <prom-uid> 'up{job="api"}' -o json
gcx metrics query -d <prom-uid> \
'rate(http_requests_total{job="api",status=~"5.."}[5m])' \
--from now-1h --to now --step 1m -o json
gcx metrics query -d <prom-uid> \
'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m]))' \
--from now-2h --to now --step 1m -o graph
gcx metrics query -d <prom-uid> \
'histogram_quantile(0.95, sum by(le, handler) (rate(http_request_duration_seconds_bucket{job="api"}[5m])))' \
--from now-1h --to now --step 1m -o json
gcx logs query -d <loki-uid> \
'{job="api"} |~ "timeout|slow|waiting"' \
--from now-2h --to now -o json
gcx metrics query -d <prom-uid> \
'rate(db_query_duration_seconds_sum{job="api"}[5m]) / rate(db_query_duration_seconds_count{job="api"}[5m])' \
--from now-2h --to now --step 1m -o json
Expected output shape at Step 4 (histogram):
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {"job": "api"},
"values": [[<timestamp>, "<seconds>"], ...]
}
]
}
}
Interpretation: Rising values across all endpoints suggests a shared
resource or dependency. Rising values for one endpoint only suggests a
handler-specific issue. Compare latency onset time with log timestamps.
Scenario 3: Service Down / No Data
Trigger: User reports "service seems completely down" or dashboard shows no data.
Command sequence:
gcx datasources list -o json
gcx metrics query -d <prom-uid> 'up{job="api"}' -o json
gcx metrics query -d <prom-uid> 'up{job="api"}' -o json
gcx metrics labels -d <prom-uid> -l job -o json
gcx metrics query -d <prom-uid> \
'absent(up{job="api"})' \
--from now-1h --to now --step 1m -o json
gcx metrics query -d <prom-uid> \
'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m]))' \
--from now-3h --to now --step 5m -o graph
gcx logs query -d <loki-uid> \
'{job="api"}' \
--from now-3h --to now -o json
gcx logs query -d <loki-uid> \
'{job="api"} |~ "panic|OOM|killed|crashed|SIGTERM"' \
--from now-3h --to now -o json
gcx alert rules list -o json | jq '.[] | .rules[]? | select(.state == "firing")'
Expected output shape when service is down (up=0):
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {"__name__": "up", "job": "api", "instance": "<host:port>"},
"value": [<timestamp>, "0"]
}
]
}
}
Expected output shape when service was never scraped (absent):
{
"status": "success",
"data": {
"resultType": "vector",
"result": []
}
}
Interpretation:
up=0: Service is registered but failing health checks — check pod/process status
- Empty result for
up{job="api"}: Job never existed or was removed from scrape config
- Data present up to a specific timestamp then absent: Service crashed at that time — correlate with crash logs
References
-
references/error-recovery.md — Recovery
patterns for auth errors (401/403), datasource not found, empty results,
query timeouts, and malformed PromQL/LogQL syntax.
-
references/query-patterns.md — Advanced
query patterns for Prometheus and Loki datasources, including time range
formats, aggregation patterns, Loki stream operators, and output format
reference.
-
references/traceql-patterns.md — TraceQL
query patterns for Tempo trace search, attribute scoping rules, and the
distinction between traces query and traces get.