with one click
dba-diagnose-postgres
// Use when diagnosing PostgreSQL DBA incidents in KubeBlocks, including replication, locks, data, connections, performance, and KubeBlocks integration symptoms.
// Use when diagnosing PostgreSQL DBA incidents in KubeBlocks, including replication, locks, data, connections, performance, and KubeBlocks integration symptoms.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | dba-diagnose-postgres |
| description | Use when diagnosing PostgreSQL DBA incidents in KubeBlocks, including replication, locks, data, connections, performance, and KubeBlocks integration symptoms. |
Reference resolution: when this source-derived skill mentions docs/..., resolve it from the shared support package beside the installed user skills: ~/.codex/skills/kubeblocks-addon-source-docs/docs/... for Codex or ~/.claude/skills/kubeblocks-addon-source-docs/docs/... for Claude Code. In the shared kubeblocks-addon-docs checkout, the same files live under skills/kubeblocks-addon-source-docs/docs/.... When it mentions scripts/..., resolve it from the same support package under scripts/.... If you are working inside a checkout of the original apecloud/kubeblocks-addon-skills, repo-relative paths are also valid.
Diagnose a PostgreSQL cluster running on KubeBlocks. Collect live diagnostics, match against the fault diagnosis card library, and recommend next steps.
Target: $ARGUMENTS
(Format: <cluster-name> [namespace] [symptom-description]
Example: my-pg default "replication lag increasing")
CLUSTER=<cluster-name>
KB_NS=<namespace, default: default>
SYMPTOM=<symptom description if provided>
Read the following files to load all known fault patterns:
docs/dba-postgres/dba-pg-diag-replication.md — PG-REP-### cardsdocs/dba-postgres/dba-pg-diag-locks.md — PG-LOCK-### cardsdocs/dba-postgres/dba-pg-diag-data.md — PG-DATA-### cardsdocs/dba-postgres/dba-pg-diag-connections.md — PG-CONN-### cardsdocs/dba-postgres/dba-pg-diag-performance.md — PG-PERF-### cardsdocs/dba-postgres/dba-pg-diag-kubeblocks.md — PG-KB-### cardsKB_NS=<namespace>
CLUSTER=<cluster-name>
# 1. Get all PG pods
kubectl get pod -n $KB_NS -l app.kubernetes.io/instance=$CLUSTER -o wide
# 2. Find primary pod
LEADER=$(kubectl get pod -n $KB_NS \
-l app.kubernetes.io/instance=$CLUSTER,kubeblocks.io/role=primary \
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
echo "Primary pod: $LEADER"
# 3. PostgreSQL version
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
psql -U postgres -Atc "SELECT version();" 2>/dev/null
# 4. Cluster phase
kubectl get cluster -n $KB_NS $CLUSTER -o jsonpath='{.status.phase}' 2>/dev/null
# 5. Patroni status
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
patronictl -c /etc/patroni/postgres.yaml list 2>/dev/null
# 6. Recent PG logs (last 100 lines, errors/warnings only)
kubectl logs -n $KB_NS $LEADER -c postgresql --tail=200 2>/dev/null \
| grep -E "ERROR|FATAL|PANIC|WARNING|HINT" | tail -30
# 7. Active connections summary
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
psql -U postgres -c "
SELECT count(*) AS total,
sum(CASE WHEN state='active' THEN 1 ELSE 0 END) AS active,
sum(CASE WHEN state='idle in transaction' THEN 1 ELSE 0 END) AS idle_in_txn,
sum(CASE WHEN wait_event IS NOT NULL THEN 1 ELSE 0 END) AS waiting
FROM pg_stat_activity WHERE pid != pg_backend_pid();
" 2>/dev/null
# 8. Replication status
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
psql -U postgres -c "
SELECT application_name, state, sync_state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS lag
FROM pg_stat_replication;
" 2>/dev/null
# 9. Logical replication slots WAL retention
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
psql -U postgres -c "
SELECT slot_name, active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_behind
FROM pg_replication_slots;
" 2>/dev/null
# 10. Lock waits
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
psql -U postgres -c "
SELECT pid, wait_event_type, wait_event, state, left(query,80) AS query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL AND pid != pg_backend_pid()
LIMIT 10;
" 2>/dev/null
# 11. XID age (wraparound risk)
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
psql -U postgres -c "
SELECT datname, age(datfrozenxid) AS xid_age,
2000000000 - age(datfrozenxid) AS xid_remaining
FROM pg_database ORDER BY age(datfrozenxid) DESC;
" 2>/dev/null
# 12. Archive status
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
psql -U postgres -c "
SELECT archived_count, failed_count, last_failed_wal, last_failed_time,
last_archived_wal, last_archived_time
FROM pg_stat_archiver;
" 2>/dev/null
# 13. Disk usage (pg_wal size)
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
du -sh /var/lib/postgresql/data/pgdata/pg_wal/ 2>/dev/null
# 14. Recent OpsRequests
kubectl get opsrequest -n $KB_NS \
-l app.kubernetes.io/instance=$CLUSTER \
--sort-by=.metadata.creationTimestamp 2>/dev/null | tail -10
Based on the collected diagnostics, analyze the following signals:
| Signal | Check | Potential Cards |
|---|---|---|
| Patroni list shows missing Leader | patronictl list | PG-REP-003, PG-REP-004, PG-KB-002 |
| Replication lag > 1MB | pg_stat_replication.lag | PG-REP-001, PG-REP-002 |
| Logical slot wal_behind > 1GB | pg_replication_slots | PG-REP-002 |
| ERROR/PANIC in logs | PG log | Match log message to card symptoms |
| idle in transaction > 10 | pg_stat_activity | PG-CONN-003, PG-LOCK-005 |
| wait_event = SyncRep | pg_stat_activity | PG-REP-004 |
| wait_event = Lock* | pg_stat_activity | PG-LOCK-001, PG-LOCK-005 |
| xid_age > 150,000,000 | pg_database | PG-PERF-004 |
| failed_count > 0 in archiver | pg_stat_archiver | PG-KB-001 |
| archive_command = /bin/true | SHOW archive_command | PG-KB-001 |
| autovacuum running > 2h | pg_stat_activity | PG-LOCK-003 |
| OpsRequest stuck in Running | kubectl get opsrequest | PG-KB-004 |
| temp_files increasing | pg_stat_bgwriter | PG-PERF-005 |
For each matched card:
PG-REP-002)For each High-confidence matched card, run the Diagnostic Steps from that card to confirm.
If no card matches, run general deep-dive:
# Full pg_stat_activity
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
psql -U postgres -c "
SELECT pid, usename, datname, state, wait_event_type, wait_event,
now() - query_start AS duration, left(query, 120) AS query
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY duration DESC NULLS LAST
LIMIT 30;
"
# Tables with most dead tuples
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
psql -U postgres -c "
SELECT schemaname, tablename, n_dead_tup, n_live_tup,
last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;
"
## PostgreSQL Diagnosis Report
Cluster: <name>
Namespace: <ns>
PG Version: <version>
Patroni Status: <Leader/Replica/count>
Reported Symptom: <symptom>
### Matched Fault Cards
#### [HIGH] PG-XXX-NNN — <card title>
**Evidence:** <specific output that matches>
**Confidence:** High
**Recommended Action:** <from card's fix section>
#### [MEDIUM] PG-XXX-NNN — <card title>
**Evidence:** <specific output>
**Confidence:** Medium
**Recommended Action:** <action>
### Signal Summary
| Signal | Value | Status |
|---|---|---|
| Patroni Leader | <name or NONE> | ✓ / ⚠ |
| Replication lag | <value> | ✓ / ⚠ |
| Idle in transaction | <count> | ✓ / ⚠ |
| Lock waits | <count> | ✓ / ⚠ |
| XID age (max) | <value> | ✓ / ⚠ |
| WAL archiver | <ok/failing> | ✓ / ⚠ |
| Logical slot WAL behind | <size> | ✓ / ⚠ |
### Immediate Actions Required
1. <most urgent action>
2. <second action>
### No Issues Found
(If all signals are green and no cards match)
All monitored signals are within normal range.
Suggest: check application-level metrics or provide more specific symptom description.
If FAILED or unclear: Ask the user for more specific symptoms or suggest running /diagnose for broader cluster diagnostics.