Run any Skill in Manus with one click

$pwd:

dba-diagnose-postgres

Name: Dba Diagnose Postgres
Author: apecloud

// Use when diagnosing PostgreSQL DBA incidents in KubeBlocks, including replication, locks, data, connections, performance, and KubeBlocks integration symptoms.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 04:41

SKILL.md

readonly

package.json

"author": "apecloud"

"repository": "apecloud/kubeblocks-addon-docs"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	dba-diagnose-postgres
description	Use when diagnosing PostgreSQL DBA incidents in KubeBlocks, including replication, locks, data, connections, performance, and KubeBlocks integration symptoms.

Reference resolution: when this source-derived skill mentions docs/..., resolve it from the shared support package beside the installed user skills: ~/.codex/skills/kubeblocks-addon-source-docs/docs/... for Codex or ~/.claude/skills/kubeblocks-addon-source-docs/docs/... for Claude Code. In the shared kubeblocks-addon-docs checkout, the same files live under skills/kubeblocks-addon-source-docs/docs/.... When it mentions scripts/..., resolve it from the same support package under scripts/.... If you are working inside a checkout of the original apecloud/kubeblocks-addon-skills, repo-relative paths are also valid.

Diagnose a PostgreSQL cluster running on KubeBlocks. Collect live diagnostics, match against the fault diagnosis card library, and recommend next steps.

Target: $ARGUMENTS (Format: <cluster-name> [namespace] [symptom-description] Example: my-pg default "replication lag increasing")

Step 0: Parse Arguments

CLUSTER=<cluster-name>
KB_NS=<namespace, default: default>
SYMPTOM=<symptom description if provided>

Step 1: Load Diagnosis Card Library

Read the following files to load all known fault patterns:

docs/dba-postgres/dba-pg-diag-replication.md — PG-REP-### cards
docs/dba-postgres/dba-pg-diag-locks.md — PG-LOCK-### cards
docs/dba-postgres/dba-pg-diag-data.md — PG-DATA-### cards
docs/dba-postgres/dba-pg-diag-connections.md — PG-CONN-### cards
docs/dba-postgres/dba-pg-diag-performance.md — PG-PERF-### cards
docs/dba-postgres/dba-pg-diag-kubeblocks.md — PG-KB-### cards

Step 2: Collect Live Diagnostics

KB_NS=<namespace>
CLUSTER=<cluster-name>

# 1. Get all PG pods
kubectl get pod -n $KB_NS -l app.kubernetes.io/instance=$CLUSTER -o wide

# 2. Find primary pod
LEADER=$(kubectl get pod -n $KB_NS \
  -l app.kubernetes.io/instance=$CLUSTER,kubeblocks.io/role=primary \
  -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
echo "Primary pod: $LEADER"

# 3. PostgreSQL version
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  psql -U postgres -Atc "SELECT version();" 2>/dev/null

# 4. Cluster phase
kubectl get cluster -n $KB_NS $CLUSTER -o jsonpath='{.status.phase}' 2>/dev/null

# 5. Patroni status
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  patronictl -c /etc/patroni/postgres.yaml list 2>/dev/null

# 6. Recent PG logs (last 100 lines, errors/warnings only)
kubectl logs -n $KB_NS $LEADER -c postgresql --tail=200 2>/dev/null \
  | grep -E "ERROR|FATAL|PANIC|WARNING|HINT" | tail -30

# 7. Active connections summary
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  psql -U postgres -c "
    SELECT count(*) AS total,
           sum(CASE WHEN state='active' THEN 1 ELSE 0 END) AS active,
           sum(CASE WHEN state='idle in transaction' THEN 1 ELSE 0 END) AS idle_in_txn,
           sum(CASE WHEN wait_event IS NOT NULL THEN 1 ELSE 0 END) AS waiting
    FROM pg_stat_activity WHERE pid != pg_backend_pid();
  " 2>/dev/null

# 8. Replication status
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  psql -U postgres -c "
    SELECT application_name, state, sync_state,
           pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS lag
    FROM pg_stat_replication;
  " 2>/dev/null

# 9. Logical replication slots WAL retention
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  psql -U postgres -c "
    SELECT slot_name, active,
           pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_behind
    FROM pg_replication_slots;
  " 2>/dev/null

# 10. Lock waits
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  psql -U postgres -c "
    SELECT pid, wait_event_type, wait_event, state, left(query,80) AS query
    FROM pg_stat_activity
    WHERE wait_event IS NOT NULL AND pid != pg_backend_pid()
    LIMIT 10;
  " 2>/dev/null

# 11. XID age (wraparound risk)
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  psql -U postgres -c "
    SELECT datname, age(datfrozenxid) AS xid_age,
           2000000000 - age(datfrozenxid) AS xid_remaining
    FROM pg_database ORDER BY age(datfrozenxid) DESC;
  " 2>/dev/null

# 12. Archive status
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  psql -U postgres -c "
    SELECT archived_count, failed_count, last_failed_wal, last_failed_time,
           last_archived_wal, last_archived_time
    FROM pg_stat_archiver;
  " 2>/dev/null

# 13. Disk usage (pg_wal size)
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  du -sh /var/lib/postgresql/data/pgdata/pg_wal/ 2>/dev/null

# 14. Recent OpsRequests
kubectl get opsrequest -n $KB_NS \
  -l app.kubernetes.io/instance=$CLUSTER \
  --sort-by=.metadata.creationTimestamp 2>/dev/null | tail -10

Step 3: Analyze and Match Fault Cards

Based on the collected diagnostics, analyze the following signals:

Signal Checklist

Signal	Check	Potential Cards
Patroni list shows missing Leader	patronictl list	PG-REP-003, PG-REP-004, PG-KB-002
Replication lag > 1MB	pg_stat_replication.lag	PG-REP-001, PG-REP-002
Logical slot wal_behind > 1GB	pg_replication_slots	PG-REP-002
ERROR/PANIC in logs	PG log	Match log message to card symptoms
idle in transaction > 10	pg_stat_activity	PG-CONN-003, PG-LOCK-005
wait_event = SyncRep	pg_stat_activity	PG-REP-004
wait_event = Lock*	pg_stat_activity	PG-LOCK-001, PG-LOCK-005
xid_age > 150,000,000	pg_database	PG-PERF-004
failed_count > 0 in archiver	pg_stat_archiver	PG-KB-001
archive_command = /bin/true	SHOW archive_command	PG-KB-001
autovacuum running > 2h	pg_stat_activity	PG-LOCK-003
OpsRequest stuck in Running	kubectl get opsrequest	PG-KB-004
temp_files increasing	pg_stat_bgwriter	PG-PERF-005

For each matched card:

State which card it matches (e.g., PG-REP-002)
Quote the specific diagnostic output that matches the symptom
State confidence: High / Medium / Low

Step 4: Run Targeted Diagnostics

For each High-confidence matched card, run the Diagnostic Steps from that card to confirm.

If no card matches, run general deep-dive:

# Full pg_stat_activity
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  psql -U postgres -c "
    SELECT pid, usename, datname, state, wait_event_type, wait_event,
           now() - query_start AS duration, left(query, 120) AS query
    FROM pg_stat_activity
    WHERE pid != pg_backend_pid()
    ORDER BY duration DESC NULLS LAST
    LIMIT 30;
  "

# Tables with most dead tuples
kubectl exec -n $KB_NS $LEADER -c postgresql -- \
  psql -U postgres -c "
    SELECT schemaname, tablename, n_dead_tup, n_live_tup,
           last_autovacuum, last_autoanalyze
    FROM pg_stat_user_tables
    ORDER BY n_dead_tup DESC LIMIT 10;
  "

Step 5: Output Diagnosis Report

## PostgreSQL Diagnosis Report

Cluster: <name>
Namespace: <ns>
PG Version: <version>
Patroni Status: <Leader/Replica/count>
Reported Symptom: <symptom>

### Matched Fault Cards

#### [HIGH] PG-XXX-NNN — <card title>
**Evidence:** <specific output that matches>
**Confidence:** High
**Recommended Action:** <from card's fix section>

#### [MEDIUM] PG-XXX-NNN — <card title>
**Evidence:** <specific output>
**Confidence:** Medium
**Recommended Action:** <action>

### Signal Summary

| Signal | Value | Status |
|---|---|---|
| Patroni Leader | <name or NONE> | ✓ / ⚠ |
| Replication lag | <value> | ✓ / ⚠ |
| Idle in transaction | <count> | ✓ / ⚠ |
| Lock waits | <count> | ✓ / ⚠ |
| XID age (max) | <value> | ✓ / ⚠ |
| WAL archiver | <ok/failing> | ✓ / ⚠ |
| Logical slot WAL behind | <size> | ✓ / ⚠ |

### Immediate Actions Required

1. <most urgent action>
2. <second action>

### No Issues Found

(If all signals are green and no cards match)
All monitored signals are within normal range.
Suggest: check application-level metrics or provide more specific symptom description.

If FAILED or unclear: Ask the user for more specific symptoms or suggest running /diagnose for broader cluster diagnostics.

name	dba-diagnose-postgres
description	Use when diagnosing PostgreSQL DBA incidents in KubeBlocks, including replication, locks, data, connections, performance, and KubeBlocks integration symptoms.