Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

diagnose-with-observer

Name: Diagnose With Observer
Author: multigres

// Use the multigres observer to diagnose cluster health issues. Fetch structured diagnostics from the /api/status endpoint, triage findings by severity, correlate root causes, and produce actionable bug reports. Use this whenever the user reports cluster problems, wants to investigate observer findings, needs to debug multigres issues, asks about cluster health, or sees errors in operator or data plane logs.

Exécuter dans Manus

$ git log --oneline --stat

stars:246

forks:26

updated:20 mars 2026 à 18:51

Explorateur de fichiers

4 fichiers

SKILL.md

readonly

related-skills.json

même dépôt

pin-upstream-images.md

from "multigres/multigres-operator"

Pin multigres container image tags in image_defaults.go for operator releases. Compares upstream multigres code changes between the current and new SHA, highlights breaking changes and new features, then updates the tags. Triggered by user requests like "prepare images for release", "pin image tags", "pin upstream images", or "upgrade multigres images".

2026-05-15246

exercise-cluster.md

from "multigres/multigres-operator"

Deploy MultigresCluster fixtures, run mutation scenarios, and validate health using the observer. Finds bugs in the operator and upstream multigres by exercising real cluster operations and verifying end-to-end health beyond CRD phase status. Use this skill whenever the user wants to test the operator, exercise the cluster, run exerciser scenarios, validate cluster health after changes, find bugs through mutation testing, or deploy and mutate fixtures.

2026-05-13246

generate-commit-message.md

from "multigres/multigres-operator"

generate semantic git commit messages

2026-03-17246

prepare-release.md

from "multigres/multigres-operator"

Prepare a release by analyzing all changes since the last git tag, updating CHANGELOG.md with categorized entries, inferring the next semantic version, and auditing all documentation for staleness or missing content. Triggered by requests like "prepare release", "bump version", "update changelog", "release prep", "version bump", or "prepare changelog".

2026-03-17246

package.json

"author": "multigres"

"repository": "multigres/multigres-operator"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Administrateurs de réseaux et de systèmes informatiquesProfessions informatiques et mathématiques15-1244L4

name

diagnose_with_observer

description

Use the multigres observer to diagnose cluster health issues. Fetch structured diagnostics from the /api/status endpoint, triage findings by severity, correlate root causes, and produce actionable bug reports. Use this whenever the user reports cluster problems, wants to investigate observer findings, needs to debug multigres issues, asks about cluster health, or sees errors in operator or data plane logs.

Diagnose with Observer Skill

Goal: Fetch a complete diagnostic snapshot from the observer's API, triage findings, investigate root causes in both operator and upstream multigres code, and produce an actionable bug report.

For systematic bug hunting (deploying fixtures, running mutations, validating health), use the exercise_cluster skill instead. This skill is for reactive investigation — when something is already broken and you need to find out why.

1. Ensure the Observer is Running

KUBECONFIG=kubeconfig.yaml kubectl get pods -l app.kubernetes.io/name=multigres-observer -n multigres-operator

If not running: make kind-deploy-observer. The observer deploys automatically with make kind-deploy.

2. Fetch the Diagnostic Snapshot

Define the observer helper (stateless kubectl exec — no stale port-forwards to manage):

observer() {
  KUBECONFIG=$(pwd)/kubeconfig.yaml kubectl exec -n multigres-operator deploy/multigres-observer -- curl -sf "http://localhost:9090$1"
}

Fetch the full snapshot:

observer /api/status | jq .

The response is a single JSON object:

Key	Contents
`summary`	Cycle timing, finding counts by severity
`healthy`	Per-check health status (`true`/`false`)
`findings`	Array of all findings from the latest cycle
`probes`	Raw probe data per check (pods, connectivity, replication, drain, topology, crdStatus)
`coverage`	What was checked: SQL probes enabled, checks run, namespace

Quick Triage

# Just errors and fatals
observer /api/status | jq '[.findings[] | select(.level == "fatal" or .level == "error")]'

# Finding history with classification
observer /api/history | jq '{
  persistent: [.persistent[]? | {check, component, message, severity, count}],
  flapping: [.flapping[]? | {check, component, message, severity, count}],
  transientCount: (.transient | length)
}'

# Run specific checks on demand (without waiting for next cycle)
observer '/api/check?categories=replication,connectivity' | jq .

Valid categories: pod-health, resource-validation, crd-status, drain-state, connectivity, logs, events, topology, replication

For the full query catalog, see references/observer-queries.md.

Finding History Classification

Category	Meaning	Action
`persistent`	Present in 75%+ of cycles (or <3 cycles)	Real issue — investigate
`flapping`	Active with gaps (3+ appearances, <75%)	Intermittent — may be a race condition
`transient`	Appeared then resolved	Expected during operations — report but don't block

Findings Structure

Each finding has:

level: info, warn, error, fatal
check: Category — pod-health, resource-validation, crd-status, drain-state, connectivity, operator-logs, dataplane-logs, events, topology, replication
component: Affected resource, e.g., shard/default/my-shard
message: Human-readable description
details: Structured data for deeper investigation

Probes

Raw data collected during the cycle, independent of findings:

probes.pods: All managed pods with phase, readiness, component labels
probes.connectivity: Every TCP/HTTP/gRPC/SQL probe result with latency and error
probes.replication: Per-shard primary/replica counts and podRoles
probes.drain: All pods currently in a drain state
probes.topology: Per-cluster etcd reachability and rootPath
probes.crdStatus: All CRDs with phase and readiness

CRITICAL: Investigation Rules

NEVER blame infrastructure (kind, Docker, CPU, memory, network) without concrete proof. If a probe times out, investigate the actual call chain — check logs of every component in the path (gateway → multiorch → pooler → postgres). A 5-second timeout on SELECT 1 is not "expected in kind" — it means something is broken.

Trace the full call chain. If the gateway SQL probe fails, check gateway logs → multiorch logs → pooler logs. Find where the request gets stuck.
Check upstream component logs. The observer only sees symptoms. Use kubectl logs on the actual components to find root causes.
Never speculate about performance. If something is slow, find the bottleneck with evidence.
Report what you actually found. State the specific failure chain with log evidence, not theories.

3. Severity Triage

Process findings in this order:

Fatal (Critical — address immediately)

SPLIT-BRAIN: Multiple pods report as primary. Data integrity emergency.
Writes blocked: Write probe timed out.
Backward drain transition: Drain state went backwards — controller bug.
Invalid phase transition: Phase went Healthy → Initializing — should never happen.
Readiness cross-check: Pod reports Ready but gRPC/readiness probe failed — silent data plane failure.
All poolers unreachable: multiorch-pooler-health shows all poolers down — total outage.
Missing PVC: Running pool pod references a PVC that doesn't exist.

Error (Investigate)

Missing replicas: Primary sees 0 replication connections.
Stuck drain: Drain state hasn't progressed past its timeout.
Generation divergence: Controller isn't reconciling (observed generation stale >60s).
WAL receiver down: Replica disconnected from primary.
Connectivity failure: Can't reach a multigres service.
PVC not Bound: Pool pod's PVC exists but is not in Bound phase.
Backup very stale: Last backup >49h old.
Cell ready > total: GatewayReadyReplicas exceeds GatewayReplicas — impossible state.

Warn (Monitor)

Replication lag >10s: Standby falling behind.
WAL replay paused: Someone paused replay on a replica.
Backup stale: Last backup >25h old.
Backup never completed: Backup configured but LastBackupTime is nil.
Stuck Progressing: Phase stuck in Progressing >10 minutes.
Empty status message: Degraded/Unknown phase with no status message.
Service missing endpoints: Managed Service has no ready addresses.
Cell deployment mismatch: Gateway readyReplicas doesn't match Cell status.
Operator metrics unreachable: Can't probe operator /metrics on port 8443.
Topology unreachable: etcd checks skipped (may be expected during startup).

Info (Normal)

Synced events, successful probes, orphaned PVCs — no action needed.

4. Common Diagnostic Patterns

Multiple fatals from the same check

They often share a single root cause. Read the details fields — look for a common pod, application_name, or component. Fix the root cause rather than addressing each fatal individually.

Silent data plane failure (all components "healthy" but queries fail)

The highest-signal finding is a readiness cross-check fatal: "Pod X reports Ready but multipooler-grpc-health probe failed". The pod passes Kubernetes readiness but the observer's gRPC probe detects the component is broken.

Check probes in order:

multipooler-grpc-health: If failing on all pool pods, the gRPC server (port 15270) is hanging
multiorch-pooler-health: If showing "all poolers unreachable", multiorch knows the data plane is down
sql-probe: If timing out while the above are failing, confirms total data plane outage

Connectivity errors — always investigate

context deadline exceeded on SQL probes does NOT mean "kind is slow". Trace the call chain: gateway logs → multiorch logs → pooler logs. If multiorch can't reach poolers via gRPC, that's the root cause. TCP/HTTP probes passing while SQL fails means the problem is in the application layer, not the transport layer.

System pod FailedScheduling in kind

FailedScheduling on coredns, local-path-provisioner, or kube-scheduler are kind artifacts during node startup. Resolve automatically. Ignore unless they persist >5 minutes.

Topology validation skipped

topology validation skipped: etcd unreachable — observer can't connect to TopoServer etcd. Typically a label mismatch or TopoServer pods haven't started yet.

Replication errors correlate

Replication findings often chain — broken sync config causes async standbys, which causes blocked writes. Look at details to find the upstream cause.

5. Investigation and Bug Report

After triaging findings, investigate root causes using:

Component log tracing: references/log-tracing.md — trace failures through gateway → multiorch → pooler → postgres logs, plus kubectl cross-reference commands
Code investigation: references/code-investigation.md — determine if the bug is in the operator or upstream multigres, with directory maps and version checking

Then produce a bug report documenting:

Finding summary: What the observer detected (severity, check, message)
Affected component: Operator, upstream multigres, or both
Root cause: The specific code path causing the issue
Version info: Which multigres SHA is running vs latest
Fix location: Whether it needs a fix in the operator, upstream, or both
Suggested fix: Concrete code change recommendation

If the bug exists on an older upstream SHA but is already fixed on main, report this — do not update image tags yourself.

6. Reference Documents

Reference	When to Read
`references/observer-queries.md`	Full catalog of jq queries for probing specific aspects of observer output
`references/log-tracing.md`	When tracing failures through component logs and cross-referencing with kubectl
`references/code-investigation.md`	When investigating root causes in operator or upstream multigres code
`tools/observer/docs/observer.md`	Full check reference with all sub-checks and SQL queries
`tools/observer/docs/configuration.md`	Threshold values and constants
`tools/observer/docs/architecture.md`	How the observer works internally

diagnose-with-observer

Plus depuis ce dépôt

Diagnose with Observer Skill

1. Ensure the Observer is Running

2. Fetch the Diagnostic Snapshot

Quick Triage

Finding History Classification

Findings Structure

Probes

CRITICAL: Investigation Rules

3. Severity Triage

Fatal (Critical — address immediately)

Error (Investigate)

Warn (Monitor)

Info (Normal)

4. Common Diagnostic Patterns

Multiple fatals from the same check

Silent data plane failure (all components "healthy" but queries fail)

Connectivity errors — always investigate

System pod FailedScheduling in kind

Topology validation skipped

Replication errors correlate

5. Investigation and Bug Report

6. Reference Documents

Diagnose with Observer Skill

1. Ensure the Observer is Running

2. Fetch the Diagnostic Snapshot

Quick Triage

Finding History Classification

Findings Structure

Probes

CRITICAL: Investigation Rules

3. Severity Triage

Fatal (Critical — address immediately)

Error (Investigate)

Warn (Monitor)

Info (Normal)

4. Common Diagnostic Patterns

Multiple fatals from the same check

Silent data plane failure (all components "healthy" but queries fail)

Connectivity errors — always investigate

System pod FailedScheduling in kind

Topology validation skipped

Replication errors correlate

5. Investigation and Bug Report

6. Reference Documents

Plus depuis ce dépôt