name	probe-soft-failure
description	Use when writing or reviewing addon livenessProbe/readinessProbe scripts, or diagnosing unexpected restarts/readiness flaps. Forces channel errors such as client timeout or connection refusal to be treated as transient unless the client succeeds and output proves a bad database state.
allowed-tools	Bash(kubectl ) Bash(rg ) Read

Probe Soft Failure

Hard Rules

Probe exit code must represent product state, not client transport state.
Every external client call must be wrapped in a shorter timeout than the probe timeout.
Client rc != 0 is transient / unknown, so probe should usually exit 0.
Only client succeeded + output proves bad state may exit 1.
Known legal slow windows must short-circuit before the client call. Use a narrow process guard when startup, DBCA, backup, restore, switchover, or role setup is known to make the control plane slow.
Probe timing must fit the slowest legal operation. Size initialDelaySeconds, periodSeconds, timeoutSeconds, and failureThreshold from real slow windows.
Readiness can be conservative; liveness must be careful. A false liveness failure kills the container and can create a restart cascade.

When To Invoke

Use this skill when:

Adding or reviewing livenessProbe / readinessProbe in ComponentDefinition.
Writing shell scripts called by K8s probes.
Seeing restarts increase while engine logs do not show a real crash.
Seeing readiness flap after switchover, reconfigure, backup/restore, role rebuild, or startup.
Reviewing PRs that use mysql, redis-cli, sqlplus, dgmgrl, psql, or similar clients in probes.

Probe Script Shape

Use this shape unless the engine has a stronger local health API:

# 1. Known legal slow window guard
if pgrep -f '<init-or-role-change-marker>' >/dev/null 2>&1; then
  echo "probe: legal slow window, soft pass"
  exit 0
fi

# 2. Client call with explicit timeout
out=$(timeout 25 <engine-client-command> 2>&1)
rc=$?

# 3. Transport/client failure is unknown, not product failure
if [ "$rc" -ne 0 ]; then
  echo "probe: client unavailable rc=$rc, soft pass"
  exit 0
fi

# 4. Only parsed bad state fails
case "$out" in
  *'<known-good-state>'*) exit 0 ;;
  *'<known-bad-state>'*)  echo "probe: bad state: $out"; exit 1 ;;
  *)                     echo "probe: unknown output, soft pass"; exit 0 ;;
esac

Review Checklist

Before approving a probe:

Find every command substitution and pipeline that calls an engine client.
Confirm each client call has an explicit timeout.
Confirm rc != 0 does not directly become exit 1.
Confirm exit 1 paths are based on parsed output from a successful client call.
Confirm legal slow windows are guarded before the client call.
Confirm probe timing leaves enough time for the slowest legal operation.
Confirm liveness and readiness semantics are not copied blindly from each other.

Incident Triage

If a pod is repeatedly killed by liveness:

Compare restart timestamps with DB startup / role-change / backup / restore activity.
Read previous container logs and probe script output.
If probe failed on timeout, connection refusal, or transient client error, classify as channel error promoted to product failure.
Patch the probe script semantics first.
Re-run the same lifecycle path and verify restarts stop.

Bad Patterns

Pattern	Risk	Fix
`client		exit 1`
`stderr contains ORA/ERR => exit 1`	Transient startup errors become product failure	Require successful readback + parsed state
No explicit `timeout`	Probe SIGKILLs client mid-call	Wrap client with `timeout N`
Broad `pgrep` guard	Probe always soft-passes	Use narrow marker strings
Liveness checks role readiness	Role changes kill otherwise healthy DB	Keep liveness minimal

Related Docs

docs/addon-probe-timeout-and-soft-failure-guide.md
docs/addon-test-probe-classification-guide.md
docs/cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md

name	probe-soft-failure
description	Use when writing or reviewing addon livenessProbe/readinessProbe scripts, or diagnosing unexpected restarts/readiness flaps. Forces channel errors such as client timeout or connection refusal to be treated as transient unless the client succeeds and output proves a bad database state.
allowed-tools	Bash(kubectl ) Bash(rg ) Read

Probe Soft Failure

Hard Rules

Probe exit code must represent product state, not client transport state.
Every external client call must be wrapped in a shorter timeout than the probe timeout.
Client rc != 0 is transient / unknown, so probe should usually exit 0.
Only client succeeded + output proves bad state may exit 1.
Known legal slow windows must short-circuit before the client call. Use a narrow process guard when startup, DBCA, backup, restore, switchover, or role setup is known to make the control plane slow.
Probe timing must fit the slowest legal operation. Size initialDelaySeconds, periodSeconds, timeoutSeconds, and failureThreshold from real slow windows.
Readiness can be conservative; liveness must be careful. A false liveness failure kills the container and can create a restart cascade.

When To Invoke

Use this skill when:

Adding or reviewing livenessProbe / readinessProbe in ComponentDefinition.
Writing shell scripts called by K8s probes.
Seeing restarts increase while engine logs do not show a real crash.
Seeing readiness flap after switchover, reconfigure, backup/restore, role rebuild, or startup.
Reviewing PRs that use mysql, redis-cli, sqlplus, dgmgrl, psql, or similar clients in probes.

Probe Script Shape

Use this shape unless the engine has a stronger local health API:

# 1. Known legal slow window guard
if pgrep -f '<init-or-role-change-marker>' >/dev/null 2>&1; then
  echo "probe: legal slow window, soft pass"
  exit 0
fi

# 2. Client call with explicit timeout
out=$(timeout 25 <engine-client-command> 2>&1)
rc=$?

# 3. Transport/client failure is unknown, not product failure
if [ "$rc" -ne 0 ]; then
  echo "probe: client unavailable rc=$rc, soft pass"
  exit 0
fi

# 4. Only parsed bad state fails
case "$out" in
  *'<known-good-state>'*) exit 0 ;;
  *'<known-bad-state>'*)  echo "probe: bad state: $out"; exit 1 ;;
  *)                     echo "probe: unknown output, soft pass"; exit 0 ;;
esac

Review Checklist

Before approving a probe:

Find every command substitution and pipeline that calls an engine client.
Confirm each client call has an explicit timeout.
Confirm rc != 0 does not directly become exit 1.
Confirm exit 1 paths are based on parsed output from a successful client call.
Confirm legal slow windows are guarded before the client call.
Confirm probe timing leaves enough time for the slowest legal operation.
Confirm liveness and readiness semantics are not copied blindly from each other.

Incident Triage

If a pod is repeatedly killed by liveness:

Compare restart timestamps with DB startup / role-change / backup / restore activity.
Read previous container logs and probe script output.
If probe failed on timeout, connection refusal, or transient client error, classify as channel error promoted to product failure.
Patch the probe script semantics first.
Re-run the same lifecycle path and verify restarts stop.

Bad Patterns

Pattern	Risk	Fix
`client		exit 1`
`stderr contains ORA/ERR => exit 1`	Transient startup errors become product failure	Require successful readback + parsed state
No explicit `timeout`	Probe SIGKILLs client mid-call	Wrap client with `timeout N`
Broad `pgrep` guard	Probe always soft-passes	Use narrow marker strings
Liveness checks role readiness	Role changes kill otherwise healthy DB	Keep liveness minimal

Related Docs

docs/addon-probe-timeout-and-soft-failure-guide.md
docs/addon-test-probe-classification-guide.md
docs/cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md

probe-soft-failure

Probe Soft Failure

Hard Rules

When To Invoke

Probe Script Shape

Review Checklist

Incident Triage

Bad Patterns

Related Docs

Probe Soft Failure

Hard Rules

When To Invoke

Probe Script Shape

Review Checklist

Incident Triage

Bad Patterns

Related Docs