| name | probe-soft-failure |
| description | Use when writing or reviewing addon livenessProbe/readinessProbe scripts, or diagnosing unexpected restarts/readiness flaps. Forces channel errors such as client timeout or connection refusal to be treated as transient unless the client succeeds and output proves a bad database state. |
| allowed-tools | Bash(kubectl *) Bash(rg *) Read |
Probe Soft Failure
Hard Rules
- Probe exit code must represent product state, not client transport state.
- Every external client call must be wrapped in a shorter timeout than the probe timeout.
- Client
rc != 0 is transient / unknown, so probe should usually exit 0.
- Only
client succeeded + output proves bad state may exit 1.
- Known legal slow windows must short-circuit before the client call. Use a narrow process guard when startup, DBCA, backup, restore, switchover, or role setup is known to make the control plane slow.
- Probe timing must fit the slowest legal operation. Size
initialDelaySeconds, periodSeconds, timeoutSeconds, and failureThreshold from real slow windows.
- Readiness can be conservative; liveness must be careful. A false liveness failure kills the container and can create a restart cascade.
When To Invoke
Use this skill when:
- Adding or reviewing
livenessProbe / readinessProbe in ComponentDefinition.
- Writing shell scripts called by K8s probes.
- Seeing restarts increase while engine logs do not show a real crash.
- Seeing readiness flap after switchover, reconfigure, backup/restore, role rebuild, or startup.
- Reviewing PRs that use
mysql, redis-cli, sqlplus, dgmgrl, psql, or similar clients in probes.
Probe Script Shape
Use this shape unless the engine has a stronger local health API:
if pgrep -f '<init-or-role-change-marker>' >/dev/null 2>&1; then
echo "probe: legal slow window, soft pass"
exit 0
fi
out=$(timeout 25 <engine-client-command> 2>&1)
rc=$?
if [ "$rc" -ne 0 ]; then
echo "probe: client unavailable rc=$rc, soft pass"
exit 0
fi
case "$out" in
*'<known-good-state>'*) exit 0 ;;
*'<known-bad-state>'*) echo "probe: bad state: $out"; exit 1 ;;
*) echo "probe: unknown output, soft pass"; exit 0 ;;
esac
Review Checklist
Before approving a probe:
- Find every command substitution and pipeline that calls an engine client.
- Confirm each client call has an explicit timeout.
- Confirm
rc != 0 does not directly become exit 1.
- Confirm
exit 1 paths are based on parsed output from a successful client call.
- Confirm legal slow windows are guarded before the client call.
- Confirm probe timing leaves enough time for the slowest legal operation.
- Confirm liveness and readiness semantics are not copied blindly from each other.
Incident Triage
If a pod is repeatedly killed by liveness:
- Compare restart timestamps with DB startup / role-change / backup / restore activity.
- Read previous container logs and probe script output.
- If probe failed on timeout, connection refusal, or transient client error, classify as channel error promoted to product failure.
- Patch the probe script semantics first.
- Re-run the same lifecycle path and verify restarts stop.
Bad Patterns
| Pattern | Risk | Fix |
|---|
| `client | | exit 1` |
stderr contains ORA/ERR => exit 1 | Transient startup errors become product failure | Require successful readback + parsed state |
No explicit timeout | Probe SIGKILLs client mid-call | Wrap client with timeout N |
Broad pgrep guard | Probe always soft-passes | Use narrow marker strings |
| Liveness checks role readiness | Role changes kill otherwise healthy DB | Keep liveness minimal |
Related Docs
docs/addon-probe-timeout-and-soft-failure-guide.md
docs/addon-test-probe-classification-guide.md
docs/cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md