One invocation of the monitor loop. Intended to be called on a ~20-minute cadence by an external orchestrator (crontab, systemd timer, CI) via claude -p '/monitor-tick', or by a Claude-internal /loop 20m /monitor-tick. The 60m deploy cool-down (check 10) is sized so each deployed binary gets ~3 ticks of observation before the next deploy is eligible.

Preconditions

This skill expects /home/tomer/data/monitor-loop.env to exist — it is written by /monitor-loop at startup and contains the runtime config. If the file is missing, bail out immediately with:

ERROR: /home/tomer/data/monitor-loop.env not found — run /monitor-loop first.

Load the env file at the start of the tick:

set -a
source /home/tomer/data/monitor-loop.env
set +a

Variables provided by the env file (all required):

MONITOR_SESSION_ID — 8-char session id, e.g. 74535976
MONITOR_MODE — validator or watcher
MONITOR_CONFIG — path to the TOML config, e.g. configs/validator-mainnet-rpc.toml
MONITOR_ADMIN_PORT — 11627 (validator) or 11727 (watcher)
MONITOR_RPC_PORT — 8000 (validator) or empty (watcher)
MONITOR_RUN_FLAGS — --validator (validator) or empty (watcher)

Throughout this skill, substitute those values for the $MONITOR_* references below.

Per-session state files

All files below live in /home/tomer/data/$MONITOR_SESSION_ID/:

File	Purpose	Writer
`build_sha`	Cache of deployed binary's source commit (authoritative source: runtime `/info.commit_hash`; seeded by build step)	check 10 (after successful build or runtime self-heal) and `monitor-loop` step 5
`last_ledger`	STUCK detection state (check 4)	check 4
`last_ledger_count`	STUCK confirmation counter (check 4)	check 4
`wipe_attempted_at`	Epoch of most recent (3a) wipe; read by (3d) post-wipe recurrence guard	check 3a (write), check 3d marker-clear rule (remove)
`tick-history.jsonl`	daily-summary aggregation	tick epilogue
`metrics/current.prom`	latest Prometheus scrape	check 8
`metrics/prev.prom`	previous Prometheus scrape	check 8
`metrics/ratio_snapshot`	counter-ratio history (check 12)	check 12
`metrics/counter_streak_snapshot`	counter-streak state (check 12b)	check 12b (constants)
`metrics/anomaly_cooldown.json`	alert dedup state	check 9
`logs/monitor.log`	node stdout/stderr (rotated on restart)	node process
`cargo-target/`	cached build tree	cargo
`.alive`	session liveness marker	session startup

Cross-session state (lives at $HOME/data/, outside any session dir):

File	Purpose	Writer	Remover
`deploy_quarantine.txt`	Commit SHAs blocked from rebuild/restart deploy; survives session wipes	Deploy regression policy (skill, during rollback)	Operator only (manual)

Session-dir-vanished detection

Immediately after loading monitor-loop.env, check whether the session directory still exists. If not, this is a distinct (and severe) failure class — the entire session state (binary, logs, metrics, tick history) was deleted out-of-band.

source "$(git rev-parse --show-toplevel)/scripts/lib/monitor-decisions.sh"
check_session_wiped "$HOME/data" "/proc" "$MONITOR_SESSION_ID" \
  "$HOME/data/monitor-loop.env" || exit 1

The check_session_wiped function (defined in scripts/lib/monitor-decisions.sh):

Sets SESSION_WIPED ("yes"/"no") and SESSION_WIPED_PROCESS_ALIVE ("yes"/"no")
Returns 0 if not wiped or wiped-and-recoverable (recovery dirs created)
Returns 1 if wiped, no process alive, env stale (>2h) — dirs NOT created, caller should exit
Uses _find_session_process to scan /proc/[0-9]*/exe for a process matching the expected binary path (including (deleted) suffix)
On return 0 with SESSION_WIPED=yes: recreates {logs,cache,cargo-target,metrics} subdirs

Long-stale session detection

When the session dir exists (SESSION_WIPED=no), check whether the session is long-abandoned — no process alive AND no recent tick/orchestrator activity. This prevents auto-relaunching a node that was intentionally stopped hours or days ago.

if [[ "$SESSION_WIPED" == "no" ]]; then
  check_long_stale_session "$HOME/data" "/proc" "$MONITOR_SESSION_ID" \
    "$HOME/data/monitor-loop.env" || exit 1
fi

The check_long_stale_session function (defined in scripts/lib/monitor-decisions.sh):

Sets LONG_STALE_SESSION ("yes"/"no")
Returns 0 if session is not long-stale (process alive, or .alive/env recent enough)
Returns 1 if long-stale: no process, .alive age > 6h (or missing), env age > 24h
Primary signal: .alive mtime (touched every tick at start). Fallback: env file mtime.
Process-alive check (via _find_session_process) overrides staleness markers.
On return 1: stderr error, caller should exit without relaunching.

Recovery path when `SESSION_WIPED=yes`

Case A: SESSION_WIPED_PROCESS_ALIVE=yes

Touch .alive.
Skip checks dependent on log/metrics files (1, 5, 6, 7, 12).
Attempt admin-port checks: (2) ledger progression, (8) RPC health, (9) OBSRVR — the live process may still respond to HTTP.
Report status, actions, watch, and wipe: line per wipe-state composition table (row #3 or #6 depending on MAINNET_WIPED).
File/comment issue per wipe-state issue-filing policy.

Case B: SESSION_WIPED_PROCESS_ALIVE=no

Touch .alive.
FRESH_START determined normally (mainnet.db likely absent → yes).
Skip checks (1)–(9), (12) — no process, no logs, no metrics.

Rebuild the binary:

CARGO_TARGET_DIR=/home/tomer/data/$MONITOR_SESSION_ID/cargo-target \
  cargo build --release -p henyey

If build succeeds, persist the deployed sha (atomic write):

new_sha=$(git rev-parse HEAD)
printf '%s\n' "$new_sha" > "/home/tomer/data/$MONITOR_SESSION_ID/build_sha.tmp"
mv "/home/tomer/data/$MONITOR_SESSION_ID/build_sha.tmp" "/home/tomer/data/$MONITOR_SESSION_ID/build_sha"

If build fails: report OFFLINE; emit actions/watch/wipe: per wipe-state composition table (row #5 or #8). File/comment issue before exit using wipe-state issue-filing policy. Then exit.

Relaunch via standard Relaunch procedure.
Report status, actions, watch, and wipe: line per wipe-state composition table (row #4 or #7 depending on MAINNET_WIPED).
File/comment issue per wipe-state issue-filing policy.

Mainnet-data-vanished detection

After the session-dir check, independently verify that mainnet data exists (regardless of session-dir state). Note: The stale-env early-exit terminates the tick before reaching this point — no wipe signal is emitted in that case because no tick report is generated.

check_mainnet_wiped "$HOME/data"

The check_mainnet_wiped function (from scripts/lib/monitor-decisions.sh):

Sets MAINNET_WIPED to "yes" if $HOME/data/mainnet directory is missing, "no" otherwise
If mainnet/ dir exists but mainnet.db is missing, defer to FRESH_START logic
MAINNET_WIPED only fires when the ENTIRE directory is gone

Wipe-state composition

Both SESSION_WIPED and MAINNET_WIPED are fully determined at this point. All downstream reporting derives wipe-related outputs from these two flags using the truth table below. This is the single source of truth — Case A/B and the status formatter reference this table; they do not independently define wipe-related outputs.

Complete state truth table

#	SESSION_WIPED	MAINNET_WIPED	Case	`actions`	`watch`	Status level	`wipe:` line	Issue title pattern
1	no	no	—	—	—	(normal)	(omitted)	—
2	no	yes	—	`"mainnet-data-wiped"`	`"wipe=mainnet-data"`	ACTION	`MAINNET DATA WIPED — catchup recovery`	`OFFLINE: mainnet data wiped out-of-band`
3	yes	no	A	`"session-wiped-process-alive"`	`"wipe=session-dir"`	ACTION	`SESSION DIR WIPED — process alive (PID N), operator intervention needed`	`OFFLINE: validator session wiped out-of-band`
4	yes	no	B-ok	`"session-wiped-recovery"`	`"wipe=session-dir"`	ACTION	`SESSION DIR WIPED — rebuilt + relaunched (new PID N)`	`OFFLINE: validator session wiped out-of-band`
5	yes	no	B-fail	`"session-wiped-rebuild-failed"`	`"wipe=session-dir"`	OFFLINE	`SESSION DIR WIPED — rebuild failed`	`OFFLINE: validator session wiped out-of-band`
6	yes	yes	A	`"session-wiped-process-alive"`, `"mainnet-data-wiped"`	`"wipe=session-dir"`, `"wipe=mainnet-data"`	ACTION	`SESSION DIR WIPED — process alive (PID N) + MAINNET DATA WIPED`	`OFFLINE: session + mainnet data wiped out-of-band`
7	yes	yes	B-ok	`"session-wiped-recovery"`, `"mainnet-data-wiped"`	`"wipe=session-dir"`, `"wipe=mainnet-data"`	ACTION	`SESSION DIR WIPED — rebuilt + relaunched (new PID N) + MAINNET DATA WIPED`	`OFFLINE: session + mainnet data wiped out-of-band`
8	yes	yes	B-fail	`"session-wiped-rebuild-failed"`, `"mainnet-data-wiped"`	`"wipe=session-dir"`, `"wipe=mainnet-data"`	OFFLINE	`SESSION DIR WIPED — rebuild failed + MAINNET DATA WIPED`	`OFFLINE: session + mainnet data wiped out-of-band`

Composition rule: The wipe: line is built from up to two fragments joined by +:

Session fragment (from Case outcome): present when SESSION_WIPED=yes
MAINNET DATA WIPED: appended when MAINNET_WIPED=yes

When only mainnet is wiped (#2), append — catchup recovery to the mainnet fragment. In combined cases (#6–8), omit this suffix because the recovery mechanism is the session-wipe Case B relaunch (which handles FRESH_START implicitly).

Issue-filing policy:

File/comment one issue per tick, not per flag.
Dedup/search: Search open issues by the canonical title pattern from the table above. If an open issue with that title exists, comment on it (adding new evidence). Otherwise create a new issue.
Combined wipe: Use the combined title (session + mainnet data wiped). Do NOT also file a separate mainnet-only issue.
Label/severity: Defer to the existing Filing Flow label policy. In general: OFFLINE status → urgent; ACTION with process alive → per policy (typically urgent since node state is uncertain).
Board-route: after filing/commenting, add to project board per scripts/lib/monitor-label-policy.md routing rules (Backlog for actionable issues, Blocked for not-ready).

Tick-history watch array: Both "wipe=session-dir" and "wipe=mainnet-data" may coexist in a single tick's watch array. The daily-summary aggregator handles this — each entry is independent.

Session-alive marker

After the session-dir-vanished detection (which ensures the dir exists), touch a sentinel file at the start of every tick:

touch /home/tomer/data/$MONITOR_SESSION_ID/.alive

The .alive file's mtime = timestamp of last successful tick start. Cleanup tooling uses this to determine session liveness (see monitor-loop cleanup guards).

Fresh-start state

Determine once at the top of the tick: if /home/tomer/data/mainnet/mainnet.db does NOT exist, set FRESH_START=yes (sync deadline = 20m). Otherwise FRESH_START=no (sync deadline = 15m). Use this when evaluating check (2).

Crash-recovery state

If the node's previous run ended with a crash or wedge (SIGKILL), restart begins from a stale lcl that can be hours behind mainnet tip; replay will legitimately exceed the 15m clean-restart deadline.

Detect crash-recovery once at the top of the tick. The rule fires on either of these signals (both gated by uptime < 2h, after which recovery is considered complete regardless):

Rotation signal: the most recent log rotation in the session's logs/ dir is a .crashed-*, .stuck-*, or .frozen-* (not a planned .preredeploy-*). Use find (not shell globs) to avoid zsh NO_NOMATCH failing the pipeline.
Active-catchup signal (fires even when the rotation was .preredeploy-*): the node is in Catching Up state AND uptime exceeds 5 minutes. Handles the case where a manual planned restart (which rotates as preredeploy-*) happens AFTER a wedge, leaving the persisted lcl hours stale.

CRASH_RECOVERY=no
PID=$(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID")
if [ -n "$PID" ]; then
  uptime_sec=$(ps -o etimes= -p "$PID" 2>/dev/null | tr -d ' ')
  uptime_sec=${uptime_sec:-0}
  if [ "$uptime_sec" -lt 7200 ]; then
    # Signal 1: newest rotation type
    logs_dir="/home/tomer/data/$MONITOR_SESSION_ID/logs"
    newest_rotation=$(find "$logs_dir" -maxdepth 1 -type f \
        \( -name 'monitor.log.crashed-*' \
        -o -name 'monitor.log.stuck-*' \
        -o -name 'monitor.log.frozen-*' \
        -o -name 'monitor.log.preredeploy-*' \) \
      -printf '%T@ %p\n' 2>/dev/null | sort -rn | head -1 | cut -d' ' -f2-)
    case "$newest_rotation" in
      *.crashed-*|*.stuck-*|*.frozen-*) CRASH_RECOVERY=yes ;;
    esac

    # Signal 2: active catchup past the clean-restart window
    if [ "$CRASH_RECOVERY" = "no" ] && [ "$uptime_sec" -gt 300 ]; then
      node_state=$(curl -s -m 3 "http://localhost:$MONITOR_ADMIN_PORT/info" 2>/dev/null \
                   | python3 -c 'import sys,json; print(json.load(sys.stdin).get("state",""))' 2>/dev/null)
      if [ "$node_state" = "Catching Up" ]; then
        CRASH_RECOVERY=yes
      fi
    fi
  fi
fi

When CRASH_RECOVERY=yes and FRESH_START=no, the active sync deadline extends to 60m and a progress carveout applies (see check (2)).

Health checks

Common procedures

Several checks share these procedures — define once, reference by name:

Stop-PID (graceful kill, fall through to SIGKILL):

kill "$PID"; for i in $(seq 1 10); do sleep 1; kill -0 "$PID" 2>/dev/null || break; done
kill -0 "$PID" 2>/dev/null && kill -9 "$PID" && sleep 2

Relaunch (preserves log via append redirection, clears any stale lockfile):

rm -f /home/tomer/data/mainnet/mainnet.lock
RUST_LOG=info nohup /home/tomer/data/$MONITOR_SESSION_ID/cargo-target/release/henyey \
  --mainnet run $MONITOR_RUN_FLAGS -c $MONITOR_CONFIG \
  >> /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log 2>&1 &

Rotate-log (preserve prior session's log under the given suffix):

mv /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log \
   /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log.<suffix>-$(date -u +%Y%m%dT%H%M%SZ) 2>/dev/null || true

Suffix per origin: crashed (process found dead), frozen (wedge per 3b), preredeploy (planned restart for deploy).

(1) Log scan — tail -n 500 /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log. Scan for hash mismatches ("hash mismatch", "HashMismatch", differing expected/actual hashes), panics/crashes ("panic", "thread.*panicked", "SIGABRT", "SIGSEGV"), ERROR-level log lines, assertion failures ("assertion failed").

Match ERROR-level lines by log-level prefix only, e.g. anchor with grep -E '^[^ ]+Z\s+ERROR\s' (timestamp followed by the ERROR level token). A naive case-insensitive error match falsely fires on INFO-level lines that contain the word "error" as part of a message body — e.g. peer disconnect lines like INFO ... recv error: IO error: Connection reset by peer. Treat those as the normal overlay-churn INFO they are; do not flag.

(2) Ledger progression & sync deadline — persist ledger progression across ticks so STUCK can be detected by a single invocation:

Read /home/tomer/data/$MONITOR_SESSION_ID/last_ledger (if it exists) — format is "<ledger>|<unix-timestamp>".
Extract the current ledger from the most recent heartbeat event (heartbeat=true) in the log tail.
If the file exists and its ledger equals the current ledger and the recorded timestamp is more than 600s old, flag STUCK.
If the ledger has advanced or the file is missing, overwrite /home/tomer/data/$MONITOR_SESSION_ID/last_ledger with "<current-ledger>|<now>".
Check node uptime: ps -o etime= -p $(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID"). Active deadline: 15m when FRESH_START=no and CRASH_RECOVERY=no, 60m when CRASH_RECOVERY=yes, 20m when FRESH_START=yes.
"Real-time sync" means RPC age < 30s — NOT just heartbeat event gap=0. Gap is the node's local view (latest_ext - ledger) and can stay at 0 even when the node is minutes behind the network. The authoritative wall-clock signal is RPC age.
If uptime exceeds the active deadline AND the node is not in real-time sync: flag SYNC FAILURE if gap > 5, or age > 30s, or heard_from_quorum=false in the latest heartbeat event. Investigate the catchup path (checkpoint-boundary stalls, hash mismatches, event-loop freezes); do not just wait.
Progress carveout (only when CRASH_RECOVERY=yes): if lcl has advanced by ≥ 500 ledgers since the previous tick's last_ledger, the node is actively replaying — report CATCHING UP regardless of uptime. Flag SYNC FAILURE only when lcl stops advancing AND uptime exceeds 60m.
Fresh-start carveout (FRESH_START=yes, uptime < 20m): a large gap is expected during initial bucket apply — report CATCHING UP, not SYNC FAILURE.

(3) Process alive — find by /proc/exe path matching via _find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID" (from scripts/lib/monitor-decisions.sh, sourced at skill init). This validates that the process binary is exactly $HOME/data/$MONITOR_SESSION_ID/cargo-target/release/henyey (including (deleted) suffix after rebuilds), scoping detection to the current session. Historical note: the original pgrep -f 'henyey.*run' form was abandoned because it false-matched parallel claude --print agent processes; the intermediate comm-only replacement was abandoned because it false-matched any henyey process regardless of session (cross-session false positive, see #2467). If not running: Rotate-log with suffix crashed, then before Relaunch evaluate the (3a) Repeated-FATAL state-wipe trigger below.

(3a) Repeated-FATAL state-wipe trigger — a kill-loop on the same persisted state means the local lcl is corrupt and forward replay can never reconcile. When this happens, restart-without-wipe just accumulates crashed logs and stays offline. Detect and self-heal once per kill-loop:

logs_dir=/home/tomer/data/$MONITOR_SESSION_ID/logs
# Uses the shared detect_crash_state function from scripts/lib/monitor-decisions.sh
# (already sourced at skill init). Uses numeric mtime comparison (stat -c %Y),
# not -newermt, for portability and testability.
detect_crash_state "$logs_dir"
recent_count="$CRASH_RECENT_COUNT"
latest_crashed="$CRASH_LATEST_FILE"
hash_mismatch_signal="$CRASH_HASH_MISMATCH"

Trigger the wipe when ALL hold:

recent_count >= 3 (3+ crashed rotations in the last 30 min — proves restart-without-fix isn't recovering)
hash_mismatch_signal == "yes" (the most recent crash logged the structured field fatal_wipe_required=true or the legacy prose "State wipe required before restart" — emitted by trigger_fatal_shutdown() for any unrecoverable local state corruption)
FRESH_START=no (don't fire on a fresh sync that hasn't completed yet)

When triggered:

# Stop any partially-running process first (defensive — should already be dead)
PID=$(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID")
[ -n "$PID" ] && kill "$PID" && sleep 5 && kill -0 "$PID" 2>/dev/null && kill -9 "$PID"

# Wipe the corrupt persisted state (recoverable from public network archive)
rm -f /home/tomer/data/mainnet/mainnet.db \
      /home/tomer/data/mainnet/mainnet.db-shm \
      /home/tomer/data/mainnet/mainnet.db-wal \
      /home/tomer/data/mainnet/mainnet.lock
rm -rf /home/tomer/data/mainnet/buckets

# Reset progression tracker so the next tick treats this as a fresh start
rm -f /home/tomer/data/$MONITOR_SESSION_ID/last_ledger

# Mark the wipe action — read by (3d) to detect post-wipe recurrence.
# Stores the wipe epoch; cleared by (3d) when stable-recovery condition holds.
date -u +%s > /home/tomer/data/$MONITOR_SESSION_ID/wipe_attempted_at

Then Relaunch. The next tick will see FRESH_START=yes (mainnet.db absent), apply the 20m sync deadline, and let the node fresh-catchup from network archive (typically ~10–15 min to validating, observed ~9 min on 2026-05-09).

File a new urgent GH issue documenting the wipe with the count of crashed rotations, the hash-mismatch evidence from the latest crashed log, and the cumulative downtime — this is a data point for whether the underlying recovery code path needs further hardening even though the immediate cause was already fixed. Board-route to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog

The 3-in-30-min trigger is self-rate-limiting only when the underlying fault is local state corruption — after a successful wipe + fresh-catchup, new .crashed-* rotations stop. (3d) Post-wipe recurrence guard below catches the alternate failure mode where the freshly-rebuilt state trips the same fatal_wipe_required signal — proof that the bug is in the apply path itself, not on disk — and prevents an infinite wipe→catchup→crash loop.

(3d) Post-wipe recurrence guard — when 3a's wipe doesn't fix the symptom (a new crashed-* rotation with fatal_wipe_required=true appears AFTER the wipe), the binary is the suspect, not local state. Looping 3a indefinitely wastes downtime and archive bandwidth. Stop relaunching, auto-quarantine the build SHA, and wait for operator-applied rollback or fix. Evaluate (3d) BEFORE (3a) in the dead-process path; if (3d) fires, skip (3a) and skip Relaunch.

marker=/home/tomer/data/$MONITOR_SESSION_ID/wipe_attempted_at
post_wipe_recurrence=no
if [ -s "$marker" ]; then
  wipe_epoch=$(cat "$marker" 2>/dev/null | tr -dc '0-9')
  # Use the same detect_crash_state result already populated for (3a):
  # CRASH_LATEST_FILE + CRASH_HASH_MISMATCH are valid in this scope.
  if [ -n "${CRASH_LATEST_FILE:-}" ] && [ -n "${wipe_epoch:-}" ]; then
    latest_mtime=$(stat -c %Y "$CRASH_LATEST_FILE" 2>/dev/null || echo 0)
    if [ "$latest_mtime" -gt "$wipe_epoch" ] && [ "${CRASH_HASH_MISMATCH:-no}" = "yes" ]; then
      post_wipe_recurrence=yes
    fi
  fi
fi

When post_wipe_recurrence=yes:

# 1. Auto-quarantine the build SHA. Idempotent — already-present is OK.
source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
build_sha=$(cat "/home/tomer/data/$MONITOR_SESSION_ID/build_sha" 2>/dev/null || true)
if [ -n "$build_sha" ]; then
  quarantine_append "$HOME/data/deploy_quarantine.txt" "$build_sha" \
    "post-wipe recurrence (3d) — see urgent issue"
fi

# 2. Do NOT relaunch. Do NOT fire (3a) again — wipe already proved insufficient.
# 3. Report OFFLINE with `actions: ["post-wipe-recurrence", "auto-quarantined"]`
#    and `watch: ["determinism-suspect-binary"]`.

File or comment on an urgent GH issue marking the binary as a determinism-suspect: include the build SHA, the wipe epoch from the marker, the latest crashed-log hash-mismatch evidence, and the cumulative downtime since wipe. Title pattern: "OFFLINE: post-wipe recurrence on <sha:8> — binary quarantined". If an open issue already names the same build_sha as determinism-suspect, comment on it instead of filing a duplicate. Board-route to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog

Recovery from (3d) is operator-driven. The deploy gate at section 10 step 3 (Quarantine gate) will continue to BLOCK redeploy of the quarantined SHA as long as it is reachable from origin/main. Operator action: revert/rollback the bad commit so it is no longer in origin/main's ancestry; the next tick's deploy gate will then let the rebuild + relaunch proceed normally. The marker is cleared automatically by the recovery rule below — operators do not need to remove it manually.

Marker-clear rule — runs each tick when the process is alive. Clear wipe_attempted_at only when the binary has demonstrably stabilized:

marker=/home/tomer/data/$MONITOR_SESSION_ID/wipe_attempted_at
if [ -s "$marker" ] && [ -n "${PID:-}" ]; then
  PROC_START_EPOCH=$(stat -c %Y /proc/$PID/stat 2>/dev/null || echo 0)
  uptime_sec=$(( $(date -u +%s) - PROC_START_EPOCH ))
  has_fatal_wipe_evidence "$logs_dir" "$logs_dir/monitor.log"  # sets FATAL_WIPE_EVIDENCE
  ledger_advanced=no
  # last_ledger holds the previous tick's value, stashed before check (2) overwrites
  if [ -n "${PREV_LEDGER:-}" ] && [ -n "${LEDGER:-}" ] && [ "$LEDGER" -gt "$PREV_LEDGER" ]; then
    ledger_advanced=yes
  fi
  if [ "$uptime_sec" -gt 1800 ] \
     && [ "${FATAL_WIPE_EVIDENCE:-no}" = "no" ] \
     && [ "$ledger_advanced" = "yes" ]; then
    rm -f "$marker"
  fi
fi

All four conditions (alive, uptime > 30m, no fatal_wipe_required since proc start, last_ledger advanced) must hold simultaneously to clear the marker. Quarantine entries are NOT auto-removed — operators clear them deliberately per the Quarantine Clearance procedure (section 10).

(3c) Soft-fail state-wipe trigger — defense-in-depth for the case where trigger_fatal_shutdown() signals exit but the process fails to terminate, leaving it alive with fatal_state_failure=true, blocking all recovery, and making no ledger progress. This complements (3a) which only fires post-mortem. Evaluate (3c) BEFORE (3b) when the process IS alive. If (3c) fires, skip (3b) — a wipe supersedes a plain restart.

logs_dir=/home/tomer/data/$MONITOR_SESSION_ID/logs
log_file="$logs_dir/monitor.log"

# PID (session-scoped via _find_session_process; skip (3c) if empty)
PID=$(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID")
[ -z "$PID" ] && : # skip — dead-process path (3a) handles this

# Process start time from /proc/$PID/stat mtime
PROC_START_EPOCH=$(stat -c %Y /proc/$PID/stat 2>/dev/null || echo 0)

# Uses shared functions from scripts/lib/monitor-decisions.sh
has_fatal_wipe_evidence "$logs_dir" "$log_file"
detect_soft_fail_blocked "$log_file" "$PROC_START_EPOCH"

Trigger the wipe when ALL hold:

SOFT_FAIL_BLOCKED == "yes" (WARN-level "Recovery escalation blocked" messages sustained for >= 5 min within current PID lifetime, most recent within 90s of now)
FATAL_WIPE_EVIDENCE == "yes" (fatal_wipe_required=true signal found in any crashed rotation OR the active log — no time window, confirms persistent state corruption)
FRESH_START == "no" (not a fresh sync)
No ledger progress since previous tick — last_ledger unchanged (stash previous tick's value before check (2) overwrites; compare against current ledger)

When triggered:

# 1. Stop the alive-but-stuck process
kill "$PID" && sleep 5
kill -0 "$PID" 2>/dev/null && kill -9 "$PID" && sleep 2

# 2. Rotate log (preserve evidence, consistent suffix with 3a)
mv "$log_file" "${log_file}.crashed-$(date -u +%Y%m%dT%H%M%SZ)" 2>/dev/null || true

# 3. Wipe corrupt persisted state (same artifacts as 3a)
rm -f /home/tomer/data/mainnet/mainnet.db \
      /home/tomer/data/mainnet/mainnet.db-shm \
      /home/tomer/data/mainnet/mainnet.db-wal \
      /home/tomer/data/mainnet/mainnet.lock
rm -rf /home/tomer/data/mainnet/buckets

# 4. Reset progression tracker
rm -f /home/tomer/data/$MONITOR_SESSION_ID/last_ledger

Then Relaunch. The next tick will see FRESH_START=yes (mainnet.db absent).

File a new urgent GH issue documenting the soft-fail wipe with: blocked duration (SOFT_FAIL_BLOCKED_DURATION_SEC), evidence source (FATAL_WIPE_SOURCE), and cumulative downtime. Use title pattern: "Soft-fail state wipe: fatal_state_failure stuck for {N}m". Always a new issue (no dedup — each wipe is a distinct incident). Known prior incidents: #2363. Board-route to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog

Self-limiting: after wipe, FRESH_START=yes blocks condition (3); new process has no fatal_state_failure so condition (1) fails; log rotation removes old blocked messages from active log.

(3b) Wedge detection — a process can be alive but have a frozen event loop (watchdog fires, HTTP hangs, ledger progression stops). Check 3 alone misses this because pgrep still finds the PID.

Flag WEDGE when BOTH:

grep -E 'watchdog_freeze"?\s*[=:]\s*true|WATCHDOG: Event loop appears frozen' $LOG | tail -1 is present with a timestamp within the last 120s. (The structured field watchdog_freeze=true is the primary signal; the prose string is a legacy fallback. The "? accounts for JSON key quoting.)
curl -s -m 3 http://localhost:$MONITOR_ADMIN_PORT/info returns empty body or times out.

On WEDGE: Stop-PID, Rotate-log with suffix frozen, then Relaunch. Always file a new urgent-labeled issue (wedge blocks validator operation). Board-route to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog Recurrence-after-fix → NEW issue, not a comment on a closed one. Known prior incidents: #1904, #1873, #1921, #1949.

(4) Memory — ps -o rss= -p $(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID"), convert to MB.

If RSS > 12 GB, flag HIGH MEMORY (report-only; no restart).
Restart condition — restart only if ALL hold (this gates on system pressure AND evidence of a real heap leak, so we don't kill a legit catchup):
1. RSS > 16 GB, AND
2. system available memory from free -m (NOT free — that excludes reclaimable kernel cache) < 8 GB, AND
3. latest two memory_report=true (or Memory report summary) entries both show heap_components_mb growing by > 500 MB vs the earlier snapshot.
Restart: Stop-PID, Rotate-log suffix crashed, Relaunch.

(5) Disk — df -h /home/tomer/data | tail -1. If usage > 85%, flag LOW DISK. Then keep the 3 most recent rotated archives per category (the ISO 8601 timestamp suffix sorts lexicographically, so sort -r gives newest-first; use find -printf not shell glob to survive zsh NO_NOMATCH):

logs_dir=/home/tomer/data/$MONITOR_SESSION_ID/logs
[ -d "$logs_dir" ] && for pat in preredeploy crashed stuck frozen; do
  find "$logs_dir" -maxdepth 1 -type f -name "monitor.log.$pat-*" \
    -printf '%f\n' 2>/dev/null | sort -r | tail -n +4 \
    | while read -r f; do rm -f "$logs_dir/$f"; done
done

Report how many files were removed if any.

(6) Session disk — du -sh /home/tomer/data/$MONITOR_SESSION_ID/ and du -sh /home/tomer/data/mainnet/. If combined > 200 GB, flag SESSION DISK HIGH.

(7) Memory report — grep -E 'memory_report=true|Memory report summary' /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log | tail -1. If grep returns no output AND process uptime > 400s, flag WARNING memory-report-missing (log format may have changed). Memory reports emit every ~6 minutes, so an uptime below 400s legitimately has no entry yet on a post-restart tick — report memreport: N/A (warmup) and move on. Otherwise extract jemalloc_allocated_mb, jemalloc_resident_mb, fragmentation_pct, heap_components_mb, mmap_mb, unaccounted_mb, unaccounted_sign. If fragmentation_pct > 50, flag HIGH FRAGMENTATION. If unaccounted_mb > 1000 with sign +, note it (known jemalloc overhead, not a bug — but verify heap_components is stable; if it is growing, investigate).

(8) RPC health (validator mode only — skip in watcher mode) — curl -s -X POST http://localhost:$MONITOR_RPC_PORT -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}'. Verify response is non-empty and status is healthy. Check pruning: latestLedger - oldestLedger should be bounded.

Baseline (data only) is retention_window ledgers, plus a maintenance-cycle buffer (maintenance every 900s ≈ 180 ledgers at 5s/ledger).
Since 57821bcf/25797e2e (Apr 27), ledger headers and tx history are also held back from pruning to satisfy publishing. The new equilibrium is ~3× retention_window (~1050 for retention=360).
If gap > 3 × retention_window + 500 (~1580 for retention=360), flag PRUNING STALLED — that's "headers/tx-history protection plus an extra maintenance cycle's slack" exceeded, indicating real pruning failure.
If RPC is not responding, flag RPC DOWN.

(9) OBSRVR Radar (validator mode only — skip in watcher mode) — get public key from curl -s http://localhost:$MONITOR_ADMIN_PORT/info (extract public_key). Then curl -s https://radar.withobsrvr.com/api/v1/nodes/<PUBLIC_KEY>. Check:

isValidating — if false and node running > 30 min, flag NOT VALIDATING.
validating24HoursPercentage — if < 50 and running > 6 hours, flag LOW VALIDATION RATE.
lag — if > 500, flag HIGH LAG.

If the API errors out, emit obsrvr: N/A (api-error) instead of omitting the field.

If the API returns a response but latestLedger or updatedAt is missing/null (partial response), treat the entire radar result as stale / incomplete and emit obsrvr: N/A (api-incomplete). Do NOT evaluate lag in this case — a lag value returned alongside a null latestLedger/updatedAt is a cached aggregate from a prior observation window (observed post-restart: lag=8754 against a node that is in real-time sync with age=2s).

(12) Metrics scan — scrape /metrics and evaluate the alert catalog.

mkdir -p /home/tomer/data/$MONITOR_SESSION_ID/metrics.
mv /home/tomer/data/$MONITOR_SESSION_ID/metrics/current.prom /home/tomer/data/$MONITOR_SESSION_ID/metrics/prev.prom 2>/dev/null || true.
curl -s http://localhost:$MONITOR_ADMIN_PORT/metrics > /home/tomer/data/$MONITOR_SESSION_ID/metrics/current.prom.
Counter reset handling: for any counter, if current < prev, treat delta = current (process restarted).

Post-restart warmup exemptions — for the first 2 ticks after a detected restart (check 3, check 10, or CRASH_RECOVERY=yes), skip these alerts. Overlay handshake + jemalloc arena stabilization + from-zero counters take ~10 minutes to settle.

henyey_jemalloc_fragmentation_pct > 50 (post-restart frag ramps ~35-45%, settles to ~18%)
stellar_peer_count < 8 (peer count ramps 0 → 10+ over the first 10 min)
stellar_overlay_inbound_authenticated < 3
stellar_scp_timing_externalized_seconds > 3
stellar_scp_timing_nominated_seconds > 2
counter-started-at-zero delta alerts (any catalog entry whose prev is 0)

Metric alert catalog

Catalog-wide notes:

All entries below are subject to the warmup exemptions above and the cooldown rule in §Firing alerts.
"Synced-only gating" = uptime > 15m AND CRASH_RECOVERY=no AND FRESH_START=no. These gauges fire only when the node should be in real-time sync; the authoritative sync check is (2). The catchup phases are legitimate non-synced states.

COUNTERS (fire on delta ≥ threshold per tick):

stellar_herder_lost_sync_total ≥1 → SYNC
henyey_post_catchup_hard_reset_total ≥1 → ACTION
(stellar_overlay_timeout_idle_total + stellar_overlay_timeout_straggler_total) ≥5× prior-tick-sum → WARN
(stellar_overlay_error_read_total + stellar_overlay_error_write_total) ≥50 → WARN
henyey_archive_cache_refresh_error_total ≥1 → NONC
henyey_archive_cache_refresh_timeout_total ≥3 → NONC

stellar_ledger_apply_failure_total, henyey_scp_post_verify_drops_total, and stellar_herder_pending_too_old_total are covered by the ratio checks below (traffic-proportional and self-calibrating); do NOT check absolute deltas.

GAUGES (fire on absolute threshold against current snapshot):

stellar_peer_count <8 → WARN
stellar_overlay_inbound_authenticated <3 → WARN (synced-only gating; healthy fleet has 50+ inbound. Aggregate peer_count can be ≥8 from outbound while inbound starves consensus; this catches that case)
stellar_ledger_age_current_seconds >30 → SYNC (synced-only gating)
stellar_herder_state !=2 → SYNC (synced-only gating)
henyey_jemalloc_fragmentation_pct >50 on two consecutive ticks → WARN
henyey_scp_verify_input_backlog >100 on two consecutive ticks → WARN (single snapshots routinely spike to 100+ during slot externalize bursts and drain to 0 within seconds — gate on persistence. Sample /metrics 5x at 2s intervals to verify before filing)
henyey_scp_verifier_thread_state !=0 → WARN (0=Running, 1=Stopping, 2=Dead)
stellar_herder_pending_envelopes >2000 → WARN
henyey_overlay_fetch_channel_depth >500 on two consecutive ticks → WARN (the _max variant is a monotonic high-water mark that never resets after catchup-tail spikes — use the live gauge with persistence guard, mirroring the scp_verify_input_backlog pattern. Sample /metrics 5x at 2s intervals to verify before filing)
(henyey_process_open_fds / henyey_process_max_fds) >0.85 → WARN
henyey_herder_drift_max_seconds >10 → NONC
stellar_scp_timing_externalized_seconds >10 → WARN (this is a SLOT-CYCLE metric: first-envelope-received → self-externalize, naturally ~4-6s on mainnet's 5s slots. The earlier >3 threshold was a misread of #1934 — it compared against stellar-core's mFirstToSelfExternalizeLag which measures a different, much narrower window (any-node externalize → self-externalize, ~0.3-0.5s healthy). Until we expose a matching metric, alert only on >2x normal slot-cycle = real degradation)
stellar_scp_timing_nominated_seconds >7 → WARN (also a SLOT-CYCLE metric measured from first local nomination vote to self-externalize; healthy floor ~3-5s on mainnet)

quorum_agree / quorum_missing / quorum_fail_at are intentionally NOT monitored — they snapshot the tracking slot's QuorumInfo and return false-positive noise between externalizations; see monitor-loop/SKILL.md Metrics Scan section for the code-path explanation.

HISTOGRAMS (fire on p99 bucket of per-tick delta): for each histogram H,

bucket_delta[le] = H_bucket_current[le] - H_bucket_prev[le]
count_delta = H_count_current - H_count_prev

Skip if count_delta < 20. Compute cumulative bucket delta at upper edge L: sum(bucket_delta[le] for le ≤ L). The smallest L where cumulative ≥ 0.99 * count_delta is the p99 upper bound.

Thresholds (all WARN):

henyey_ledger_close_handle_complete_seconds p99 >0.5s
henyey_ledger_close_dispatch_to_join_seconds p99 >5s
henyey_ledger_close_post_complete_seconds p99 >0.5s
henyey_ledger_close_tx_exec_seconds p99 >1s
henyey_ledger_close_soroban_exec_seconds p99 >1s
henyey_ledger_close_commit_seconds p99 >0.5s
henyey_ledger_close_soroban_state_seconds p99 >0.5s
henyey_ledger_close_complete_tx_queue_seconds p99 >0.5s

Mean check (sum_delta / count_delta) is a cheaper fallback — fire on whichever breaches.

Metric extraction forms

Counter catalog entries use one of three extraction forms. When adding a new counter, check the emission site in crates/app/src/metrics.rs to determine whether the metric is scalar or labeled, and use the appropriate form.

Form 1 — Scalar counter (most counters):

cur=$(grep -E '^<metric_name> ' current.prom | awk '{printf "%d", $2}')
prev=$(grep -E '^<metric_name> ' prev.prom | awk '{printf "%d", $2}')

Form 2 — Single labeled series (extract one specific label value):

cur=$(grep -E '^<metric_name>\{<label>="<value>"\} ' current.prom | awk '{printf "%d", $2}')
prev=$(grep -E '^<metric_name>\{<label>="<value>"\} ' prev.prom | awk '{printf "%d", $2}')

Form 3 — Sum of explicit labeled series (sum specific label values):

cur=$(awk '/^<metric_name>\{<label>="<value1>"\}|^<metric_name>\{<label>="<value2>"\}/ {sum+=$NF} END{printf "%d", sum+0}' current.prom)
prev=$(awk '/^<metric_name>\{<label>="<value1>"\}|^<metric_name>\{<label>="<value2>"\}/ {sum+=$NF} END{printf "%d", sum+0}' prev.prom)

For all forms, delta handling is identical:

Extract prev using the same pattern against prev.prom (default 0 if absent)
If current < prev: counter reset (node restarted) → delta = current
Otherwise: delta = current - prev

Label-presence validation: For labeled counters, validate that the expected label set is present before extracting. This catches silent renames or removals that would weaken an alert. If the expected label set is missing or mutated, skip the alert entirely rather than treating missing series as zero. Follow the existing pattern used for henyey_scp_post_verify_total (lines 406-413):

labels=$(grep -oP '^<metric_name>\{<label>="\K[^"]+' current.prom | sort)
expected=$(printf '%s\n' <label1> <label2> ... | sort)
if [ "$labels" != "$expected" ]; then
  # report: <metric>: skipped (label set mismatch)
fi

Ratio checks — sustained breach detection

Three ratio-based checks replace the former absolute-delta thresholds for stellar_ledger_apply_failure_total, henyey_scp_post_verify_drops_total, and stellar_herder_pending_too_old_total. These fire only after 3 consecutive ticks of threshold breach, avoiding transient spikes.

Required inputs — global (all must be present, non-empty, numeric; if any missing or invalid, skip ratio checks entirely, empty the ratio snapshot, reset streaks):

stellar_ledger_age_current_seconds — sync gating
stellar_ledger_apply_success_total
stellar_ledger_apply_failure_total
henyey_scp_post_verify_total{reason="accepted"}
henyey_scp_post_verify_total{reason="processed_directly"}
Sum of all 14 henyey_scp_post_verify_total{reason="..."} lines (pv_total_sum)

Per-check required inputs — these are NOT part of the global required set. If missing, only the pending check (Check 3) skips; SCP and apply proceed normally:

stellar_herder_pending_too_old_total
stellar_herder_pending_received_total

Post-verify label-set validation: After extracting the 14 henyey_scp_post_verify_total{reason="..."} lines, validate that the exact set of reason labels is: invalid_sig, panic, drift_range, drift_close_time, drift_cannot_receive, self_message, non_quorum, buffered, duplicate, too_far, buffer_full, processed_directly, per_slot_full, accepted. If any label is missing or unexpected labels appear, treat as missing counters (partial scrape or label change).

Skip conditions (skip all ratio checks when any is true):

FRESH_START=yes
heartbeat event gap > 5
stellar_ledger_age_current_seconds > 30
Process uptime < 10 minutes
/metrics fetch fails (curl error or empty response)
/metrics returns "recorder not installed"
Any required counter missing or invalid
Post-verify label set ≠ expected 14 labels

On any skip: empty /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot, reset all streak counters to 0, report metrics_ratio: skipped (<reason>).

Snapshot file: /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot

Format:

version=1
pid=<PID>
start_ticks=<field 22 from /proc/$PID/stat>
timestamp=<ISO8601>
apply_success=<value>
apply_failure=<value>
pv_accepted=<value>
pv_processed_directly=<value>
pv_total_sum=<value>
apply_breach_streak=<N>
scp_breach_streak=<N>
pending_too_old=<value>
pending_received=<value>
pending_breach_streak=<N>

Process identity: start_ticks = field 22 from /proc/$PID/stat (starttime in clock ticks since boot). Extract with awk '{print $22}' /proc/$PID/stat. Locale-independent, catches PID reuse.

Invalidation:

PID or start_ticks changed → discard, write new baseline, reset streaks
Snapshot malformed, missing fields, or version ≠ 1 → discard
Any current counter < previous value → discard (counter reset)

Metric extraction — illustrative pseudocode (not literal shell; each bail point must stop ratio-check processing for this tick):

metrics_body=$(curl -s http://localhost:$MONITOR_ADMIN_PORT/metrics)

# Bail on fetch failure — stop ratio checks for this tick
if [ -z "$metrics_body" ]; then
  > /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
  # report: metrics_ratio: skipped (fetch failed)
  # STOP — do not proceed to extraction or ratio evaluation
fi

# Bail if recorder not installed — stop ratio checks
if echo "$metrics_body" | grep -q 'metrics recorder not installed'; then
  > /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
  # report: metrics_ratio: skipped (recorder not installed)
  # STOP
fi

ledger_age=$(echo "$metrics_body" | grep -E '^stellar_ledger_age_current_seconds ' | awk '{printf "%d", $2}')
apply_success=$(echo "$metrics_body" | grep -E '^stellar_ledger_apply_success_total ' | awk '{printf "%d", $2}')
apply_failure=$(echo "$metrics_body" | grep -E '^stellar_ledger_apply_failure_total ' | awk '{printf "%d", $2}')
pv_accepted=$(echo "$metrics_body" | grep -E '^henyey_scp_post_verify_total\{reason="accepted"\} ' | awk '{printf "%d", $2}')
pv_processed=$(echo "$metrics_body" | grep -E '^henyey_scp_post_verify_total\{reason="processed_directly"\} ' | awk '{printf "%d", $2}')

# Validate exact 14-label set — STOP if mismatch
pv_labels=$(echo "$metrics_body" | grep -oP '^henyey_scp_post_verify_total\{reason="\K[^"]+' | sort)
expected_labels=$(printf '%s\n' accepted buffer_full buffered drift_cannot_receive drift_close_time drift_range duplicate invalid_sig non_quorum panic per_slot_full processed_directly self_message too_far | sort)
if [ "$pv_labels" != "$expected_labels" ]; then
  > /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
  # report: metrics_ratio: skipped (label set mismatch)
  # STOP
fi

pv_total_sum=$(echo "$metrics_body" | grep -E '^henyey_scp_post_verify_total\{reason="[^"]+"\} ' | awk '{sum+=$2} END {printf "%d", sum}')

# Validate all 6 global values present and numeric — STOP if any invalid
for v in "$ledger_age" "$apply_success" "$apply_failure" "$pv_accepted" "$pv_processed" "$pv_total_sum"; do
  if [ -z "$v" ] || ! echo "$v" | grep -qE '^[0-9]+$'; then
    > /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
    # report: metrics_ratio: skipped (missing counters)
    # STOP
  fi
done

# Per-check inputs for pending (Check 3) — if missing, only Check 3 skips
pending_too_old=$(echo "$metrics_body" | grep -E '^stellar_herder_pending_too_old_total ' | awk '{printf "%d", $2}')
pending_received=$(echo "$metrics_body" | grep -E '^stellar_herder_pending_received_total ' | awk '{printf "%d", $2}')
pending_counters_valid=true
for v in "$pending_too_old" "$pending_received"; do
  if [ -z "$v" ] || ! echo "$v" | grep -qE '^[0-9]+$'; then
    pending_counters_valid=false
    break
  fi
done

# henyey_recovery_stalled_tick_total: extract via Form 2 (see §Metric extraction
# forms). Relaxed label validation: require only the forcing_catchup_behind
# label to be present (minimum for Check 12b alerting). Other labels
# (backoff_active, forcing_catchup_not_behind, archive_behind_peer_ahead_hard_reset,
# at_tip_no_scp_hard_reset) may not yet exist on a fresh node — only 3 of the 5
# are pre-registered in the metrics macro (crates/app/src/metrics.rs:613); the
# remaining 2 are created dynamically on first use of each recovery path.
# If forcing_catchup_behind is missing, skip Check 12b for this tick.
# Alerting is streak-gated — see Check 12b section for logic.

Check 1: SCP post-verify acceptance rate

Numerator delta: delta(pv_accepted) + delta(pv_processed)
Denominator delta: delta(pv_total_sum)
Alert: (numerator / denominator) < 0.05 (less than 5% accepted) for 3 consecutive ticks
Min denominator delta: 500 (below → scp: skipped (low volume), reset scp streak)
Baseline acceptance on mainnet: ~10-20%. <5% sustained = nearly nothing reaching SCP.
On breach: increment scp_breach_streak. If streak ≥ 3, route through Bug Filing Workflow (investigate verifier thread, stale envelopes, sync state).
On healthy tick (ratio ≥ 0.05 with sufficient volume): reset scp_breach_streak to 0.

Check 2: Transaction apply failure rate

Numerator delta: delta(apply_failure)
Denominator delta: delta(apply_failure) + delta(apply_success)
Alert: (numerator / denominator) > 0.50 (over 50% fail) for 3 consecutive ticks
Min denominator delta: 200 (below → apply: skipped (low volume), reset apply streak)
On breach: increment apply_breach_streak. If streak ≥ 3, investigate in same tick. If evidence points to henyey apply-engine bug, file/comment via Bug Filing Workflow. If expected bad-tx traffic (spam wave, known rejections), report as WARNING without filing.
On healthy tick (ratio ≤ 0.50 with sufficient volume): reset apply_breach_streak to 0.

Check 3: Pending too-old rate

Numerator delta: delta(pending_too_old)
Denominator delta: delta(pending_received)
Alert: (numerator / denominator) > 0.50 (over 50% too old) for 3 consecutive ticks
Min denominator delta: 100 (below → pending: skipped (low volume), reset pending streak)
Per-check skip: If pending_counters_valid is false (missing counters), skip this check only — report pending: skipped (missing counters), reset pending_breach_streak to 0. SCP and apply checks proceed normally.
On breach: increment pending_breach_streak. If streak ≥ 3, report as WARNING (overlay lag or stale-envelope flood). Investigate overlay peer health and slot progression.
On healthy tick (ratio ≤ 0.50 with sufficient volume): reset pending_breach_streak to 0.

Per-check state machine (each check independently):

Skip (global skip, low volume, or missing data) → reset that check's streak to 0
Healthy (ratio within threshold, sufficient volume) → reset streak to 0
Breach (ratio exceeds threshold, sufficient volume) → increment streak

Per-check low-volume: When only one check's denominator delta is below its minimum, that check skips (streak resets to 0) and the others proceed normally.

Thresholds are provisional — tune after 1-2 weeks of production data.

Status report: Each check independently reports one of: ok (value), skipped (reason), WARNING value (N ticks), or collecting baseline.

Check 12b: Recovery-stalled streak (counter-based, independent of ratio checks)

Canonical constants: Threshold values, snapshot path, and applicability are defined in shared/check-12b-constants.toml. This section is authoritative for the state machine logic; inline literals are cross-validated against the TOML by scripts/test-monitor-skill-snippets.sh.

This check tracks henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"} using a streak-gated alert, independent of Check 12's ratio checks. It runs on its own state machine because ratio checks are globally skipped during unsync states (ledger age > 30s, gap > 5, etc.), but the recovery-stalled counter fires precisely during recovery transitions when the node is briefly unsynced.

Data source: Reuses the same /metrics scrape result ($metrics_body / metrics/current.prom) already fetched by check-8/check-12. Does NOT perform a second /metrics fetch.

Applicability: Validator mode only. In watcher mode, skip Check 12b entirely and omit the recovery_stalled: line from the status report.

Snapshot file: /home/tomer/data/$MONITOR_SESSION_ID/metrics/counter_streak_snapshot

Format:

version=1
pid=<PID>
start_ticks=<field 22 from /proc/$PID/stat>
timestamp=<ISO8601>
recovery_stalled_behind=<value>
recovery_stalled_breach_streak=<N>

PID/start_ticks check (always, even on skip): Before evaluating skip conditions, check PID/start_ticks against the stored snapshot. If the process restarted (PID or start_ticks changed), invalidate the snapshot immediately. This catches restarts during skip ticks and prevents comparing post-restart counters to a pre-restart baseline.

Skip conditions (skip Check 12b only when any is true):

/metrics fetch failed this tick (same condition check-8 detects)
/metrics returns "recorder not installed"
forcing_catchup_behind label missing from the scrape

On skip: write snapshot preserving existing recovery_stalled_behind value (or 0 if no prior snapshot) with recovery_stalled_breach_streak=0. Next healthy tick compares against preserved value — does NOT enter "collecting baseline" after a skip. Report recovery_stalled: skipped (<reason>).

Invalidation (reset streak AND enter "collecting baseline"):

PID or start_ticks changed (process restart) — checked before skip conditions
Snapshot malformed, missing fields, or version ≠ 1
Current recovery_stalled_behind value < previous (counter reset)
First tick after fresh start (no prior snapshot exists)

On invalidation: write new snapshot with current counter value and recovery_stalled_breach_streak=0. Report recovery_stalled: collecting baseline. Do NOT evaluate burst or streak logic on invalidation ticks — this prevents false-firing the burst override on the first tick after a restart where the counter jumps from 0 to the current absolute value.

Per-tick logic (not skipped AND not invalidated):

delta = current(recovery_stalled_behind) - prev(recovery_stalled_behind)

if delta >= 10:
    # Immediate-fire override: large burst indicates sustained stalling.
    # Do NOT reset streak — keep incrementing; cooldown (7200s) handles dedup.
    recovery_stalled_breach_streak += 1
    → fire WARN, route through Bug Filing Workflow
elif delta >= 1:
    recovery_stalled_breach_streak += 1
    if recovery_stalled_breach_streak >= 3:
        → fire WARN, route through Bug Filing Workflow
else:  # delta == 0
    recovery_stalled_breach_streak = 0

Note: after skipped ticks where the counter value was preserved, the first healthy tick may see a large accumulated delta spanning multiple monitor intervals. This is acceptable — the burst threshold (≥10) catches sustained trouble regardless of tick granularity.

Post-restart warmup: First tick writes baseline ("collecting baseline"). Second tick has a valid prev for delta comparison and begins normal evaluation.

Alert identity and cooldown:

Cooldown key: henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"} (full selector-qualified)
Cooldown period: 7200s (2h)
Issue search: gh issue list --search 'metrics: henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"}' --state open
Issue title on filing: metrics: henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"} — sustained breach
Filing follows the standard Bug Filing Workflow (§Firing alerts — cooldown + filing)

Integration with metrics aggregate: Check 12b alerts are NOT counted in the metrics: line (which tracks immediate-fire counter/gauge alerts). They appear on their own recovery_stalled: line. A fired Check 12b alert does contribute to overall tick severity — the tick is considered unhealthy when any alert fires.

Status line: recovery_stalled: (reported after metrics_ratio:):

recovery_stalled: ok (delta=0) — no increment, streak reset
recovery_stalled: breach (delta=N, streak M/3) — incrementing, below threshold
recovery_stalled: WARNING delta=N (M ticks) — investigating — streak ≥ 3
recovery_stalled: WARNING delta=N (burst) — investigating — immediate fire (delta ≥ 10)
recovery_stalled: skipped (<reason>) — metric missing or fetch failed
recovery_stalled: collecting baseline — first tick after restart/invalidation

Rendering precedence (determines the metrics_ratio: line format):

Global skip (all checks skipped for the same reason — e.g., not in sync, fetch failed, recorder not installed, missing counters, label mismatch): Use the collapsed form: metrics_ratio: skipped (<reason>)
Collecting baseline (no previous snapshot exists — first steady-state tick): metrics_ratio: collecting baseline
Per-check reporting (at least one check ran — whether ok, warning, or individually skipped for low volume or missing per-check inputs): Compose: metrics_ratio: scp <scp_status>, apply <apply_status>, pending <pending_status>

Examples:

All healthy: metrics_ratio: scp ok (accept=15%), apply ok (fail=8%), pending ok (too_old=3%)
One warning: metrics_ratio: scp ok (accept=12%), apply WARNING fail=55%>50% (3 ticks) — investigating, pending ok (too_old=5%)
One skipped (low volume): metrics_ratio: scp skipped (low volume), apply ok (fail=5%), pending ok (too_old=2%)
Pending skipped (missing counters): metrics_ratio: scp ok (accept=15%), apply ok (fail=8%), pending skipped (missing counters)
Global skip: metrics_ratio: skipped (not in sync)
Collecting: metrics_ratio: collecting baseline

recovery_stalled: examples (after metrics_ratio: in output):

Healthy: recovery_stalled: ok (delta=0)
Building streak: recovery_stalled: breach (delta=2, streak 1/3)
Firing (streak): recovery_stalled: WARNING delta=1 (3 ticks) — investigating
Firing (burst): recovery_stalled: WARNING delta=15 (burst) — investigating
Skipped: recovery_stalled: skipped (metric missing)
Baseline: recovery_stalled: collecting baseline

Firing alerts — cooldown + filing

For each firing alert:

Read /home/tomer/data/$MONITOR_SESSION_ID/metrics/anomaly_cooldown.json (create empty {} if missing).
If now - last_filed[<metric>] < 7200s (2h), include the alert in the status report but SKIP file/comment.
Otherwise follow the BUG FILING WORKFLOW:
- Search gh issue list --search "metrics: <metric-name>" --state open.
- If one matches, gh issue comment <N> with the new evidence (current/prev values, delta, threshold, ledger, binary sha, sibling metrics). Apply the urgent label only if the metric breach blocks validator operation (per the Label policy in the Bug filing workflow); otherwise leave unlabeled. Board-route: if NOT on project board, add to Backlog.
- If no match, gh issue create (append --label urgent only when the metric breach is operation-blocking; most metric alerts are non-urgent and should be filed without a label) with:
  - Title: Non-critical: metrics: <metric> (NONC tier) or metrics: <metric> — <symptom> (WARN/SYNC tier).
  - Body: current/prev values, delta, threshold, ledger, binary sha, related sibling metrics, file:line citation from grep -n "<metric_name_without_prefix>" crates/ -r, and a suggested fix. Board-route: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
Update anomaly_cooldown.json with {"<metric>": <now>}.
For SYNC-tier alerts, ALSO update the sync: line in the status report (not just metrics:).

Watcher mode

If $MONITOR_MODE = watcher, run check (12) with a reduced catalog: only process (open_fds, max_fds), jemalloc, and overlay counters/gauges. Skip SCP, quorum, herder_state, histogram p99 alerts, ratio checks, and Check 12b (recovery-stalled streak). Omit the recovery_stalled: line from watcher output entirely.

Remote sync & redeploy

(10) Remote sync — first sanity-check the working tree:

If git status --porcelain reports any output, ABORT the deploy path for this tick. Report: DEPLOY SKIPPED (dirty tree) with the list of dirty paths. Do not run git pull against a dirty tree; do not kill the node. Investigate the dirty tree before the next tick.
If clean, git fetch origin main. If in detached HEAD state (git symbolic-ref HEAD fails), git checkout main first.

Determine deployed_sha — the sha of the source tree used to build the currently-running binary. This decouples the deploy gate from git HEAD, which can diverge from the running binary if an operator or CI pushes commits without going through the skill's build+restart path.

BUILD_SHA_FILE="/home/tomer/data/$MONITOR_SESSION_ID/build_sha"
SESSION_DIR="/home/tomer/data/$MONITOR_SESSION_ID"

deployed_sha=""
deployed_sha_status=""

if [ -s "$BUILD_SHA_FILE" ]; then
  candidate=$(tr -d '[:space:]' < "$BUILD_SHA_FILE")
  if printf '%s' "$candidate" | grep -qE '^[0-9a-f]{40}$' \
     && git cat-file -e "${candidate}^{commit}" 2>/dev/null; then
    deployed_sha="$candidate"
    deployed_sha_status="ok"
  else
    deployed_sha_status="invalid"
  fi
elif [ -e "$SESSION_DIR/last_ledger" ] \
  || [ -e "$SESSION_DIR/metrics/current.prom" ] \
  || [ -e "$SESSION_DIR/tick-history.jsonl" ]; then
  # Other per-session state exists — monitor-loop should have written
  # build_sha but didn't (e.g. rollout of this change to an existing
  # session). Force a rebuild to restore the invariant.
  deployed_sha_status="invalid"
else
  # Truly fresh session dir — safe to fall back to today's behavior.
  deployed_sha=$(git rev-parse HEAD)
  deployed_sha_status="missing-fresh"
fi

Runtime cross-validation — use the running binary's embedded commit hash as the authoritative source of truth. If /info reports a valid, locally- reachable commit hash, it overrides BUILD_SHA_FILE (which becomes a cache).

# Cross-validate: running binary's commit_hash is authoritative.
running_hash=""
if info_json=$(curl -sf "http://localhost:$MONITOR_ADMIN_PORT/info" 2>/dev/null); then
  running_hash=$(printf '%s' "$info_json" \
    | python3 -c 'import sys,json; print(json.load(sys.stdin).get("commit_hash",""))' 2>/dev/null)
fi

if [ -n "$running_hash" ] \
   && printf '%s' "$running_hash" | grep -qE '^[0-9a-f]{40}$' \
   && git cat-file -e "${running_hash}^{commit}" 2>/dev/null; then
  # Binary reports a valid, locally-reachable commit hash — authoritative.
  case "$deployed_sha_status" in
    ok)
      if [ "$deployed_sha" != "$running_hash" ]; then
        # BUILD_SHA_FILE disagrees with binary. Trust binary, repair file.
        printf '%s\n' "$running_hash" > "${BUILD_SHA_FILE}.tmp"
        mv "${BUILD_SHA_FILE}.tmp" "$BUILD_SHA_FILE"
        deployed_sha="$running_hash"
        # deployed_sha_status stays "ok"
      fi
      ;;
    invalid|missing-fresh)
      # File bad/missing but binary knows its commit — repair and use.
      printf '%s\n' "$running_hash" > "${BUILD_SHA_FILE}.tmp"
      mv "${BUILD_SHA_FILE}.tmp" "$BUILD_SHA_FILE"
      deployed_sha="$running_hash"
      deployed_sha_status="ok"
      ;;
  esac
fi
# If /info unreachable, commit_hash absent/empty/malformed, or SHA not in
# local git: no-op — fall through to existing logic unchanged.

`/info` state	`commit_hash` field	`git cat-file -e`	Action
Unreachable	—	—	No-op
Reachable	Missing/empty/malformed	—	No-op
Reachable	Valid 40-char hex	Fails (not in local git)	No-op (prevents cache poisoning)
Reachable	Valid 40-char hex	Succeeds, matches `deployed_sha`	No-op
Reachable	Valid 40-char hex	Succeeds, differs from `deployed_sha`	Repair file, use runtime hash

State	Meaning	Gate behavior
`ok`	Valid commit known (from file or repaired from runtime)	Compare `deployed_sha` vs `origin/main`
`missing-fresh`	No file, no runtime hash, no other session state	Falls back to `HEAD` vs `origin/main`
`invalid`	File corrupt + runtime unavailable	Force rebuild unconditionally

Report deployed_sha_status in the tick's status line.

Gate comparison:

origin_sha=$(git rev-parse origin/main)

case "$deployed_sha_status" in
  ok|missing-fresh)
    [ "$deployed_sha" = "$origin_sha" ] && gate_action="up-to-date" || gate_action="proceed"
    ;;
  invalid)
    gate_action="proceed-force-rebuild"
    ;;
esac

If gate_action="up-to-date": no action (already up to date).

Otherwise enter the deploy path:

Cool-down guard: if node uptime is < 60m AND CRASH_RECOVERY=no (i.e. the previous tick deployed cleanly), SKIP DEPLOY this tick. Report: DEPLOY DEFERRED (cool-down: uptime=<X>m < 60m). If deployed_sha_status is invalid, append , build_sha: invalid to the report so the operator knows a follow-up rebuild is pending once cool-down expires. Rationale: every restart costs ~1-2m of validation downtime + brief peer reconnect. Deploying multiple commits within minutes thrashes the validator without giving each version time to surface issues. The 60m floor was raised from the original 30m (#1944) on 2026-04-30 after observing too many quick-cycle deploys per day. Crash-recovery restarts are exempt because they're not voluntary deploys. Urgent fixes can override by killing the node manually before the next tick — the normal startup path will pick up the latest origin/main.
Binary-relevance check:
```
if [ "$gate_action" = "proceed-force-rebuild" ]; then
  # invalid build_sha — skip allowlist, force rebuild
  needs_rebuild="yes"
else
  # ok or missing-fresh: diff from deployed_sha
  if changed_paths=$(git diff --name-only "$deployed_sha" origin/main 2>/dev/null); then
    # Evaluate allowlist on $changed_paths (see below).
    # If every path is allowlisted: needs_rebuild="no"
    # Else: needs_rebuild="yes"
    ...
  else
    # git diff failed — fail closed, force rebuild.
    needs_rebuild="yes"
  fi
fi
```
If needs_rebuild="no" (all paths allowlisted), skip the rebuild + restart: run git pull --rebase only, then report DEPLOY SYNCED (no-binary-impact: docs/scripts only — pulled <N> commits, no restart). Do NOT update BUILD_SHA_FILE (the binary hasn't changed).

The non-binary-impact allowlist is:
- .github/
- .claude/
- scripts/
- docs/
- root-level *.md files, e.g. README.md or CLAUDE.md
- stellar-specs / stellar-specs/ submodule pointer changes only
Deny by default: if any path is outside this allowlist, continue to the CI, build, and restart path. In particular, changes to Cargo.toml, Cargo.lock, any build.rs, crates/, configs/, or mixed docs+code commits require a rebuild + restart.

Quarantine gate (only reached when needs_rebuild=yes):

# --- Quarantine gate ---
source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
check_quarantine_ancestry "$HOME/data/deploy_quarantine.txt"
if [ $? -eq 0 ]; then
  case "$QUARANTINE_STATUS" in
    blocked_unreadable)
      deploy_report="BLOCKED (quarantine file unreadable — fail-closed)" ;;
    blocked_git_error)
      deploy_report="BLOCKED (quarantine ancestry check failed for ${QUARANTINED_MATCH:0:8} — fail-closed)" ;;
    blocked_ancestor)
      deploy_report="DEFERRED (quarantined: ${QUARANTINED_MATCH:0:8} reachable from origin/main — see ~/data/deploy_quarantine.txt)" ;;
  esac
  [ -n "$QUARANTINE_WARNINGS" ] && deploy_report="$deploy_report [WARN: $QUARANTINE_WARNINGS]"
  # Skip to status report with this deploy_report. Do not continue.
fi

If quarantined, report the appropriate status and exit the deploy path. Do NOT proceed to CI check, build, or restart. Parse/check warnings are appended to the deploy: status line regardless of outcome.

Check CI status on origin/main: gh run list --branch main --limit 3 --json conclusion --jq '.[].conclusion'. If any recent run has conclusion failure, do NOT deploy — route the failure through check (11) and wait.
If all conclusions are success (ignore "" for in-progress and cancelled): git pull --rebase, CARGO_TARGET_DIR=/home/tomer/data/$MONITOR_SESSION_ID/cargo-target cargo build --release -p henyey.

If build succeeds:

# Persist the just-built sha BEFORE Stop-PID. Atomic via tmp + mv.
new_sha=$(git rev-parse HEAD)
printf '%s\n' "$new_sha" > "${BUILD_SHA_FILE}.tmp"
mv "${BUILD_SHA_FILE}.tmp" "$BUILD_SHA_FILE"

Then: Stop-PID, Rotate-log suffix preredeploy, Relaunch. Report: DEPLOY — pulled <N> commits (<old-sha>..<new-sha>), rebuilt, restarted at L<ledger>.

If build fails: report BUILD FAILED, do NOT restart — the old binary is still running. Do NOT update BUILD_SHA_FILE. Route the build error through check (11).

CI check workflow

(11) CI check — scope and levels of detection.

(11a) Scope: only inspect workflows that run on branch main. gh run list --branch main --limit 10 --json databaseId,name,status,conclusion,headSha,createdAt --jq '.[] | "\(.name)|\(.status)|\(.conclusion)|\(.headSha[:8])|\(.databaseId)|\(.createdAt)"'. Ignore runs triggered by PRs on other branches. Scan for completed runs with conclusion failure.

(11b) Job-level (CRITICAL — catches continue-on-error failures): For the latest completed run of EACH distinct workflow name (enumerate dynamically from 11a, do NOT hard-code names), check individual jobs: gh run view <ID> --json jobs --jq '.jobs[] | select(.conclusion == "failure") | "\(.name)|\(.conclusion)"'. Workflows with continue-on-error jobs report run-level conclusion success even when jobs fail — you MUST check job-level conclusions. If any jobs have conclusion failure, treat it the same as a run-level failure.

REPORTING RULE — NEVER report ci: all green if ANY job has conclusion failure, even if the run-level conclusion is success. The ci: line in the status report MUST reflect the WORST job-level result across all workflows. A continue-on-error job failure is NOT green — it is RED. Do not qualify failures as "known", "pre-existing", or "cosmetic".

Only act on failures from the last 2 hours (compare createdAt with date -u +%Y-%m-%dT%H:%M:%SZ). For each failure:

gh run view <ID> --log-failed 2>&1 | tail -80.
Categorize: build error, test failure, timeout, flaky, infrastructure.
If the failure looks like CI automation rather than a real defect — runner timeout outside test logic, network error fetching deps, GHA service hiccup, OOM-on-runner, "operation timed out" with no test stack, known-flaky test patterns (test_core3_restart_rejoin_* is a recurrer; see #1838 / #1939) — rerun the failing jobs first:
```
attempts=$(gh run view <ID> --json attempt --jq .attempt)
if [ "$attempts" -lt 2 ]; then
  gh run rerun <ID> --failed
fi
```
Report CI RERUN ATTEMPTED — <workflow> jobs <names> on <sha>, attempt 1→2 and stop CI processing for this tick. The next tick will see the re-run conclusion. Don't rerun more than once — if the re-run also fails, the failure is real (or an unusually persistent flake) and warrants filing. If the failure is clearly NOT automation (build error from real code, test assertion mismatch, hash mismatch, panic from production code), skip the rerun and proceed to step 4.
Check for an existing open issue: gh issue list --search "<workflow name + signature>" --state open. If one matches, gh issue comment <N> with the new evidence (sha, log snippet, timestamp) and ensure it has the urgent label (gh issue edit <N> --add-label urgent) — failing CI on origin/main blocks deploy and meets the urgent criteria. Board-route: if NOT on project board, add to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
Otherwise, file a new issue: gh issue create --label urgent --title "<workflow>: <short signature>" --body "..." with investigation findings. Board-route: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
Do NOT commit a fix. Report: CI ISSUE FILED — <workflow> failed on <sha>, filed/commented #<N>.

Bug filing workflow

Applies to node bugs, metric alerts, and CI failures.

Body delivery — use `--body-file`, not heredocs

When the issue body contains code blocks, always write the body to a temp file and pass --body-file <path> rather than --body "$(cat <<'EOF' ... EOF)".

Reason: heredocs nested inside double-quoted shell expressions tempt the agent to backslash-escape backticks (\`), which GitHub renders literally — every code fence breaks. There is no GitHub-side workaround once filed; the issue must be re-edited via gh issue edit <N> --body-file <path>.

Pattern:

cat > /tmp/issue-body.md <<'EOF'
## Symptom
...
```rust
some code

... EOF gh issue create --title "..." --body-file /tmp/issue-body.md

or for amendments:

gh issue edit --body-file /tmp/issue-body.md


The single-quoted `'EOF'` makes the heredoc literal — no escaping needed for
backticks, dollar signs, etc. inside the body.

### Label policy

> **Canonical reference:** `scripts/lib/monitor-label-policy.md` is the
> single source of truth for labeling and board-routing rules. This section
> is a normative excerpt.

When creating or commenting on issues:

- **`urgent`** — file with `--label urgent`, OR `gh issue edit <N> --add-label urgent`
  on an existing issue, ONLY when the symptom blocks validator operation or
  consensus participation. Specifically:
  - hash mismatch (any kind)
  - wedged node (frozen event loop, watchdog auto-abort)
  - failing CI on origin/main blocking deploy
  - panic or crash from production code
  - SYNC FAILURE past the active deadline
  - OOM-driven restart
  - and similar runtime-blocking conditions
- **(no label)** — non-urgent issues: calibration, threshold tuning, NONC
  alerts, cosmetic noise, follow-up improvements. Downstream picks these up
  at lower priority.
- **`not-ready`** — reserved for tier-3 self-reflection issues that need
  operator decision before any code change. Do not use for regular bug
  filings.

**Label lifecycle:**
- `urgent` and `not-ready` are mutually exclusive in steady state.
- Monitors MAY add `urgent` to an existing unlabeled issue (escalation).
- Monitors do NOT remove labels (de-escalation is operator-only).
- If a `not-ready` issue becomes blocking: add `urgent`, comment with
  evidence. Both labels coexist temporarily until operator resolves.

### Filing flow

1. Identify the failing signature (ledger + error type for node bugs; metric
   + threshold for alerts; workflow + job + error type for CI).
2. Investigate to root cause — read source code, trace code paths.
3. Check for an existing open issue: `gh issue list --search "<keywords>" --state open`.
   If a match exists, verify its state is OPEN
   (`gh issue view <N> --json state -q .state`), `gh issue comment <N>` with
   new evidence, and apply the `urgent` label only if the recurrence meets
   the urgent criteria above.
   **Board-route:** if the issue is NOT on the project board, add it:
   `bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog`
   (or `Blocked` if labeled `not-ready`). If already on board, preserve
   current status. STOP.
4. If no OPEN match, file a new issue with `gh issue create` — append
   `--label urgent` only if the symptom meets the urgent criteria above;
   otherwise omit the label. Body is a self-contained proposal (symptom,
   evidence, suspected root cause, fix sketch with file:line references).
   **Board-route:** add to project board:
   `bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog`
   (or `Blocked` if labeled `not-ready`).
5. Do NOT spawn agents. Do NOT edit the main checkout. The next redeploy tick
   (check 10) will pick up whatever lands on main.

**Board routing failure:** If `move-issue-status.sh` fails, log as ACTION
in the status report. Retry on next tick. Escalate after 3 consecutive
failures (see `scripts/lib/monitor-label-policy.md`).

**Recurrence policy**: If a previously-filed bug recurs with material new
evidence, prefer commenting on the existing issue when it is the same bug at
the same site AND the issue is OPEN. If the prior issue is CLOSED, file a new
issue (label per the policy above) with `Related to #<prior> (closed)` in the
body and a note on why the prior fix did not cover this case; do NOT comment
on the closed issue. File a new issue (still referencing `Related to #<prior>`
with one-line scope-diff) when new evidence points at a different named
subsystem, phase/mark, root-cause hypothesis, or candidate site set.

**Commit policy**: the monitor does NOT commit code. All fixes are delegated
via `gh issue` + project board routing.

**Deploy regression policy**: If the node fails after a deploy
(see also `scripts/lib/monitor-label-policy.md`):

(a) Record the bad SHA BEFORE any rollback rebuild (`build_sha` will be
overwritten during rebuild):

```bash
bad_sha=$(cat "$BUILD_SHA_FILE")

(b) Append to quarantine file (idempotent):

source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
rc=0
quarantine_append "$HOME/data/deploy_quarantine.txt" "$bad_sha" "regression #<issue>" || rc=$?
if [ $rc -ne 0 ]; then
  echo "WARNING: quarantine_append failed (rc=$rc) — deploy gate may not block next tick" >&2
fi

(c) File or comment on a GitHub issue (label urgent since validator operation is impacted) with the regression details (commit range, symptoms, watchdog data). Board-route to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog

(d) Restart the node on the last known-good binary (rebuild from the previous commit) while waiting for the fix. Do NOT revert commits inline.

The quarantine gate (section 10, step 3) will now block re-deployment as long as the quarantined SHA is reachable from origin/main. The quarantine does NOT auto-lift — see "Quarantine Clearance" below.

Quarantine Clearance

The deploy quarantine is a safety lock. The monitor skill NEVER removes entries. Clearance is an explicit operator decision.

When to clear: After ALL of:

The linked issue (in the reason field) is CLOSED with a fix merged.
CI on origin/main is green (the fix passes all tests).
The operator has reviewed the fix commit and is confident the regression is resolved.

How to clear (using the deploy-quarantine helper):

source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
rc=0
quarantine_remove "$HOME/data/deploy_quarantine.txt" "$bad_sha" || rc=$?
if [ $rc -ne 0 ]; then
  echo "ERROR: quarantine_remove failed (rc=$rc) — entry may still be active" >&2
fi

Operator reminders: The deploy: status line reports DEFERRED (quarantined: ...) every tick (~20 minutes). This is a persistent, automatic reminder that requires no separate notification.

Emergency override: To force deploy despite quarantine:

rm "$HOME/data/deploy_quarantine.txt"   # clears ALL quarantines
# or: remove a specific entry using the helper:
# source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
# quarantine_remove "$HOME/data/deploy_quarantine.txt" "<sha>"

Investigation

For ANY anomaly, investigate to root cause — read source code, check logs, trace code paths. Never dismiss as "expected". Produce a GitHub issue (label per the Bug filing workflow's Label policy — urgent if it blocks operation, otherwise no label) or comment on an existing one for every anomaly that isn't immediately explained. Only exception: anomalies whose root cause turns out to be literal expected-correct behavior per the code — document the code path in the status report and skip the filing.

Output

Print a multiline status report:

MONITOR <OK|WARNING|ACTION|OFFLINE> — L<ledger> — <timestamp>
  node:    mode=<MODE> session=<session-id> pid=<PID> fresh_start=<yes|no>
  wipe:    <per wipe-state composition table — omitted when no wipe>
  sync:    <synced | CATCHING UP (gap=N, uptime=Xm, deadline=<15m|20m|60m>) | SYNC FAILURE (gap=N, uptime=Xm — filed/commented #<N>)>
  mem:     <RSS_MB>MB rss | alloc=<alloc>MB resident=<resident>MB frag=<pct>%
           heap=<heap>MB mmap=<mmap>MB unaccounted=<sign><unaccounted>MB
  disk:    <used>/<total> (<pct>%) | session+data=<size>
  rpc:     <healthy|unhealthy|N/A> oldestL=<X> latestL=<Y> window=<Z>
  obsrvr:  <validating=<Y/N> val24h=<pct>% lag=<N> | N/A (watcher) | N/A (api-error)>
  metrics: <clean | N alerts (<metric1>,<metric2>,...) — filed/commented #<N>,#<M> | N alerts, K suppressed by cooldown>
  metrics_ratio: scp <ok (accept=X%) | skipped (reason) | WARNING accept=X%<5% (N ticks)>, apply <ok (fail=Y%) | skipped (reason) | WARNING fail=Y%>50% (N ticks) — investigating>, pending <ok (too_old=Z%) | skipped (reason) | WARNING too_old=Z%>50% (N ticks)> | collecting baseline
  recovery_stalled: <ok (delta=0) | breach (delta=N, streak M/3) | WARNING delta=N (M ticks) — investigating | WARNING delta=N (burst) — investigating | skipped (<reason>) | collecting baseline>
  deploy:  <up-to-date | DEFERRED (quarantined: <sha8> reachable from origin/main — see ~/data/deploy_quarantine.txt) | BLOCKED (quarantine file unreadable — fail-closed) | BLOCKED (quarantine ancestry check failed for <sha8> — fail-closed) | DEFERRED (cool-down: ...) | SYNCED (no-binary-impact: ...) | pulled N commits (old..new) | SKIPPED (dirty-tree|ci-red|build-failed, filed/commented #<N>)>
  ci:      <all green (run+job level) | WORKFLOW failed — filed/commented #<N> | WORKFLOW jobs FAILED (continue-on-error) — NAME|conclusion listed, filed/commented #<N>>
  self_reflect: <clean | fixed inline (<sha>: <short-desc>) | filed #<N> (urgent: <short-desc>) | filed #<N> (no-label: <short-desc>) | filed #<N> (not-ready: <short-desc>)>

The wipe: line is present when SESSION_WIPED=yes or MAINNET_WIPED=yes (or both). Format per the wipe-state composition table above. When neither fires, omit the line entirely.

Use WARNING for threshold breaches. Use ACTION when a corrective action was taken (restart, deploy, filed a new issue, commented on an existing issue, session-wipe recovery). Use OFFLINE when the node cannot be recovered in this tick (e.g., rebuild failed after session wipe). Use SYNC FAILURE (not WARNING) when the node has exceeded the active sync deadline (15m populated / 20m fresh-start / 60m crash-recovery) but is not closing ledgers in real-time — this is a bug that requires immediate investigation AND filing/commenting on a GitHub issue (label urgent since SYNC FAILURE blocks consensus). Board-route to Backlog per scripts/lib/monitor-label-policy.md.

Tick history capture

After emitting the status report (and before self-reflection), append a single JSON line to /home/tomer/data/$MONITOR_SESSION_ID/tick-history.jsonl so /daily-summary can aggregate the last 24h of ticks:

# ts is computed inside the Python block — do NOT use shell expansion for timestamps.
HIST=/home/tomer/data/$MONITOR_SESSION_ID/tick-history.jsonl
python3 - <<'PY' >> "$HIST"
import json
from datetime import datetime, timezone
print(json.dumps({
  "ts":           datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
  "status":       "<OK|WARNING|ACTION|OFFLINE>",
  "ledger":       <current-ledger-int>,
  "build":        "<short-sha>",
  "deploys":      <0 or 1>,
  "warnings":     [<list of metric names that breached>],
  "actions":      [<list of action keywords: restart, deploy, filed-#N, session-wiped-recovery, session-wiped-process-alive, session-wiped-rebuild-failed, mainnet-data-wiped>],
  "self_reflect": "<clean | fixed-inline | filed-#N>",
  "watch":        ["<key>=<value>", ...]
}))
PY

ts contract: The ts field records the wall-clock UTC time captured immediately before emitting the JSON line. Format: YYYY-MM-DDTHH:MM:SSZ (ISO 8601, always UTC). Acceptable skew: ≤ 60 seconds from real wall-clock time. Any drift > 60 s indicates a bug in the capture mechanism.

watch carries multi-tick non-incident concerns the daily summary should surface — examples: pruning_gap=2451, frag_pct=18, disk_pct=72, wipe=session-dir, wipe=mainnet-data. Add a key whenever the tick body called out a value worth tracking but did not file an issue. Each entry MUST be one line — the daily-summary aggregator parses with for ln in f: json.loads(ln).

Self-reflection

After the status report is emitted, look back at THIS tick's output and check for problems in the monitor itself — not in the node. The node is covered by the ## Investigation policy above. This section is about bugs / miscalibrations in the tick logic, thresholds, catalog, or detection code that fired false positives, failed silently, or rendered contradictory output.

What to look for

False-positive alert: a check fired, but investigation showed the underlying state is expected-correct. Root cause is the catalog threshold / rule, not the node. (Examples from this session: scp_verifier_thread_state !=1 was inverted; quorum_agree <4 fires between externalizations by design.)
Silent failure: a check returned empty / zero due to a tool bug. (Examples: zsh glob NO_NOMATCH killed a pipeline; a histogram metric name had a typo and silently resolved to count_delta=0 which skipped evaluation.)
Self-contradictory output: sync: synced alongside metrics: peer_count<8 WARN, or similar internal inconsistency that suggests a missing warmup exemption or cross-check rule.
Missing carveout: a gauge / counter fires during a legitimate transient (startup, restart-warmup, crash-recovery replay) but has no exemption in the catalog.

If none of the above: report self_reflect: clean and stop.

Three tiers of action

When an issue is found, choose a tier and act on it in the same tick.

Tier 1 — Fix inline (trivial edit).

Criteria (ALL must hold):

Change is contained to .claude/skills/monitor-tick/SKILL.md (no other files touched)
Diff < 50 lines
No new runtime dependency (no new file path, env var, or metric name that the skill depends on)
The edit is a clear text change with an obvious correct value: typo fix, threshold adjustment to match an observed live baseline, shell-portability fix, adding a metric to an existing exemption list, rendering-template tweak
The need is demonstrated by THIS tick's observation — not speculative

Action sequence (same pattern used for every skill edit in this repo):

# 1. Edit .claude/skills/monitor-tick/SKILL.md
# 2. git add .claude/skills/monitor-tick/SKILL.md
# 3. git commit -m "Monitor-tick: <what + why>" with Co-authored-by: Claude Code trailer
# 4. git push origin main (on reject: git pull --rebase && git push)

Report: self_reflect: fixed inline (<short-sha>: <short-desc>).

Tier 2 — File GH issue (non-trivial but codeable).

Apply the Bug filing workflow's Label policy: most monitor-tick self-reflection issues are non-urgent (calibration / threshold / catalog tuning) and should be filed without a label. Only use urgent if the skill's miscalibration is silently masking a real validator-blocking signal.

The issue is real and actionable but any of:

Touches multiple files or crates
Requires a design choice (which metric? which threshold value? which algorithm?)
Needs verification beyond the tick (new test cases, reproducing with a build)
Affects runtime contracts (env schema, file format, section ordering that another skill depends on)

Before filing, search for an existing open issue: gh issue list --search "monitor-tick: <keywords>" --state open. Comment with new evidence if a match exists; otherwise file. Board-route per scripts/lib/monitor-label-policy.md: Backlog for actionable issues, Blocked for not-ready issues.

Issue body MUST include:

Symptom: one-line description of the false positive / silent failure / contradiction
Evidence: exact tick output and command results that demonstrated the issue
Suspected root cause: which rule / threshold / code path
Concrete fix sketch: file:line references and proposed diff direction
Related to # if it's a recurrence of something already filed

Title format:

Non-critical: monitor-tick: <description> — for observability / calibration issues (noise reduction, threshold tuning) — file with no label
monitor-tick: <description> — for correctness bugs (silent failures, contradictory output) — file with no label unless the bug is silently masking a real validator-blocking signal (then urgent)

Report: self_reflect: filed #<N> (urgent: <short-desc>) or self_reflect: filed #<N> (no-label: <short-desc>).

Tier 3 — File not-ready GH issue (human input required).

Use this tier when any of:

The fix has product/ops policy implications (e.g., "should we monitor a new metric class?", "should we change the restart philosophy?", "should we broaden the auto-deploy trigger?")
Ambiguous scope — fix A, B, or C would all work and the right choice depends on operator intent
Touches the node code, another skill, or config defaults — the downstream fixer should not auto-pick this up without explicit operator direction

Issue body includes everything from Tier 2 plus an explicit "Human input required" section listing the specific decisions / options that need an operator answer.

Label: not-ready. Title format: monitor-tick: [needs-decision] <description>.

Report: self_reflect: filed #<N> (not-ready: <short-desc>).

Boundaries

Scope is single-tick. Do not retrospectively review prior ticks for drift. Cross-tick pattern detection is a separate concern.
Never suppresses a real node-side filing. If a check flagged a SYNC FAILURE on the node and self-reflection concludes the detection was over-eager, the SYNC FAILURE filing still goes out this tick (real-or-not is downstream's call). File a separate Tier 2 issue to tune the detection.
Never re-opens or argues with already-filed issues from this or prior ticks. Those stand as-is.
Trivial-fix bias: when in doubt between Tier 1 and Tier 2, prefer Tier 2 (filing) over an aggressive inline edit. Inline edits affect every subsequent tick immediately; better to write up the rationale and let the operator review.

name	monitor-tick
description	One tick of the henyey mainnet monitor — checks, metrics scan, deploy, status report

Monitor Tick

Preconditions

This skill expects /home/tomer/data/monitor-loop.env to exist — it is written by /monitor-loop at startup and contains the runtime config. If the file is missing, bail out immediately with:

ERROR: /home/tomer/data/monitor-loop.env not found — run /monitor-loop first.

Load the env file at the start of the tick:

set -a
source /home/tomer/data/monitor-loop.env
set +a

Variables provided by the env file (all required):

MONITOR_SESSION_ID — 8-char session id, e.g. 74535976
MONITOR_MODE — validator or watcher
MONITOR_CONFIG — path to the TOML config, e.g. configs/validator-mainnet-rpc.toml
MONITOR_ADMIN_PORT — 11627 (validator) or 11727 (watcher)
MONITOR_RPC_PORT — 8000 (validator) or empty (watcher)
MONITOR_RUN_FLAGS — --validator (validator) or empty (watcher)

Throughout this skill, substitute those values for the $MONITOR_* references below.

Per-session state files

All files below live in /home/tomer/data/$MONITOR_SESSION_ID/:

File	Purpose	Writer
`build_sha`	Cache of deployed binary's source commit (authoritative source: runtime `/info.commit_hash`; seeded by build step)	check 10 (after successful build or runtime self-heal) and `monitor-loop` step 5
`last_ledger`	STUCK detection state (check 4)	check 4
`last_ledger_count`	STUCK confirmation counter (check 4)	check 4
`wipe_attempted_at`	Epoch of most recent (3a) wipe; read by (3d) post-wipe recurrence guard	check 3a (write), check 3d marker-clear rule (remove)
`tick-history.jsonl`	daily-summary aggregation	tick epilogue
`metrics/current.prom`	latest Prometheus scrape	check 8
`metrics/prev.prom`	previous Prometheus scrape	check 8
`metrics/ratio_snapshot`	counter-ratio history (check 12)	check 12
`metrics/counter_streak_snapshot`	counter-streak state (check 12b)	check 12b (constants)
`metrics/anomaly_cooldown.json`	alert dedup state	check 9
`logs/monitor.log`	node stdout/stderr (rotated on restart)	node process
`cargo-target/`	cached build tree	cargo
`.alive`	session liveness marker	session startup

Cross-session state (lives at $HOME/data/, outside any session dir):

File	Purpose	Writer	Remover
`deploy_quarantine.txt`	Commit SHAs blocked from rebuild/restart deploy; survives session wipes	Deploy regression policy (skill, during rollback)	Operator only (manual)

Session-dir-vanished detection

source "$(git rev-parse --show-toplevel)/scripts/lib/monitor-decisions.sh"
check_session_wiped "$HOME/data" "/proc" "$MONITOR_SESSION_ID" \
  "$HOME/data/monitor-loop.env" || exit 1

The check_session_wiped function (defined in scripts/lib/monitor-decisions.sh):

Sets SESSION_WIPED ("yes"/"no") and SESSION_WIPED_PROCESS_ALIVE ("yes"/"no")
Returns 0 if not wiped or wiped-and-recoverable (recovery dirs created)
Returns 1 if wiped, no process alive, env stale (>2h) — dirs NOT created, caller should exit
Uses _find_session_process to scan /proc/[0-9]*/exe for a process matching the expected binary path (including (deleted) suffix)
On return 0 with SESSION_WIPED=yes: recreates {logs,cache,cargo-target,metrics} subdirs

Long-stale session detection

if [[ "$SESSION_WIPED" == "no" ]]; then
  check_long_stale_session "$HOME/data" "/proc" "$MONITOR_SESSION_ID" \
    "$HOME/data/monitor-loop.env" || exit 1
fi

The check_long_stale_session function (defined in scripts/lib/monitor-decisions.sh):

Sets LONG_STALE_SESSION ("yes"/"no")
Returns 0 if session is not long-stale (process alive, or .alive/env recent enough)
Returns 1 if long-stale: no process, .alive age > 6h (or missing), env age > 24h
Primary signal: .alive mtime (touched every tick at start). Fallback: env file mtime.
Process-alive check (via _find_session_process) overrides staleness markers.
On return 1: stderr error, caller should exit without relaunching.

Recovery path when `SESSION_WIPED=yes`

Case A: SESSION_WIPED_PROCESS_ALIVE=yes

Touch .alive.
Skip checks dependent on log/metrics files (1, 5, 6, 7, 12).
Attempt admin-port checks: (2) ledger progression, (8) RPC health, (9) OBSRVR — the live process may still respond to HTTP.
Report status, actions, watch, and wipe: line per wipe-state composition table (row #3 or #6 depending on MAINNET_WIPED).
File/comment issue per wipe-state issue-filing policy.

Case B: SESSION_WIPED_PROCESS_ALIVE=no

Touch .alive.
FRESH_START determined normally (mainnet.db likely absent → yes).
Skip checks (1)–(9), (12) — no process, no logs, no metrics.

Rebuild the binary:

CARGO_TARGET_DIR=/home/tomer/data/$MONITOR_SESSION_ID/cargo-target \
  cargo build --release -p henyey

If build succeeds, persist the deployed sha (atomic write):

new_sha=$(git rev-parse HEAD)
printf '%s\n' "$new_sha" > "/home/tomer/data/$MONITOR_SESSION_ID/build_sha.tmp"
mv "/home/tomer/data/$MONITOR_SESSION_ID/build_sha.tmp" "/home/tomer/data/$MONITOR_SESSION_ID/build_sha"

If build fails: report OFFLINE; emit actions/watch/wipe: per wipe-state composition table (row #5 or #8). File/comment issue before exit using wipe-state issue-filing policy. Then exit.

Relaunch via standard Relaunch procedure.
Report status, actions, watch, and wipe: line per wipe-state composition table (row #4 or #7 depending on MAINNET_WIPED).
File/comment issue per wipe-state issue-filing policy.

Mainnet-data-vanished detection

check_mainnet_wiped "$HOME/data"

The check_mainnet_wiped function (from scripts/lib/monitor-decisions.sh):

Sets MAINNET_WIPED to "yes" if $HOME/data/mainnet directory is missing, "no" otherwise
If mainnet/ dir exists but mainnet.db is missing, defer to FRESH_START logic
MAINNET_WIPED only fires when the ENTIRE directory is gone

Wipe-state composition

Complete state truth table

#	SESSION_WIPED	MAINNET_WIPED	Case	`actions`	`watch`	Status level	`wipe:` line	Issue title pattern
1	no	no	—	—	—	(normal)	(omitted)	—
2	no	yes	—	`"mainnet-data-wiped"`	`"wipe=mainnet-data"`	ACTION	`MAINNET DATA WIPED — catchup recovery`	`OFFLINE: mainnet data wiped out-of-band`
3	yes	no	A	`"session-wiped-process-alive"`	`"wipe=session-dir"`	ACTION	`SESSION DIR WIPED — process alive (PID N), operator intervention needed`	`OFFLINE: validator session wiped out-of-band`
4	yes	no	B-ok	`"session-wiped-recovery"`	`"wipe=session-dir"`	ACTION	`SESSION DIR WIPED — rebuilt + relaunched (new PID N)`	`OFFLINE: validator session wiped out-of-band`
5	yes	no	B-fail	`"session-wiped-rebuild-failed"`	`"wipe=session-dir"`	OFFLINE	`SESSION DIR WIPED — rebuild failed`	`OFFLINE: validator session wiped out-of-band`
6	yes	yes	A	`"session-wiped-process-alive"`, `"mainnet-data-wiped"`	`"wipe=session-dir"`, `"wipe=mainnet-data"`	ACTION	`SESSION DIR WIPED — process alive (PID N) + MAINNET DATA WIPED`	`OFFLINE: session + mainnet data wiped out-of-band`
7	yes	yes	B-ok	`"session-wiped-recovery"`, `"mainnet-data-wiped"`	`"wipe=session-dir"`, `"wipe=mainnet-data"`	ACTION	`SESSION DIR WIPED — rebuilt + relaunched (new PID N) + MAINNET DATA WIPED`	`OFFLINE: session + mainnet data wiped out-of-band`
8	yes	yes	B-fail	`"session-wiped-rebuild-failed"`, `"mainnet-data-wiped"`	`"wipe=session-dir"`, `"wipe=mainnet-data"`	OFFLINE	`SESSION DIR WIPED — rebuild failed + MAINNET DATA WIPED`	`OFFLINE: session + mainnet data wiped out-of-band`

Composition rule: The wipe: line is built from up to two fragments joined by +:

Session fragment (from Case outcome): present when SESSION_WIPED=yes
MAINNET DATA WIPED: appended when MAINNET_WIPED=yes

Issue-filing policy:

File/comment one issue per tick, not per flag.
Dedup/search: Search open issues by the canonical title pattern from the table above. If an open issue with that title exists, comment on it (adding new evidence). Otherwise create a new issue.
Combined wipe: Use the combined title (session + mainnet data wiped). Do NOT also file a separate mainnet-only issue.
Label/severity: Defer to the existing Filing Flow label policy. In general: OFFLINE status → urgent; ACTION with process alive → per policy (typically urgent since node state is uncertain).
Board-route: after filing/commenting, add to project board per scripts/lib/monitor-label-policy.md routing rules (Backlog for actionable issues, Blocked for not-ready).

Session-alive marker

After the session-dir-vanished detection (which ensures the dir exists), touch a sentinel file at the start of every tick:

touch /home/tomer/data/$MONITOR_SESSION_ID/.alive

The .alive file's mtime = timestamp of last successful tick start. Cleanup tooling uses this to determine session liveness (see monitor-loop cleanup guards).

Fresh-start state

Crash-recovery state

If the node's previous run ended with a crash or wedge (SIGKILL), restart begins from a stale lcl that can be hours behind mainnet tip; replay will legitimately exceed the 15m clean-restart deadline.

Detect crash-recovery once at the top of the tick. The rule fires on either of these signals (both gated by uptime < 2h, after which recovery is considered complete regardless):

Rotation signal: the most recent log rotation in the session's logs/ dir is a .crashed-*, .stuck-*, or .frozen-* (not a planned .preredeploy-*). Use find (not shell globs) to avoid zsh NO_NOMATCH failing the pipeline.
Active-catchup signal (fires even when the rotation was .preredeploy-*): the node is in Catching Up state AND uptime exceeds 5 minutes. Handles the case where a manual planned restart (which rotates as preredeploy-*) happens AFTER a wedge, leaving the persisted lcl hours stale.

CRASH_RECOVERY=no
PID=$(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID")
if [ -n "$PID" ]; then
  uptime_sec=$(ps -o etimes= -p "$PID" 2>/dev/null | tr -d ' ')
  uptime_sec=${uptime_sec:-0}
  if [ "$uptime_sec" -lt 7200 ]; then
    # Signal 1: newest rotation type
    logs_dir="/home/tomer/data/$MONITOR_SESSION_ID/logs"
    newest_rotation=$(find "$logs_dir" -maxdepth 1 -type f \
        \( -name 'monitor.log.crashed-*' \
        -o -name 'monitor.log.stuck-*' \
        -o -name 'monitor.log.frozen-*' \
        -o -name 'monitor.log.preredeploy-*' \) \
      -printf '%T@ %p\n' 2>/dev/null | sort -rn | head -1 | cut -d' ' -f2-)
    case "$newest_rotation" in
      *.crashed-*|*.stuck-*|*.frozen-*) CRASH_RECOVERY=yes ;;
    esac

    # Signal 2: active catchup past the clean-restart window
    if [ "$CRASH_RECOVERY" = "no" ] && [ "$uptime_sec" -gt 300 ]; then
      node_state=$(curl -s -m 3 "http://localhost:$MONITOR_ADMIN_PORT/info" 2>/dev/null \
                   | python3 -c 'import sys,json; print(json.load(sys.stdin).get("state",""))' 2>/dev/null)
      if [ "$node_state" = "Catching Up" ]; then
        CRASH_RECOVERY=yes
      fi
    fi
  fi
fi

When CRASH_RECOVERY=yes and FRESH_START=no, the active sync deadline extends to 60m and a progress carveout applies (see check (2)).

Health checks

Common procedures

Several checks share these procedures — define once, reference by name:

Stop-PID (graceful kill, fall through to SIGKILL):

kill "$PID"; for i in $(seq 1 10); do sleep 1; kill -0 "$PID" 2>/dev/null || break; done
kill -0 "$PID" 2>/dev/null && kill -9 "$PID" && sleep 2

Relaunch (preserves log via append redirection, clears any stale lockfile):

rm -f /home/tomer/data/mainnet/mainnet.lock
RUST_LOG=info nohup /home/tomer/data/$MONITOR_SESSION_ID/cargo-target/release/henyey \
  --mainnet run $MONITOR_RUN_FLAGS -c $MONITOR_CONFIG \
  >> /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log 2>&1 &

Rotate-log (preserve prior session's log under the given suffix):

mv /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log \
   /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log.<suffix>-$(date -u +%Y%m%dT%H%M%SZ) 2>/dev/null || true

Suffix per origin: crashed (process found dead), frozen (wedge per 3b), preredeploy (planned restart for deploy).

(2) Ledger progression & sync deadline — persist ledger progression across ticks so STUCK can be detected by a single invocation:

Read /home/tomer/data/$MONITOR_SESSION_ID/last_ledger (if it exists) — format is "<ledger>|<unix-timestamp>".
Extract the current ledger from the most recent heartbeat event (heartbeat=true) in the log tail.
If the file exists and its ledger equals the current ledger and the recorded timestamp is more than 600s old, flag STUCK.
If the ledger has advanced or the file is missing, overwrite /home/tomer/data/$MONITOR_SESSION_ID/last_ledger with "<current-ledger>|<now>".
Check node uptime: ps -o etime= -p $(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID"). Active deadline: 15m when FRESH_START=no and CRASH_RECOVERY=no, 60m when CRASH_RECOVERY=yes, 20m when FRESH_START=yes.
"Real-time sync" means RPC age < 30s — NOT just heartbeat event gap=0. Gap is the node's local view (latest_ext - ledger) and can stay at 0 even when the node is minutes behind the network. The authoritative wall-clock signal is RPC age.
If uptime exceeds the active deadline AND the node is not in real-time sync: flag SYNC FAILURE if gap > 5, or age > 30s, or heard_from_quorum=false in the latest heartbeat event. Investigate the catchup path (checkpoint-boundary stalls, hash mismatches, event-loop freezes); do not just wait.
Progress carveout (only when CRASH_RECOVERY=yes): if lcl has advanced by ≥ 500 ledgers since the previous tick's last_ledger, the node is actively replaying — report CATCHING UP regardless of uptime. Flag SYNC FAILURE only when lcl stops advancing AND uptime exceeds 60m.
Fresh-start carveout (FRESH_START=yes, uptime < 20m): a large gap is expected during initial bucket apply — report CATCHING UP, not SYNC FAILURE.

logs_dir=/home/tomer/data/$MONITOR_SESSION_ID/logs
# Uses the shared detect_crash_state function from scripts/lib/monitor-decisions.sh
# (already sourced at skill init). Uses numeric mtime comparison (stat -c %Y),
# not -newermt, for portability and testability.
detect_crash_state "$logs_dir"
recent_count="$CRASH_RECENT_COUNT"
latest_crashed="$CRASH_LATEST_FILE"
hash_mismatch_signal="$CRASH_HASH_MISMATCH"

Trigger the wipe when ALL hold:

recent_count >= 3 (3+ crashed rotations in the last 30 min — proves restart-without-fix isn't recovering)
hash_mismatch_signal == "yes" (the most recent crash logged the structured field fatal_wipe_required=true or the legacy prose "State wipe required before restart" — emitted by trigger_fatal_shutdown() for any unrecoverable local state corruption)
FRESH_START=no (don't fire on a fresh sync that hasn't completed yet)

When triggered:

# Stop any partially-running process first (defensive — should already be dead)
PID=$(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID")
[ -n "$PID" ] && kill "$PID" && sleep 5 && kill -0 "$PID" 2>/dev/null && kill -9 "$PID"

# Wipe the corrupt persisted state (recoverable from public network archive)
rm -f /home/tomer/data/mainnet/mainnet.db \
      /home/tomer/data/mainnet/mainnet.db-shm \
      /home/tomer/data/mainnet/mainnet.db-wal \
      /home/tomer/data/mainnet/mainnet.lock
rm -rf /home/tomer/data/mainnet/buckets

# Reset progression tracker so the next tick treats this as a fresh start
rm -f /home/tomer/data/$MONITOR_SESSION_ID/last_ledger

# Mark the wipe action — read by (3d) to detect post-wipe recurrence.
# Stores the wipe epoch; cleared by (3d) when stable-recovery condition holds.
date -u +%s > /home/tomer/data/$MONITOR_SESSION_ID/wipe_attempted_at

marker=/home/tomer/data/$MONITOR_SESSION_ID/wipe_attempted_at
post_wipe_recurrence=no
if [ -s "$marker" ]; then
  wipe_epoch=$(cat "$marker" 2>/dev/null | tr -dc '0-9')
  # Use the same detect_crash_state result already populated for (3a):
  # CRASH_LATEST_FILE + CRASH_HASH_MISMATCH are valid in this scope.
  if [ -n "${CRASH_LATEST_FILE:-}" ] && [ -n "${wipe_epoch:-}" ]; then
    latest_mtime=$(stat -c %Y "$CRASH_LATEST_FILE" 2>/dev/null || echo 0)
    if [ "$latest_mtime" -gt "$wipe_epoch" ] && [ "${CRASH_HASH_MISMATCH:-no}" = "yes" ]; then
      post_wipe_recurrence=yes
    fi
  fi
fi

When post_wipe_recurrence=yes:

# 1. Auto-quarantine the build SHA. Idempotent — already-present is OK.
source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
build_sha=$(cat "/home/tomer/data/$MONITOR_SESSION_ID/build_sha" 2>/dev/null || true)
if [ -n "$build_sha" ]; then
  quarantine_append "$HOME/data/deploy_quarantine.txt" "$build_sha" \
    "post-wipe recurrence (3d) — see urgent issue"
fi

# 2. Do NOT relaunch. Do NOT fire (3a) again — wipe already proved insufficient.
# 3. Report OFFLINE with `actions: ["post-wipe-recurrence", "auto-quarantined"]`
#    and `watch: ["determinism-suspect-binary"]`.

Marker-clear rule — runs each tick when the process is alive. Clear wipe_attempted_at only when the binary has demonstrably stabilized:

marker=/home/tomer/data/$MONITOR_SESSION_ID/wipe_attempted_at
if [ -s "$marker" ] && [ -n "${PID:-}" ]; then
  PROC_START_EPOCH=$(stat -c %Y /proc/$PID/stat 2>/dev/null || echo 0)
  uptime_sec=$(( $(date -u +%s) - PROC_START_EPOCH ))
  has_fatal_wipe_evidence "$logs_dir" "$logs_dir/monitor.log"  # sets FATAL_WIPE_EVIDENCE
  ledger_advanced=no
  # last_ledger holds the previous tick's value, stashed before check (2) overwrites
  if [ -n "${PREV_LEDGER:-}" ] && [ -n "${LEDGER:-}" ] && [ "$LEDGER" -gt "$PREV_LEDGER" ]; then
    ledger_advanced=yes
  fi
  if [ "$uptime_sec" -gt 1800 ] \
     && [ "${FATAL_WIPE_EVIDENCE:-no}" = "no" ] \
     && [ "$ledger_advanced" = "yes" ]; then
    rm -f "$marker"
  fi
fi

logs_dir=/home/tomer/data/$MONITOR_SESSION_ID/logs
log_file="$logs_dir/monitor.log"

# PID (session-scoped via _find_session_process; skip (3c) if empty)
PID=$(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID")
[ -z "$PID" ] && : # skip — dead-process path (3a) handles this

# Process start time from /proc/$PID/stat mtime
PROC_START_EPOCH=$(stat -c %Y /proc/$PID/stat 2>/dev/null || echo 0)

# Uses shared functions from scripts/lib/monitor-decisions.sh
has_fatal_wipe_evidence "$logs_dir" "$log_file"
detect_soft_fail_blocked "$log_file" "$PROC_START_EPOCH"

Trigger the wipe when ALL hold:

SOFT_FAIL_BLOCKED == "yes" (WARN-level "Recovery escalation blocked" messages sustained for >= 5 min within current PID lifetime, most recent within 90s of now)
FATAL_WIPE_EVIDENCE == "yes" (fatal_wipe_required=true signal found in any crashed rotation OR the active log — no time window, confirms persistent state corruption)
FRESH_START == "no" (not a fresh sync)
No ledger progress since previous tick — last_ledger unchanged (stash previous tick's value before check (2) overwrites; compare against current ledger)

When triggered:

# 1. Stop the alive-but-stuck process
kill "$PID" && sleep 5
kill -0 "$PID" 2>/dev/null && kill -9 "$PID" && sleep 2

# 2. Rotate log (preserve evidence, consistent suffix with 3a)
mv "$log_file" "${log_file}.crashed-$(date -u +%Y%m%dT%H%M%SZ)" 2>/dev/null || true

# 3. Wipe corrupt persisted state (same artifacts as 3a)
rm -f /home/tomer/data/mainnet/mainnet.db \
      /home/tomer/data/mainnet/mainnet.db-shm \
      /home/tomer/data/mainnet/mainnet.db-wal \
      /home/tomer/data/mainnet/mainnet.lock
rm -rf /home/tomer/data/mainnet/buckets

# 4. Reset progression tracker
rm -f /home/tomer/data/$MONITOR_SESSION_ID/last_ledger

Then Relaunch. The next tick will see FRESH_START=yes (mainnet.db absent).

Self-limiting: after wipe, FRESH_START=yes blocks condition (3); new process has no fatal_state_failure so condition (1) fails; log rotation removes old blocked messages from active log.

(3b) Wedge detection — a process can be alive but have a frozen event loop (watchdog fires, HTTP hangs, ledger progression stops). Check 3 alone misses this because pgrep still finds the PID.

Flag WEDGE when BOTH:

grep -E 'watchdog_freeze"?\s*[=:]\s*true|WATCHDOG: Event loop appears frozen' $LOG | tail -1 is present with a timestamp within the last 120s. (The structured field watchdog_freeze=true is the primary signal; the prose string is a legacy fallback. The "? accounts for JSON key quoting.)
curl -s -m 3 http://localhost:$MONITOR_ADMIN_PORT/info returns empty body or times out.

(4) Memory — ps -o rss= -p $(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID"), convert to MB.

If RSS > 12 GB, flag HIGH MEMORY (report-only; no restart).
Restart condition — restart only if ALL hold (this gates on system pressure AND evidence of a real heap leak, so we don't kill a legit catchup):
1. RSS > 16 GB, AND
2. system available memory from free -m (NOT free — that excludes reclaimable kernel cache) < 8 GB, AND
3. latest two memory_report=true (or Memory report summary) entries both show heap_components_mb growing by > 500 MB vs the earlier snapshot.
Restart: Stop-PID, Rotate-log suffix crashed, Relaunch.

logs_dir=/home/tomer/data/$MONITOR_SESSION_ID/logs
[ -d "$logs_dir" ] && for pat in preredeploy crashed stuck frozen; do
  find "$logs_dir" -maxdepth 1 -type f -name "monitor.log.$pat-*" \
    -printf '%f\n' 2>/dev/null | sort -r | tail -n +4 \
    | while read -r f; do rm -f "$logs_dir/$f"; done
done

Report how many files were removed if any.

(6) Session disk — du -sh /home/tomer/data/$MONITOR_SESSION_ID/ and du -sh /home/tomer/data/mainnet/. If combined > 200 GB, flag SESSION DISK HIGH.

Baseline (data only) is retention_window ledgers, plus a maintenance-cycle buffer (maintenance every 900s ≈ 180 ledgers at 5s/ledger).
Since 57821bcf/25797e2e (Apr 27), ledger headers and tx history are also held back from pruning to satisfy publishing. The new equilibrium is ~3× retention_window (~1050 for retention=360).
If gap > 3 × retention_window + 500 (~1580 for retention=360), flag PRUNING STALLED — that's "headers/tx-history protection plus an extra maintenance cycle's slack" exceeded, indicating real pruning failure.
If RPC is not responding, flag RPC DOWN.

isValidating — if false and node running > 30 min, flag NOT VALIDATING.
validating24HoursPercentage — if < 50 and running > 6 hours, flag LOW VALIDATION RATE.
lag — if > 500, flag HIGH LAG.

If the API errors out, emit obsrvr: N/A (api-error) instead of omitting the field.

(12) Metrics scan — scrape /metrics and evaluate the alert catalog.

mkdir -p /home/tomer/data/$MONITOR_SESSION_ID/metrics.
mv /home/tomer/data/$MONITOR_SESSION_ID/metrics/current.prom /home/tomer/data/$MONITOR_SESSION_ID/metrics/prev.prom 2>/dev/null || true.
curl -s http://localhost:$MONITOR_ADMIN_PORT/metrics > /home/tomer/data/$MONITOR_SESSION_ID/metrics/current.prom.
Counter reset handling: for any counter, if current < prev, treat delta = current (process restarted).

henyey_jemalloc_fragmentation_pct > 50 (post-restart frag ramps ~35-45%, settles to ~18%)
stellar_peer_count < 8 (peer count ramps 0 → 10+ over the first 10 min)
stellar_overlay_inbound_authenticated < 3
stellar_scp_timing_externalized_seconds > 3
stellar_scp_timing_nominated_seconds > 2
counter-started-at-zero delta alerts (any catalog entry whose prev is 0)

Metric alert catalog

Catalog-wide notes:

All entries below are subject to the warmup exemptions above and the cooldown rule in §Firing alerts.
"Synced-only gating" = uptime > 15m AND CRASH_RECOVERY=no AND FRESH_START=no. These gauges fire only when the node should be in real-time sync; the authoritative sync check is (2). The catchup phases are legitimate non-synced states.

COUNTERS (fire on delta ≥ threshold per tick):

stellar_herder_lost_sync_total ≥1 → SYNC
henyey_post_catchup_hard_reset_total ≥1 → ACTION
(stellar_overlay_timeout_idle_total + stellar_overlay_timeout_straggler_total) ≥5× prior-tick-sum → WARN
(stellar_overlay_error_read_total + stellar_overlay_error_write_total) ≥50 → WARN
henyey_archive_cache_refresh_error_total ≥1 → NONC
henyey_archive_cache_refresh_timeout_total ≥3 → NONC

GAUGES (fire on absolute threshold against current snapshot):

stellar_peer_count <8 → WARN
stellar_overlay_inbound_authenticated <3 → WARN (synced-only gating; healthy fleet has 50+ inbound. Aggregate peer_count can be ≥8 from outbound while inbound starves consensus; this catches that case)
stellar_ledger_age_current_seconds >30 → SYNC (synced-only gating)
stellar_herder_state !=2 → SYNC (synced-only gating)
henyey_jemalloc_fragmentation_pct >50 on two consecutive ticks → WARN
henyey_scp_verify_input_backlog >100 on two consecutive ticks → WARN (single snapshots routinely spike to 100+ during slot externalize bursts and drain to 0 within seconds — gate on persistence. Sample /metrics 5x at 2s intervals to verify before filing)
henyey_scp_verifier_thread_state !=0 → WARN (0=Running, 1=Stopping, 2=Dead)
stellar_herder_pending_envelopes >2000 → WARN
henyey_overlay_fetch_channel_depth >500 on two consecutive ticks → WARN (the _max variant is a monotonic high-water mark that never resets after catchup-tail spikes — use the live gauge with persistence guard, mirroring the scp_verify_input_backlog pattern. Sample /metrics 5x at 2s intervals to verify before filing)
(henyey_process_open_fds / henyey_process_max_fds) >0.85 → WARN
henyey_herder_drift_max_seconds >10 → NONC
stellar_scp_timing_externalized_seconds >10 → WARN (this is a SLOT-CYCLE metric: first-envelope-received → self-externalize, naturally ~4-6s on mainnet's 5s slots. The earlier >3 threshold was a misread of #1934 — it compared against stellar-core's mFirstToSelfExternalizeLag which measures a different, much narrower window (any-node externalize → self-externalize, ~0.3-0.5s healthy). Until we expose a matching metric, alert only on >2x normal slot-cycle = real degradation)
stellar_scp_timing_nominated_seconds >7 → WARN (also a SLOT-CYCLE metric measured from first local nomination vote to self-externalize; healthy floor ~3-5s on mainnet)

HISTOGRAMS (fire on p99 bucket of per-tick delta): for each histogram H,

bucket_delta[le] = H_bucket_current[le] - H_bucket_prev[le]
count_delta = H_count_current - H_count_prev

Skip if count_delta < 20. Compute cumulative bucket delta at upper edge L: sum(bucket_delta[le] for le ≤ L). The smallest L where cumulative ≥ 0.99 * count_delta is the p99 upper bound.

Thresholds (all WARN):

henyey_ledger_close_handle_complete_seconds p99 >0.5s
henyey_ledger_close_dispatch_to_join_seconds p99 >5s
henyey_ledger_close_post_complete_seconds p99 >0.5s
henyey_ledger_close_tx_exec_seconds p99 >1s
henyey_ledger_close_soroban_exec_seconds p99 >1s
henyey_ledger_close_commit_seconds p99 >0.5s
henyey_ledger_close_soroban_state_seconds p99 >0.5s
henyey_ledger_close_complete_tx_queue_seconds p99 >0.5s

Mean check (sum_delta / count_delta) is a cheaper fallback — fire on whichever breaches.

Metric extraction forms

Form 1 — Scalar counter (most counters):

cur=$(grep -E '^<metric_name> ' current.prom | awk '{printf "%d", $2}')
prev=$(grep -E '^<metric_name> ' prev.prom | awk '{printf "%d", $2}')

Form 2 — Single labeled series (extract one specific label value):

cur=$(grep -E '^<metric_name>\{<label>="<value>"\} ' current.prom | awk '{printf "%d", $2}')
prev=$(grep -E '^<metric_name>\{<label>="<value>"\} ' prev.prom | awk '{printf "%d", $2}')

Form 3 — Sum of explicit labeled series (sum specific label values):

cur=$(awk '/^<metric_name>\{<label>="<value1>"\}|^<metric_name>\{<label>="<value2>"\}/ {sum+=$NF} END{printf "%d", sum+0}' current.prom)
prev=$(awk '/^<metric_name>\{<label>="<value1>"\}|^<metric_name>\{<label>="<value2>"\}/ {sum+=$NF} END{printf "%d", sum+0}' prev.prom)

For all forms, delta handling is identical:

Extract prev using the same pattern against prev.prom (default 0 if absent)
If current < prev: counter reset (node restarted) → delta = current
Otherwise: delta = current - prev

labels=$(grep -oP '^<metric_name>\{<label>="\K[^"]+' current.prom | sort)
expected=$(printf '%s\n' <label1> <label2> ... | sort)
if [ "$labels" != "$expected" ]; then
  # report: <metric>: skipped (label set mismatch)
fi

Ratio checks — sustained breach detection

Required inputs — global (all must be present, non-empty, numeric; if any missing or invalid, skip ratio checks entirely, empty the ratio snapshot, reset streaks):

stellar_ledger_age_current_seconds — sync gating
stellar_ledger_apply_success_total
stellar_ledger_apply_failure_total
henyey_scp_post_verify_total{reason="accepted"}
henyey_scp_post_verify_total{reason="processed_directly"}
Sum of all 14 henyey_scp_post_verify_total{reason="..."} lines (pv_total_sum)

Per-check required inputs — these are NOT part of the global required set. If missing, only the pending check (Check 3) skips; SCP and apply proceed normally:

stellar_herder_pending_too_old_total
stellar_herder_pending_received_total

Skip conditions (skip all ratio checks when any is true):

FRESH_START=yes
heartbeat event gap > 5
stellar_ledger_age_current_seconds > 30
Process uptime < 10 minutes
/metrics fetch fails (curl error or empty response)
/metrics returns "recorder not installed"
Any required counter missing or invalid
Post-verify label set ≠ expected 14 labels

On any skip: empty /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot, reset all streak counters to 0, report metrics_ratio: skipped (<reason>).

Snapshot file: /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot

Format:

version=1
pid=<PID>
start_ticks=<field 22 from /proc/$PID/stat>
timestamp=<ISO8601>
apply_success=<value>
apply_failure=<value>
pv_accepted=<value>
pv_processed_directly=<value>
pv_total_sum=<value>
apply_breach_streak=<N>
scp_breach_streak=<N>
pending_too_old=<value>
pending_received=<value>
pending_breach_streak=<N>

Process identity: start_ticks = field 22 from /proc/$PID/stat (starttime in clock ticks since boot). Extract with awk '{print $22}' /proc/$PID/stat. Locale-independent, catches PID reuse.

Invalidation:

PID or start_ticks changed → discard, write new baseline, reset streaks
Snapshot malformed, missing fields, or version ≠ 1 → discard
Any current counter < previous value → discard (counter reset)

Metric extraction — illustrative pseudocode (not literal shell; each bail point must stop ratio-check processing for this tick):

metrics_body=$(curl -s http://localhost:$MONITOR_ADMIN_PORT/metrics)

# Bail on fetch failure — stop ratio checks for this tick
if [ -z "$metrics_body" ]; then
  > /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
  # report: metrics_ratio: skipped (fetch failed)
  # STOP — do not proceed to extraction or ratio evaluation
fi

# Bail if recorder not installed — stop ratio checks
if echo "$metrics_body" | grep -q 'metrics recorder not installed'; then
  > /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
  # report: metrics_ratio: skipped (recorder not installed)
  # STOP
fi

ledger_age=$(echo "$metrics_body" | grep -E '^stellar_ledger_age_current_seconds ' | awk '{printf "%d", $2}')
apply_success=$(echo "$metrics_body" | grep -E '^stellar_ledger_apply_success_total ' | awk '{printf "%d", $2}')
apply_failure=$(echo "$metrics_body" | grep -E '^stellar_ledger_apply_failure_total ' | awk '{printf "%d", $2}')
pv_accepted=$(echo "$metrics_body" | grep -E '^henyey_scp_post_verify_total\{reason="accepted"\} ' | awk '{printf "%d", $2}')
pv_processed=$(echo "$metrics_body" | grep -E '^henyey_scp_post_verify_total\{reason="processed_directly"\} ' | awk '{printf "%d", $2}')

# Validate exact 14-label set — STOP if mismatch
pv_labels=$(echo "$metrics_body" | grep -oP '^henyey_scp_post_verify_total\{reason="\K[^"]+' | sort)
expected_labels=$(printf '%s\n' accepted buffer_full buffered drift_cannot_receive drift_close_time drift_range duplicate invalid_sig non_quorum panic per_slot_full processed_directly self_message too_far | sort)
if [ "$pv_labels" != "$expected_labels" ]; then
  > /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
  # report: metrics_ratio: skipped (label set mismatch)
  # STOP
fi

pv_total_sum=$(echo "$metrics_body" | grep -E '^henyey_scp_post_verify_total\{reason="[^"]+"\} ' | awk '{sum+=$2} END {printf "%d", sum}')

# Validate all 6 global values present and numeric — STOP if any invalid
for v in "$ledger_age" "$apply_success" "$apply_failure" "$pv_accepted" "$pv_processed" "$pv_total_sum"; do
  if [ -z "$v" ] || ! echo "$v" | grep -qE '^[0-9]+$'; then
    > /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
    # report: metrics_ratio: skipped (missing counters)
    # STOP
  fi
done

# Per-check inputs for pending (Check 3) — if missing, only Check 3 skips
pending_too_old=$(echo "$metrics_body" | grep -E '^stellar_herder_pending_too_old_total ' | awk '{printf "%d", $2}')
pending_received=$(echo "$metrics_body" | grep -E '^stellar_herder_pending_received_total ' | awk '{printf "%d", $2}')
pending_counters_valid=true
for v in "$pending_too_old" "$pending_received"; do
  if [ -z "$v" ] || ! echo "$v" | grep -qE '^[0-9]+$'; then
    pending_counters_valid=false
    break
  fi
done

# henyey_recovery_stalled_tick_total: extract via Form 2 (see §Metric extraction
# forms). Relaxed label validation: require only the forcing_catchup_behind
# label to be present (minimum for Check 12b alerting). Other labels
# (backoff_active, forcing_catchup_not_behind, archive_behind_peer_ahead_hard_reset,
# at_tip_no_scp_hard_reset) may not yet exist on a fresh node — only 3 of the 5
# are pre-registered in the metrics macro (crates/app/src/metrics.rs:613); the
# remaining 2 are created dynamically on first use of each recovery path.
# If forcing_catchup_behind is missing, skip Check 12b for this tick.
# Alerting is streak-gated — see Check 12b section for logic.

Check 1: SCP post-verify acceptance rate

Numerator delta: delta(pv_accepted) + delta(pv_processed)
Denominator delta: delta(pv_total_sum)
Alert: (numerator / denominator) < 0.05 (less than 5% accepted) for 3 consecutive ticks
Min denominator delta: 500 (below → scp: skipped (low volume), reset scp streak)
Baseline acceptance on mainnet: ~10-20%. <5% sustained = nearly nothing reaching SCP.
On breach: increment scp_breach_streak. If streak ≥ 3, route through Bug Filing Workflow (investigate verifier thread, stale envelopes, sync state).
On healthy tick (ratio ≥ 0.05 with sufficient volume): reset scp_breach_streak to 0.

Check 2: Transaction apply failure rate

Numerator delta: delta(apply_failure)
Denominator delta: delta(apply_failure) + delta(apply_success)
Alert: (numerator / denominator) > 0.50 (over 50% fail) for 3 consecutive ticks
Min denominator delta: 200 (below → apply: skipped (low volume), reset apply streak)
On breach: increment apply_breach_streak. If streak ≥ 3, investigate in same tick. If evidence points to henyey apply-engine bug, file/comment via Bug Filing Workflow. If expected bad-tx traffic (spam wave, known rejections), report as WARNING without filing.
On healthy tick (ratio ≤ 0.50 with sufficient volume): reset apply_breach_streak to 0.

Check 3: Pending too-old rate

Numerator delta: delta(pending_too_old)
Denominator delta: delta(pending_received)
Alert: (numerator / denominator) > 0.50 (over 50% too old) for 3 consecutive ticks
Min denominator delta: 100 (below → pending: skipped (low volume), reset pending streak)
Per-check skip: If pending_counters_valid is false (missing counters), skip this check only — report pending: skipped (missing counters), reset pending_breach_streak to 0. SCP and apply checks proceed normally.
On breach: increment pending_breach_streak. If streak ≥ 3, report as WARNING (overlay lag or stale-envelope flood). Investigate overlay peer health and slot progression.
On healthy tick (ratio ≤ 0.50 with sufficient volume): reset pending_breach_streak to 0.

Per-check state machine (each check independently):

Skip (global skip, low volume, or missing data) → reset that check's streak to 0
Healthy (ratio within threshold, sufficient volume) → reset streak to 0
Breach (ratio exceeds threshold, sufficient volume) → increment streak

Per-check low-volume: When only one check's denominator delta is below its minimum, that check skips (streak resets to 0) and the others proceed normally.

Thresholds are provisional — tune after 1-2 weeks of production data.

Status report: Each check independently reports one of: ok (value), skipped (reason), WARNING value (N ticks), or collecting baseline.

Check 12b: Recovery-stalled streak (counter-based, independent of ratio checks)

Canonical constants: Threshold values, snapshot path, and applicability are defined in shared/check-12b-constants.toml. This section is authoritative for the state machine logic; inline literals are cross-validated against the TOML by scripts/test-monitor-skill-snippets.sh.

Data source: Reuses the same /metrics scrape result ($metrics_body / metrics/current.prom) already fetched by check-8/check-12. Does NOT perform a second /metrics fetch.

Applicability: Validator mode only. In watcher mode, skip Check 12b entirely and omit the recovery_stalled: line from the status report.

Snapshot file: /home/tomer/data/$MONITOR_SESSION_ID/metrics/counter_streak_snapshot

Format:

version=1
pid=<PID>
start_ticks=<field 22 from /proc/$PID/stat>
timestamp=<ISO8601>
recovery_stalled_behind=<value>
recovery_stalled_breach_streak=<N>

Skip conditions (skip Check 12b only when any is true):

/metrics fetch failed this tick (same condition check-8 detects)
/metrics returns "recorder not installed"
forcing_catchup_behind label missing from the scrape

Invalidation (reset streak AND enter "collecting baseline"):

PID or start_ticks changed (process restart) — checked before skip conditions
Snapshot malformed, missing fields, or version ≠ 1
Current recovery_stalled_behind value < previous (counter reset)
First tick after fresh start (no prior snapshot exists)

Per-tick logic (not skipped AND not invalidated):

delta = current(recovery_stalled_behind) - prev(recovery_stalled_behind)

if delta >= 10:
    # Immediate-fire override: large burst indicates sustained stalling.
    # Do NOT reset streak — keep incrementing; cooldown (7200s) handles dedup.
    recovery_stalled_breach_streak += 1
    → fire WARN, route through Bug Filing Workflow
elif delta >= 1:
    recovery_stalled_breach_streak += 1
    if recovery_stalled_breach_streak >= 3:
        → fire WARN, route through Bug Filing Workflow
else:  # delta == 0
    recovery_stalled_breach_streak = 0

Post-restart warmup: First tick writes baseline ("collecting baseline"). Second tick has a valid prev for delta comparison and begins normal evaluation.

Alert identity and cooldown:

Cooldown key: henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"} (full selector-qualified)
Cooldown period: 7200s (2h)
Issue search: gh issue list --search 'metrics: henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"}' --state open
Issue title on filing: metrics: henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"} — sustained breach
Filing follows the standard Bug Filing Workflow (§Firing alerts — cooldown + filing)

Status line: recovery_stalled: (reported after metrics_ratio:):

recovery_stalled: ok (delta=0) — no increment, streak reset
recovery_stalled: breach (delta=N, streak M/3) — incrementing, below threshold
recovery_stalled: WARNING delta=N (M ticks) — investigating — streak ≥ 3
recovery_stalled: WARNING delta=N (burst) — investigating — immediate fire (delta ≥ 10)
recovery_stalled: skipped (<reason>) — metric missing or fetch failed
recovery_stalled: collecting baseline — first tick after restart/invalidation

Rendering precedence (determines the metrics_ratio: line format):

Global skip (all checks skipped for the same reason — e.g., not in sync, fetch failed, recorder not installed, missing counters, label mismatch): Use the collapsed form: metrics_ratio: skipped (<reason>)
Collecting baseline (no previous snapshot exists — first steady-state tick): metrics_ratio: collecting baseline
Per-check reporting (at least one check ran — whether ok, warning, or individually skipped for low volume or missing per-check inputs): Compose: metrics_ratio: scp <scp_status>, apply <apply_status>, pending <pending_status>

Examples:

All healthy: metrics_ratio: scp ok (accept=15%), apply ok (fail=8%), pending ok (too_old=3%)
One warning: metrics_ratio: scp ok (accept=12%), apply WARNING fail=55%>50% (3 ticks) — investigating, pending ok (too_old=5%)
One skipped (low volume): metrics_ratio: scp skipped (low volume), apply ok (fail=5%), pending ok (too_old=2%)
Pending skipped (missing counters): metrics_ratio: scp ok (accept=15%), apply ok (fail=8%), pending skipped (missing counters)
Global skip: metrics_ratio: skipped (not in sync)
Collecting: metrics_ratio: collecting baseline

recovery_stalled: examples (after metrics_ratio: in output):

Healthy: recovery_stalled: ok (delta=0)
Building streak: recovery_stalled: breach (delta=2, streak 1/3)
Firing (streak): recovery_stalled: WARNING delta=1 (3 ticks) — investigating
Firing (burst): recovery_stalled: WARNING delta=15 (burst) — investigating
Skipped: recovery_stalled: skipped (metric missing)
Baseline: recovery_stalled: collecting baseline

Firing alerts — cooldown + filing

For each firing alert:

Read /home/tomer/data/$MONITOR_SESSION_ID/metrics/anomaly_cooldown.json (create empty {} if missing).
If now - last_filed[<metric>] < 7200s (2h), include the alert in the status report but SKIP file/comment.
Otherwise follow the BUG FILING WORKFLOW:
- Search gh issue list --search "metrics: <metric-name>" --state open.
- If one matches, gh issue comment <N> with the new evidence (current/prev values, delta, threshold, ledger, binary sha, sibling metrics). Apply the urgent label only if the metric breach blocks validator operation (per the Label policy in the Bug filing workflow); otherwise leave unlabeled. Board-route: if NOT on project board, add to Backlog.
- If no match, gh issue create (append --label urgent only when the metric breach is operation-blocking; most metric alerts are non-urgent and should be filed without a label) with:
  - Title: Non-critical: metrics: <metric> (NONC tier) or metrics: <metric> — <symptom> (WARN/SYNC tier).
  - Body: current/prev values, delta, threshold, ledger, binary sha, related sibling metrics, file:line citation from grep -n "<metric_name_without_prefix>" crates/ -r, and a suggested fix. Board-route: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
Update anomaly_cooldown.json with {"<metric>": <now>}.
For SYNC-tier alerts, ALSO update the sync: line in the status report (not just metrics:).

Watcher mode

Remote sync & redeploy

(10) Remote sync — first sanity-check the working tree:

If git status --porcelain reports any output, ABORT the deploy path for this tick. Report: DEPLOY SKIPPED (dirty tree) with the list of dirty paths. Do not run git pull against a dirty tree; do not kill the node. Investigate the dirty tree before the next tick.
If clean, git fetch origin main. If in detached HEAD state (git symbolic-ref HEAD fails), git checkout main first.

BUILD_SHA_FILE="/home/tomer/data/$MONITOR_SESSION_ID/build_sha"
SESSION_DIR="/home/tomer/data/$MONITOR_SESSION_ID"

deployed_sha=""
deployed_sha_status=""

if [ -s "$BUILD_SHA_FILE" ]; then
  candidate=$(tr -d '[:space:]' < "$BUILD_SHA_FILE")
  if printf '%s' "$candidate" | grep -qE '^[0-9a-f]{40}$' \
     && git cat-file -e "${candidate}^{commit}" 2>/dev/null; then
    deployed_sha="$candidate"
    deployed_sha_status="ok"
  else
    deployed_sha_status="invalid"
  fi
elif [ -e "$SESSION_DIR/last_ledger" ] \
  || [ -e "$SESSION_DIR/metrics/current.prom" ] \
  || [ -e "$SESSION_DIR/tick-history.jsonl" ]; then
  # Other per-session state exists — monitor-loop should have written
  # build_sha but didn't (e.g. rollout of this change to an existing
  # session). Force a rebuild to restore the invariant.
  deployed_sha_status="invalid"
else
  # Truly fresh session dir — safe to fall back to today's behavior.
  deployed_sha=$(git rev-parse HEAD)
  deployed_sha_status="missing-fresh"
fi

# Cross-validate: running binary's commit_hash is authoritative.
running_hash=""
if info_json=$(curl -sf "http://localhost:$MONITOR_ADMIN_PORT/info" 2>/dev/null); then
  running_hash=$(printf '%s' "$info_json" \
    | python3 -c 'import sys,json; print(json.load(sys.stdin).get("commit_hash",""))' 2>/dev/null)
fi

if [ -n "$running_hash" ] \
   && printf '%s' "$running_hash" | grep -qE '^[0-9a-f]{40}$' \
   && git cat-file -e "${running_hash}^{commit}" 2>/dev/null; then
  # Binary reports a valid, locally-reachable commit hash — authoritative.
  case "$deployed_sha_status" in
    ok)
      if [ "$deployed_sha" != "$running_hash" ]; then
        # BUILD_SHA_FILE disagrees with binary. Trust binary, repair file.
        printf '%s\n' "$running_hash" > "${BUILD_SHA_FILE}.tmp"
        mv "${BUILD_SHA_FILE}.tmp" "$BUILD_SHA_FILE"
        deployed_sha="$running_hash"
        # deployed_sha_status stays "ok"
      fi
      ;;
    invalid|missing-fresh)
      # File bad/missing but binary knows its commit — repair and use.
      printf '%s\n' "$running_hash" > "${BUILD_SHA_FILE}.tmp"
      mv "${BUILD_SHA_FILE}.tmp" "$BUILD_SHA_FILE"
      deployed_sha="$running_hash"
      deployed_sha_status="ok"
      ;;
  esac
fi
# If /info unreachable, commit_hash absent/empty/malformed, or SHA not in
# local git: no-op — fall through to existing logic unchanged.

`/info` state	`commit_hash` field	`git cat-file -e`	Action
Unreachable	—	—	No-op
Reachable	Missing/empty/malformed	—	No-op
Reachable	Valid 40-char hex	Fails (not in local git)	No-op (prevents cache poisoning)
Reachable	Valid 40-char hex	Succeeds, matches `deployed_sha`	No-op
Reachable	Valid 40-char hex	Succeeds, differs from `deployed_sha`	Repair file, use runtime hash

State	Meaning	Gate behavior
`ok`	Valid commit known (from file or repaired from runtime)	Compare `deployed_sha` vs `origin/main`
`missing-fresh`	No file, no runtime hash, no other session state	Falls back to `HEAD` vs `origin/main`
`invalid`	File corrupt + runtime unavailable	Force rebuild unconditionally

Report deployed_sha_status in the tick's status line.

Gate comparison:

origin_sha=$(git rev-parse origin/main)

case "$deployed_sha_status" in
  ok|missing-fresh)
    [ "$deployed_sha" = "$origin_sha" ] && gate_action="up-to-date" || gate_action="proceed"
    ;;
  invalid)
    gate_action="proceed-force-rebuild"
    ;;
esac

If gate_action="up-to-date": no action (already up to date).

Otherwise enter the deploy path:

Cool-down guard: if node uptime is < 60m AND CRASH_RECOVERY=no (i.e. the previous tick deployed cleanly), SKIP DEPLOY this tick. Report: DEPLOY DEFERRED (cool-down: uptime=<X>m < 60m). If deployed_sha_status is invalid, append , build_sha: invalid to the report so the operator knows a follow-up rebuild is pending once cool-down expires. Rationale: every restart costs ~1-2m of validation downtime + brief peer reconnect. Deploying multiple commits within minutes thrashes the validator without giving each version time to surface issues. The 60m floor was raised from the original 30m (#1944) on 2026-04-30 after observing too many quick-cycle deploys per day. Crash-recovery restarts are exempt because they're not voluntary deploys. Urgent fixes can override by killing the node manually before the next tick — the normal startup path will pick up the latest origin/main.
Binary-relevance check:
```
if [ "$gate_action" = "proceed-force-rebuild" ]; then
  # invalid build_sha — skip allowlist, force rebuild
  needs_rebuild="yes"
else
  # ok or missing-fresh: diff from deployed_sha
  if changed_paths=$(git diff --name-only "$deployed_sha" origin/main 2>/dev/null); then
    # Evaluate allowlist on $changed_paths (see below).
    # If every path is allowlisted: needs_rebuild="no"
    # Else: needs_rebuild="yes"
    ...
  else
    # git diff failed — fail closed, force rebuild.
    needs_rebuild="yes"
  fi
fi
```
If needs_rebuild="no" (all paths allowlisted), skip the rebuild + restart: run git pull --rebase only, then report DEPLOY SYNCED (no-binary-impact: docs/scripts only — pulled <N> commits, no restart). Do NOT update BUILD_SHA_FILE (the binary hasn't changed).

The non-binary-impact allowlist is:
- .github/
- .claude/
- scripts/
- docs/
- root-level *.md files, e.g. README.md or CLAUDE.md
- stellar-specs / stellar-specs/ submodule pointer changes only
Deny by default: if any path is outside this allowlist, continue to the CI, build, and restart path. In particular, changes to Cargo.toml, Cargo.lock, any build.rs, crates/, configs/, or mixed docs+code commits require a rebuild + restart.

Quarantine gate (only reached when needs_rebuild=yes):

# --- Quarantine gate ---
source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
check_quarantine_ancestry "$HOME/data/deploy_quarantine.txt"
if [ $? -eq 0 ]; then
  case "$QUARANTINE_STATUS" in
    blocked_unreadable)
      deploy_report="BLOCKED (quarantine file unreadable — fail-closed)" ;;
    blocked_git_error)
      deploy_report="BLOCKED (quarantine ancestry check failed for ${QUARANTINED_MATCH:0:8} — fail-closed)" ;;
    blocked_ancestor)
      deploy_report="DEFERRED (quarantined: ${QUARANTINED_MATCH:0:8} reachable from origin/main — see ~/data/deploy_quarantine.txt)" ;;
  esac
  [ -n "$QUARANTINE_WARNINGS" ] && deploy_report="$deploy_report [WARN: $QUARANTINE_WARNINGS]"
  # Skip to status report with this deploy_report. Do not continue.
fi

Check CI status on origin/main: gh run list --branch main --limit 3 --json conclusion --jq '.[].conclusion'. If any recent run has conclusion failure, do NOT deploy — route the failure through check (11) and wait.
If all conclusions are success (ignore "" for in-progress and cancelled): git pull --rebase, CARGO_TARGET_DIR=/home/tomer/data/$MONITOR_SESSION_ID/cargo-target cargo build --release -p henyey.

If build succeeds:

# Persist the just-built sha BEFORE Stop-PID. Atomic via tmp + mv.
new_sha=$(git rev-parse HEAD)
printf '%s\n' "$new_sha" > "${BUILD_SHA_FILE}.tmp"
mv "${BUILD_SHA_FILE}.tmp" "$BUILD_SHA_FILE"

Then: Stop-PID, Rotate-log suffix preredeploy, Relaunch. Report: DEPLOY — pulled <N> commits (<old-sha>..<new-sha>), rebuilt, restarted at L<ledger>.

If build fails: report BUILD FAILED, do NOT restart — the old binary is still running. Do NOT update BUILD_SHA_FILE. Route the build error through check (11).

CI check workflow

(11) CI check — scope and levels of detection.

Only act on failures from the last 2 hours (compare createdAt with date -u +%Y-%m-%dT%H:%M:%SZ). For each failure:

gh run view <ID> --log-failed 2>&1 | tail -80.
Categorize: build error, test failure, timeout, flaky, infrastructure.
If the failure looks like CI automation rather than a real defect — runner timeout outside test logic, network error fetching deps, GHA service hiccup, OOM-on-runner, "operation timed out" with no test stack, known-flaky test patterns (test_core3_restart_rejoin_* is a recurrer; see #1838 / #1939) — rerun the failing jobs first:
```
attempts=$(gh run view <ID> --json attempt --jq .attempt)
if [ "$attempts" -lt 2 ]; then
  gh run rerun <ID> --failed
fi
```
Report CI RERUN ATTEMPTED — <workflow> jobs <names> on <sha>, attempt 1→2 and stop CI processing for this tick. The next tick will see the re-run conclusion. Don't rerun more than once — if the re-run also fails, the failure is real (or an unusually persistent flake) and warrants filing. If the failure is clearly NOT automation (build error from real code, test assertion mismatch, hash mismatch, panic from production code), skip the rerun and proceed to step 4.
Check for an existing open issue: gh issue list --search "<workflow name + signature>" --state open. If one matches, gh issue comment <N> with the new evidence (sha, log snippet, timestamp) and ensure it has the urgent label (gh issue edit <N> --add-label urgent) — failing CI on origin/main blocks deploy and meets the urgent criteria. Board-route: if NOT on project board, add to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
Otherwise, file a new issue: gh issue create --label urgent --title "<workflow>: <short signature>" --body "..." with investigation findings. Board-route: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
Do NOT commit a fix. Report: CI ISSUE FILED — <workflow> failed on <sha>, filed/commented #<N>.

Bug filing workflow

Applies to node bugs, metric alerts, and CI failures.

Body delivery — use `--body-file`, not heredocs

When the issue body contains code blocks, always write the body to a temp file and pass --body-file <path> rather than --body "$(cat <<'EOF' ... EOF)".

Pattern:

cat > /tmp/issue-body.md <<'EOF'
## Symptom
...
```rust
some code

... EOF gh issue create --title "..." --body-file /tmp/issue-body.md

or for amendments:

gh issue edit --body-file /tmp/issue-body.md


The single-quoted `'EOF'` makes the heredoc literal — no escaping needed for
backticks, dollar signs, etc. inside the body.

### Label policy

> **Canonical reference:** `scripts/lib/monitor-label-policy.md` is the
> single source of truth for labeling and board-routing rules. This section
> is a normative excerpt.

When creating or commenting on issues:

- **`urgent`** — file with `--label urgent`, OR `gh issue edit <N> --add-label urgent`
  on an existing issue, ONLY when the symptom blocks validator operation or
  consensus participation. Specifically:
  - hash mismatch (any kind)
  - wedged node (frozen event loop, watchdog auto-abort)
  - failing CI on origin/main blocking deploy
  - panic or crash from production code
  - SYNC FAILURE past the active deadline
  - OOM-driven restart
  - and similar runtime-blocking conditions
- **(no label)** — non-urgent issues: calibration, threshold tuning, NONC
  alerts, cosmetic noise, follow-up improvements. Downstream picks these up
  at lower priority.
- **`not-ready`** — reserved for tier-3 self-reflection issues that need
  operator decision before any code change. Do not use for regular bug
  filings.

**Label lifecycle:**
- `urgent` and `not-ready` are mutually exclusive in steady state.
- Monitors MAY add `urgent` to an existing unlabeled issue (escalation).
- Monitors do NOT remove labels (de-escalation is operator-only).
- If a `not-ready` issue becomes blocking: add `urgent`, comment with
  evidence. Both labels coexist temporarily until operator resolves.

### Filing flow

1. Identify the failing signature (ledger + error type for node bugs; metric
   + threshold for alerts; workflow + job + error type for CI).
2. Investigate to root cause — read source code, trace code paths.
3. Check for an existing open issue: `gh issue list --search "<keywords>" --state open`.
   If a match exists, verify its state is OPEN
   (`gh issue view <N> --json state -q .state`), `gh issue comment <N>` with
   new evidence, and apply the `urgent` label only if the recurrence meets
   the urgent criteria above.
   **Board-route:** if the issue is NOT on the project board, add it:
   `bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog`
   (or `Blocked` if labeled `not-ready`). If already on board, preserve
   current status. STOP.
4. If no OPEN match, file a new issue with `gh issue create` — append
   `--label urgent` only if the symptom meets the urgent criteria above;
   otherwise omit the label. Body is a self-contained proposal (symptom,
   evidence, suspected root cause, fix sketch with file:line references).
   **Board-route:** add to project board:
   `bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog`
   (or `Blocked` if labeled `not-ready`).
5. Do NOT spawn agents. Do NOT edit the main checkout. The next redeploy tick
   (check 10) will pick up whatever lands on main.

**Board routing failure:** If `move-issue-status.sh` fails, log as ACTION
in the status report. Retry on next tick. Escalate after 3 consecutive
failures (see `scripts/lib/monitor-label-policy.md`).

**Recurrence policy**: If a previously-filed bug recurs with material new
evidence, prefer commenting on the existing issue when it is the same bug at
the same site AND the issue is OPEN. If the prior issue is CLOSED, file a new
issue (label per the policy above) with `Related to #<prior> (closed)` in the
body and a note on why the prior fix did not cover this case; do NOT comment
on the closed issue. File a new issue (still referencing `Related to #<prior>`
with one-line scope-diff) when new evidence points at a different named
subsystem, phase/mark, root-cause hypothesis, or candidate site set.

**Commit policy**: the monitor does NOT commit code. All fixes are delegated
via `gh issue` + project board routing.

**Deploy regression policy**: If the node fails after a deploy
(see also `scripts/lib/monitor-label-policy.md`):

(a) Record the bad SHA BEFORE any rollback rebuild (`build_sha` will be
overwritten during rebuild):

```bash
bad_sha=$(cat "$BUILD_SHA_FILE")

(b) Append to quarantine file (idempotent):

source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
rc=0
quarantine_append "$HOME/data/deploy_quarantine.txt" "$bad_sha" "regression #<issue>" || rc=$?
if [ $rc -ne 0 ]; then
  echo "WARNING: quarantine_append failed (rc=$rc) — deploy gate may not block next tick" >&2
fi

(d) Restart the node on the last known-good binary (rebuild from the previous commit) while waiting for the fix. Do NOT revert commits inline.

Quarantine Clearance

The deploy quarantine is a safety lock. The monitor skill NEVER removes entries. Clearance is an explicit operator decision.

When to clear: After ALL of:

The linked issue (in the reason field) is CLOSED with a fix merged.
CI on origin/main is green (the fix passes all tests).
The operator has reviewed the fix commit and is confident the regression is resolved.

How to clear (using the deploy-quarantine helper):

source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
rc=0
quarantine_remove "$HOME/data/deploy_quarantine.txt" "$bad_sha" || rc=$?
if [ $rc -ne 0 ]; then
  echo "ERROR: quarantine_remove failed (rc=$rc) — entry may still be active" >&2
fi

Operator reminders: The deploy: status line reports DEFERRED (quarantined: ...) every tick (~20 minutes). This is a persistent, automatic reminder that requires no separate notification.

Emergency override: To force deploy despite quarantine:

rm "$HOME/data/deploy_quarantine.txt"   # clears ALL quarantines
# or: remove a specific entry using the helper:
# source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
# quarantine_remove "$HOME/data/deploy_quarantine.txt" "<sha>"

Investigation

Output

Print a multiline status report:

MONITOR <OK|WARNING|ACTION|OFFLINE> — L<ledger> — <timestamp>
  node:    mode=<MODE> session=<session-id> pid=<PID> fresh_start=<yes|no>
  wipe:    <per wipe-state composition table — omitted when no wipe>
  sync:    <synced | CATCHING UP (gap=N, uptime=Xm, deadline=<15m|20m|60m>) | SYNC FAILURE (gap=N, uptime=Xm — filed/commented #<N>)>
  mem:     <RSS_MB>MB rss | alloc=<alloc>MB resident=<resident>MB frag=<pct>%
           heap=<heap>MB mmap=<mmap>MB unaccounted=<sign><unaccounted>MB
  disk:    <used>/<total> (<pct>%) | session+data=<size>
  rpc:     <healthy|unhealthy|N/A> oldestL=<X> latestL=<Y> window=<Z>
  obsrvr:  <validating=<Y/N> val24h=<pct>% lag=<N> | N/A (watcher) | N/A (api-error)>
  metrics: <clean | N alerts (<metric1>,<metric2>,...) — filed/commented #<N>,#<M> | N alerts, K suppressed by cooldown>
  metrics_ratio: scp <ok (accept=X%) | skipped (reason) | WARNING accept=X%<5% (N ticks)>, apply <ok (fail=Y%) | skipped (reason) | WARNING fail=Y%>50% (N ticks) — investigating>, pending <ok (too_old=Z%) | skipped (reason) | WARNING too_old=Z%>50% (N ticks)> | collecting baseline
  recovery_stalled: <ok (delta=0) | breach (delta=N, streak M/3) | WARNING delta=N (M ticks) — investigating | WARNING delta=N (burst) — investigating | skipped (<reason>) | collecting baseline>
  deploy:  <up-to-date | DEFERRED (quarantined: <sha8> reachable from origin/main — see ~/data/deploy_quarantine.txt) | BLOCKED (quarantine file unreadable — fail-closed) | BLOCKED (quarantine ancestry check failed for <sha8> — fail-closed) | DEFERRED (cool-down: ...) | SYNCED (no-binary-impact: ...) | pulled N commits (old..new) | SKIPPED (dirty-tree|ci-red|build-failed, filed/commented #<N>)>
  ci:      <all green (run+job level) | WORKFLOW failed — filed/commented #<N> | WORKFLOW jobs FAILED (continue-on-error) — NAME|conclusion listed, filed/commented #<N>>
  self_reflect: <clean | fixed inline (<sha>: <short-desc>) | filed #<N> (urgent: <short-desc>) | filed #<N> (no-label: <short-desc>) | filed #<N> (not-ready: <short-desc>)>

The wipe: line is present when SESSION_WIPED=yes or MAINNET_WIPED=yes (or both). Format per the wipe-state composition table above. When neither fires, omit the line entirely.

Tick history capture

# ts is computed inside the Python block — do NOT use shell expansion for timestamps.
HIST=/home/tomer/data/$MONITOR_SESSION_ID/tick-history.jsonl
python3 - <<'PY' >> "$HIST"
import json
from datetime import datetime, timezone
print(json.dumps({
  "ts":           datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
  "status":       "<OK|WARNING|ACTION|OFFLINE>",
  "ledger":       <current-ledger-int>,
  "build":        "<short-sha>",
  "deploys":      <0 or 1>,
  "warnings":     [<list of metric names that breached>],
  "actions":      [<list of action keywords: restart, deploy, filed-#N, session-wiped-recovery, session-wiped-process-alive, session-wiped-rebuild-failed, mainnet-data-wiped>],
  "self_reflect": "<clean | fixed-inline | filed-#N>",
  "watch":        ["<key>=<value>", ...]
}))
PY

ts contract: The ts field records the wall-clock UTC time captured immediately before emitting the JSON line. Format: YYYY-MM-DDTHH:MM:SSZ (ISO 8601, always UTC). Acceptable skew: ≤ 60 seconds from real wall-clock time. Any drift > 60 s indicates a bug in the capture mechanism.

Self-reflection

What to look for

False-positive alert: a check fired, but investigation showed the underlying state is expected-correct. Root cause is the catalog threshold / rule, not the node. (Examples from this session: scp_verifier_thread_state !=1 was inverted; quorum_agree <4 fires between externalizations by design.)
Silent failure: a check returned empty / zero due to a tool bug. (Examples: zsh glob NO_NOMATCH killed a pipeline; a histogram metric name had a typo and silently resolved to count_delta=0 which skipped evaluation.)
Self-contradictory output: sync: synced alongside metrics: peer_count<8 WARN, or similar internal inconsistency that suggests a missing warmup exemption or cross-check rule.
Missing carveout: a gauge / counter fires during a legitimate transient (startup, restart-warmup, crash-recovery replay) but has no exemption in the catalog.

If none of the above: report self_reflect: clean and stop.

Three tiers of action

When an issue is found, choose a tier and act on it in the same tick.

Tier 1 — Fix inline (trivial edit).

Criteria (ALL must hold):

Change is contained to .claude/skills/monitor-tick/SKILL.md (no other files touched)
Diff < 50 lines
No new runtime dependency (no new file path, env var, or metric name that the skill depends on)
The edit is a clear text change with an obvious correct value: typo fix, threshold adjustment to match an observed live baseline, shell-portability fix, adding a metric to an existing exemption list, rendering-template tweak
The need is demonstrated by THIS tick's observation — not speculative

Action sequence (same pattern used for every skill edit in this repo):

# 1. Edit .claude/skills/monitor-tick/SKILL.md
# 2. git add .claude/skills/monitor-tick/SKILL.md
# 3. git commit -m "Monitor-tick: <what + why>" with Co-authored-by: Claude Code trailer
# 4. git push origin main (on reject: git pull --rebase && git push)

Report: self_reflect: fixed inline (<short-sha>: <short-desc>).

Tier 2 — File GH issue (non-trivial but codeable).

The issue is real and actionable but any of:

Touches multiple files or crates
Requires a design choice (which metric? which threshold value? which algorithm?)
Needs verification beyond the tick (new test cases, reproducing with a build)
Affects runtime contracts (env schema, file format, section ordering that another skill depends on)

Issue body MUST include:

Symptom: one-line description of the false positive / silent failure / contradiction
Evidence: exact tick output and command results that demonstrated the issue
Suspected root cause: which rule / threshold / code path
Concrete fix sketch: file:line references and proposed diff direction
Related to # if it's a recurrence of something already filed

Title format:

Non-critical: monitor-tick: <description> — for observability / calibration issues (noise reduction, threshold tuning) — file with no label
monitor-tick: <description> — for correctness bugs (silent failures, contradictory output) — file with no label unless the bug is silently masking a real validator-blocking signal (then urgent)

Report: self_reflect: filed #<N> (urgent: <short-desc>) or self_reflect: filed #<N> (no-label: <short-desc>).

Tier 3 — File not-ready GH issue (human input required).

Use this tier when any of:

The fix has product/ops policy implications (e.g., "should we monitor a new metric class?", "should we change the restart philosophy?", "should we broaden the auto-deploy trigger?")
Ambiguous scope — fix A, B, or C would all work and the right choice depends on operator intent
Touches the node code, another skill, or config defaults — the downstream fixer should not auto-pick this up without explicit operator direction

Issue body includes everything from Tier 2 plus an explicit "Human input required" section listing the specific decisions / options that need an operator answer.

Label: not-ready. Title format: monitor-tick: [needs-decision] <description>.

Report: self_reflect: filed #<N> (not-ready: <short-desc>).

Boundaries

Scope is single-tick. Do not retrospectively review prior ticks for drift. Cross-tick pattern detection is a separate concern.
Never suppresses a real node-side filing. If a check flagged a SYNC FAILURE on the node and self-reflection concludes the detection was over-eager, the SYNC FAILURE filing still goes out this tick (real-or-not is downstream's call). File a separate Tier 2 issue to tune the detection.
Never re-opens or argues with already-filed issues from this or prior ticks. Those stand as-is.
Trivial-fix bias: when in doubt between Tier 1 and Tier 2, prefer Tier 2 (filing) over an aggressive inline edit. Inline edits affect every subsequent tick immediately; better to write up the rationale and let the operator review.

monitor-tick

Monitor Tick

Preconditions

Per-session state files

Session-dir-vanished detection

Long-stale session detection

Recovery path when SESSION_WIPED=yes

Mainnet-data-vanished detection

Wipe-state composition

Complete state truth table

Session-alive marker

Fresh-start state

Crash-recovery state

Health checks

Common procedures

Metric alert catalog

Metric extraction forms

Ratio checks — sustained breach detection

Check 12b: Recovery-stalled streak (counter-based, independent of ratio checks)

Firing alerts — cooldown + filing

Watcher mode

Remote sync & redeploy

CI check workflow

Bug filing workflow

Body delivery — use --body-file, not heredocs

or for amendments:

Quarantine Clearance

Investigation

Output

Tick history capture

Self-reflection

What to look for

Three tiers of action

Boundaries

Monitor Tick

Preconditions

Per-session state files

Session-dir-vanished detection

Long-stale session detection

Recovery path when SESSION_WIPED=yes

Mainnet-data-vanished detection

Wipe-state composition

Complete state truth table

Session-alive marker

Fresh-start state

Crash-recovery state

Health checks

Common procedures

Metric alert catalog

Metric extraction forms

Ratio checks — sustained breach detection

Check 12b: Recovery-stalled streak (counter-based, independent of ratio checks)

Firing alerts — cooldown + filing

Watcher mode

Remote sync & redeploy

CI check workflow

Bug filing workflow

Body delivery — use --body-file, not heredocs

or for amendments:

Quarantine Clearance

Investigation

Output

Tick history capture

Self-reflection

What to look for

Three tiers of action

Boundaries

Recovery path when `SESSION_WIPED=yes`

Body delivery — use `--body-file`, not heredocs

Recovery path when `SESSION_WIPED=yes`

Body delivery — use `--body-file`, not heredocs