with one click
monitor-tick
// One tick of the henyey mainnet monitor — checks, metrics scan, deploy, status report
// One tick of the henyey mainnet monitor — checks, metrics scan, deploy, status report
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | monitor-tick |
| description | One tick of the henyey mainnet monitor — checks, metrics scan, deploy, status report |
One invocation of the monitor loop. Intended to be called on a ~20-minute
cadence by an external orchestrator (crontab, systemd timer, CI) via
claude -p '/monitor-tick', or by a Claude-internal /loop 20m /monitor-tick.
The 60m deploy cool-down (check 10) is sized so each deployed binary gets
~3 ticks of observation before the next deploy is eligible.
This skill expects /home/tomer/data/monitor-loop.env to exist — it is
written by /monitor-loop at startup and contains the runtime config.
If the file is missing, bail out immediately with:
ERROR: /home/tomer/data/monitor-loop.env not found — run /monitor-loop first.
Load the env file at the start of the tick:
set -a
source /home/tomer/data/monitor-loop.env
set +a
Variables provided by the env file (all required):
MONITOR_SESSION_ID — 8-char session id, e.g. 74535976MONITOR_MODE — validator or watcherMONITOR_CONFIG — path to the TOML config, e.g. configs/validator-mainnet-rpc.tomlMONITOR_ADMIN_PORT — 11627 (validator) or 11727 (watcher)MONITOR_RPC_PORT — 8000 (validator) or empty (watcher)MONITOR_RUN_FLAGS — --validator (validator) or empty (watcher)Throughout this skill, substitute those values for the $MONITOR_* references below.
All files below live in /home/tomer/data/$MONITOR_SESSION_ID/:
| File | Purpose | Writer |
|---|---|---|
build_sha | Cache of deployed binary's source commit (authoritative source: runtime /info.commit_hash; seeded by build step) | check 10 (after successful build or runtime self-heal) and monitor-loop step 5 |
last_ledger | STUCK detection state (check 4) | check 4 |
last_ledger_count | STUCK confirmation counter (check 4) | check 4 |
wipe_attempted_at | Epoch of most recent (3a) wipe; read by (3d) post-wipe recurrence guard | check 3a (write), check 3d marker-clear rule (remove) |
tick-history.jsonl | daily-summary aggregation | tick epilogue |
metrics/current.prom | latest Prometheus scrape | check 8 |
metrics/prev.prom | previous Prometheus scrape | check 8 |
metrics/ratio_snapshot | counter-ratio history (check 12) | check 12 |
metrics/counter_streak_snapshot | counter-streak state (check 12b) | check 12b (constants) |
metrics/anomaly_cooldown.json | alert dedup state | check 9 |
logs/monitor.log | node stdout/stderr (rotated on restart) | node process |
cargo-target/ | cached build tree | cargo |
.alive | session liveness marker | session startup |
Cross-session state (lives at $HOME/data/, outside any session dir):
| File | Purpose | Writer | Remover |
|---|---|---|---|
deploy_quarantine.txt | Commit SHAs blocked from rebuild/restart deploy; survives session wipes | Deploy regression policy (skill, during rollback) | Operator only (manual) |
Immediately after loading monitor-loop.env, check whether the session
directory still exists. If not, this is a distinct (and severe) failure class
— the entire session state (binary, logs, metrics, tick history) was deleted
out-of-band.
source "$(git rev-parse --show-toplevel)/scripts/lib/monitor-decisions.sh"
check_session_wiped "$HOME/data" "/proc" "$MONITOR_SESSION_ID" \
"$HOME/data/monitor-loop.env" || exit 1
The check_session_wiped function (defined in scripts/lib/monitor-decisions.sh):
SESSION_WIPED ("yes"/"no") and SESSION_WIPED_PROCESS_ALIVE ("yes"/"no")_find_session_process to scan /proc/[0-9]*/exe for a process matching the expected binary path (including (deleted) suffix){logs,cache,cargo-target,metrics} subdirsWhen the session dir exists (SESSION_WIPED=no), check whether the session
is long-abandoned — no process alive AND no recent tick/orchestrator activity.
This prevents auto-relaunching a node that was intentionally stopped hours or
days ago.
if [[ "$SESSION_WIPED" == "no" ]]; then
check_long_stale_session "$HOME/data" "/proc" "$MONITOR_SESSION_ID" \
"$HOME/data/monitor-loop.env" || exit 1
fi
The check_long_stale_session function (defined in scripts/lib/monitor-decisions.sh):
LONG_STALE_SESSION ("yes"/"no").alive/env recent enough).alive age > 6h (or missing), env age > 24h.alive mtime (touched every tick at start). Fallback: env file mtime._find_session_process) overrides staleness markers.SESSION_WIPED=yesCase A: SESSION_WIPED_PROCESS_ALIVE=yes
.alive.actions, watch, and wipe: line per wipe-state
composition table (row #3 or #6 depending on MAINNET_WIPED).Case B: SESSION_WIPED_PROCESS_ALIVE=no
.alive.FRESH_START determined normally (mainnet.db likely absent → yes).CARGO_TARGET_DIR=/home/tomer/data/$MONITOR_SESSION_ID/cargo-target \
cargo build --release -p henyey
If build succeeds, persist the deployed sha (atomic write):
new_sha=$(git rev-parse HEAD)
printf '%s\n' "$new_sha" > "/home/tomer/data/$MONITOR_SESSION_ID/build_sha.tmp"
mv "/home/tomer/data/$MONITOR_SESSION_ID/build_sha.tmp" "/home/tomer/data/$MONITOR_SESSION_ID/build_sha"
If build fails: report OFFLINE; emit actions/watch/wipe: per
wipe-state composition table (row #5 or #8). File/comment issue before
exit using wipe-state issue-filing policy. Then exit.actions, watch, and wipe: line per wipe-state
composition table (row #4 or #7 depending on MAINNET_WIPED).After the session-dir check, independently verify that mainnet data exists (regardless of session-dir state). Note: The stale-env early-exit terminates the tick before reaching this point — no wipe signal is emitted in that case because no tick report is generated.
check_mainnet_wiped "$HOME/data"
The check_mainnet_wiped function (from scripts/lib/monitor-decisions.sh):
MAINNET_WIPED to "yes" if $HOME/data/mainnet directory is missing, "no" otherwiseBoth SESSION_WIPED and MAINNET_WIPED are fully determined at this point.
All downstream reporting derives wipe-related outputs from these two flags
using the truth table below. This is the single source of truth — Case A/B
and the status formatter reference this table; they do not independently define
wipe-related outputs.
| # | SESSION_WIPED | MAINNET_WIPED | Case | actions | watch | Status level | wipe: line | Issue title pattern |
|---|---|---|---|---|---|---|---|---|
| 1 | no | no | — | — | — | (normal) | (omitted) | — |
| 2 | no | yes | — | "mainnet-data-wiped" | "wipe=mainnet-data" | ACTION | MAINNET DATA WIPED — catchup recovery | OFFLINE: mainnet data wiped out-of-band |
| 3 | yes | no | A | "session-wiped-process-alive" | "wipe=session-dir" | ACTION | SESSION DIR WIPED — process alive (PID N), operator intervention needed | OFFLINE: validator session wiped out-of-band |
| 4 | yes | no | B-ok | "session-wiped-recovery" | "wipe=session-dir" | ACTION | SESSION DIR WIPED — rebuilt + relaunched (new PID N) | OFFLINE: validator session wiped out-of-band |
| 5 | yes | no | B-fail | "session-wiped-rebuild-failed" | "wipe=session-dir" | OFFLINE | SESSION DIR WIPED — rebuild failed | OFFLINE: validator session wiped out-of-band |
| 6 | yes | yes | A | "session-wiped-process-alive", "mainnet-data-wiped" | "wipe=session-dir", "wipe=mainnet-data" | ACTION | SESSION DIR WIPED — process alive (PID N) + MAINNET DATA WIPED | OFFLINE: session + mainnet data wiped out-of-band |
| 7 | yes | yes | B-ok | "session-wiped-recovery", "mainnet-data-wiped" | "wipe=session-dir", "wipe=mainnet-data" | ACTION | SESSION DIR WIPED — rebuilt + relaunched (new PID N) + MAINNET DATA WIPED | OFFLINE: session + mainnet data wiped out-of-band |
| 8 | yes | yes | B-fail | "session-wiped-rebuild-failed", "mainnet-data-wiped" | "wipe=session-dir", "wipe=mainnet-data" | OFFLINE | SESSION DIR WIPED — rebuild failed + MAINNET DATA WIPED | OFFLINE: session + mainnet data wiped out-of-band |
Composition rule: The wipe: line is built from up to two fragments
joined by +:
SESSION_WIPED=yesMAINNET DATA WIPED: appended when MAINNET_WIPED=yesWhen only mainnet is wiped (#2), append — catchup recovery to the mainnet
fragment. In combined cases (#6–8), omit this suffix because the recovery
mechanism is the session-wipe Case B relaunch (which handles FRESH_START
implicitly).
Issue-filing policy:
session + mainnet data wiped).
Do NOT also file a separate mainnet-only issue.urgent; ACTION with process alive → per policy
(typically urgent since node state is uncertain).scripts/lib/monitor-label-policy.md routing rules (Backlog for
actionable issues, Blocked for not-ready).Tick-history watch array: Both "wipe=session-dir" and
"wipe=mainnet-data" may coexist in a single tick's watch array. The
daily-summary aggregator handles this — each entry is independent.
After the session-dir-vanished detection (which ensures the dir exists), touch a sentinel file at the start of every tick:
touch /home/tomer/data/$MONITOR_SESSION_ID/.alive
The .alive file's mtime = timestamp of last successful tick start. Cleanup
tooling uses this to determine session liveness (see monitor-loop cleanup
guards).
Determine once at the top of the tick: if /home/tomer/data/mainnet/mainnet.db
does NOT exist, set FRESH_START=yes (sync deadline = 20m). Otherwise
FRESH_START=no (sync deadline = 15m). Use this when evaluating check (2).
If the node's previous run ended with a crash or wedge (SIGKILL), restart begins from a stale lcl that can be hours behind mainnet tip; replay will legitimately exceed the 15m clean-restart deadline.
Detect crash-recovery once at the top of the tick. The rule fires on
either of these signals (both gated by uptime < 2h, after which
recovery is considered complete regardless):
Rotation signal: the most recent log rotation in the session's
logs/ dir is a .crashed-*, .stuck-*, or .frozen-* (not a
planned .preredeploy-*). Use find (not shell globs) to avoid
zsh NO_NOMATCH failing the pipeline.
Active-catchup signal (fires even when the rotation was
.preredeploy-*): the node is in Catching Up state AND uptime
exceeds 5 minutes. Handles the case where a manual planned restart
(which rotates as preredeploy-*) happens AFTER a wedge, leaving
the persisted lcl hours stale.
CRASH_RECOVERY=no
PID=$(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID")
if [ -n "$PID" ]; then
uptime_sec=$(ps -o etimes= -p "$PID" 2>/dev/null | tr -d ' ')
uptime_sec=${uptime_sec:-0}
if [ "$uptime_sec" -lt 7200 ]; then
# Signal 1: newest rotation type
logs_dir="/home/tomer/data/$MONITOR_SESSION_ID/logs"
newest_rotation=$(find "$logs_dir" -maxdepth 1 -type f \
\( -name 'monitor.log.crashed-*' \
-o -name 'monitor.log.stuck-*' \
-o -name 'monitor.log.frozen-*' \
-o -name 'monitor.log.preredeploy-*' \) \
-printf '%T@ %p\n' 2>/dev/null | sort -rn | head -1 | cut -d' ' -f2-)
case "$newest_rotation" in
*.crashed-*|*.stuck-*|*.frozen-*) CRASH_RECOVERY=yes ;;
esac
# Signal 2: active catchup past the clean-restart window
if [ "$CRASH_RECOVERY" = "no" ] && [ "$uptime_sec" -gt 300 ]; then
node_state=$(curl -s -m 3 "http://localhost:$MONITOR_ADMIN_PORT/info" 2>/dev/null \
| python3 -c 'import sys,json; print(json.load(sys.stdin).get("state",""))' 2>/dev/null)
if [ "$node_state" = "Catching Up" ]; then
CRASH_RECOVERY=yes
fi
fi
fi
fi
When CRASH_RECOVERY=yes and FRESH_START=no, the active sync deadline
extends to 60m and a progress carveout applies (see check (2)).
Several checks share these procedures — define once, reference by name:
Stop-PID (graceful kill, fall through to SIGKILL):
kill "$PID"; for i in $(seq 1 10); do sleep 1; kill -0 "$PID" 2>/dev/null || break; done
kill -0 "$PID" 2>/dev/null && kill -9 "$PID" && sleep 2
Relaunch (preserves log via append redirection, clears any stale lockfile):
rm -f /home/tomer/data/mainnet/mainnet.lock
RUST_LOG=info nohup /home/tomer/data/$MONITOR_SESSION_ID/cargo-target/release/henyey \
--mainnet run $MONITOR_RUN_FLAGS -c $MONITOR_CONFIG \
>> /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log 2>&1 &
Rotate-log (preserve prior session's log under the given suffix):
mv /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log \
/home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log.<suffix>-$(date -u +%Y%m%dT%H%M%SZ) 2>/dev/null || true
Suffix per origin: crashed (process found dead), frozen (wedge per 3b),
preredeploy (planned restart for deploy).
(1) Log scan — tail -n 500 /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log.
Scan for hash mismatches ("hash mismatch", "HashMismatch", differing expected/actual
hashes), panics/crashes ("panic", "thread.*panicked", "SIGABRT", "SIGSEGV"),
ERROR-level log lines, assertion failures ("assertion failed").
Match ERROR-level lines by log-level prefix only, e.g. anchor with
grep -E '^[^ ]+Z\s+ERROR\s' (timestamp followed by the ERROR level token).
A naive case-insensitive error match falsely fires on INFO-level lines that
contain the word "error" as part of a message body — e.g. peer disconnect
lines like INFO ... recv error: IO error: Connection reset by peer. Treat
those as the normal overlay-churn INFO they are; do not flag.
(2) Ledger progression & sync deadline — persist ledger progression across ticks so STUCK can be detected by a single invocation:
/home/tomer/data/$MONITOR_SESSION_ID/last_ledger (if it exists) —
format is "<ledger>|<unix-timestamp>".heartbeat=true) in the log tail./home/tomer/data/$MONITOR_SESSION_ID/last_ledger with "<current-ledger>|<now>".ps -o etime= -p $(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID").
Active deadline: 15m when FRESH_START=no and CRASH_RECOVERY=no,
60m when CRASH_RECOVERY=yes, 20m when FRESH_START=yes.age < 30s — NOT just heartbeat event gap=0. Gap is
the node's local view (latest_ext - ledger) and can stay at 0 even when
the node is minutes behind the network. The authoritative wall-clock signal
is RPC age.gap > 5, or age > 30s, or heard_from_quorum=false
in the latest heartbeat event. Investigate the catchup path (checkpoint-boundary
stalls, hash mismatches, event-loop freezes); do not just wait.CRASH_RECOVERY=yes): if lcl has advanced
by ≥ 500 ledgers since the previous tick's last_ledger, the node is
actively replaying — report CATCHING UP regardless of uptime. Flag SYNC
FAILURE only when lcl stops advancing AND uptime exceeds 60m.FRESH_START=yes, uptime < 20m): a large gap is
expected during initial bucket apply — report CATCHING UP, not SYNC FAILURE.(3) Process alive — find by /proc/exe path matching via _find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID" (from scripts/lib/monitor-decisions.sh, sourced at skill init). This validates that the process binary is exactly $HOME/data/$MONITOR_SESSION_ID/cargo-target/release/henyey (including (deleted) suffix after rebuilds), scoping detection to the current session. Historical note: the original pgrep -f 'henyey.*run' form was abandoned because it false-matched parallel claude --print agent processes; the intermediate comm-only replacement was abandoned because it false-matched any henyey process regardless of session (cross-session false positive, see #2467). If not running: Rotate-log with suffix crashed, then before Relaunch evaluate the (3a) Repeated-FATAL state-wipe trigger below.
(3a) Repeated-FATAL state-wipe trigger — a kill-loop on the same persisted state means the local lcl is corrupt and forward replay can never reconcile. When this happens, restart-without-wipe just accumulates crashed logs and stays offline. Detect and self-heal once per kill-loop:
logs_dir=/home/tomer/data/$MONITOR_SESSION_ID/logs
# Uses the shared detect_crash_state function from scripts/lib/monitor-decisions.sh
# (already sourced at skill init). Uses numeric mtime comparison (stat -c %Y),
# not -newermt, for portability and testability.
detect_crash_state "$logs_dir"
recent_count="$CRASH_RECENT_COUNT"
latest_crashed="$CRASH_LATEST_FILE"
hash_mismatch_signal="$CRASH_HASH_MISMATCH"
Trigger the wipe when ALL hold:
recent_count >= 3 (3+ crashed rotations in the last 30 min — proves restart-without-fix isn't recovering)hash_mismatch_signal == "yes" (the most recent crash logged the structured field fatal_wipe_required=true or the legacy prose "State wipe required before restart" — emitted by trigger_fatal_shutdown() for any unrecoverable local state corruption)FRESH_START=no (don't fire on a fresh sync that hasn't completed yet)When triggered:
# Stop any partially-running process first (defensive — should already be dead)
PID=$(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID")
[ -n "$PID" ] && kill "$PID" && sleep 5 && kill -0 "$PID" 2>/dev/null && kill -9 "$PID"
# Wipe the corrupt persisted state (recoverable from public network archive)
rm -f /home/tomer/data/mainnet/mainnet.db \
/home/tomer/data/mainnet/mainnet.db-shm \
/home/tomer/data/mainnet/mainnet.db-wal \
/home/tomer/data/mainnet/mainnet.lock
rm -rf /home/tomer/data/mainnet/buckets
# Reset progression tracker so the next tick treats this as a fresh start
rm -f /home/tomer/data/$MONITOR_SESSION_ID/last_ledger
# Mark the wipe action — read by (3d) to detect post-wipe recurrence.
# Stores the wipe epoch; cleared by (3d) when stable-recovery condition holds.
date -u +%s > /home/tomer/data/$MONITOR_SESSION_ID/wipe_attempted_at
Then Relaunch. The next tick will see FRESH_START=yes (mainnet.db absent), apply the 20m sync deadline, and let the node fresh-catchup from network archive (typically ~10–15 min to validating, observed ~9 min on 2026-05-09).
File a new urgent GH issue documenting the wipe with the count of crashed rotations, the hash-mismatch evidence from the latest crashed log, and the cumulative downtime — this is a data point for whether the underlying recovery code path needs further hardening even though the immediate cause was already fixed. Board-route to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
The 3-in-30-min trigger is self-rate-limiting only when the underlying fault is local state corruption — after a successful wipe + fresh-catchup, new .crashed-* rotations stop. (3d) Post-wipe recurrence guard below catches the alternate failure mode where the freshly-rebuilt state trips the same fatal_wipe_required signal — proof that the bug is in the apply path itself, not on disk — and prevents an infinite wipe→catchup→crash loop.
(3d) Post-wipe recurrence guard — when 3a's wipe doesn't fix the symptom (a new crashed-* rotation with fatal_wipe_required=true appears AFTER the wipe), the binary is the suspect, not local state. Looping 3a indefinitely wastes downtime and archive bandwidth. Stop relaunching, auto-quarantine the build SHA, and wait for operator-applied rollback or fix. Evaluate (3d) BEFORE (3a) in the dead-process path; if (3d) fires, skip (3a) and skip Relaunch.
marker=/home/tomer/data/$MONITOR_SESSION_ID/wipe_attempted_at
post_wipe_recurrence=no
if [ -s "$marker" ]; then
wipe_epoch=$(cat "$marker" 2>/dev/null | tr -dc '0-9')
# Use the same detect_crash_state result already populated for (3a):
# CRASH_LATEST_FILE + CRASH_HASH_MISMATCH are valid in this scope.
if [ -n "${CRASH_LATEST_FILE:-}" ] && [ -n "${wipe_epoch:-}" ]; then
latest_mtime=$(stat -c %Y "$CRASH_LATEST_FILE" 2>/dev/null || echo 0)
if [ "$latest_mtime" -gt "$wipe_epoch" ] && [ "${CRASH_HASH_MISMATCH:-no}" = "yes" ]; then
post_wipe_recurrence=yes
fi
fi
fi
When post_wipe_recurrence=yes:
# 1. Auto-quarantine the build SHA. Idempotent — already-present is OK.
source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
build_sha=$(cat "/home/tomer/data/$MONITOR_SESSION_ID/build_sha" 2>/dev/null || true)
if [ -n "$build_sha" ]; then
quarantine_append "$HOME/data/deploy_quarantine.txt" "$build_sha" \
"post-wipe recurrence (3d) — see urgent issue"
fi
# 2. Do NOT relaunch. Do NOT fire (3a) again — wipe already proved insufficient.
# 3. Report OFFLINE with `actions: ["post-wipe-recurrence", "auto-quarantined"]`
# and `watch: ["determinism-suspect-binary"]`.
File or comment on an urgent GH issue marking the binary as a determinism-suspect: include the build SHA, the wipe epoch from the marker, the latest crashed-log hash-mismatch evidence, and the cumulative downtime since wipe. Title pattern: "OFFLINE: post-wipe recurrence on <sha:8> — binary quarantined". If an open issue already names the same build_sha as determinism-suspect, comment on it instead of filing a duplicate. Board-route to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
Recovery from (3d) is operator-driven. The deploy gate at section 10 step 3 (Quarantine gate) will continue to BLOCK redeploy of the quarantined SHA as long as it is reachable from origin/main. Operator action: revert/rollback the bad commit so it is no longer in origin/main's ancestry; the next tick's deploy gate will then let the rebuild + relaunch proceed normally. The marker is cleared automatically by the recovery rule below — operators do not need to remove it manually.
Marker-clear rule — runs each tick when the process is alive. Clear wipe_attempted_at only when the binary has demonstrably stabilized:
marker=/home/tomer/data/$MONITOR_SESSION_ID/wipe_attempted_at
if [ -s "$marker" ] && [ -n "${PID:-}" ]; then
PROC_START_EPOCH=$(stat -c %Y /proc/$PID/stat 2>/dev/null || echo 0)
uptime_sec=$(( $(date -u +%s) - PROC_START_EPOCH ))
has_fatal_wipe_evidence "$logs_dir" "$logs_dir/monitor.log" # sets FATAL_WIPE_EVIDENCE
ledger_advanced=no
# last_ledger holds the previous tick's value, stashed before check (2) overwrites
if [ -n "${PREV_LEDGER:-}" ] && [ -n "${LEDGER:-}" ] && [ "$LEDGER" -gt "$PREV_LEDGER" ]; then
ledger_advanced=yes
fi
if [ "$uptime_sec" -gt 1800 ] \
&& [ "${FATAL_WIPE_EVIDENCE:-no}" = "no" ] \
&& [ "$ledger_advanced" = "yes" ]; then
rm -f "$marker"
fi
fi
All four conditions (alive, uptime > 30m, no fatal_wipe_required since proc start, last_ledger advanced) must hold simultaneously to clear the marker. Quarantine entries are NOT auto-removed — operators clear them deliberately per the Quarantine Clearance procedure (section 10).
(3c) Soft-fail state-wipe trigger — defense-in-depth for the case where
trigger_fatal_shutdown() signals exit but the process fails to terminate,
leaving it alive with fatal_state_failure=true, blocking all recovery, and
making no ledger progress. This complements (3a) which only fires post-mortem.
Evaluate (3c) BEFORE (3b) when the process IS alive. If (3c) fires, skip (3b)
— a wipe supersedes a plain restart.
logs_dir=/home/tomer/data/$MONITOR_SESSION_ID/logs
log_file="$logs_dir/monitor.log"
# PID (session-scoped via _find_session_process; skip (3c) if empty)
PID=$(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID")
[ -z "$PID" ] && : # skip — dead-process path (3a) handles this
# Process start time from /proc/$PID/stat mtime
PROC_START_EPOCH=$(stat -c %Y /proc/$PID/stat 2>/dev/null || echo 0)
# Uses shared functions from scripts/lib/monitor-decisions.sh
has_fatal_wipe_evidence "$logs_dir" "$log_file"
detect_soft_fail_blocked "$log_file" "$PROC_START_EPOCH"
Trigger the wipe when ALL hold:
SOFT_FAIL_BLOCKED == "yes" (WARN-level "Recovery escalation blocked" messages sustained for >= 5 min within current PID lifetime, most recent within 90s of now)FATAL_WIPE_EVIDENCE == "yes" (fatal_wipe_required=true signal found in any crashed rotation OR the active log — no time window, confirms persistent state corruption)FRESH_START == "no" (not a fresh sync)last_ledger unchanged (stash previous tick's value before check (2) overwrites; compare against current ledger)When triggered:
# 1. Stop the alive-but-stuck process
kill "$PID" && sleep 5
kill -0 "$PID" 2>/dev/null && kill -9 "$PID" && sleep 2
# 2. Rotate log (preserve evidence, consistent suffix with 3a)
mv "$log_file" "${log_file}.crashed-$(date -u +%Y%m%dT%H%M%SZ)" 2>/dev/null || true
# 3. Wipe corrupt persisted state (same artifacts as 3a)
rm -f /home/tomer/data/mainnet/mainnet.db \
/home/tomer/data/mainnet/mainnet.db-shm \
/home/tomer/data/mainnet/mainnet.db-wal \
/home/tomer/data/mainnet/mainnet.lock
rm -rf /home/tomer/data/mainnet/buckets
# 4. Reset progression tracker
rm -f /home/tomer/data/$MONITOR_SESSION_ID/last_ledger
Then Relaunch. The next tick will see FRESH_START=yes (mainnet.db absent).
File a new urgent GH issue documenting the soft-fail wipe with: blocked
duration (SOFT_FAIL_BLOCKED_DURATION_SEC), evidence source
(FATAL_WIPE_SOURCE), and cumulative downtime. Use title pattern:
"Soft-fail state wipe: fatal_state_failure stuck for {N}m". Always a new
issue (no dedup — each wipe is a distinct incident). Known prior incidents: #2363.
Board-route to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
Self-limiting: after wipe, FRESH_START=yes blocks condition (3); new process
has no fatal_state_failure so condition (1) fails; log rotation removes old
blocked messages from active log.
(3b) Wedge detection — a process can be alive but have a frozen event
loop (watchdog fires, HTTP hangs, ledger progression stops). Check 3 alone
misses this because pgrep still finds the PID.
Flag WEDGE when BOTH:
grep -E 'watchdog_freeze"?\s*[=:]\s*true|WATCHDOG: Event loop appears frozen' $LOG | tail -1 is present
with a timestamp within the last 120s.
(The structured field watchdog_freeze=true is the primary signal; the
prose string is a legacy fallback. The "? accounts for JSON key quoting.)curl -s -m 3 http://localhost:$MONITOR_ADMIN_PORT/info returns empty
body or times out.On WEDGE: Stop-PID, Rotate-log with suffix frozen, then Relaunch.
Always file a new urgent-labeled issue (wedge blocks validator operation).
Board-route to Backlog: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
Recurrence-after-fix → NEW issue, not a comment on a closed one. Known prior
incidents: #1904, #1873, #1921, #1949.
(4) Memory — ps -o rss= -p $(_find_session_process "$HOME/data" "/proc" "$MONITOR_SESSION_ID"), convert to MB.
RSS > 12 GB, flag HIGH MEMORY (report-only; no restart).RSS > 16 GB, ANDavailable memory from free -m (NOT free — that excludes
reclaimable kernel cache) < 8 GB, ANDmemory_report=true (or Memory report summary) entries both show heap_components_mb
growing by > 500 MB vs the earlier snapshot.crashed, Relaunch.(5) Disk — df -h /home/tomer/data | tail -1. If usage > 85%, flag LOW DISK.
Then keep the 3 most recent rotated archives per category (the ISO 8601
timestamp suffix sorts lexicographically, so sort -r gives newest-first;
use find -printf not shell glob to survive zsh NO_NOMATCH):
logs_dir=/home/tomer/data/$MONITOR_SESSION_ID/logs
[ -d "$logs_dir" ] && for pat in preredeploy crashed stuck frozen; do
find "$logs_dir" -maxdepth 1 -type f -name "monitor.log.$pat-*" \
-printf '%f\n' 2>/dev/null | sort -r | tail -n +4 \
| while read -r f; do rm -f "$logs_dir/$f"; done
done
Report how many files were removed if any.
(6) Session disk — du -sh /home/tomer/data/$MONITOR_SESSION_ID/
and du -sh /home/tomer/data/mainnet/. If combined > 200 GB, flag SESSION DISK HIGH.
(7) Memory report — grep -E 'memory_report=true|Memory report summary' /home/tomer/data/$MONITOR_SESSION_ID/logs/monitor.log | tail -1.
If grep returns no output AND process uptime > 400s, flag WARNING
memory-report-missing (log format may have changed). Memory reports emit
every ~6 minutes, so an uptime below 400s legitimately has no entry yet on
a post-restart tick — report memreport: N/A (warmup) and move on.
Otherwise extract jemalloc_allocated_mb, jemalloc_resident_mb,
fragmentation_pct, heap_components_mb, mmap_mb, unaccounted_mb,
unaccounted_sign. If fragmentation_pct > 50, flag HIGH FRAGMENTATION.
If unaccounted_mb > 1000 with sign +, note it (known jemalloc overhead,
not a bug — but verify heap_components is stable; if it is growing, investigate).
(8) RPC health (validator mode only — skip in watcher mode) —
curl -s -X POST http://localhost:$MONITOR_RPC_PORT -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}'.
Verify response is non-empty and status is healthy. Check pruning:
latestLedger - oldestLedger should be bounded.
retention_window ledgers, plus a maintenance-cycle
buffer (maintenance every 900s ≈ 180 ledgers at 5s/ledger).57821bcf/25797e2e (Apr 27), ledger headers and tx history are
also held back from pruning to satisfy publishing. The new equilibrium is
~3× retention_window (~1050 for retention=360).gap > 3 × retention_window + 500 (~1580 for retention=360), flag
PRUNING STALLED — that's "headers/tx-history protection plus an extra
maintenance cycle's slack" exceeded, indicating real pruning failure.(9) OBSRVR Radar (validator mode only — skip in watcher mode) — get
public key from curl -s http://localhost:$MONITOR_ADMIN_PORT/info
(extract public_key). Then curl -s https://radar.withobsrvr.com/api/v1/nodes/<PUBLIC_KEY>.
Check:
isValidating — if false and node running > 30 min, flag NOT VALIDATING.validating24HoursPercentage — if < 50 and running > 6 hours, flag LOW VALIDATION RATE.lag — if > 500, flag HIGH LAG.If the API errors out, emit obsrvr: N/A (api-error) instead of omitting the field.
If the API returns a response but latestLedger or updatedAt is missing/null
(partial response), treat the entire radar result as stale / incomplete and
emit obsrvr: N/A (api-incomplete). Do NOT evaluate lag in this case — a
lag value returned alongside a null latestLedger/updatedAt is a cached
aggregate from a prior observation window (observed post-restart: lag=8754
against a node that is in real-time sync with age=2s).
(12) Metrics scan — scrape /metrics and evaluate the alert catalog.
mkdir -p /home/tomer/data/$MONITOR_SESSION_ID/metrics.mv /home/tomer/data/$MONITOR_SESSION_ID/metrics/current.prom /home/tomer/data/$MONITOR_SESSION_ID/metrics/prev.prom 2>/dev/null || true.curl -s http://localhost:$MONITOR_ADMIN_PORT/metrics > /home/tomer/data/$MONITOR_SESSION_ID/metrics/current.prom.current < prev, treat
delta = current (process restarted).Post-restart warmup exemptions — for the first 2 ticks after a detected
restart (check 3, check 10, or CRASH_RECOVERY=yes), skip these alerts.
Overlay handshake + jemalloc arena stabilization + from-zero counters take
~10 minutes to settle.
henyey_jemalloc_fragmentation_pct > 50 (post-restart frag ramps ~35-45%, settles to ~18%)stellar_peer_count < 8 (peer count ramps 0 → 10+ over the first 10 min)stellar_overlay_inbound_authenticated < 3stellar_scp_timing_externalized_seconds > 3stellar_scp_timing_nominated_seconds > 2prev is 0)Catalog-wide notes:
uptime > 15m AND CRASH_RECOVERY=no AND FRESH_START=no.
These gauges fire only when the node should be in real-time sync; the
authoritative sync check is (2). The catchup phases are legitimate
non-synced states.COUNTERS (fire on delta ≥ threshold per tick):
stellar_herder_lost_sync_total ≥1 → SYNChenyey_post_catchup_hard_reset_total ≥1 → ACTION(stellar_overlay_timeout_idle_total + stellar_overlay_timeout_straggler_total) ≥5× prior-tick-sum → WARN(stellar_overlay_error_read_total + stellar_overlay_error_write_total) ≥50 → WARNhenyey_archive_cache_refresh_error_total ≥1 → NONChenyey_archive_cache_refresh_timeout_total ≥3 → NONCstellar_ledger_apply_failure_total, henyey_scp_post_verify_drops_total,
and stellar_herder_pending_too_old_total are covered by the ratio checks
below (traffic-proportional and self-calibrating); do NOT check absolute deltas.
GAUGES (fire on absolute threshold against current snapshot):
stellar_peer_count <8 → WARNstellar_overlay_inbound_authenticated <3 → WARN (synced-only gating; healthy fleet has 50+ inbound. Aggregate peer_count can be ≥8 from outbound while inbound starves consensus; this catches that case)stellar_ledger_age_current_seconds >30 → SYNC (synced-only gating)stellar_herder_state !=2 → SYNC (synced-only gating)henyey_jemalloc_fragmentation_pct >50 on two consecutive ticks → WARNhenyey_scp_verify_input_backlog >100 on two consecutive ticks → WARN (single snapshots routinely spike to 100+ during slot externalize bursts and drain to 0 within seconds — gate on persistence. Sample /metrics 5x at 2s intervals to verify before filing)henyey_scp_verifier_thread_state !=0 → WARN (0=Running, 1=Stopping, 2=Dead)stellar_herder_pending_envelopes >2000 → WARNhenyey_overlay_fetch_channel_depth >500 on two consecutive ticks → WARN (the _max variant is a monotonic high-water mark that never resets after catchup-tail spikes — use the live gauge with persistence guard, mirroring the scp_verify_input_backlog pattern. Sample /metrics 5x at 2s intervals to verify before filing)(henyey_process_open_fds / henyey_process_max_fds) >0.85 → WARNhenyey_herder_drift_max_seconds >10 → NONCstellar_scp_timing_externalized_seconds >10 → WARN (this is a SLOT-CYCLE metric:
first-envelope-received → self-externalize, naturally ~4-6s on mainnet's 5s slots.
The earlier >3 threshold was a misread of #1934 — it compared against stellar-core's
mFirstToSelfExternalizeLag which measures a different, much narrower window (any-node
externalize → self-externalize, ~0.3-0.5s healthy). Until we expose a matching metric,
alert only on >2x normal slot-cycle = real degradation)stellar_scp_timing_nominated_seconds >7 → WARN (also a SLOT-CYCLE metric measured from
first local nomination vote to self-externalize; healthy floor ~3-5s on mainnet)quorum_agree / quorum_missing / quorum_fail_at are intentionally NOT
monitored — they snapshot the tracking slot's QuorumInfo and return
false-positive noise between externalizations; see monitor-loop/SKILL.md
Metrics Scan section for the code-path explanation.
HISTOGRAMS (fire on p99 bucket of per-tick delta): for each histogram H,
bucket_delta[le] = H_bucket_current[le] - H_bucket_prev[le]
count_delta = H_count_current - H_count_prev
Skip if count_delta < 20. Compute cumulative bucket delta at upper edge L:
sum(bucket_delta[le] for le ≤ L). The smallest L where cumulative ≥
0.99 * count_delta is the p99 upper bound.
Thresholds (all WARN):
henyey_ledger_close_handle_complete_seconds p99 >0.5shenyey_ledger_close_dispatch_to_join_seconds p99 >5shenyey_ledger_close_post_complete_seconds p99 >0.5shenyey_ledger_close_tx_exec_seconds p99 >1shenyey_ledger_close_soroban_exec_seconds p99 >1shenyey_ledger_close_commit_seconds p99 >0.5shenyey_ledger_close_soroban_state_seconds p99 >0.5shenyey_ledger_close_complete_tx_queue_seconds p99 >0.5sMean check (sum_delta / count_delta) is a cheaper fallback — fire on whichever breaches.
Counter catalog entries use one of three extraction forms. When adding a new
counter, check the emission site in crates/app/src/metrics.rs to determine
whether the metric is scalar or labeled, and use the appropriate form.
Form 1 — Scalar counter (most counters):
cur=$(grep -E '^<metric_name> ' current.prom | awk '{printf "%d", $2}')
prev=$(grep -E '^<metric_name> ' prev.prom | awk '{printf "%d", $2}')
Form 2 — Single labeled series (extract one specific label value):
cur=$(grep -E '^<metric_name>\{<label>="<value>"\} ' current.prom | awk '{printf "%d", $2}')
prev=$(grep -E '^<metric_name>\{<label>="<value>"\} ' prev.prom | awk '{printf "%d", $2}')
Form 3 — Sum of explicit labeled series (sum specific label values):
cur=$(awk '/^<metric_name>\{<label>="<value1>"\}|^<metric_name>\{<label>="<value2>"\}/ {sum+=$NF} END{printf "%d", sum+0}' current.prom)
prev=$(awk '/^<metric_name>\{<label>="<value1>"\}|^<metric_name>\{<label>="<value2>"\}/ {sum+=$NF} END{printf "%d", sum+0}' prev.prom)
For all forms, delta handling is identical:
prev using the same pattern against prev.prom (default 0 if absent)current < prev: counter reset (node restarted) → delta = currentdelta = current - prevLabel-presence validation: For labeled counters, validate that the
expected label set is present before extracting. This catches silent
renames or removals that would weaken an alert. If the expected label set
is missing or mutated, skip the alert entirely rather than treating
missing series as zero. Follow the existing pattern used for
henyey_scp_post_verify_total (lines 406-413):
labels=$(grep -oP '^<metric_name>\{<label>="\K[^"]+' current.prom | sort)
expected=$(printf '%s\n' <label1> <label2> ... | sort)
if [ "$labels" != "$expected" ]; then
# report: <metric>: skipped (label set mismatch)
fi
Three ratio-based checks replace the former absolute-delta thresholds for
stellar_ledger_apply_failure_total, henyey_scp_post_verify_drops_total,
and stellar_herder_pending_too_old_total.
These fire only after 3 consecutive ticks of threshold breach, avoiding
transient spikes.
Required inputs — global (all must be present, non-empty, numeric; if any missing or invalid, skip ratio checks entirely, empty the ratio snapshot, reset streaks):
stellar_ledger_age_current_seconds — sync gatingstellar_ledger_apply_success_totalstellar_ledger_apply_failure_totalhenyey_scp_post_verify_total{reason="accepted"}henyey_scp_post_verify_total{reason="processed_directly"}henyey_scp_post_verify_total{reason="..."} lines (pv_total_sum)Per-check required inputs — these are NOT part of the global required set. If missing, only the pending check (Check 3) skips; SCP and apply proceed normally:
stellar_herder_pending_too_old_totalstellar_herder_pending_received_totalPost-verify label-set validation: After extracting the 14
henyey_scp_post_verify_total{reason="..."} lines, validate that the exact set
of reason labels is: invalid_sig, panic, drift_range, drift_close_time,
drift_cannot_receive, self_message, non_quorum, buffered, duplicate,
too_far, buffer_full, processed_directly, per_slot_full, accepted. If any label is
missing or unexpected labels appear, treat as missing counters (partial scrape
or label change).
Skip conditions (skip all ratio checks when any is true):
FRESH_START=yesgap > 5stellar_ledger_age_current_seconds > 30/metrics fetch fails (curl error or empty response)/metrics returns "recorder not installed"On any skip: empty /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot,
reset all streak counters to 0, report metrics_ratio: skipped (<reason>).
Snapshot file: /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
Format:
version=1
pid=<PID>
start_ticks=<field 22 from /proc/$PID/stat>
timestamp=<ISO8601>
apply_success=<value>
apply_failure=<value>
pv_accepted=<value>
pv_processed_directly=<value>
pv_total_sum=<value>
apply_breach_streak=<N>
scp_breach_streak=<N>
pending_too_old=<value>
pending_received=<value>
pending_breach_streak=<N>
Process identity: start_ticks = field 22 from /proc/$PID/stat (starttime
in clock ticks since boot). Extract with awk '{print $22}' /proc/$PID/stat.
Locale-independent, catches PID reuse.
Invalidation:
start_ticks changed → discard, write new baseline, reset streaksversion ≠ 1 → discardMetric extraction — illustrative pseudocode (not literal shell; each bail point must stop ratio-check processing for this tick):
metrics_body=$(curl -s http://localhost:$MONITOR_ADMIN_PORT/metrics)
# Bail on fetch failure — stop ratio checks for this tick
if [ -z "$metrics_body" ]; then
> /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
# report: metrics_ratio: skipped (fetch failed)
# STOP — do not proceed to extraction or ratio evaluation
fi
# Bail if recorder not installed — stop ratio checks
if echo "$metrics_body" | grep -q 'metrics recorder not installed'; then
> /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
# report: metrics_ratio: skipped (recorder not installed)
# STOP
fi
ledger_age=$(echo "$metrics_body" | grep -E '^stellar_ledger_age_current_seconds ' | awk '{printf "%d", $2}')
apply_success=$(echo "$metrics_body" | grep -E '^stellar_ledger_apply_success_total ' | awk '{printf "%d", $2}')
apply_failure=$(echo "$metrics_body" | grep -E '^stellar_ledger_apply_failure_total ' | awk '{printf "%d", $2}')
pv_accepted=$(echo "$metrics_body" | grep -E '^henyey_scp_post_verify_total\{reason="accepted"\} ' | awk '{printf "%d", $2}')
pv_processed=$(echo "$metrics_body" | grep -E '^henyey_scp_post_verify_total\{reason="processed_directly"\} ' | awk '{printf "%d", $2}')
# Validate exact 14-label set — STOP if mismatch
pv_labels=$(echo "$metrics_body" | grep -oP '^henyey_scp_post_verify_total\{reason="\K[^"]+' | sort)
expected_labels=$(printf '%s\n' accepted buffer_full buffered drift_cannot_receive drift_close_time drift_range duplicate invalid_sig non_quorum panic per_slot_full processed_directly self_message too_far | sort)
if [ "$pv_labels" != "$expected_labels" ]; then
> /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
# report: metrics_ratio: skipped (label set mismatch)
# STOP
fi
pv_total_sum=$(echo "$metrics_body" | grep -E '^henyey_scp_post_verify_total\{reason="[^"]+"\} ' | awk '{sum+=$2} END {printf "%d", sum}')
# Validate all 6 global values present and numeric — STOP if any invalid
for v in "$ledger_age" "$apply_success" "$apply_failure" "$pv_accepted" "$pv_processed" "$pv_total_sum"; do
if [ -z "$v" ] || ! echo "$v" | grep -qE '^[0-9]+$'; then
> /home/tomer/data/$MONITOR_SESSION_ID/metrics/ratio_snapshot
# report: metrics_ratio: skipped (missing counters)
# STOP
fi
done
# Per-check inputs for pending (Check 3) — if missing, only Check 3 skips
pending_too_old=$(echo "$metrics_body" | grep -E '^stellar_herder_pending_too_old_total ' | awk '{printf "%d", $2}')
pending_received=$(echo "$metrics_body" | grep -E '^stellar_herder_pending_received_total ' | awk '{printf "%d", $2}')
pending_counters_valid=true
for v in "$pending_too_old" "$pending_received"; do
if [ -z "$v" ] || ! echo "$v" | grep -qE '^[0-9]+$'; then
pending_counters_valid=false
break
fi
done
# henyey_recovery_stalled_tick_total: extract via Form 2 (see §Metric extraction
# forms). Relaxed label validation: require only the forcing_catchup_behind
# label to be present (minimum for Check 12b alerting). Other labels
# (backoff_active, forcing_catchup_not_behind, archive_behind_peer_ahead_hard_reset,
# at_tip_no_scp_hard_reset) may not yet exist on a fresh node — only 3 of the 5
# are pre-registered in the metrics macro (crates/app/src/metrics.rs:613); the
# remaining 2 are created dynamically on first use of each recovery path.
# If forcing_catchup_behind is missing, skip Check 12b for this tick.
# Alerting is streak-gated — see Check 12b section for logic.
Check 1: SCP post-verify acceptance rate
delta(pv_accepted) + delta(pv_processed)delta(pv_total_sum)(numerator / denominator) < 0.05 (less than 5% accepted) for 3 consecutive ticksscp: skipped (low volume), reset scp streak)scp_breach_streak. If streak ≥ 3, route through Bug Filing Workflow
(investigate verifier thread, stale envelopes, sync state).scp_breach_streak to 0.Check 2: Transaction apply failure rate
delta(apply_failure)delta(apply_failure) + delta(apply_success)(numerator / denominator) > 0.50 (over 50% fail) for 3 consecutive ticksapply: skipped (low volume), reset apply streak)apply_breach_streak. If streak ≥ 3, investigate in same tick.
If evidence points to henyey apply-engine bug, file/comment via Bug Filing Workflow.
If expected bad-tx traffic (spam wave, known rejections), report as WARNING without filing.apply_breach_streak to 0.Check 3: Pending too-old rate
delta(pending_too_old)delta(pending_received)(numerator / denominator) > 0.50 (over 50% too old) for 3 consecutive tickspending: skipped (low volume), reset pending streak)pending_counters_valid is false (missing counters), skip this check
only — report pending: skipped (missing counters), reset pending_breach_streak to 0.
SCP and apply checks proceed normally.pending_breach_streak. If streak ≥ 3, report as WARNING
(overlay lag or stale-envelope flood). Investigate overlay peer health and slot progression.pending_breach_streak to 0.Per-check state machine (each check independently):
Per-check low-volume: When only one check's denominator delta is below its minimum, that check skips (streak resets to 0) and the others proceed normally.
Thresholds are provisional — tune after 1-2 weeks of production data.
Status report: Each check independently reports one of: ok (value),
skipped (reason), WARNING value (N ticks), or collecting baseline.
Canonical constants: Threshold values, snapshot path, and applicability are defined in
shared/check-12b-constants.toml. This section is authoritative for the state machine logic; inline literals are cross-validated against the TOML byscripts/test-monitor-skill-snippets.sh.
This check tracks henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"}
using a streak-gated alert, independent of Check 12's ratio checks. It runs on
its own state machine because ratio checks are globally skipped during unsync
states (ledger age > 30s, gap > 5, etc.), but the recovery-stalled counter fires
precisely during recovery transitions when the node is briefly unsynced.
Data source: Reuses the same /metrics scrape result ($metrics_body /
metrics/current.prom) already fetched by check-8/check-12. Does NOT perform a
second /metrics fetch.
Applicability: Validator mode only. In watcher mode, skip Check 12b entirely
and omit the recovery_stalled: line from the status report.
Snapshot file: /home/tomer/data/$MONITOR_SESSION_ID/metrics/counter_streak_snapshot
Format:
version=1
pid=<PID>
start_ticks=<field 22 from /proc/$PID/stat>
timestamp=<ISO8601>
recovery_stalled_behind=<value>
recovery_stalled_breach_streak=<N>
PID/start_ticks check (always, even on skip): Before evaluating skip conditions, check PID/start_ticks against the stored snapshot. If the process restarted (PID or start_ticks changed), invalidate the snapshot immediately. This catches restarts during skip ticks and prevents comparing post-restart counters to a pre-restart baseline.
Skip conditions (skip Check 12b only when any is true):
/metrics fetch failed this tick (same condition check-8 detects)/metrics returns "recorder not installed"forcing_catchup_behind label missing from the scrapeOn skip: write snapshot preserving existing recovery_stalled_behind value (or
0 if no prior snapshot) with recovery_stalled_breach_streak=0. Next healthy
tick compares against preserved value — does NOT enter "collecting baseline" after
a skip. Report recovery_stalled: skipped (<reason>).
Invalidation (reset streak AND enter "collecting baseline"):
start_ticks changed (process restart) — checked before skip conditionsversion ≠ 1recovery_stalled_behind value < previous (counter reset)On invalidation: write new snapshot with current counter value and
recovery_stalled_breach_streak=0. Report recovery_stalled: collecting baseline.
Do NOT evaluate burst or streak logic on invalidation ticks — this prevents
false-firing the burst override on the first tick after a restart where the
counter jumps from 0 to the current absolute value.
Per-tick logic (not skipped AND not invalidated):
delta = current(recovery_stalled_behind) - prev(recovery_stalled_behind)
if delta >= 10:
# Immediate-fire override: large burst indicates sustained stalling.
# Do NOT reset streak — keep incrementing; cooldown (7200s) handles dedup.
recovery_stalled_breach_streak += 1
→ fire WARN, route through Bug Filing Workflow
elif delta >= 1:
recovery_stalled_breach_streak += 1
if recovery_stalled_breach_streak >= 3:
→ fire WARN, route through Bug Filing Workflow
else: # delta == 0
recovery_stalled_breach_streak = 0
Note: after skipped ticks where the counter value was preserved, the first healthy tick may see a large accumulated delta spanning multiple monitor intervals. This is acceptable — the burst threshold (≥10) catches sustained trouble regardless of tick granularity.
Post-restart warmup: First tick writes baseline ("collecting baseline"). Second tick has a valid prev for delta comparison and begins normal evaluation.
Alert identity and cooldown:
henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"} (full selector-qualified)gh issue list --search 'metrics: henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"}' --state openmetrics: henyey_recovery_stalled_tick_total{reason="forcing_catchup_behind"} — sustained breachIntegration with metrics aggregate: Check 12b alerts are NOT counted in the
metrics: line (which tracks immediate-fire counter/gauge alerts). They appear
on their own recovery_stalled: line. A fired Check 12b alert does contribute
to overall tick severity — the tick is considered unhealthy when any alert fires.
Status line: recovery_stalled: (reported after metrics_ratio:):
recovery_stalled: ok (delta=0) — no increment, streak resetrecovery_stalled: breach (delta=N, streak M/3) — incrementing, below thresholdrecovery_stalled: WARNING delta=N (M ticks) — investigating — streak ≥ 3recovery_stalled: WARNING delta=N (burst) — investigating — immediate fire (delta ≥ 10)recovery_stalled: skipped (<reason>) — metric missing or fetch failedrecovery_stalled: collecting baseline — first tick after restart/invalidationRendering precedence (determines the metrics_ratio: line format):
metrics_ratio: skipped (<reason>)metrics_ratio: collecting baselinemetrics_ratio: scp <scp_status>, apply <apply_status>, pending <pending_status>Examples:
metrics_ratio: scp ok (accept=15%), apply ok (fail=8%), pending ok (too_old=3%)metrics_ratio: scp ok (accept=12%), apply WARNING fail=55%>50% (3 ticks) — investigating, pending ok (too_old=5%)metrics_ratio: scp skipped (low volume), apply ok (fail=5%), pending ok (too_old=2%)metrics_ratio: scp ok (accept=15%), apply ok (fail=8%), pending skipped (missing counters)metrics_ratio: skipped (not in sync)metrics_ratio: collecting baselinerecovery_stalled: examples (after metrics_ratio: in output):
recovery_stalled: ok (delta=0)recovery_stalled: breach (delta=2, streak 1/3)recovery_stalled: WARNING delta=1 (3 ticks) — investigatingrecovery_stalled: WARNING delta=15 (burst) — investigatingrecovery_stalled: skipped (metric missing)recovery_stalled: collecting baselineFor each firing alert:
/home/tomer/data/$MONITOR_SESSION_ID/metrics/anomaly_cooldown.json
(create empty {} if missing).now - last_filed[<metric>] < 7200s (2h), include the alert in the
status report but SKIP file/comment.gh issue list --search "metrics: <metric-name>" --state open.gh issue comment <N> with the new evidence (current/prev
values, delta, threshold, ledger, binary sha, sibling metrics). Apply the
urgent label only if the metric breach blocks validator operation (per
the Label policy in the Bug filing workflow); otherwise leave unlabeled.
Board-route: if NOT on project board, add to Backlog.gh issue create (append --label urgent only when the
metric breach is operation-blocking; most metric alerts are non-urgent
and should be filed without a label) with:
Non-critical: metrics: <metric> (NONC tier) or metrics: <metric> — <symptom> (WARN/SYNC tier).grep -n "<metric_name_without_prefix>" crates/ -r,
and a suggested fix.
Board-route: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backloganomaly_cooldown.json with {"<metric>": <now>}.sync: line in the status report
(not just metrics:).If $MONITOR_MODE = watcher, run check (12) with a reduced catalog: only
process (open_fds, max_fds), jemalloc, and overlay counters/gauges. Skip
SCP, quorum, herder_state, histogram p99 alerts, ratio checks, and
Check 12b (recovery-stalled streak). Omit the recovery_stalled: line
from watcher output entirely.
(10) Remote sync — first sanity-check the working tree:
git status --porcelain reports any output, ABORT the deploy path for
this tick. Report: DEPLOY SKIPPED (dirty tree) with the list of dirty
paths. Do not run git pull against a dirty tree; do not kill the node.
Investigate the dirty tree before the next tick.git fetch origin main. If in detached HEAD state
(git symbolic-ref HEAD fails), git checkout main first.Determine deployed_sha — the sha of the source tree used to build the
currently-running binary. This decouples the deploy gate from git HEAD, which
can diverge from the running binary if an operator or CI pushes commits
without going through the skill's build+restart path.
BUILD_SHA_FILE="/home/tomer/data/$MONITOR_SESSION_ID/build_sha"
SESSION_DIR="/home/tomer/data/$MONITOR_SESSION_ID"
deployed_sha=""
deployed_sha_status=""
if [ -s "$BUILD_SHA_FILE" ]; then
candidate=$(tr -d '[:space:]' < "$BUILD_SHA_FILE")
if printf '%s' "$candidate" | grep -qE '^[0-9a-f]{40}$' \
&& git cat-file -e "${candidate}^{commit}" 2>/dev/null; then
deployed_sha="$candidate"
deployed_sha_status="ok"
else
deployed_sha_status="invalid"
fi
elif [ -e "$SESSION_DIR/last_ledger" ] \
|| [ -e "$SESSION_DIR/metrics/current.prom" ] \
|| [ -e "$SESSION_DIR/tick-history.jsonl" ]; then
# Other per-session state exists — monitor-loop should have written
# build_sha but didn't (e.g. rollout of this change to an existing
# session). Force a rebuild to restore the invariant.
deployed_sha_status="invalid"
else
# Truly fresh session dir — safe to fall back to today's behavior.
deployed_sha=$(git rev-parse HEAD)
deployed_sha_status="missing-fresh"
fi
Runtime cross-validation — use the running binary's embedded commit hash
as the authoritative source of truth. If /info reports a valid, locally-
reachable commit hash, it overrides BUILD_SHA_FILE (which becomes a cache).
# Cross-validate: running binary's commit_hash is authoritative.
running_hash=""
if info_json=$(curl -sf "http://localhost:$MONITOR_ADMIN_PORT/info" 2>/dev/null); then
running_hash=$(printf '%s' "$info_json" \
| python3 -c 'import sys,json; print(json.load(sys.stdin).get("commit_hash",""))' 2>/dev/null)
fi
if [ -n "$running_hash" ] \
&& printf '%s' "$running_hash" | grep -qE '^[0-9a-f]{40}$' \
&& git cat-file -e "${running_hash}^{commit}" 2>/dev/null; then
# Binary reports a valid, locally-reachable commit hash — authoritative.
case "$deployed_sha_status" in
ok)
if [ "$deployed_sha" != "$running_hash" ]; then
# BUILD_SHA_FILE disagrees with binary. Trust binary, repair file.
printf '%s\n' "$running_hash" > "${BUILD_SHA_FILE}.tmp"
mv "${BUILD_SHA_FILE}.tmp" "$BUILD_SHA_FILE"
deployed_sha="$running_hash"
# deployed_sha_status stays "ok"
fi
;;
invalid|missing-fresh)
# File bad/missing but binary knows its commit — repair and use.
printf '%s\n' "$running_hash" > "${BUILD_SHA_FILE}.tmp"
mv "${BUILD_SHA_FILE}.tmp" "$BUILD_SHA_FILE"
deployed_sha="$running_hash"
deployed_sha_status="ok"
;;
esac
fi
# If /info unreachable, commit_hash absent/empty/malformed, or SHA not in
# local git: no-op — fall through to existing logic unchanged.
/info state | commit_hash field | git cat-file -e | Action |
|---|---|---|---|
| Unreachable | — | — | No-op |
| Reachable | Missing/empty/malformed | — | No-op |
| Reachable | Valid 40-char hex | Fails (not in local git) | No-op (prevents cache poisoning) |
| Reachable | Valid 40-char hex | Succeeds, matches deployed_sha | No-op |
| Reachable | Valid 40-char hex | Succeeds, differs from deployed_sha | Repair file, use runtime hash |
| State | Meaning | Gate behavior |
|---|---|---|
ok | Valid commit known (from file or repaired from runtime) | Compare deployed_sha vs origin/main |
missing-fresh | No file, no runtime hash, no other session state | Falls back to HEAD vs origin/main |
invalid | File corrupt + runtime unavailable | Force rebuild unconditionally |
Report deployed_sha_status in the tick's status line.
Gate comparison:
origin_sha=$(git rev-parse origin/main)
case "$deployed_sha_status" in
ok|missing-fresh)
[ "$deployed_sha" = "$origin_sha" ] && gate_action="up-to-date" || gate_action="proceed"
;;
invalid)
gate_action="proceed-force-rebuild"
;;
esac
If gate_action="up-to-date": no action (already up to date).
Otherwise enter the deploy path:
Cool-down guard: if node uptime is < 60m AND CRASH_RECOVERY=no (i.e.
the previous tick deployed cleanly), SKIP DEPLOY this tick. Report:
DEPLOY DEFERRED (cool-down: uptime=<X>m < 60m). If deployed_sha_status
is invalid, append , build_sha: invalid to the report so the operator
knows a follow-up rebuild is pending once cool-down expires. Rationale:
every restart costs ~1-2m of validation downtime + brief peer reconnect.
Deploying multiple commits within minutes thrashes the validator without
giving each version time to surface issues. The 60m floor was raised from
the original 30m (#1944) on 2026-04-30 after observing too many quick-cycle
deploys per day. Crash-recovery restarts are exempt because they're not
voluntary deploys. Urgent fixes can override by killing the node manually
before the next tick — the normal startup path will pick up the latest
origin/main.
Binary-relevance check:
if [ "$gate_action" = "proceed-force-rebuild" ]; then
# invalid build_sha — skip allowlist, force rebuild
needs_rebuild="yes"
else
# ok or missing-fresh: diff from deployed_sha
if changed_paths=$(git diff --name-only "$deployed_sha" origin/main 2>/dev/null); then
# Evaluate allowlist on $changed_paths (see below).
# If every path is allowlisted: needs_rebuild="no"
# Else: needs_rebuild="yes"
...
else
# git diff failed — fail closed, force rebuild.
needs_rebuild="yes"
fi
fi
If needs_rebuild="no" (all paths allowlisted), skip the rebuild + restart:
run git pull --rebase only, then report
DEPLOY SYNCED (no-binary-impact: docs/scripts only — pulled <N> commits, no restart).
Do NOT update BUILD_SHA_FILE (the binary hasn't changed).
The non-binary-impact allowlist is:
.github/.claude/scripts/docs/*.md files, e.g. README.md or CLAUDE.mdstellar-specs / stellar-specs/ submodule pointer changes onlyDeny by default: if any path is outside this allowlist, continue to the CI,
build, and restart path. In particular, changes to Cargo.toml,
Cargo.lock, any build.rs, crates/, configs/, or mixed docs+code
commits require a rebuild + restart.
Quarantine gate (only reached when needs_rebuild=yes):
# --- Quarantine gate ---
source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
check_quarantine_ancestry "$HOME/data/deploy_quarantine.txt"
if [ $? -eq 0 ]; then
case "$QUARANTINE_STATUS" in
blocked_unreadable)
deploy_report="BLOCKED (quarantine file unreadable — fail-closed)" ;;
blocked_git_error)
deploy_report="BLOCKED (quarantine ancestry check failed for ${QUARANTINED_MATCH:0:8} — fail-closed)" ;;
blocked_ancestor)
deploy_report="DEFERRED (quarantined: ${QUARANTINED_MATCH:0:8} reachable from origin/main — see ~/data/deploy_quarantine.txt)" ;;
esac
[ -n "$QUARANTINE_WARNINGS" ] && deploy_report="$deploy_report [WARN: $QUARANTINE_WARNINGS]"
# Skip to status report with this deploy_report. Do not continue.
fi
If quarantined, report the appropriate status and exit the deploy path.
Do NOT proceed to CI check, build, or restart. Parse/check warnings are
appended to the deploy: status line regardless of outcome.
Check CI status on origin/main: gh run list --branch main --limit 3 --json conclusion --jq '.[].conclusion'.
If any recent run has conclusion failure, do NOT deploy — route the failure
through check (11) and wait.
If all conclusions are success (ignore "" for in-progress and cancelled):
git pull --rebase, CARGO_TARGET_DIR=/home/tomer/data/$MONITOR_SESSION_ID/cargo-target cargo build --release -p henyey.
If build succeeds:
# Persist the just-built sha BEFORE Stop-PID. Atomic via tmp + mv.
new_sha=$(git rev-parse HEAD)
printf '%s\n' "$new_sha" > "${BUILD_SHA_FILE}.tmp"
mv "${BUILD_SHA_FILE}.tmp" "$BUILD_SHA_FILE"
Then: Stop-PID, Rotate-log suffix preredeploy, Relaunch.
Report: DEPLOY — pulled <N> commits (<old-sha>..<new-sha>), rebuilt, restarted at L<ledger>.
If build fails: report BUILD FAILED, do NOT restart — the old binary is
still running. Do NOT update BUILD_SHA_FILE. Route the build error through
check (11).
(11) CI check — scope and levels of detection.
(11a) Scope: only inspect workflows that run on branch main.
gh run list --branch main --limit 10 --json databaseId,name,status,conclusion,headSha,createdAt --jq '.[] | "\(.name)|\(.status)|\(.conclusion)|\(.headSha[:8])|\(.databaseId)|\(.createdAt)"'.
Ignore runs triggered by PRs on other branches. Scan for completed runs with
conclusion failure.
(11b) Job-level (CRITICAL — catches continue-on-error failures): For the
latest completed run of EACH distinct workflow name (enumerate dynamically
from 11a, do NOT hard-code names), check individual jobs:
gh run view <ID> --json jobs --jq '.jobs[] | select(.conclusion == "failure") | "\(.name)|\(.conclusion)"'.
Workflows with continue-on-error jobs report run-level conclusion success
even when jobs fail — you MUST check job-level conclusions. If any jobs have
conclusion failure, treat it the same as a run-level failure.
REPORTING RULE — NEVER report ci: all green if ANY job has conclusion
failure, even if the run-level conclusion is success. The ci: line in
the status report MUST reflect the WORST job-level result across all
workflows. A continue-on-error job failure is NOT green — it is RED. Do not
qualify failures as "known", "pre-existing", or "cosmetic".
Only act on failures from the last 2 hours (compare createdAt with
date -u +%Y-%m-%dT%H:%M:%SZ). For each failure:
gh run view <ID> --log-failed 2>&1 | tail -80.test_core3_restart_rejoin_* is a
recurrer; see #1838 / #1939) — rerun the failing jobs first:
attempts=$(gh run view <ID> --json attempt --jq .attempt)
if [ "$attempts" -lt 2 ]; then
gh run rerun <ID> --failed
fi
Report CI RERUN ATTEMPTED — <workflow> jobs <names> on <sha>, attempt 1→2
and stop CI processing for this tick. The next tick will see the
re-run conclusion. Don't rerun more than once — if the re-run also
fails, the failure is real (or an unusually persistent flake) and
warrants filing.
If the failure is clearly NOT automation (build error from real code,
test assertion mismatch, hash mismatch, panic from production code),
skip the rerun and proceed to step 4.gh issue list --search "<workflow name + signature>" --state open.
If one matches, gh issue comment <N> with the new evidence (sha, log
snippet, timestamp) and ensure it has the urgent label
(gh issue edit <N> --add-label urgent) — failing CI on origin/main
blocks deploy and meets the urgent criteria.
Board-route: if NOT on project board, add to Backlog:
bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backloggh issue create --label urgent --title "<workflow>: <short signature>" --body "..." with investigation findings.
Board-route: bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" BacklogCI ISSUE FILED — <workflow> failed on <sha>, filed/commented #<N>.Applies to node bugs, metric alerts, and CI failures.
--body-file, not heredocsWhen the issue body contains code blocks, always write the body to a temp
file and pass --body-file <path> rather than --body "$(cat <<'EOF' ... EOF)".
Reason: heredocs nested inside double-quoted shell expressions tempt the agent
to backslash-escape backticks (\`), which GitHub renders literally — every
code fence breaks. There is no GitHub-side workaround once filed; the issue
must be re-edited via gh issue edit <N> --body-file <path>.
Pattern:
cat > /tmp/issue-body.md <<'EOF'
## Symptom
...
```rust
some code
... EOF gh issue create --title "..." --body-file /tmp/issue-body.md
gh issue edit --body-file /tmp/issue-body.md
The single-quoted `'EOF'` makes the heredoc literal — no escaping needed for
backticks, dollar signs, etc. inside the body.
### Label policy
> **Canonical reference:** `scripts/lib/monitor-label-policy.md` is the
> single source of truth for labeling and board-routing rules. This section
> is a normative excerpt.
When creating or commenting on issues:
- **`urgent`** — file with `--label urgent`, OR `gh issue edit <N> --add-label urgent`
on an existing issue, ONLY when the symptom blocks validator operation or
consensus participation. Specifically:
- hash mismatch (any kind)
- wedged node (frozen event loop, watchdog auto-abort)
- failing CI on origin/main blocking deploy
- panic or crash from production code
- SYNC FAILURE past the active deadline
- OOM-driven restart
- and similar runtime-blocking conditions
- **(no label)** — non-urgent issues: calibration, threshold tuning, NONC
alerts, cosmetic noise, follow-up improvements. Downstream picks these up
at lower priority.
- **`not-ready`** — reserved for tier-3 self-reflection issues that need
operator decision before any code change. Do not use for regular bug
filings.
**Label lifecycle:**
- `urgent` and `not-ready` are mutually exclusive in steady state.
- Monitors MAY add `urgent` to an existing unlabeled issue (escalation).
- Monitors do NOT remove labels (de-escalation is operator-only).
- If a `not-ready` issue becomes blocking: add `urgent`, comment with
evidence. Both labels coexist temporarily until operator resolves.
### Filing flow
1. Identify the failing signature (ledger + error type for node bugs; metric
+ threshold for alerts; workflow + job + error type for CI).
2. Investigate to root cause — read source code, trace code paths.
3. Check for an existing open issue: `gh issue list --search "<keywords>" --state open`.
If a match exists, verify its state is OPEN
(`gh issue view <N> --json state -q .state`), `gh issue comment <N>` with
new evidence, and apply the `urgent` label only if the recurrence meets
the urgent criteria above.
**Board-route:** if the issue is NOT on the project board, add it:
`bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog`
(or `Blocked` if labeled `not-ready`). If already on board, preserve
current status. STOP.
4. If no OPEN match, file a new issue with `gh issue create` — append
`--label urgent` only if the symptom meets the urgent criteria above;
otherwise omit the label. Body is a self-contained proposal (symptom,
evidence, suspected root cause, fix sketch with file:line references).
**Board-route:** add to project board:
`bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog`
(or `Blocked` if labeled `not-ready`).
5. Do NOT spawn agents. Do NOT edit the main checkout. The next redeploy tick
(check 10) will pick up whatever lands on main.
**Board routing failure:** If `move-issue-status.sh` fails, log as ACTION
in the status report. Retry on next tick. Escalate after 3 consecutive
failures (see `scripts/lib/monitor-label-policy.md`).
**Recurrence policy**: If a previously-filed bug recurs with material new
evidence, prefer commenting on the existing issue when it is the same bug at
the same site AND the issue is OPEN. If the prior issue is CLOSED, file a new
issue (label per the policy above) with `Related to #<prior> (closed)` in the
body and a note on why the prior fix did not cover this case; do NOT comment
on the closed issue. File a new issue (still referencing `Related to #<prior>`
with one-line scope-diff) when new evidence points at a different named
subsystem, phase/mark, root-cause hypothesis, or candidate site set.
**Commit policy**: the monitor does NOT commit code. All fixes are delegated
via `gh issue` + project board routing.
**Deploy regression policy**: If the node fails after a deploy
(see also `scripts/lib/monitor-label-policy.md`):
(a) Record the bad SHA BEFORE any rollback rebuild (`build_sha` will be
overwritten during rebuild):
```bash
bad_sha=$(cat "$BUILD_SHA_FILE")
(b) Append to quarantine file (idempotent):
source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
rc=0
quarantine_append "$HOME/data/deploy_quarantine.txt" "$bad_sha" "regression #<issue>" || rc=$?
if [ $rc -ne 0 ]; then
echo "WARNING: quarantine_append failed (rc=$rc) — deploy gate may not block next tick" >&2
fi
(c) File or comment on a GitHub issue (label urgent since validator
operation is impacted) with the regression details (commit range, symptoms,
watchdog data). Board-route to Backlog:
bash .github/skills/plan-do-review/scripts/move-issue-status.sh "$N" Backlog
(d) Restart the node on the last known-good binary (rebuild from the previous commit) while waiting for the fix. Do NOT revert commits inline.
The quarantine gate (section 10, step 3) will now block re-deployment as long as the quarantined SHA is reachable from origin/main. The quarantine does NOT auto-lift — see "Quarantine Clearance" below.
The deploy quarantine is a safety lock. The monitor skill NEVER removes entries. Clearance is an explicit operator decision.
When to clear: After ALL of:
How to clear (using the deploy-quarantine helper):
source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
rc=0
quarantine_remove "$HOME/data/deploy_quarantine.txt" "$bad_sha" || rc=$?
if [ $rc -ne 0 ]; then
echo "ERROR: quarantine_remove failed (rc=$rc) — entry may still be active" >&2
fi
Operator reminders: The deploy: status line reports
DEFERRED (quarantined: ...) every tick (~20 minutes). This is a
persistent, automatic reminder that requires no separate notification.
Emergency override: To force deploy despite quarantine:
rm "$HOME/data/deploy_quarantine.txt" # clears ALL quarantines
# or: remove a specific entry using the helper:
# source "$(git rev-parse --show-toplevel)/scripts/lib/deploy-quarantine.sh"
# quarantine_remove "$HOME/data/deploy_quarantine.txt" "<sha>"
For ANY anomaly, investigate to root cause — read source code, check logs,
trace code paths. Never dismiss as "expected". Produce a GitHub issue
(label per the Bug filing workflow's Label policy — urgent if it blocks
operation, otherwise no label) or comment on an existing one for every anomaly that isn't
immediately explained. Only exception: anomalies whose root cause turns
out to be literal expected-correct behavior per the code — document the code
path in the status report and skip the filing.
Print a multiline status report:
MONITOR <OK|WARNING|ACTION|OFFLINE> — L<ledger> — <timestamp>
node: mode=<MODE> session=<session-id> pid=<PID> fresh_start=<yes|no>
wipe: <per wipe-state composition table — omitted when no wipe>
sync: <synced | CATCHING UP (gap=N, uptime=Xm, deadline=<15m|20m|60m>) | SYNC FAILURE (gap=N, uptime=Xm — filed/commented #<N>)>
mem: <RSS_MB>MB rss | alloc=<alloc>MB resident=<resident>MB frag=<pct>%
heap=<heap>MB mmap=<mmap>MB unaccounted=<sign><unaccounted>MB
disk: <used>/<total> (<pct>%) | session+data=<size>
rpc: <healthy|unhealthy|N/A> oldestL=<X> latestL=<Y> window=<Z>
obsrvr: <validating=<Y/N> val24h=<pct>% lag=<N> | N/A (watcher) | N/A (api-error)>
metrics: <clean | N alerts (<metric1>,<metric2>,...) — filed/commented #<N>,#<M> | N alerts, K suppressed by cooldown>
metrics_ratio: scp <ok (accept=X%) | skipped (reason) | WARNING accept=X%<5% (N ticks)>, apply <ok (fail=Y%) | skipped (reason) | WARNING fail=Y%>50% (N ticks) — investigating>, pending <ok (too_old=Z%) | skipped (reason) | WARNING too_old=Z%>50% (N ticks)> | collecting baseline
recovery_stalled: <ok (delta=0) | breach (delta=N, streak M/3) | WARNING delta=N (M ticks) — investigating | WARNING delta=N (burst) — investigating | skipped (<reason>) | collecting baseline>
deploy: <up-to-date | DEFERRED (quarantined: <sha8> reachable from origin/main — see ~/data/deploy_quarantine.txt) | BLOCKED (quarantine file unreadable — fail-closed) | BLOCKED (quarantine ancestry check failed for <sha8> — fail-closed) | DEFERRED (cool-down: ...) | SYNCED (no-binary-impact: ...) | pulled N commits (old..new) | SKIPPED (dirty-tree|ci-red|build-failed, filed/commented #<N>)>
ci: <all green (run+job level) | WORKFLOW failed — filed/commented #<N> | WORKFLOW jobs FAILED (continue-on-error) — NAME|conclusion listed, filed/commented #<N>>
self_reflect: <clean | fixed inline (<sha>: <short-desc>) | filed #<N> (urgent: <short-desc>) | filed #<N> (no-label: <short-desc>) | filed #<N> (not-ready: <short-desc>)>
The wipe: line is present when SESSION_WIPED=yes or MAINNET_WIPED=yes
(or both). Format per the wipe-state composition table above. When neither
fires, omit the line entirely.
Use WARNING for threshold breaches. Use ACTION when a corrective action was
taken (restart, deploy, filed a new issue, commented on an existing issue,
session-wipe recovery). Use OFFLINE when the node cannot be recovered in
this tick (e.g., rebuild failed after session wipe).
Use SYNC FAILURE (not WARNING) when the node has exceeded the active sync
deadline (15m populated / 20m fresh-start / 60m crash-recovery) but is not closing ledgers in
real-time — this is a bug that requires immediate investigation AND
filing/commenting on a GitHub issue (label urgent since SYNC FAILURE blocks
consensus). Board-route to Backlog per scripts/lib/monitor-label-policy.md.
After emitting the status report (and before self-reflection), append a
single JSON line to /home/tomer/data/$MONITOR_SESSION_ID/tick-history.jsonl
so /daily-summary can aggregate the last 24h of ticks:
# ts is computed inside the Python block — do NOT use shell expansion for timestamps.
HIST=/home/tomer/data/$MONITOR_SESSION_ID/tick-history.jsonl
python3 - <<'PY' >> "$HIST"
import json
from datetime import datetime, timezone
print(json.dumps({
"ts": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
"status": "<OK|WARNING|ACTION|OFFLINE>",
"ledger": <current-ledger-int>,
"build": "<short-sha>",
"deploys": <0 or 1>,
"warnings": [<list of metric names that breached>],
"actions": [<list of action keywords: restart, deploy, filed-#N, session-wiped-recovery, session-wiped-process-alive, session-wiped-rebuild-failed, mainnet-data-wiped>],
"self_reflect": "<clean | fixed-inline | filed-#N>",
"watch": ["<key>=<value>", ...]
}))
PY
tscontract: Thetsfield records the wall-clock UTC time captured immediately before emitting the JSON line. Format:YYYY-MM-DDTHH:MM:SSZ(ISO 8601, always UTC). Acceptable skew: ≤ 60 seconds from real wall-clock time. Any drift > 60 s indicates a bug in the capture mechanism.
watch carries multi-tick non-incident concerns the daily summary should
surface — examples: pruning_gap=2451, frag_pct=18, disk_pct=72,
wipe=session-dir, wipe=mainnet-data. Add a key whenever the tick body
called out a value worth tracking but did not file an issue. Each entry MUST
be one line — the daily-summary aggregator parses with
for ln in f: json.loads(ln).
After the status report is emitted, look back at THIS tick's output and
check for problems in the monitor itself — not in the node. The node
is covered by the ## Investigation policy above. This section is about
bugs / miscalibrations in the tick logic, thresholds, catalog, or
detection code that fired false positives, failed silently, or rendered
contradictory output.
scp_verifier_thread_state !=1 was inverted; quorum_agree <4 fires
between externalizations by design.)NO_NOMATCH killed a pipeline; a histogram
metric name had a typo and silently resolved to count_delta=0
which skipped evaluation.)sync: synced alongside
metrics: peer_count<8 WARN, or similar internal inconsistency that
suggests a missing warmup exemption or cross-check rule.If none of the above: report self_reflect: clean and stop.
When an issue is found, choose a tier and act on it in the same tick.
Tier 1 — Fix inline (trivial edit).
Criteria (ALL must hold):
.claude/skills/monitor-tick/SKILL.md
(no other files touched)Action sequence (same pattern used for every skill edit in this repo):
# 1. Edit .claude/skills/monitor-tick/SKILL.md
# 2. git add .claude/skills/monitor-tick/SKILL.md
# 3. git commit -m "Monitor-tick: <what + why>" with Co-authored-by: Claude Code trailer
# 4. git push origin main (on reject: git pull --rebase && git push)
Report: self_reflect: fixed inline (<short-sha>: <short-desc>).
Tier 2 — File GH issue (non-trivial but codeable).
Apply the Bug filing workflow's Label policy: most monitor-tick self-reflection
issues are non-urgent (calibration / threshold / catalog tuning) and should
be filed without a label. Only use urgent if the skill's miscalibration is
silently masking a real validator-blocking signal.
The issue is real and actionable but any of:
Before filing, search for an existing open issue:
gh issue list --search "monitor-tick: <keywords>" --state open.
Comment with new evidence if a match exists; otherwise file.
Board-route per scripts/lib/monitor-label-policy.md: Backlog for
actionable issues, Blocked for not-ready issues.
Issue body MUST include:
Title format:
Non-critical: monitor-tick: <description> — for observability /
calibration issues (noise reduction, threshold tuning) — file with no labelmonitor-tick: <description> — for correctness bugs (silent
failures, contradictory output) — file with no label unless the bug
is silently masking a real validator-blocking signal (then urgent)Report: self_reflect: filed #<N> (urgent: <short-desc>) or
self_reflect: filed #<N> (no-label: <short-desc>).
Tier 3 — File not-ready GH issue (human input required).
Use this tier when any of:
Issue body includes everything from Tier 2 plus an explicit "Human input required" section listing the specific decisions / options that need an operator answer.
Label: not-ready. Title format:
monitor-tick: [needs-decision] <description>.
Report: self_reflect: filed #<N> (not-ready: <short-desc>).