| name | daily-health-check |
| description | Run the daily pipeline health check. Use when the user says "daily check", "run the health check", "pipeline check", "what happened since last time", "daily routine", or any variation asking about overall pipeline status and health. Walks every stage from image import through publishing, reports what changed since last check, and flags issues. |
Daily Pipeline Health Check
A structured audit of the entire plant pipeline, from image collection through publishing. Designed to be run once a day (or on demand) and report what changed since the last check.
When to Use
User says any of: "daily check", "health check", "pipeline check", "daily routine", "what happened", "run the check", "check the pipeline", or similar.
Core Principle: 50/50 Fix + Prevent
Every issue found must be addressed in two halves:
- Fix what is wrong now โ repair the immediate data/state problem (DB updates, stage resets, etc.)
- Fix the root cause so it can't happen again โ find the code path that created the bad state and patch it systemically. Not a band-aid, not a workaround โ a deep fix in the code that created the problem.
If you only do #1, the same issue will reappear on the next check. If you only do #2, the existing bad data stays broken. Always do both.
Examples of systemic fixes:
- If plants pile up in a stage with no exit path โ add the missing fallback to the stage advancement logic
- If image counts drift from actual DB counts โ add auto-sync to the pipeline advancement cycle
- If plants get flagged for something the pipeline can handle automatically โ remove the flag from the code path
- If extraction stamps are missing โ make the tier computation self-healing
Examples of band-aid fixes (DO NOT do these alone):
- Manually resetting 400 plants every week without fixing why they got stuck
- Unflagging plants without removing the code that re-flags them
- Running one-off SQL to fix counts without adding auto-sync
Admin UI Obligation
After every fix โ both immediate data repair AND systemic code fix โ verify the admin dashboard reflects the current state:
- Does the pipeline stage breakdown show accurate counts?
- Are stuck/flagged sections updated?
- Are any progress bars or funnels stale?
- If you added new pipeline behavior, does the UI expose it?
The admin dashboard is the user's window into the pipeline. If the data is fixed but the UI shows stale numbers, the fix is incomplete.
Protocol
Step 0: Load Last Check State
Read .local/last_health_check.json. If it exists:
- Use the
timestamp field as the "since" boundary.
- Capture the
metrics map (counter values from the previous run) so that
this run can compute deltas without manually scrolling old reports.
If the file is missing, default the timestamp to 24 hours ago and treat
all metrics as null (delta unknown โ render as (no prior value)).
Backward compatibility: legacy checkpoints may contain only
{timestamp, summary} (no metrics map, and summary instead of
notes). Treat both shapes as valid input โ fall back to summary when
notes is absent, and treat a missing or empty metrics map exactly
like a fresh run (every counter renders (no prior value)). The next
checkpoint write will normalize the file to the new shape.
The metrics map has this shape (all integers; add new entries as the
skill grows, but never remove existing ones โ older runs need them):
{
"timestamp": "2026-03-28T08:00:00Z",
"notes": "brief one-liner",
"metrics": {
"total_plants": 4321,
"total_published": 312,
"accepted_as_lost": 7,
"stuck_local_acer_only_24h": 0
}
}
Carry the loaded metrics through the run so Step 2 can render
"accepted_as_lost: N (+X since last check)" automatically and Step 3
can auto-flag PROBLEM when accepted_as_lost increased.
If a key is present in this run's results but missing from the previous
file (e.g. first run after the skill grew a new counter), render the
delta as (no prior value) rather than guessing zero. Use the same
label everywhere a baseline is missing โ never invent variants like
"(new metric)" or "(unknown)".
Step 1: Run the Diagnostic Queries
CRITICAL: ALL queries MUST target the production database. Always pass environment: "production" to every executeSql() call. The development database is stale and does not reflect the live pipeline state. Running health checks against the dev database produces meaningless results.
const result = await executeSql({
sqlQuery: "SELECT ...",
environment: "production"
});
const result = await executeSql({
sqlQuery: "SELECT ..."
});
Execute ALL queries from .agents/skills/daily-health-check/queries.sql using the executeSql callback in code_execution. Run them in logical batches to minimize round-trips.
The queries are organized into sections matching the pipeline stages. Each returns structured data that feeds into the report.
Step 2: Build the Report
Present findings as a single table-based report to the user. The format is defined in .agents/skills/daily-health-check/report-template.md.
Key principles:
- "Since last check" column shows deltas, not absolute totals (though totals are shown too)
- Traffic-light flags: use words not emoji. "OK", "WATCH", "PROBLEM"
- Issue detection is the primary value โ don't just report numbers, interpret them
- Keep it scannable. The user is not technical โ plain language, no jargon.
Audit-write failures (Task #112): Always include ยง1c in the report.
Run Q21 with the previous check's timestamp as $1::timestamptz.
Q21 reads audit_write_failures, the alerting sink for every caught
failure of a pipeline_state_changes insert. If Q21 returns any rows,
also run Q21a to fetch the raw error messages for those rows. Any
non-zero count means at least one pipeline mutation completed but its
audit row was lost โ flag as PROBLEM with the call_site (file:line)
so the regression can be traced via git log -p on that file. The
swallowed-then-alerted pattern is intentional: it prevents a broken
audit insert from blocking a real pipeline mutation while still
guaranteeing the failure surfaces within hours instead of waiting for
the next manual code review.
State-change audit trail (Task #110): Always include ยง1b in the
report. Run Q20 (per-row transitions, joined to plants) and
Q20a (aggregated by entity_type, field, source) with the previous
check's timestamp as the $1::timestamptz parameter. Both queries cover
every flip of plants.published_to_library, plants.pipeline_stage,
and lora_training_jobs.lora_url since last check โ i.e. who, when,
and why every tracked field moved. Render ยง1b.i as up to ~30 plain
prose lines (with the โฆ +N more overflow note) and ยง1b.ii as the
aggregate table (see report-template.md). Use the
GET /api/admin/pipeline/state-history endpoint to drill into a single
plant or source when something looks unfamiliar; flag any unrecognised
source that undoes prior publishes as PROBLEM.
Step 3: Flag Issues
After the table, list specific issues found, ordered by severity:
PROBLEM (needs action):
- Pipeline stages with zero movement since last check
- LoRA validation failure rate above 30%
- Plants stuck in any stage for >48 hours
- Morphology enrichment failures
- LoRA worker downtime (no new trainings when there should be)
- Plants at
renderer_ready not advancing to publish_ready
pending-local-worker LoRA entries (locally trained but not uploaded)
- Image collection stalled (imaging_complete with <20 images)
- Pipeline errors on any plant
- Image source producing 0 images when it was active last check (API down?)
- Image count mismatch (lora_training_image_count != actual reference_images count)
- LoRA SDXL backup stuck on
/needs-archive >24h (Q18 stuck_local_acer_only_24h > 0) โ Acer archive sweep is failing for those rows; the worker is logging "No SDXL backup found for archival" every 5 min
- LoRA SDXL backup
accepted_as_lost count increased since last check โ compare Q18's accepted_as_lost against metrics.accepted_as_lost from .local/last_health_check.json; any positive delta means a new file loss happened; record the affected species from Q19 and decide retrain vs re-import
- Audit-write failures since last check (Q21 returned any rows) โ at least one
pipeline_state_changes insert was swallowed; the audit trail (Q20 / Q20a) is now incomplete for that call site. Always list the call_site (file:line) and label, and pull the raw error_message from Q21a so you can distinguish a transient DB blip from a wiring regression. A call_site inside server/pipelineStateAudit.ts itself usually means a missing table/column on the target DB (check whether migration 0003 / 0004 ran in that environment)
WATCH (monitor):
- Morphology enrichment pace dropping
- Image collection pace below 500/day when plants are queued
- LoRA training pace below 5/day
- Large backlog in any stage relative to throughput
- Plants with morphology but missing critical fields
- Any image source with <80% approval rate
- Plants at 30-34 images (near threshold โ quick wins for collection)
Step 4: Actionable Recommendations
For each PROBLEM:
- Immediate fix: what DB/state repair to do right now
- Systemic fix: what code change prevents this from recurring
- UI check: does the admin dashboard reflect the fix?
For WATCH items, note what to look for next time.
Step 5: Execute Fixes
Do not just report problems โ fix them. For each PROBLEM:
- Apply the immediate data fix (SQL, stage resets, etc.)
- Implement the systemic code fix
- Verify the admin UI shows accurate state after fixes
Step 6: Save Checkpoint
Write .local/last_health_check.json with the current timestamp, a
brief one-line notes field, and the metrics map of counter values
captured this run. The next run reads metrics to compute deltas
automatically (Step 0).
At minimum, persist these keys (extend over time as needed; never drop
keys older runs already wrote):
{
"timestamp": "<ISO 8601 UTC>",
"notes": "<one-liner>",
"metrics": {
"total_plants": <Q2 total_plants>,
"total_published": <Q2 total_published>,
"accepted_as_lost": <Q18 accepted_as_lost>,
"stuck_local_acer_only_24h": <Q18 stuck_local_acer_only_24h>
}
}
Always write all four keys, even when the value is 0 โ a missing key
makes the next run render (no prior value) instead of a real delta.
Issue Detection Heuristics
LoRA Worker Downtime Detection
Compare the gap between the last LoRA training timestamp and now. If >12 hours during a period when plants were queued, flag as potential worker downtime. Common causes: Windows updates, machine restart, network issues.
Stuck Plant Detection
Any plant in morphology_processing with pipeline_stage_updated_at older than 48h is stuck. Same for imaging_queued with no image count increase.
Validation Failure Spike
If LoRA validation failure rate in the check period is >30%, something is wrong with either the training images or the validation prompt. List the top failure patterns (which species are being misidentified as what).
Image Collection Health
Track images by source (iNaturalist, PlantNet, Pixabay, Brave, GBIF, Wikimedia, Wikipedia, Flickr, web_scrape). Key metrics per source: total collected since last check, validated count, quality score distribution (high >=0.85, medium 0.70-0.84, low <0.70). Flag sources that went silent.
Also run an image count sync: compare lora_training_image_count on plants table against actual validated reference_images. If they diverge, sync them with:
UPDATE plants p SET lora_training_image_count = sub.actual_count
FROM (SELECT ri.plant_id, COUNT(*) as actual_count FROM reference_images ri
WHERE ri.verification_status IN ('approved', 'auto_approved') AND ri.image_purpose = 'lora_training'
GROUP BY ri.plant_id) sub
WHERE sub.plant_id = p.id AND (p.lora_training_image_count IS NULL OR p.lora_training_image_count != sub.actual_count)
AND p.pipeline_stage IN ('imaging_complete', 'imaging_queued', 'lora_queued', 'enrichment_queued');
LoRA SDXL Backup Health (Q18 / Q19)
Two metrics surface the "missing-backup" failure mode that previously sat
silent in the Acer worker console for ~10 days at a time
(incident: Nelumbo nucifera, task #99 / #101):
- Stuck on /needs-archive >24h (
stuck_local_acer_only_24h): a
completed lora_training_jobs row with lora_url='local-acer-only'
that the Acer archive sweep should have backed up to the Replit
bucket but hasn't, for more than 24 hours. While that row exists, the
SDXL worker logs WARNING No SDXL backup found for archival of job <id> (species=...) every 5 min on the Acer console. Treat any row
here as a candidate for that warning. Investigate before it turns
into an "accepted as lost" entry: the .safetensors might be (a)
recoverable on the Acer (re-run archiver), (b) genuinely missing
(operator decides โ retrain / re-import / mark local-acer-lost).
- Accepted as lost (
accepted_as_lost): cumulative count of jobs
flipped to lora_url='local-acer-lost' by an operator. This is a
permanent record of file loss; the affected plants most likely cannot
render their LoRA until retrained or re-imported. Compare against the
previous check's value โ only the delta is news. If it increased,
list the new species from Q19 in the report so the operator can
decide retrain vs re-import without scrolling the Acer console.
Compute the "since last check" delta for accepted_as_lost
automatically: the value is persisted in metrics.accepted_as_lost
inside .local/last_health_check.json from the previous run.
delta = current - previous. If previous is missing (first run with
this metric), render (no prior value) and skip the auto-flag.
Same treatment for stuck_local_acer_only_24h โ persist it as
metrics.stuck_local_acer_only_24h so the next run can show whether
the same backups are still stuck or whether the queue has churned.
Publish Gate Analysis
Plants at renderer_ready with lora_validated=true should be candidates for publishing. If the count is high and publish_ready count is low, the publish gate may be too strict or not running.
File References
- Queries:
.agents/skills/daily-health-check/queries.sql
- Report template:
.agents/skills/daily-health-check/report-template.md
- Last check state:
.local/last_health_check.json