Run any Skill in Manus with one click

$pwd:

daily-health-check

Name: Daily Health Check
Author: JoergFritschi-crypto

// Run the daily pipeline health check. Use when the user says "daily check", "run the health check", "pipeline check", "what happened since last time", "daily routine", or any variation asking about overall pipeline status and health. Walks every stage from image import through publishing, reports what changed since last check, and flags issues.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 09:03

File Explorer

3 files

SKILL.md

readonly

package.json

"author": "JoergFritschi-crypto"

"repository": "JoergFritschi-crypto/Repository-1"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	daily-health-check
description	Run the daily pipeline health check. Use when the user says "daily check", "run the health check", "pipeline check", "what happened since last time", "daily routine", or any variation asking about overall pipeline status and health. Walks every stage from image import through publishing, reports what changed since last check, and flags issues.

Daily Pipeline Health Check

A structured audit of the entire plant pipeline, from image collection through publishing. Designed to be run once a day (or on demand) and report what changed since the last check.

When to Use

User says any of: "daily check", "health check", "pipeline check", "daily routine", "what happened", "run the check", "check the pipeline", or similar.

Core Principle: 50/50 Fix + Prevent

Every issue found must be addressed in two halves:

Fix what is wrong now — repair the immediate data/state problem (DB updates, stage resets, etc.)
Fix the root cause so it can't happen again — find the code path that created the bad state and patch it systemically. Not a band-aid, not a workaround — a deep fix in the code that created the problem.

If you only do #1, the same issue will reappear on the next check. If you only do #2, the existing bad data stays broken. Always do both.

Examples of systemic fixes:

If plants pile up in a stage with no exit path → add the missing fallback to the stage advancement logic
If image counts drift from actual DB counts → add auto-sync to the pipeline advancement cycle
If plants get flagged for something the pipeline can handle automatically → remove the flag from the code path
If extraction stamps are missing → make the tier computation self-healing

Examples of band-aid fixes (DO NOT do these alone):

Manually resetting 400 plants every week without fixing why they got stuck
Unflagging plants without removing the code that re-flags them
Running one-off SQL to fix counts without adding auto-sync

Admin UI Obligation

After every fix — both immediate data repair AND systemic code fix — verify the admin dashboard reflects the current state:

Does the pipeline stage breakdown show accurate counts?
Are stuck/flagged sections updated?
Are any progress bars or funnels stale?
If you added new pipeline behavior, does the UI expose it?

The admin dashboard is the user's window into the pipeline. If the data is fixed but the UI shows stale numbers, the fix is incomplete.

Protocol

Step 0: Load Last Check State

Read .local/last_health_check.json. If it exists:

Use the timestamp field as the "since" boundary.
Capture the metrics map (counter values from the previous run) so that this run can compute deltas without manually scrolling old reports.

If the file is missing, default the timestamp to 24 hours ago and treat all metrics as null (delta unknown — render as (no prior value)).

Backward compatibility: legacy checkpoints may contain only {timestamp, summary} (no metrics map, and summary instead of notes). Treat both shapes as valid input — fall back to summary when notes is absent, and treat a missing or empty metrics map exactly like a fresh run (every counter renders (no prior value)). The next checkpoint write will normalize the file to the new shape.

The metrics map has this shape (all integers; add new entries as the skill grows, but never remove existing ones — older runs need them):

{
  "timestamp": "2026-03-28T08:00:00Z",
  "notes": "brief one-liner",
  "metrics": {
    "total_plants": 4321,
    "total_published": 312,
    "accepted_as_lost": 7,
    "stuck_local_acer_only_24h": 0
  }
}

Carry the loaded metrics through the run so Step 2 can render "accepted_as_lost: N (+X since last check)" automatically and Step 3 can auto-flag PROBLEM when accepted_as_lost increased.

If a key is present in this run's results but missing from the previous file (e.g. first run after the skill grew a new counter), render the delta as (no prior value) rather than guessing zero. Use the same label everywhere a baseline is missing — never invent variants like "(new metric)" or "(unknown)".

Step 1: Run the Diagnostic Queries

CRITICAL: ALL queries MUST target the production database. Always pass environment: "production" to every executeSql() call. The development database is stale and does not reflect the live pipeline state. Running health checks against the dev database produces meaningless results.

// CORRECT — always do this:
const result = await executeSql({
  sqlQuery: "SELECT ...",
  environment: "production"
});

// WRONG — never do this (hits stale dev database):
const result = await executeSql({
  sqlQuery: "SELECT ..."
});

Execute ALL queries from .agents/skills/daily-health-check/queries.sql using the executeSql callback in code_execution. Run them in logical batches to minimize round-trips.

The queries are organized into sections matching the pipeline stages. Each returns structured data that feeds into the report.

Step 2: Build the Report

Present findings as a single table-based report to the user. The format is defined in .agents/skills/daily-health-check/report-template.md.

Key principles:

"Since last check" column shows deltas, not absolute totals (though totals are shown too)
Traffic-light flags: use words not emoji. "OK", "WATCH", "PROBLEM"
Issue detection is the primary value — don't just report numbers, interpret them
Keep it scannable. The user is not technical — plain language, no jargon.

Audit-write failures (Task #112): Always include §1c in the report. Run Q21 with the previous check's timestamp as $1::timestamptz. Q21 reads audit_write_failures, the alerting sink for every caught failure of a pipeline_state_changes insert. If Q21 returns any rows, also run Q21a to fetch the raw error messages for those rows. Any non-zero count means at least one pipeline mutation completed but its audit row was lost — flag as PROBLEM with the call_site (file:line) so the regression can be traced via git log -p on that file. The swallowed-then-alerted pattern is intentional: it prevents a broken audit insert from blocking a real pipeline mutation while still guaranteeing the failure surfaces within hours instead of waiting for the next manual code review.

State-change audit trail (Task #110): Always include §1b in the report. Run Q20 (per-row transitions, joined to plants) and Q20a (aggregated by entity_type, field, source) with the previous check's timestamp as the $1::timestamptz parameter. Both queries cover every flip of plants.published_to_library, plants.pipeline_stage, and lora_training_jobs.lora_url since last check — i.e. who, when, and why every tracked field moved. Render §1b.i as up to ~30 plain prose lines (with the … +N more overflow note) and §1b.ii as the aggregate table (see report-template.md). Use the GET /api/admin/pipeline/state-history endpoint to drill into a single plant or source when something looks unfamiliar; flag any unrecognised source that undoes prior publishes as PROBLEM.

Step 3: Flag Issues

After the table, list specific issues found, ordered by severity:

PROBLEM (needs action):

Pipeline stages with zero movement since last check
LoRA validation failure rate above 30%
Plants stuck in any stage for >48 hours
Morphology enrichment failures
LoRA worker downtime (no new trainings when there should be)
Plants at renderer_ready not advancing to publish_ready
pending-local-worker LoRA entries (locally trained but not uploaded)
Image collection stalled (imaging_complete with <20 images)
Pipeline errors on any plant
Image source producing 0 images when it was active last check (API down?)
Image count mismatch (lora_training_image_count != actual reference_images count)
LoRA SDXL backup stuck on /needs-archive >24h (Q18 stuck_local_acer_only_24h > 0) — Acer archive sweep is failing for those rows; the worker is logging "No SDXL backup found for archival" every 5 min
LoRA SDXL backup accepted_as_lost count increased since last check — compare Q18's accepted_as_lost against metrics.accepted_as_lost from .local/last_health_check.json; any positive delta means a new file loss happened; record the affected species from Q19 and decide retrain vs re-import
Audit-write failures since last check (Q21 returned any rows) — at least one pipeline_state_changes insert was swallowed; the audit trail (Q20 / Q20a) is now incomplete for that call site. Always list the call_site (file:line) and label, and pull the raw error_message from Q21a so you can distinguish a transient DB blip from a wiring regression. A call_site inside server/pipelineStateAudit.ts itself usually means a missing table/column on the target DB (check whether migration 0003 / 0004 ran in that environment)

WATCH (monitor):

Morphology enrichment pace dropping
Image collection pace below 500/day when plants are queued
LoRA training pace below 5/day
Large backlog in any stage relative to throughput
Plants with morphology but missing critical fields
Any image source with <80% approval rate
Plants at 30-34 images (near threshold — quick wins for collection)

Step 4: Actionable Recommendations

For each PROBLEM:

Immediate fix: what DB/state repair to do right now
Systemic fix: what code change prevents this from recurring
UI check: does the admin dashboard reflect the fix?

For WATCH items, note what to look for next time.

Step 5: Execute Fixes

Do not just report problems — fix them. For each PROBLEM:

Apply the immediate data fix (SQL, stage resets, etc.)
Implement the systemic code fix
Verify the admin UI shows accurate state after fixes

Step 6: Save Checkpoint

Write .local/last_health_check.json with the current timestamp, a brief one-line notes field, and the metrics map of counter values captured this run. The next run reads metrics to compute deltas automatically (Step 0).

At minimum, persist these keys (extend over time as needed; never drop keys older runs already wrote):

{
  "timestamp": "<ISO 8601 UTC>",
  "notes": "<one-liner>",
  "metrics": {
    "total_plants": <Q2 total_plants>,
    "total_published": <Q2 total_published>,
    "accepted_as_lost": <Q18 accepted_as_lost>,
    "stuck_local_acer_only_24h": <Q18 stuck_local_acer_only_24h>
  }
}

Always write all four keys, even when the value is 0 — a missing key makes the next run render (no prior value) instead of a real delta.

Issue Detection Heuristics

LoRA Worker Downtime Detection

Compare the gap between the last LoRA training timestamp and now. If >12 hours during a period when plants were queued, flag as potential worker downtime. Common causes: Windows updates, machine restart, network issues.

Stuck Plant Detection

Any plant in morphology_processing with pipeline_stage_updated_at older than 48h is stuck. Same for imaging_queued with no image count increase.

Validation Failure Spike

If LoRA validation failure rate in the check period is >30%, something is wrong with either the training images or the validation prompt. List the top failure patterns (which species are being misidentified as what).

Image Collection Health

Track images by source (iNaturalist, PlantNet, Pixabay, Brave, GBIF, Wikimedia, Wikipedia, Flickr, web_scrape). Key metrics per source: total collected since last check, validated count, quality score distribution (high >=0.85, medium 0.70-0.84, low <0.70). Flag sources that went silent.

Also run an image count sync: compare lora_training_image_count on plants table against actual validated reference_images. If they diverge, sync them with:

UPDATE plants p SET lora_training_image_count = sub.actual_count
FROM (SELECT ri.plant_id, COUNT(*) as actual_count FROM reference_images ri
  WHERE ri.verification_status IN ('approved', 'auto_approved') AND ri.image_purpose = 'lora_training'
  GROUP BY ri.plant_id) sub
WHERE sub.plant_id = p.id AND (p.lora_training_image_count IS NULL OR p.lora_training_image_count != sub.actual_count)
AND p.pipeline_stage IN ('imaging_complete', 'imaging_queued', 'lora_queued', 'enrichment_queued');

LoRA SDXL Backup Health (Q18 / Q19)

Two metrics surface the "missing-backup" failure mode that previously sat silent in the Acer worker console for ~10 days at a time (incident: Nelumbo nucifera, task #99 / #101):

Stuck on /needs-archive >24h (stuck_local_acer_only_24h): a completed lora_training_jobs row with lora_url='local-acer-only' that the Acer archive sweep should have backed up to the Replit bucket but hasn't, for more than 24 hours. While that row exists, the SDXL worker logs WARNING No SDXL backup found for archival of job <id> (species=...) every 5 min on the Acer console. Treat any row here as a candidate for that warning. Investigate before it turns into an "accepted as lost" entry: the .safetensors might be (a) recoverable on the Acer (re-run archiver), (b) genuinely missing (operator decides → retrain / re-import / mark local-acer-lost).
Accepted as lost (accepted_as_lost): cumulative count of jobs flipped to lora_url='local-acer-lost' by an operator. This is a permanent record of file loss; the affected plants most likely cannot render their LoRA until retrained or re-imported. Compare against the previous check's value — only the delta is news. If it increased, list the new species from Q19 in the report so the operator can decide retrain vs re-import without scrolling the Acer console.

Compute the "since last check" delta for accepted_as_lost automatically: the value is persisted in metrics.accepted_as_lost inside .local/last_health_check.json from the previous run. delta = current - previous. If previous is missing (first run with this metric), render (no prior value) and skip the auto-flag.

Same treatment for stuck_local_acer_only_24h — persist it as metrics.stuck_local_acer_only_24h so the next run can show whether the same backups are still stuck or whether the queue has churned.

Publish Gate Analysis

Plants at renderer_ready with lora_validated=true should be candidates for publishing. If the count is high and publish_ready count is low, the publish gate may be too strict or not running.

File References

Queries: .agents/skills/daily-health-check/queries.sql
Report template: .agents/skills/daily-health-check/report-template.md
Last check state: .local/last_health_check.json

name	daily-health-check
description	Run the daily pipeline health check. Use when the user says "daily check", "run the health check", "pipeline check", "what happened since last time", "daily routine", or any variation asking about overall pipeline status and health. Walks every stage from image import through publishing, reports what changed since last check, and flags issues.