원클릭으로 Manus에서 모든 스킬 실행

$pwd:

ntfy-alert-triage

Name: Ntfy Alert Triage
Author: wcygan

// Triage ntfy.sh-routed alerts in Anton — identify which alert fired, why it fired (or why it didn't deliver), and propose a fix. Use when "got an ntfy alert", "alert just fired", "ntfy not delivering", "AlertmanagerClusterFailedToSendAlerts", "AlertmanagerFailedToSendAlerts", "code 40014", "attachments not allowed", "iOS push missing", "test the ntfy receiver", "send a test alert", "what just paged me", "is ntfy working". Combines kube-prometheus-stack Alertmanager API, the self-hosted ntfy server (ADR 0026), and the ntfy CLI for poll/publish probes. Read-only by default; proposes edits the operator applies.

Manus에서 실행

$ git log --oneline --stat

stars:2

forks:0

updated:2026년 5월 23일 22:42

SKILL.md

readonly

related-skills.json

같은 저장소

adr.md

from "wcygan/anton"

Anton ADR lifecycle — author new architectural decision records, list existing ones by status or affects-category, and mark old decisions superseded. Use when capturing a decision (especially after `cluster-intake-gatekeeper` returns an ADD/DEFER/REJECT verdict), when reviewing prior decisions before changing direction, when checking if a candidate component has been removed before, or when promoting a decision out of memory into a durable record. ADRs live in `context/adrs/` and are immutable — supersession is the only way to change a decision. The ADR index is built by scanning ADR files directly and injected into every Codex session by `.Codex/hooks/inject_adr_index.py`. Keywords — ADR, architecture decision record, decision log, supersede, decision history, why did we, prior decision, recorded decision, MADR, immutable, intake handoff, cluster-intake-gatekeeper handoff, removal graveyard, reverted decision.

2026-05-232

cluster-intake.md

from "wcygan/anton"

Intake gate for adding new system or infrastructure components to Anton. Asks the user to declare intent (concrete need, honest learning, or both), then applies the matching rubric — full production rubric for concrete need, contained-learning rubric for learning intake — and returns add / defer / reject with an ADR-ready summary. Welcomes honest learning intake (anton is partly a learning cluster; "things that don't scale" are okay when declared) but rejects completionism dressed as need. Read-only — never scaffolds manifests, never applies to the cluster. Use when asking "should I add X", "can I run X on the cluster", "is X worth adopting", "I want to try X", "I want to learn X", "evaluate new component", "vet this helm chart", "cluster intake", "new app decision", before scaffolding a new Flux app, or when tempted by a shiny project on HN. Hands passing candidates off to add-flux-app. Keywords — intake, adopt, install, new component, new app, evaluate, should I run, worth it, learning, experiment, try out,

2026-05-232

expose-service.md

from "wcygan/anton"

Expose a workload for access. Four paths: envoy-internal (LAN via split-horizon DNS), Tailscale Ingress (internal remote HTTP with browser-trusted TLS), Tailscale Service annotation (raw TCP / non-HTTP), envoy-external + Cloudflare tunnel (genuinely public, requires explicit approval). Handles HTTPRoute authoring, DNSEndpoint for secondary domains, and per-domain cert wiring.

2026-05-232

planner.md

from "wcygan/anton"

Anton planner skill — author, update, and close multi-session initiatives (migrations, rollouts, long-running refactors) in `context/plans/`. Use when starting a multi-session initiative, tracking next steps on in-flight work, migrating a memory entry to a durable plan, closing a completed initiative, or reviewing what's open. Plans live at `context/plans/NNNN-kebab-slug.md` and are mutable — they capture execution state (what's next, what's blocked, log of decisions made during work) while ADRs capture immutable decisions (why). The active-plan index is built by scanning plan files directly and injected into every Codex session by `.Codex/hooks/inject_plans_index.py`. Keywords — plan, planner, initiative, track work, multi-session, next steps, checklist, migration plan, rollout plan, roadmap, in-flight, blocker, close plan, review-by, exit plan, timebox, memory-to-plan handoff.

2026-05-232

anton-temporal-cli.md

from "wcygan/anton"

Use when Codex needs to inspect or troubleshoot the Anton cluster's self-hosted Temporal deployment with the local Temporal CLI, including checking cluster health, namespaces, workflow visibility, schedules, search attributes, Web UI reachability, or Kubernetes readiness for the `temporal` namespace. Prefer this for Temporal CLI tasks in Anton rather than generic Temporal SDK guidance.

2026-05-222

anton-cluster-health.md

from "wcygan/anton"

Kubernetes-layer health triage for Anton. Use when Codex checks whether the cluster is OK, Flux is healthy, platform controllers are ready, CNI or DNS is failing, cert-manager or ESO is stuck, gateways are broken, cloudflared is disconnected, or apps are unhealthy.

2026-05-222

package.json

"author": "wcygan"

"repository": "wcygan/anton"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

네트워크·컴퓨터 시스템 관리자컴퓨터 및 수학직15-1244L4

name	ntfy-alert-triage
description	Triage ntfy.sh-routed alerts in Anton — identify which alert fired, why it fired (or why it didn't deliver), and propose a fix. Use when "got an ntfy alert", "alert just fired", "ntfy not delivering", "AlertmanagerClusterFailedToSendAlerts", "AlertmanagerFailedToSendAlerts", "code 40014", "attachments not allowed", "iOS push missing", "test the ntfy receiver", "send a test alert", "what just paged me", "is ntfy working". Combines kube-prometheus-stack Alertmanager API, the self-hosted ntfy server (ADR 0026), and the ntfy CLI for poll/publish probes. Read-only by default; proposes edits the operator applies.
allowed-tools	Read, Bash, Grep, Edit

ntfy alert triage

Ordered triage for the anton alert pipeline: PrometheusRule → Alertmanager → AlertmanagerConfig route → webhook → self-hosted ntfy at ntfy.<tailnet>.ts.net → (optionally) ntfy.sh upstream relay → device. The 2026-05-05 cascade where a 40014 from ntfy turned into a 5-minute AlertmanagerClusterFailedToSendAlerts loop is the canonical failure mode this skill is built around.

Pipeline map (memorise this)

PrometheusRule                                Helm-managed via kube-prometheus-stack;
   │                                          Anton-authored rules live in
   │                                          kubernetes/apps/observability/kube-prometheus-stack/app/
   ▼
Alertmanager (alertmanager-kube-prometheus-stack-alertmanager-0)
   │  route: severity=critical → observability/ntfy/ntfy receiver
   │  everything else → "null" receiver
   │  (see AlertmanagerConfig at kubernetes/apps/observability/ntfy/app/alertmanagerconfig.yaml)
   ▼
webhook POST to URL from Secret ntfy-topic in observability ns
   │                                          (templated by ESO from 1Password ntfy/topic)
   ▼
self-hosted ntfy (Deployment ntfy in observability)
   │  base-url: https://ntfy.<tailnet>.ts.net
   │  upstream-base-url: https://ntfy.sh   ← iOS push relay (ADR 0026)
   │  attachment-cache-dir set since 2026-05-05 (configmap.yaml)
   ▼
device (browser at https://ntfy.<tailnet>/<topic>, or iOS via ntfy.sh poll trigger)

Decision tree — which question are we answering?

Symptom	Start at
Got an ntfy notification, want to know what it was	§ Step 1
Expected an alert, didn't get one	§ Step 2
`AlertmanagerClusterFailedToSendAlerts` is firing	§ Step 3
Want to verify the path end-to-end	§ Step 4
Need to write/tighten a rule or receiver	hand off to `observability-integrate`

Step 1 — What just fired?

kubectl exec -n observability alertmanager-kube-prometheus-stack-alertmanager-0 \
  -c alertmanager -- wget -qO- \
  'http://localhost:9093/api/v2/alerts?active=true&silenced=false&inhibited=false' \
  | python3 -c "import json,sys; [print(a['labels'].get('alertname'),'|',a['labels'].get('severity'),'|',a['labels'].get('instance',''),'|','start:',a['startsAt'],'|','rcv:',[r['name'] for r in a['receivers']]) for a in json.load(sys.stdin)]"

Sort the active alerts by startsAt and find the one that matches the time the user was paged. Only alerts with rcv: containing observability/ntfy/ntfy actually went to ntfy; everything else routes to null.

Ground-truth check before reacting: a fresh-looking alert can be a metric/series artifact, not a real event. For NodeUnexpectedReboot specifically, check /proc/uptime directly via talosctl; cross-reference with the cluster-triage agent memory at .Codex/agent-memory/cluster-triage/reference_reboot_alert_disambiguation.md.

Step 2 — Why didn't it deliver?

Two failure classes — distinguish by alertmanager behaviour.

Class A: never reached Alertmanager — Prometheus didn't fire it, or the rule expression doesn't evaluate to truth, or the rule isn't loaded.

# Did Prometheus load the rule?
kubectl exec -n observability prometheus-kube-prometheus-stack-prometheus-0 \
  -c prometheus -- wget -qO- http://localhost:9090/api/v1/rules \
  | python3 -c "import json,sys; [print(g['name'],'/',r['name']) for f in json.load(sys.stdin)['data']['groups'] for r in f['rules'] for g in [f]]" \
  | grep -i <alertname>

# Is the expression returning samples right now?
# (Pull the expr from the PrometheusRule yaml and test it via /api/v1/query.)

Class B: reached Alertmanager but didn't deliver — receiver matchers excluded it, or webhook delivery failed.

# Receiver matchers — only severity=critical reaches ntfy in anton.
kubectl exec -n observability alertmanager-kube-prometheus-stack-alertmanager-0 \
  -c alertmanager -- wget -qO- http://localhost:9093/api/v2/status \
  | python3 -c "import json,sys,re; print(re.search(r'route:.*?inhibit_rules', json.load(sys.stdin)['config']['original'], re.S).group(0))"

If your alert has severity warning/info, it routes to null by design. To reach ntfy, either bump severity in the rule or broaden the AlertmanagerConfig matcher (see ADR 0026 — "easier to broaden than to silence"; reconsider widening to "everything except info").

Step 3 — Decode the delivery failure

This is the 2026-05-05 cascade flow. When AlertmanagerClusterFailedToSendAlerts is firing, the actual error is in the alertmanager pod logs, not the alert payload.

kubectl logs -n observability alertmanager-kube-prometheus-stack-alertmanager-0 \
  -c alertmanager --tail=200 | grep -iE 'error|fail|notify' | tail -20

Match the error to the table:

Log fragment	Root cause	Fix
`code":40014,"http":400,"error":"invalid request: attachments not allowed"`	ntfy server has `attachment-cache-dir` unset; AM payload exceeds the ~5 KiB inline cap and ntfy refuses to spill	Add `attachment-cache-dir: /var/cache/ntfy/attachments` + size limits to `kubernetes/apps/observability/ntfy/app/configmap.yaml` (already fixed in main as of 2026-05-05)
`unexpected status code 401`	ntfy ACL added without updating the webhook URL secret	Refresh the `ntfy-topic` secret (1Password `ntfy/topic`) — see `rotate-credential`
`dial tcp ... no route to host` / `connection refused`	ntfy pod down or service IP changed	`kubectl get pod,svc -n observability -l app.kubernetes.io/name=ntfy`
`x509: certificate signed by unknown authority`	TLS chain regression on the cluster_gateway / cert-manager	Hand off to `anton-cluster-health` layer-5
`context deadline exceeded`	ntfy.sh upstream slow; usually transient	Wait one repeat-interval; if persistent, check status.ntfy.sh
repeated `notify retry canceled due to unrecoverable error` for the same alert group	ntfy returning 4xx — the alert payload itself is malformed for the receiver	Decode the request body shape (see § Step 4)

Inhibit while fixing: the cluster-failed-to-send alert will keep paging while you work. Silence it via amtool silence add alertname=AlertmanagerClusterFailedToSendAlerts --duration=30m -c <annotation>, or accept ~5-15 min of stale alerts after the fix lands.

Step 4 — Probe the pipeline with the ntfy CLI

ntfy is preinstalled on the operator workstation. Reference: https://docs.ntfy.sh/subscribe/cli/.

The webhook URL — including the secret topic — lives in the ntfy-topic Secret in observability. Never echo the URL to stdout or write it to a file (AGENTS.md hard rule). Pipe it directly:

# Read the URL into a shell variable WITHOUT printing it.
# (Single command; topic stays out of shell history if HISTCONTROL=ignorespace and you prefix with a space.)
 NTFY_URL=$(kubectl get secret -n observability ntfy-topic -o jsonpath='{.data.url}' | base64 -d)

Then split server + topic if you need them separately:

 NTFY_SERVER="${NTFY_URL%/*}"
 NTFY_TOPIC="${NTFY_URL##*/}"

Read recent deliveries (no long-running connection)

ntfy subscribe --poll --since 30m "${NTFY_URL}"

Use --since 1h / --since 12h for wider windows (ntfy cache-duration: 12h in anton's ConfigMap; older messages are gone). The --poll flag fetches and exits — never start a backgrounded ntfy subscribe from this skill.

Send a test publish

ntfy publish --title "anton triage probe" --priority default --tags test \
  "${NTFY_URL}" "$(date -u +%FT%TZ) probe from ntfy-alert-triage skill"

A successful publish prints the message ID and returns 0. A 40014 here means attachments are still disabled (see Step 3). A 401 means ACLs were added without updating the secret. A 404 on the topic means the topic name doesn't match the URL path.

Reproduce the 40014 (size-spill check)

The original cascade was triggered by Alertmanager grouping 2+ alerts into one webhook body that exceeded ~5 KiB. To verify attachments-cache works end-to-end:

ntfy publish "${NTFY_URL}" "$(python3 -c 'print("x"*8000)')"

Pre-fix this returned 40014; post-fix it succeeds and the message is stored as an attachment (visible in the ntfy web UI, not inline on iOS).

Verify the iOS upstream relay

iOS push goes via upstream-base-url: https://ntfy.sh (ADR 0026 — only message IDs transit; bodies stay on cluster). To check the upstream poll registration:

kubectl logs -n observability deploy/ntfy --tail=50 | grep -i upstream

A healthy line looks like Successfully forwarded message to upstream. Errors mean iOS won't get push notifications even when desktop browser delivery works.

Anton-specific reference

Where it lives	What it is
`kubernetes/apps/observability/ntfy/app/alertmanagerconfig.yaml`	The route — currently `severity=critical` only (per ADR 0026 "easier to broaden than silence")
`kubernetes/apps/observability/ntfy/app/configmap.yaml`	ntfy server.yml — base-url, upstream-base-url, attachment limits
`kubernetes/apps/observability/ntfy/app/externalsecret.yaml`	ESO mapping that fills the `ntfy-topic` Secret from 1Password
`kubernetes/apps/observability/ntfy/app/deployment.yaml`	Single-replica ntfy v2.x, RWO cache PVC, Recreate strategy
ADR 0026 (`context/adrs/0026-self-hosted-ntfy-as-alertmanager-destination.md`)	Why ntfy, why upstream relay, broadening policy
Postmortem 2026-05-05 (alert cascade)	Captured as commit `084babaa` body — three coordinated fixes

Hard rules (carry-overs from AGENTS.md)

Never echo, log, or write the topic URL or secret value. Always pipe kubectl get secret ... -o jsonpath directly into the consuming command.
Never edit *.sops.* files in plaintext; use sops <file> for round-trip.
Never restart the alertmanager StatefulSet to "fix" a delivery problem — the failure is upstream of AM in 99% of cases. Restart ntfy if the ntfy pod is the suspect; restart AM only if its own logs say so.
The Vector kernel sink (talos-log-sink) and its 30Gi PVC are unrelated to ntfy alerting; if both are firing alerts, treat them independently.

Hand-offs

Rule expression is wrong / missing → observability-integrate
ntfy pod itself is broken (CrashLoop, OOM, image pull) → anton-cluster-health layer 5, then debug-flux-reconciliation if it's a Flux apply problem
Topic / token compromised or suspected leaked → rotate-credential
Need to add a new alert that should reach ntfy → observability-integrate for the rule, then verify with § Step 4 here

ntfy-alert-triage

이 저장소의 다른 Skills

이 저장소의 다른 Skills

ntfy alert triage

Pipeline map (memorise this)

Decision tree — which question are we answering?

Step 1 — What just fired?

Step 2 — Why didn't it deliver?

Step 3 — Decode the delivery failure

Step 4 — Probe the pipeline with the ntfy CLI

Read recent deliveries (no long-running connection)

Send a test publish

Reproduce the 40014 (size-spill check)

Verify the iOS upstream relay

Anton-specific reference

Hard rules (carry-overs from AGENTS.md)

Hand-offs

ntfy alert triage

Pipeline map (memorise this)

Decision tree — which question are we answering?

Step 1 — What just fired?

Step 2 — Why didn't it deliver?

Step 3 — Decode the delivery failure

Step 4 — Probe the pipeline with the ntfy CLI

Read recent deliveries (no long-running connection)

Send a test publish

Reproduce the 40014 (size-spill check)

Verify the iOS upstream relay

Anton-specific reference

Hard rules (carry-overs from AGENTS.md)

Hand-offs