Run any Skill in Manus with one click

monitoring-authoring

Stars24

Forks3

UpdatedMarch 23, 2026 at 02:56

Author monitoring resources: PrometheusRules, ServiceMonitors, PodMonitors, AlertmanagerConfig, Silence CRs, and canary-checker health checks. Use when: (1) Creating or modifying alert rules (PrometheusRule), (2) Adding scrape targets (ServiceMonitor/PodMonitor), (3) Configuring Alertmanager routing or silences, (4) Writing canary-checker health checks, (5) Creating recording rules, (6) Adding monitoring for a new application or platform component. Triggers: "create alert", "add alerting", "PrometheusRule", "ServiceMonitor", "PodMonitor", "AlertmanagerConfig", "silence alert", "canary check", "recording rule", "add monitoring", "scrape target", "alert rule", "prometheus rule", "health check canary"

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

ionfury

ionfury/homelab

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Network and Computer Systems AdministratorsComputer and Mathematical Occupations·SOC 15-1244

File Explorer

6 files

SKILL.md

readonly

More from this repository

same repository

dashboard-design

ionfury/homelab

Visual design and layout for Grafana dashboards — panel hierarchy, type selection, color/threshold design, and iterative screenshot-based refinement. Use when: (1) Deciding what panels belong on a new dashboard, (2) Choosing panel types for specific data patterns, (3) Structuring visual hierarchy and layout, (4) Applying color and thresholds to communicate status, (5) Reviewing dashboard appearance via Playwright screenshots, (6) Iterating on readability and density Triggers: "dashboard design", "visual design", "layout design", "panel type", "color scheme", "screenshot review", "iterate dashboard", "dashboard looks", "visual feedback", "refine dashboard", "dashboard hierarchy", "information density"

2026-05-1924

architecture-review

ionfury/homelab

Architecture evaluation criteria and technology standards for the homelab. Preloaded into the designer agent to ground design decisions in established patterns and principles. Use when: (1) Evaluating a proposed technology addition, (2) Reviewing architecture decisions, (3) Assessing stack fit for a new component, (4) Comparing implementation approaches. Triggers: "architecture review", "evaluate technology", "stack fit", "should we use", "technology comparison", "design review", "architecture decision"

2026-04-0824

deploy-app

ionfury/homelab

End-to-end application deployment orchestration for the Kubernetes homelab. Covers research, worktree setup, Flux ResourceSet configuration, dev cluster testing, monitoring integration, and PR creation. Use when: (1) Deploying a new application to the cluster, (2) Adding a new Helm release to the platform, (3) Setting up monitoring, alerting, and health checks for a new service, (4) Testing deployment on dev cluster before GitOps promotion. Triggers: "deploy app", "add new application", "deploy to kubernetes", "install helm chart", "/deploy-app", "set up new service", "add monitoring for", "deploy with monitoring"

2026-03-3024

secrets

ionfury/homelab

Secret management patterns for the Kubernetes homelab platform. Covers secret-generator, ExternalSecret, app-secrets Terragrunt module, and cross-namespace replication via kubernetes-replicator. Use when: (1) Adding secrets for a new application, (2) Deciding between secret-generator and ExternalSecret, (3) Configuring cross-namespace secret replication, (4) Creating persistent secrets via the app-secrets Terragrunt module, (5) Debugging secret sync failures. Triggers: "secret", "ExternalSecret", "secret-generator", "aws ssm", "parameter store", "kubernetes-replicator", "replicate secret", "app-secrets", "persistent secret", "cross-namespace secret", "secret not syncing", "ClusterSecretStore"

2026-03-3024

cnpg-database

ionfury/homelab

CloudNative-PG (CNPG) PostgreSQL database management for the Kubernetes homelab. Covers shared platform cluster, dedicated per-app clusters, credential provisioning, cross-namespace replication via kubernetes-replicator, and monitoring. Use when: (1) Adding a new database for an application, (2) Creating a dedicated CNPG cluster, (3) Setting up database credentials and cross-namespace replication, (4) Debugging database connectivity or CNPG cluster health, (5) Adding PostgreSQL extensions for specialized workloads. Triggers: "database", "postgresql", "postgres", "cnpg", "cloudnative-pg", "pooler", "pgbouncer", "database credentials", "db password", "managed roles", "Database CRD", "database cluster", "shared database", "dedicated database", "cnpg cluster"

2026-03-2324

gateway-routing

ionfury/homelab

Gateway API routing, TLS certificates, and WAF configuration for the homelab Kubernetes platform. Use when: (1) Exposing a service via HTTPRoute, (2) Choosing between internal and external gateways, (3) Debugging TLS or routing issues, (4) Understanding or tuning WAF (Coraza) behavior. Triggers: "httproute", "gateway", "expose service", "add route", "certificate", "tls", "coraza", "waf", "internal gateway", "external gateway", "dns", "ingress", "routing", "cert-manager", "letsencrypt", "homelab-ca"

2026-03-2324

name	monitoring-authoring
description	Author monitoring resources: PrometheusRules, ServiceMonitors, PodMonitors, AlertmanagerConfig, Silence CRs, and canary-checker health checks. Use when: (1) Creating or modifying alert rules (PrometheusRule), (2) Adding scrape targets (ServiceMonitor/PodMonitor), (3) Configuring Alertmanager routing or silences, (4) Writing canary-checker health checks, (5) Creating recording rules, (6) Adding monitoring for a new application or platform component. Triggers: "create alert", "add alerting", "PrometheusRule", "ServiceMonitor", "PodMonitor", "AlertmanagerConfig", "silence alert", "canary check", "recording rule", "add monitoring", "scrape target", "alert rule", "prometheus rule", "health check canary"
user-invocable	false

Monitoring Resource Authoring

This skill covers creating and modifying monitoring resources. For querying Prometheus or investigating alerts, see the prometheus skill and sre skill.

Resource Types

Resource	API Group	Purpose
`PrometheusRule`	`monitoring.coreos.com/v1`	Alert rules and recording rules
`ServiceMonitor`	`monitoring.coreos.com/v1`	Scrape metrics from Services
`PodMonitor`	`monitoring.coreos.com/v1`	Scrape metrics from Pods directly
`ScrapeConfig`	`monitoring.coreos.com/v1alpha1`	Advanced scrape configuration
`AlertmanagerConfig`	`monitoring.coreos.com/v1alpha1`	Routing, receivers, silencing
`Silence`	`observability.giantswarm.io/v1alpha2`	Declarative Alertmanager silences
`Canary`	`canaries.flanksource.com/v1`	Synthetic health checks (HTTP, TCP, K8s)

See [references/file-placement.md] for where to put each resource type and naming conventions.

PrometheusRule Authoring

Every PrometheusRule must include release: kube-prometheus-stack label for Prometheus to discover it.

PrometheusRule template: see references/alert-patterns.md

Severity and `for` Duration

Severity	`for` Duration	Use Case	Routing
`critical`	2m-5m	Service down, data loss risk	Discord
`warning`	5m-15m	Degraded performance, limits	Discord
`info`	10m-30m	Informational, non-urgent	Silenced by InfoInhibitor

Guidelines: for: 0m only for instant failures (e.g., SMART fail). Most alerts: 5m default. Flap-prone metrics (error rates, latency): 10m-15m. Use 5m for absence detection.

Alert Grouping

Group related alerts in named rule groups — affects Prometheus UI ordering:

spec:
  groups:
    - name: cilium-agent       # Agent availability and health
      rules: [...]
    - name: cilium-bpf         # BPF subsystem alerts
      rules: [...]

See [references/alert-patterns.md] for common alert patterns (down, error rate, latency, capacity, PVC), annotation template functions, and recording rule examples.

ServiceMonitor and PodMonitor

Via Helm Values (Preferred)

serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s

Manual ServiceMonitor

Place in monitoring namespace; use namespaceSelector to reach target namespace. Required label: release: kube-prometheus-stack.

---
# yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/servicemonitor_v1.json
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: <component>
  namespace: monitoring
  labels:
    release: kube-prometheus-stack    # REQUIRED
spec:
  namespaceSelector:
    matchNames: [<target-namespace>]
  selector:
    matchLabels:
      app.kubernetes.io/name: <component>
  endpoints:
    - port: http-monitoring
      path: /metrics
      interval: 30s

Manual PodMonitor

Use when pods expose metrics but don't have a Service (DaemonSets, sidecars). Same pattern as ServiceMonitor with podMetricsEndpoints instead of endpoints, and numeric ports quoted: port: "15020". For matchExpressions selecting multiple values, see any existing Flux PodMonitor in config/monitoring/.

See [references/alertmanagerconfig-reference.md] for AlertmanagerConfig routing, Silence CR templates, and matcher reference.

Canary Health Checks

Canary resources live in config/canary-checker/ (platform) or alongside app config.

HTTP health check:

---
# yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/canaries.flanksource.com/canary_v1.json
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
  name: http-check-<component>
spec:
  schedule: "@every 1m"
  http:
    - name: <component>-health
      url: https://<component>.${internal_domain}/health
      responseCodes: [200]
      maxSSLExpiry: 7
      thresholdMillis: 5000

Kubernetes resource check with CEL (preferred over ready: true — avoids penalizing pods with restart history):

spec:
  interval: 60
  kubernetes:
    - name: <component>-pods-healthy
      kind: Pod
      namespaceSelector:
        name: <namespace>
      resource:
        labelSelector: app.kubernetes.io/name=<component>
      test:
        expr: >
          dyn(results).all(pod,
            pod.Object.status.phase == "Running" &&
            pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
          )

canary_check == 1 triggers CanaryCheckFailure (critical, 2m). No per-canary alert needed.

Workflow: Adding Monitoring for a New Component

Check if the chart provides monitoring via Helm values first (kubesearch <chart-name> serviceMonitor) → enable via values if available → else create ServiceMonitor/PodMonitor + PrometheusRule + Canary manually → place in correct directory → register in kustomization → task k8s:validate → verify after deployment:

# Check ServiceMonitor is discovered
kubectl --context <cluster> exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/targets' | \
  jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'

# Check alert rules are loaded
kubectl --context <cluster> exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/rules' | \
  jq '.data.groups[] | select(.name | contains("<component>"))'

For PrometheusRule validation before committing, see [scripts/validate-rules.sh].

Common Mistakes

Mistake	Impact	Fix
Missing `release: kube-prometheus-stack` label	Prometheus ignores the resource	Add to metadata.labels
ServiceMonitor selector does not match any service	No metrics scraped, no error	Verify labels with `kubectl get svc -n <ns> --show-labels`
Using `ready: true` in canary Kubernetes checks	False negatives after pod restarts	Use CEL `test.expr`
Hardcoding domains in canary URLs	Breaks across clusters	Use `${internal_domain}`
Very short `for` on flappy metrics	Alert noise	Use 10m+ for error rates and latencies
Creating alerts for non-existent metrics	Alert stuck in "pending"	Verify metrics exist in Prometheus first

Keywords

PrometheusRule, ServiceMonitor, PodMonitor, ScrapeConfig, AlertmanagerConfig, Silence, silence-operator, canary-checker, Canary, recording rules, alert rules, monitoring, observability, scrape targets, prometheus, alertmanager, discord, heartbeat

monitoring-authoring

More from this repository

Monitoring Resource Authoring

Resource Types

PrometheusRule Authoring

Severity and for Duration

Alert Grouping

ServiceMonitor and PodMonitor

Via Helm Values (Preferred)

Manual ServiceMonitor

Manual PodMonitor

Canary Health Checks

Workflow: Adding Monitoring for a New Component

Common Mistakes

Keywords

Monitoring Resource Authoring

Resource Types

PrometheusRule Authoring

Severity and for Duration

Alert Grouping

ServiceMonitor and PodMonitor

Via Helm Values (Preferred)

Manual ServiceMonitor

Manual PodMonitor

Canary Health Checks

Workflow: Adding Monitoring for a New Component

Common Mistakes

Keywords

More from this repository

Severity and `for` Duration

Severity and `for` Duration