Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

cluster-health-scan

Sterne0

Forks0

Aktualisiert12. Februar 2026 um 15:14

Full multi-cluster health assessment across all 3 Kubernetes clusters. Use when: Running periodic health checks, investigating cross-cluster issues, or someone asks "how are the clusters doing?" Covers nodes, pods, Ceph, Flux, certs, alerts, and resource utilization. Don't use when: Debugging a specific pod failure (use pod-troubleshooting). Don't use for Flux-specific reconciliation errors (use flux-ops). Don't use for Ceph-specific deep dives (use storage-ops). Don't use for a single cluster's issue — this skill scans ALL clusters. Outputs: Structured health report covering all 3 clusters with issues flagged and severity noted.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

rajsinghtech

rajsinghtech/openclaw-workspace

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

Netzwerk- und ComputersystemadministratorenInformatik- und Mathematikberufe·SOC 15-1244

SKILL.md

readonly

name	Cluster Health Scan
description	Full multi-cluster health assessment across all 3 Kubernetes clusters. Use when: Running periodic health checks, investigating cross-cluster issues, or someone asks "how are the clusters doing?" Covers nodes, pods, Ceph, Flux, certs, alerts, and resource utilization. Don't use when: Debugging a specific pod failure (use pod-troubleshooting). Don't use for Flux-specific reconciliation errors (use flux-ops). Don't use for Ceph-specific deep dives (use storage-ops). Don't use for a single cluster's issue — this skill scans ALL clusters. Outputs: Structured health report covering all 3 clusters with issues flagged and severity noted.
requires	[]

Cluster Health Scan

Routing

Use This Skill When

Running a scheduled or on-demand health check across all clusters
Someone asks "what's the cluster status?" or "any issues?"
Starting a monitoring session and need a baseline
After a maintenance window to verify everything recovered
Generating a health report for the heartbeat

Don't Use This Skill When

Debugging a specific pod (CrashLoopBackOff, etc.) → use pod-troubleshooting
Flux reconciliation failure on a specific kustomization → use flux-ops
Deep-diving into Ceph health (OSD failures, PG repair) → use storage-ops
Only one cluster needs attention → run targeted commands instead of full scan
Making changes to fix an issue → use pr-workflow for the fix

Comprehensive health check across talos-ottawa, talos-robbinsdale, and talos-stpetersburg.

Cluster Contexts

⚠️ Always use --context <ctx> — never rely on current-context, it may not be what you expect.

Cluster	Context
Ottawa	`talos-ottawa`
Robbinsdale	`talos-robbinsdale`
StPetersburg	`talos-stpetersburg`

Procedure

Run for each context in talos-ottawa talos-robbinsdale talos-stpetersburg:

1. Node Health

kubectl --context <ctx> get nodes -o wide
kubectl --context <ctx> top nodes

Verify all nodes are Ready
Check for memory/disk/PID pressure
Flag nodes with high resource utilization (>85%)

2. Pod Status

kubectl --context <ctx> get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
kubectl --context <ctx> get pods -A | grep -E 'CrashLoop|ImagePull|Error|Pending|Init:'

List all non-healthy pods by namespace
For CrashLoopBackOff: pull last 20 lines of logs
For Pending: check events for scheduling failures

3. Resource Utilization

kubectl --context <ctx> top pods -A --sort-by=memory | head -20
kubectl --context <ctx> top pods -A --sort-by=cpu | head -20

Flag pods using >80% of their memory limit
Flag namespaces with no resource limits set

4. Flux GitOps

flux --context <ctx> get kustomizations -A
flux --context <ctx> get helmreleases -A
flux --context <ctx> get sources git -A
flux --context <ctx> get sources helm -A

Verify all kustomizations and HelmReleases are Ready
Check source freshness — stale fetches indicate connectivity issues
Report any suspended resources

5. Helm Releases

helm --kube-context <ctx> list -A --filter 'failed|pending'

List any releases in failed or pending-upgrade state
For failed releases: helm status <release> -n <ns> for details

6. PVC Health

kubectl --context <ctx> get pvc -A

Flag any unbound or lost PVCs
Check for PVCs near capacity

7. Storage (cluster-specific)

Ottawa + Robbinsdale (Rook-Ceph):

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph status
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd status
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph df

Verify HEALTH_OK
Check OSD status (all up+in)
Check pool usage (<80%)

StPetersburg (GPU):

kubectl --context talos-stpetersburg get pods -n gpu-operator
kubectl --context talos-stpetersburg exec -n gpu-operator <device-plugin-pod> -- nvidia-smi

Verify GPU operator pods are running
Check GPU utilization and memory

8. Certificates

kubectl --context <ctx> get certificates -A
kubectl --context <ctx> get certificaterequests -A --field-selector=status.conditions[0].status!=True

Flag certificates expiring within 7 days
Report any failed certificate requests

9. Firing Alerts

kubectl --context <ctx> exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts[] | select(.state=="firing") | {alert: .labels.alertname, ns: .labels.namespace, severity: .labels.severity}'

Report all firing alerts (skip Watchdog)
Group by severity

Output Template

=== CLUSTER HEALTH REPORT ===
Timestamp: <ISO timestamp>

[ottawa] Nodes: 3/3 Ready | Pods: 2 unhealthy | Ceph: HEALTH_OK | Flux: OK | Alerts: 0 firing
  - [ottawa] pod kube-system/coredns-abc123: CrashLoopBackOff (OOMKilled)
  - [ottawa] pod media/sonarr-xyz: Pending (Insufficient memory)

[robbinsdale] Nodes: 5/5 Ready | Pods: 0 unhealthy | Ceph: HEALTH_OK | Flux: OK | Alerts: 0 firing

[stpetersburg] Nodes: 1/1 Ready | Pods: 0 unhealthy | GPU: OK | Flux: OK | Alerts: 1 firing
  - [stpetersburg] alert: KubeMemoryOvercommit (warning)

Overall: 2 issues found across 3 clusters

Compaction Notes

Health scans produce large output. For long monitoring sessions:

mkdir -p /tmp/outputs before writing any artifacts
Write each cluster's results to /tmp/outputs/health-<cluster>-<date>.md
Summarize findings in the report template above
Only keep actionable items in context — healthy clusters need one line, not full output

Edge Cases

Metrics server down: kubectl top will fail — note this as an issue, don't block the scan
Ceph tools pod missing: Can't run ceph commands — report this as a finding
Prometheus not reachable: Alert check will fail — note and continue with other checks

Mehr aus diesem Repository

gleiches Repository

code-review

rajsinghtech/openclaw-workspace

Structured PR review — security scan, correctness, consistency, style. Covers diff analysis, comment posting via gh, and priority-based finding reports. Use when: A PR needs review, someone asks for code feedback, or changes need security/correctness validation before merge. Also use for pre-commit review of your own changes. Don't use when: The issue is a runtime pod failure (use pod-troubleshooting), a Flux reconciliation error (use flux-debugging), or a CI build failure (use ci-diagnosis). Don't use for architecture-level design discussions (use architecture-design instead). Outputs: Review comment posted on the PR via `gh pr review`, or a structured findings report grouped by severity (Critical/High/Medium/Low).

2026-02-200

openspec-workflow

rajsinghtech/openclaw-workspace

Spec-driven development workflow — proposals, requirements, design docs, task breakdowns, and implementation using the OpenSpec framework. Use when: Starting a new feature or change that needs planning, someone says "I want to build X", creating proposals or specs, breaking down requirements into tasks, or transitioning from planning to implementation. Don't use when: Debugging or troubleshooting (use appropriate troubleshooting skill). Don't use for Kubernetes manifest changes (use pr-workflow). Don't use for reviewing existing code (use code-review). Outputs: OpenSpec change folder with proposal.md, specs/, design.md, and tasks.md. Implementation follows directly from tasks.md.

2026-02-200

session-review

rajsinghtech/openclaw-workspace

Analyze agent sessions for tool failures, retry patterns, knowledge gaps, context limits, and config drift. Use when: Running periodic session reviews (cron), investigating agent reliability issues, looking for recurring failure patterns, or identifying workspace improvements from real usage. This is the primary skill for Robert's review cron job. Don't use when: You're making changes to fix issues (use workspace-improvement for that). Don't use for live debugging of a current issue (use the appropriate troubleshooting skill). Don't use for code review of PRs (use code-review). Outputs: Session analysis report with categorized findings (tool failures, retries, knowledge gaps, config drift), severity ratings, and proposed fixes. Written to /tmp/outputs/session-review.md for handoff.

2026-02-200

cluster-context

rajsinghtech/openclaw-workspace

OpenClaw pod architecture, volumes, networking, secrets, and provider configuration reference. Use when: Debugging container, mount, networking, or credential issues. Also use when you need to understand pod structure, check which providers are configured, verify volume mounts, or inspect secrets configuration. Don't use when: Debugging pod crashes (use pod-troubleshooting). Don't use for Flux issues (use flux-debugging). Don't use for deploying changes (use gitops-deploy). This is a reference skill, not a diagnostic workflow. Outputs: Architecture reference information. No artifacts — this skill provides context for other skills to use.

2026-02-200

gitops-deploy

rajsinghtech/openclaw-workspace

End-to-end deployment workflow — commit, CI, Flux reconcile, pod restart, verify. Includes ConfigMap changes, Flux postBuild escaping, and SOPS secret management. Use when: You need to deploy changes to the OpenClaw pod — config updates, workspace changes, image rebuilds, or secret rotations. Also use when someone asks "how do I deploy this?" or "push this change live." Don't use when: You're debugging why a deployment failed (use flux-debugging or pod-troubleshooting). Don't use for changes to kubernetes-manifests repo (Dyson's pr-workflow handles that). Don't use for registry/image inspection (use zot-registry). Outputs: Deployed changes verified in the running pod. Confirmation includes CI status, Flux reconciliation state, pod status, and startup logs.

2026-02-200

openclaw-docs-lookup-morty

rajsinghtech/openclaw-workspace

Look up OpenClaw documentation via web_fetch for config validation and verification. Use when: You need to verify a config key, understand OpenClaw configuration options, or check documentation for Kubernetes-specific settings before making changes. Don't use when: The answer is already in CONFIG.md, AGENTS.md, TOOLS.md in your workspace.

2026-02-200

name	Cluster Health Scan
description	Full multi-cluster health assessment across all 3 Kubernetes clusters. Use when: Running periodic health checks, investigating cross-cluster issues, or someone asks "how are the clusters doing?" Covers nodes, pods, Ceph, Flux, certs, alerts, and resource utilization. Don't use when: Debugging a specific pod failure (use pod-troubleshooting). Don't use for Flux-specific reconciliation errors (use flux-ops). Don't use for Ceph-specific deep dives (use storage-ops). Don't use for a single cluster's issue — this skill scans ALL clusters. Outputs: Structured health report covering all 3 clusters with issues flagged and severity noted.
requires	[]

Cluster Health Scan

Routing

Use This Skill When

Running a scheduled or on-demand health check across all clusters
Someone asks "what's the cluster status?" or "any issues?"
Starting a monitoring session and need a baseline
After a maintenance window to verify everything recovered
Generating a health report for the heartbeat

Don't Use This Skill When

Debugging a specific pod (CrashLoopBackOff, etc.) → use pod-troubleshooting
Flux reconciliation failure on a specific kustomization → use flux-ops
Deep-diving into Ceph health (OSD failures, PG repair) → use storage-ops
Only one cluster needs attention → run targeted commands instead of full scan
Making changes to fix an issue → use pr-workflow for the fix

Comprehensive health check across talos-ottawa, talos-robbinsdale, and talos-stpetersburg.

Cluster Contexts

⚠️ Always use --context <ctx> — never rely on current-context, it may not be what you expect.

Cluster	Context
Ottawa	`talos-ottawa`
Robbinsdale	`talos-robbinsdale`
StPetersburg	`talos-stpetersburg`

Procedure

Run for each context in talos-ottawa talos-robbinsdale talos-stpetersburg:

1. Node Health

kubectl --context <ctx> get nodes -o wide
kubectl --context <ctx> top nodes

Verify all nodes are Ready
Check for memory/disk/PID pressure
Flag nodes with high resource utilization (>85%)

2. Pod Status

kubectl --context <ctx> get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
kubectl --context <ctx> get pods -A | grep -E 'CrashLoop|ImagePull|Error|Pending|Init:'

List all non-healthy pods by namespace
For CrashLoopBackOff: pull last 20 lines of logs
For Pending: check events for scheduling failures

3. Resource Utilization

kubectl --context <ctx> top pods -A --sort-by=memory | head -20
kubectl --context <ctx> top pods -A --sort-by=cpu | head -20

Flag pods using >80% of their memory limit
Flag namespaces with no resource limits set

4. Flux GitOps

flux --context <ctx> get kustomizations -A
flux --context <ctx> get helmreleases -A
flux --context <ctx> get sources git -A
flux --context <ctx> get sources helm -A

Verify all kustomizations and HelmReleases are Ready
Check source freshness — stale fetches indicate connectivity issues
Report any suspended resources

5. Helm Releases

helm --kube-context <ctx> list -A --filter 'failed|pending'

List any releases in failed or pending-upgrade state
For failed releases: helm status <release> -n <ns> for details

6. PVC Health

kubectl --context <ctx> get pvc -A

Flag any unbound or lost PVCs
Check for PVCs near capacity

7. Storage (cluster-specific)

Ottawa + Robbinsdale (Rook-Ceph):

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph status
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd status
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph df

Verify HEALTH_OK
Check OSD status (all up+in)
Check pool usage (<80%)

StPetersburg (GPU):

kubectl --context talos-stpetersburg get pods -n gpu-operator
kubectl --context talos-stpetersburg exec -n gpu-operator <device-plugin-pod> -- nvidia-smi

Verify GPU operator pods are running
Check GPU utilization and memory

8. Certificates

kubectl --context <ctx> get certificates -A
kubectl --context <ctx> get certificaterequests -A --field-selector=status.conditions[0].status!=True

Flag certificates expiring within 7 days
Report any failed certificate requests

9. Firing Alerts

kubectl --context <ctx> exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts[] | select(.state=="firing") | {alert: .labels.alertname, ns: .labels.namespace, severity: .labels.severity}'

Report all firing alerts (skip Watchdog)
Group by severity

Output Template

=== CLUSTER HEALTH REPORT ===
Timestamp: <ISO timestamp>

[ottawa] Nodes: 3/3 Ready | Pods: 2 unhealthy | Ceph: HEALTH_OK | Flux: OK | Alerts: 0 firing
  - [ottawa] pod kube-system/coredns-abc123: CrashLoopBackOff (OOMKilled)
  - [ottawa] pod media/sonarr-xyz: Pending (Insufficient memory)

[robbinsdale] Nodes: 5/5 Ready | Pods: 0 unhealthy | Ceph: HEALTH_OK | Flux: OK | Alerts: 0 firing

[stpetersburg] Nodes: 1/1 Ready | Pods: 0 unhealthy | GPU: OK | Flux: OK | Alerts: 1 firing
  - [stpetersburg] alert: KubeMemoryOvercommit (warning)

Overall: 2 issues found across 3 clusters

Compaction Notes

Health scans produce large output. For long monitoring sessions:

mkdir -p /tmp/outputs before writing any artifacts
Write each cluster's results to /tmp/outputs/health-<cluster>-<date>.md
Summarize findings in the report template above
Only keep actionable items in context — healthy clusters need one line, not full output

Edge Cases

Metrics server down: kubectl top will fail — note this as an issue, don't block the scan
Ceph tools pod missing: Can't run ceph commands — report this as a finding
Prometheus not reachable: Alert check will fail — note and continue with other checks