تشغيل أي مهارة في Manus بنقرة واحدة

storage-operations

النجوم٠

التفرعات٠

آخر تحديث١٢ فبراير ٢٠٢٦ في ١٥:١٤

Rook-Ceph diagnostics (Ottawa + Robbinsdale) and storage troubleshooting for all clusters including local-path on StPetersburg. Use when: Ceph health is not OK, PVCs are stuck in Pending, OSD is down, pools are near full, or volume attachment errors occur. Also use for routine storage capacity monitoring. Don't use when: The issue is a pod crash unrelated to storage (use pod-troubleshooting). Don't use for Flux reconciliation errors (use flux-ops). Don't use for image pull failures (use zot-registry). Don't use for general cluster health (use cluster-health — it includes a storage summary). Outputs: Storage health diagnosis with specific Ceph status, OSD state, pool capacity, and PVC status. Remediation steps for identified issues.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

rajsinghtech

rajsinghtech/openclaw-workspace

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

مديرو الشبكات وأنظمة الحاسوبمهن الحاسوب والرياضيات·SOC 15-1244

SKILL.md

readonly

name	Storage Operations
description	Rook-Ceph diagnostics (Ottawa + Robbinsdale) and storage troubleshooting for all clusters including local-path on StPetersburg. Use when: Ceph health is not OK, PVCs are stuck in Pending, OSD is down, pools are near full, or volume attachment errors occur. Also use for routine storage capacity monitoring. Don't use when: The issue is a pod crash unrelated to storage (use pod-troubleshooting). Don't use for Flux reconciliation errors (use flux-ops). Don't use for image pull failures (use zot-registry). Don't use for general cluster health (use cluster-health — it includes a storage summary). Outputs: Storage health diagnosis with specific Ceph status, OSD state, pool capacity, and PVC status. Remediation steps for identified issues.
requires	[]

Storage Operations

Routing

Use This Skill When

Ceph status shows HEALTH_WARN or HEALTH_ERR
PVCs stuck in Pending or Lost state
OSD is down or has been marked out
Pool usage is approaching capacity (>80%)
Volume attachment errors (FailedAttachVolume, FailedMount, multi-attach)
Pods stuck because of storage issues
Routine storage capacity check

Don't Use This Skill When

Pod is crashing for non-storage reasons → use pod-troubleshooting
Flux can't reconcile → use flux-ops
Image pull issues → use zot-registry
You want a full health scan including storage → use cluster-health (it covers storage at a high level)
The issue is with the registry, not cluster storage → use zot-registry

Storage diagnostics for all 3 clusters. Ottawa and Robbinsdale run Rook-Ceph; StPetersburg uses local-path-provisioner.

Cluster Contexts

⚠️ Always use --context <ctx> — never rely on current-context.

Cluster	Context	Storage
Ottawa	`talos-ottawa`	Rook-Ceph
Robbinsdale	`talos-robbinsdale`	Rook-Ceph
StPetersburg	`talos-stpetersburg`	local-path-provisioner

Rook-Ceph (Ottawa + Robbinsdale)

Cluster Health

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph status

HEALTH_OK — all good
HEALTH_WARN — degraded but functional, investigate
HEALTH_ERR — data at risk, report immediately

OSD Health

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd status
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd tree
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd df

Verify all OSDs are up and in
Check for uneven data distribution (variance >10%)
Flag OSDs >85% full

Placement Group Status

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph pg stat
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph pg dump_stuck

All PGs should be active+clean
degraded, undersized, stale, incomplete PGs need investigation
Stuck PGs: check if an OSD is down or a node is unreachable

Pool Usage

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph df
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd pool ls detail
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- rados df

Check per-pool usage
Flag pools >80% capacity
Note replication factor (should be 3 for data pools)

Ceph Operator

kubectl --context <ctx> get pods -n rook-ceph -l app=rook-ceph-operator
kubectl --context <ctx> logs -n rook-ceph -l app=rook-ceph-operator --tail=30

Verify operator is running
Check for reconciliation errors

PVC Troubleshooting (All Clusters)

Unbound PVCs

kubectl --context <ctx> get pvc -A | grep -v Bound

Pending PVC: check events with kubectl describe pvc <name> -n <ns>
Common causes: no available PV, storageClass misconfigured, Ceph pool full

PVC Capacity

kubectl --context <ctx> get pvc -A -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CAPACITY:.status.capacity.storage,CLASS:.spec.storageClassName'

Volume Attachment Issues

kubectl --context <ctx> get volumeattachments
kubectl --context <ctx> get events -A --field-selector reason=FailedAttachVolume
kubectl --context <ctx> get events -A --field-selector reason=FailedMount

Multi-attach errors: RWO volume still attached to old node after reschedule
Common fix: delete the stale VolumeAttachment (but verify pod is actually gone first)

Local-Path (StPetersburg)

# Check provisioner
kubectl --context talos-stpetersburg get pods -n local-path-storage

# List PVCs
kubectl --context talos-stpetersburg get pvc -A

# Check local-path config
kubectl --context talos-stpetersburg get configmap -n local-path-storage local-path-config -o yaml

local-path provisions on the node where the pod runs
No replication — if the node dies, data is lost
Mostly used for AI model caches and ephemeral workloads

Common Issues

Symptom	Likely Cause	Action
HEALTH_WARN: 1 OSD down	Node offline or OSD crashed	Check node status, OSD pod logs
PG degraded	OSD down, rebalancing	Wait if OSD is recovering; escalate if OSD stays down
Pool nearfull	Storage capacity	Report — needs OSD expansion or data cleanup
PVC Pending	StorageClass mismatch or pool full	Check storageClass exists and pool has capacity
FailedMount	Stale VolumeAttachment	Verify old pod is gone, then report
local-path Pending	Node selector or path issue	Check provisioner logs

Edge Cases

HEALTH_WARN after node restart: Usually transient (PG rebalancing). Wait 5 minutes before escalating.
OSD marked out but node is fine: OSD process crashed — check OSD pod logs, may need restart
PVC bound but pod can't mount: Different node than where the PV lives (RWO constraint) — check node affinity

Artifact Handoff

For complex storage investigations:

mkdir -p /tmp/outputs before writing any artifacts
Write findings to /tmp/outputs/storage-diagnosis.md including Ceph status, OSD state, and pool usage snapshots.

المزيد من هذا المستودع

نفس المستودع

code-review

rajsinghtech/openclaw-workspace

Structured PR review — security scan, correctness, consistency, style. Covers diff analysis, comment posting via gh, and priority-based finding reports. Use when: A PR needs review, someone asks for code feedback, or changes need security/correctness validation before merge. Also use for pre-commit review of your own changes. Don't use when: The issue is a runtime pod failure (use pod-troubleshooting), a Flux reconciliation error (use flux-debugging), or a CI build failure (use ci-diagnosis). Don't use for architecture-level design discussions (use architecture-design instead). Outputs: Review comment posted on the PR via `gh pr review`, or a structured findings report grouped by severity (Critical/High/Medium/Low).

2026-02-200

openspec-workflow

rajsinghtech/openclaw-workspace

Spec-driven development workflow — proposals, requirements, design docs, task breakdowns, and implementation using the OpenSpec framework. Use when: Starting a new feature or change that needs planning, someone says "I want to build X", creating proposals or specs, breaking down requirements into tasks, or transitioning from planning to implementation. Don't use when: Debugging or troubleshooting (use appropriate troubleshooting skill). Don't use for Kubernetes manifest changes (use pr-workflow). Don't use for reviewing existing code (use code-review). Outputs: OpenSpec change folder with proposal.md, specs/, design.md, and tasks.md. Implementation follows directly from tasks.md.

2026-02-200

session-review

rajsinghtech/openclaw-workspace

Analyze agent sessions for tool failures, retry patterns, knowledge gaps, context limits, and config drift. Use when: Running periodic session reviews (cron), investigating agent reliability issues, looking for recurring failure patterns, or identifying workspace improvements from real usage. This is the primary skill for Robert's review cron job. Don't use when: You're making changes to fix issues (use workspace-improvement for that). Don't use for live debugging of a current issue (use the appropriate troubleshooting skill). Don't use for code review of PRs (use code-review). Outputs: Session analysis report with categorized findings (tool failures, retries, knowledge gaps, config drift), severity ratings, and proposed fixes. Written to /tmp/outputs/session-review.md for handoff.

2026-02-200

cluster-context

rajsinghtech/openclaw-workspace

OpenClaw pod architecture, volumes, networking, secrets, and provider configuration reference. Use when: Debugging container, mount, networking, or credential issues. Also use when you need to understand pod structure, check which providers are configured, verify volume mounts, or inspect secrets configuration. Don't use when: Debugging pod crashes (use pod-troubleshooting). Don't use for Flux issues (use flux-debugging). Don't use for deploying changes (use gitops-deploy). This is a reference skill, not a diagnostic workflow. Outputs: Architecture reference information. No artifacts — this skill provides context for other skills to use.

2026-02-200

gitops-deploy

rajsinghtech/openclaw-workspace

End-to-end deployment workflow — commit, CI, Flux reconcile, pod restart, verify. Includes ConfigMap changes, Flux postBuild escaping, and SOPS secret management. Use when: You need to deploy changes to the OpenClaw pod — config updates, workspace changes, image rebuilds, or secret rotations. Also use when someone asks "how do I deploy this?" or "push this change live." Don't use when: You're debugging why a deployment failed (use flux-debugging or pod-troubleshooting). Don't use for changes to kubernetes-manifests repo (Dyson's pr-workflow handles that). Don't use for registry/image inspection (use zot-registry). Outputs: Deployed changes verified in the running pod. Confirmation includes CI status, Flux reconciliation state, pod status, and startup logs.

2026-02-200

openclaw-docs-lookup-morty

rajsinghtech/openclaw-workspace

Look up OpenClaw documentation via web_fetch for config validation and verification. Use when: You need to verify a config key, understand OpenClaw configuration options, or check documentation for Kubernetes-specific settings before making changes. Don't use when: The answer is already in CONFIG.md, AGENTS.md, TOOLS.md in your workspace.

2026-02-200

name	Storage Operations
description	Rook-Ceph diagnostics (Ottawa + Robbinsdale) and storage troubleshooting for all clusters including local-path on StPetersburg. Use when: Ceph health is not OK, PVCs are stuck in Pending, OSD is down, pools are near full, or volume attachment errors occur. Also use for routine storage capacity monitoring. Don't use when: The issue is a pod crash unrelated to storage (use pod-troubleshooting). Don't use for Flux reconciliation errors (use flux-ops). Don't use for image pull failures (use zot-registry). Don't use for general cluster health (use cluster-health — it includes a storage summary). Outputs: Storage health diagnosis with specific Ceph status, OSD state, pool capacity, and PVC status. Remediation steps for identified issues.
requires	[]

Storage Operations

Routing

Use This Skill When

Ceph status shows HEALTH_WARN or HEALTH_ERR
PVCs stuck in Pending or Lost state
OSD is down or has been marked out
Pool usage is approaching capacity (>80%)
Volume attachment errors (FailedAttachVolume, FailedMount, multi-attach)
Pods stuck because of storage issues
Routine storage capacity check

Don't Use This Skill When

Pod is crashing for non-storage reasons → use pod-troubleshooting
Flux can't reconcile → use flux-ops
Image pull issues → use zot-registry
You want a full health scan including storage → use cluster-health (it covers storage at a high level)
The issue is with the registry, not cluster storage → use zot-registry

Storage diagnostics for all 3 clusters. Ottawa and Robbinsdale run Rook-Ceph; StPetersburg uses local-path-provisioner.

Cluster Contexts

⚠️ Always use --context <ctx> — never rely on current-context.

Cluster	Context	Storage
Ottawa	`talos-ottawa`	Rook-Ceph
Robbinsdale	`talos-robbinsdale`	Rook-Ceph
StPetersburg	`talos-stpetersburg`	local-path-provisioner

Rook-Ceph (Ottawa + Robbinsdale)

Cluster Health

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph status

HEALTH_OK — all good
HEALTH_WARN — degraded but functional, investigate
HEALTH_ERR — data at risk, report immediately

OSD Health

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd status
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd tree
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd df

Verify all OSDs are up and in
Check for uneven data distribution (variance >10%)
Flag OSDs >85% full

Placement Group Status

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph pg stat
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph pg dump_stuck

All PGs should be active+clean
degraded, undersized, stale, incomplete PGs need investigation
Stuck PGs: check if an OSD is down or a node is unreachable

Pool Usage

kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph df
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- ceph osd pool ls detail
kubectl --context <ctx> exec -n rook-ceph deploy/rook-ceph-tools -- rados df

Check per-pool usage
Flag pools >80% capacity
Note replication factor (should be 3 for data pools)

Ceph Operator

kubectl --context <ctx> get pods -n rook-ceph -l app=rook-ceph-operator
kubectl --context <ctx> logs -n rook-ceph -l app=rook-ceph-operator --tail=30

Verify operator is running
Check for reconciliation errors

PVC Troubleshooting (All Clusters)

Unbound PVCs

kubectl --context <ctx> get pvc -A | grep -v Bound

Pending PVC: check events with kubectl describe pvc <name> -n <ns>
Common causes: no available PV, storageClass misconfigured, Ceph pool full

PVC Capacity

kubectl --context <ctx> get pvc -A -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CAPACITY:.status.capacity.storage,CLASS:.spec.storageClassName'

Volume Attachment Issues

kubectl --context <ctx> get volumeattachments
kubectl --context <ctx> get events -A --field-selector reason=FailedAttachVolume
kubectl --context <ctx> get events -A --field-selector reason=FailedMount

Multi-attach errors: RWO volume still attached to old node after reschedule
Common fix: delete the stale VolumeAttachment (but verify pod is actually gone first)

Local-Path (StPetersburg)

# Check provisioner
kubectl --context talos-stpetersburg get pods -n local-path-storage

# List PVCs
kubectl --context talos-stpetersburg get pvc -A

# Check local-path config
kubectl --context talos-stpetersburg get configmap -n local-path-storage local-path-config -o yaml

local-path provisions on the node where the pod runs
No replication — if the node dies, data is lost
Mostly used for AI model caches and ephemeral workloads

Common Issues

Symptom	Likely Cause	Action
HEALTH_WARN: 1 OSD down	Node offline or OSD crashed	Check node status, OSD pod logs
PG degraded	OSD down, rebalancing	Wait if OSD is recovering; escalate if OSD stays down
Pool nearfull	Storage capacity	Report — needs OSD expansion or data cleanup
PVC Pending	StorageClass mismatch or pool full	Check storageClass exists and pool has capacity
FailedMount	Stale VolumeAttachment	Verify old pod is gone, then report
local-path Pending	Node selector or path issue	Check provisioner logs

Edge Cases

HEALTH_WARN after node restart: Usually transient (PG rebalancing). Wait 5 minutes before escalating.
OSD marked out but node is fine: OSD process crashed — check OSD pod logs, may need restart
PVC bound but pod can't mount: Different node than where the PV lives (RWO constraint) — check node affinity

Artifact Handoff

For complex storage investigations:

mkdir -p /tmp/outputs before writing any artifacts
Write findings to /tmp/outputs/storage-diagnosis.md including Ceph status, OSD state, and pool usage snapshots.