بنقرة واحدة
check-ceph-health
// Check Ceph storage health on OpenShift OCS/ODF clusters. Use when PVCs are stuck in Pending, storage provisioning fails, Ceph is degraded, OSDs are full, or cluster storage needs diagnosis.
// Check Ceph storage health on OpenShift OCS/ODF clusters. Use when PVCs are stuck in Pending, storage provisioning fails, Ceph is degraded, OSDs are full, or cluster storage needs diagnosis.
Generate bash e2e verification scripts for MTV/Forklift bugs and features through a guided workflow (gather context, write test plan, get approval, generate script). Use when the user asks to create a test, write a test script, verify a bug fix, build an e2e test, generate a verification script, or mentions an MTV/Forklift Jira ticket (MTV-<number>) together with testing.
General OpenShift (OCP) cluster health check. Use when the cluster is unhealthy, nodes are NotReady, operators are degraded, pods are crashing, etcd is slow, networking issues occur, or a general cluster diagnosis is needed.
Use the oc mtv CLI to manage VM migrations. Use this skill when the user wants to migrate VMs from vSphere, oVirt, OpenStack, OVA, EC2, or HyperV to OpenShift/KubeVirt.
Use oc virt (or kubectl virt) to manage KubeVirt virtual machines. Use this skill when the user wants to create, start, stop, or manage VMs on OpenShift/Kubernetes.
Install and configure the CLI plugins for Forklift/MTV, Prometheus metrics, and Kubernetes debug queries. Use when CLI tools (oc mtv, oc metrics, oc debug-queries) are not available, or when the user wants to set up the tools.
Observe cluster metrics via Prometheus/Thanos. Use when the user wants to check cluster metrics, monitor network traffic, storage I/O, pod resource usage, VM migration throughput, or discover available Prometheus metrics. Covers metric discovery, storage (Ceph/ODF), network traffic by namespace/pod, pod statistics, and Forklift/MTV migration monitoring.
| name | check-ceph-health |
| description | Check Ceph storage health on OpenShift OCS/ODF clusters. Use when PVCs are stuck in Pending, storage provisioning fails, Ceph is degraded, OSDs are full, or cluster storage needs diagnosis. |
Use this guide to diagnose and remediate Ceph storage issues on OpenShift clusters running OCS/ODF (OpenShift Data Foundation).
This skill requires:
oc metrics (kubectl-metrics) -- for Ceph metrics (health, capacity, OSD, PG)oc debug-queries (kubectl-debug-queries) -- for listing resources, logs, eventsIf any tool is missing, install with:
curl -sSL https://raw.githubusercontent.com/yaacov/kubectl-metrics/main/install.sh | bash
curl -sSL https://raw.githubusercontent.com/yaacov/kubectl-debug-queries/main/install.sh | bash
--query for FilteringUse --query to filter, sort, and project results server-side. Use pipe output to jq, grep, or other post-processing tools only when --query cannot express what you need.
The --query flag accepts TSL (Tree Search Language) with four optional clauses:
[select <field>, ...] [where <condition>] [order by <field> [asc|desc]] [limit N]
Note: select only affects table output (the default). With --output json, all fields are always returned regardless of select.
Operators: =, !=, <, >, <=, >=, like (% wildcard), ilike (case-insensitive), ~= (regex), and, or, not, in [...], between X and Y.
Before writing queries, discover actual field names with --output json:
oc debug-queries list --resource pods --namespace openshift-storage --limit 2 --output json
oc metrics query --query "ceph_health_status"
Health values: 0=OK, 1=WARN, 2=ERR.
oc metrics query --query "ceph_cluster_total_bytes"
oc metrics query --query "ceph_cluster_total_used_bytes"
oc metrics query --query "ceph_cluster_total_used_bytes / ceph_cluster_total_bytes * 100"
oc debug-queries get --resource cephcluster --namespace openshift-storage --output json
Health states:
HEALTH_OK -- cluster is healthyHEALTH_WARN -- degraded but functional (backfillfull, nearfull, degraded PGs)HEALTH_ERR -- critical, writes may be blocked (full OSDs, too few OSDs, down PGs)oc metrics query --query "ceph_osd_stat_bytes"
oc metrics query --query "ceph_osd_stat_bytes_used"
oc metrics query --query "rate(ceph_osd_op_latency_sum[5m]) / rate(ceph_osd_op_latency_count[5m])"
oc debug-queries list --resource pods --namespace openshift-storage --selector "app=rook-ceph-osd"
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*osd-prepare.*'"
oc debug-queries list --resource pvc --namespace openshift-storage --selector "app=rook-ceph-osd"
oc metrics query --query "ceph_pg_total"
oc metrics query --query "ceph_pg_active"
oc metrics query --query "ceph_pg_degraded"
oc metrics query --query "ceph_pool_percent_used * 100"
oc metrics query --query "rate(ceph_pool_rd[5m])"
oc metrics query --query "rate(ceph_pool_wr[5m])"
oc metrics query --query "ceph_pool_stored"
oc metrics query --query "ceph_pool_max_avail"
PVC provisioning is handled by CSI driver pods. If these are unhealthy, no volumes can be created.
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*rbd.*ctrlplugin.*'"
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*cephfs.*ctrlplugin.*'"
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*rbd.*nodeplugin.*'"
Check CSI provisioner logs:
oc debug-queries logs --name <rbd-ctrlplugin-pod> --namespace openshift-storage --container csi-rbdplugin --tail 50
oc debug-queries list --resource pvc --all-namespaces --query "where status.phase = 'Pending'"
oc debug-queries get --resource pvc --name <pvc-name> --namespace <namespace>
oc debug-queries list --resource pv --all-namespaces --query "where status.phase = 'Released'"
oc debug-queries list --resource storageclass --all-namespaces
Symptoms: PVCs stuck in Pending, provisioning errors with DeadlineExceeded or operation already exists.
Diagnosis:
oc metrics query --query "ceph_health_status"
oc metrics query --query "ceph_osd_stat_bytes_used / ceph_osd_stat_bytes * 100"
oc debug-queries get --resource cephcluster --namespace openshift-storage --output json
Look for OSD_FULL and POOL_FULL messages in the CephCluster status.
Remediation: See "Requires Shell" section below for oc delete pv and ceph osd set-full-ratio.
Symptoms: Cluster functional but approaching full. Warnings about nearfull or backfillfull OSDs.
Remediation:
Symptoms: HEALTH_WARN with messages about degraded or undersized placement groups.
Diagnosis:
oc metrics query --query "ceph_pg_degraded"
oc metrics query --query "ceph_pg_total - ceph_pg_active"
oc debug-queries events --namespace openshift-storage --query "where type = 'Warning'"
Remediation:
Symptoms: PVC events say "waiting for external provisioner" but no ProvisioningFailed errors.
Diagnosis:
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*ctrlplugin.*'"
oc debug-queries logs --name <rbd-ctrlplugin-pod> --namespace openshift-storage --container csi-rbdplugin --tail 100
Remediation:
Symptoms: POOL_FULL warning but individual OSDs have space.
Diagnosis:
oc metrics query --query "ceph_pool_percent_used * 100"
oc metrics query --query "ceph_osd_stat_bytes_used / ceph_osd_stat_bytes * 100"
Remediation:
oc debug-queries list --resource pods --namespace openshift-storage --query "where name ~= '.*ocs-operator.*|.*odf-operator.*|.*rook-ceph-operator.*'"
oc debug-queries logs --name deployment/rook-ceph-operator --namespace openshift-storage --tail 50
Check for high restart counts:
oc metrics query --query "topk(10, sort_desc(kube_pod_container_status_restarts_total))"
Run these periodically to avoid surprise outages:
oc metrics query --query "ceph_cluster_total_used_bytes / ceph_cluster_total_bytes * 100"
oc debug-queries list --resource pv --all-namespaces --query "where status.phase = 'Released'"
oc debug-queries list --resource pvc --all-namespaces --query "where status.phase = 'Pending'"
Act when usage exceeds 70% -- start cleaning up or expanding capacity before hitting the 85% full threshold.
These remediation operations require shell access:
oc get pv --field-selector status.phase=Released
oc delete pv <released-pv-names>
MON_POD=$(oc -n openshift-storage get pods -l app=rook-ceph-mon -o jsonpath='{.items[0].metadata.name}')
MON_ADDR=$(oc -n openshift-storage get pod $MON_POD -o jsonpath='{.spec.containers[0].env[?(@.name=="ROOK_CEPH_MON_HOST")].value}' | sed 's/\[//;s/\]//')
# Raise to 0.92 to unblock writes temporarily
oc -n openshift-storage exec $MON_POD -c mon -- \
ceph -m $MON_ADDR --keyring /etc/ceph/keyring-store/keyring \
osd set-full-ratio 0.92
# After space is freed, reset to default
oc -n openshift-storage exec $MON_POD -c mon -- \
ceph -m $MON_ADDR --keyring /etc/ceph/keyring-store/keyring \
osd set-full-ratio 0.85
When you need to discover available flags or verify syntax:
oc debug-queries list --help
oc debug-queries logs --help
oc metrics query --help