一键导入
check-ocp-health
// General OpenShift (OCP) cluster health check. Use when the cluster is unhealthy, nodes are NotReady, operators are degraded, pods are crashing, etcd is slow, networking issues occur, or a general cluster diagnosis is needed.
// General OpenShift (OCP) cluster health check. Use when the cluster is unhealthy, nodes are NotReady, operators are degraded, pods are crashing, etcd is slow, networking issues occur, or a general cluster diagnosis is needed.
Generate bash e2e verification scripts for MTV/Forklift bugs and features through a guided workflow (gather context, write test plan, get approval, generate script). Use when the user asks to create a test, write a test script, verify a bug fix, build an e2e test, generate a verification script, or mentions an MTV/Forklift Jira ticket (MTV-<number>) together with testing.
Check Ceph storage health on OpenShift OCS/ODF clusters. Use when PVCs are stuck in Pending, storage provisioning fails, Ceph is degraded, OSDs are full, or cluster storage needs diagnosis.
Use the oc mtv CLI to manage VM migrations. Use this skill when the user wants to migrate VMs from vSphere, oVirt, OpenStack, OVA, EC2, or HyperV to OpenShift/KubeVirt.
Use oc virt (or kubectl virt) to manage KubeVirt virtual machines. Use this skill when the user wants to create, start, stop, or manage VMs on OpenShift/Kubernetes.
Install and configure the CLI plugins for Forklift/MTV, Prometheus metrics, and Kubernetes debug queries. Use when CLI tools (oc mtv, oc metrics, oc debug-queries) are not available, or when the user wants to set up the tools.
Observe cluster metrics via Prometheus/Thanos. Use when the user wants to check cluster metrics, monitor network traffic, storage I/O, pod resource usage, VM migration throughput, or discover available Prometheus metrics. Covers metric discovery, storage (Ceph/ODF), network traffic by namespace/pod, pod statistics, and Forklift/MTV migration monitoring.
| name | check-ocp-health |
| description | General OpenShift (OCP) cluster health check. Use when the cluster is unhealthy, nodes are NotReady, operators are degraded, pods are crashing, etcd is slow, networking issues occur, or a general cluster diagnosis is needed. |
Use this guide for general OCP cluster health diagnosis and remediation.
This skill requires:
oc debug-queries (kubectl-debug-queries) -- for listing resources, logs, eventsoc metrics (kubectl-metrics) -- for CPU/memory/node metricsIf any tool is missing, install with:
curl -sSL https://raw.githubusercontent.com/yaacov/kubectl-debug-queries/main/install.sh | bash
curl -sSL https://raw.githubusercontent.com/yaacov/kubectl-metrics/main/install.sh | bash
--query for FilteringUse --query to filter, sort, and project results server-side. Use pipe output to jq, grep, or other post-processing tools only when --query cannot express what you need.
The --query flag accepts TSL (Tree Search Language) with four optional clauses:
[select <field>, ...] [where <condition>] [order by <field> [asc|desc]] [limit N]
Note: select only affects table output (the default). With --output json, all fields are always returned regardless of select.
Operators: =, !=, <, >, <=, >=, like (% wildcard), ilike (case-insensitive), ~= (regex), and, or, not, in [...], between X and Y.
Before writing queries, discover actual field names with --output json:
oc debug-queries list --resource pods --namespace <namespace> --limit 2 --output json
Check these in order for a fast overview:
oc debug-queries list --resource nodes --all-namespaces
oc debug-queries list --resource clusteroperators --all-namespaces
oc debug-queries list --resource pods --all-namespaces --query "where status.phase != 'Running' and status.phase != 'Succeeded'" --limit 30
oc debug-queries events --all-namespaces --query "where type = 'Warning'" --sort-by time_desc --limit 20
oc debug-queries list --resource nodes --all-namespaces
oc debug-queries get --resource node --name <node-name> --namespace default
Resource usage:
oc metrics query --query "avg(instance:node_cpu:ratio) * 100"
oc metrics query --query "(1 - sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)) * 100"
oc metrics query --query "sum(kube_node_status_condition{condition='Ready',status='true'})"
Pods on a specific node:
oc debug-queries list --resource pods --all-namespaces --query "where spec.nodeName = '<node-name>'"
Diagnosis:
oc debug-queries get --resource node --name <node-name> --namespace default
oc debug-queries events --all-namespaces --name <node-name> --sort-by time_desc
Common causes:
Remediation:
oc debug-queries list --resource clusteroperators --all-namespaces
oc debug-queries get --resource clusteroperator --name <operator-name> --namespace default
Key operators to watch: etcd, kube-apiserver, openshift-controller-manager, ingress, monitoring, storage, machine-config.
Diagnosis: Check the operator's namespace for unhealthy pods:
oc debug-queries list --resource pods --namespace openshift-<operator-name> --query "where status.phase != 'Running'"
oc debug-queries logs --name <pod-name> --namespace openshift-<operator-name> --tail 50
Before writing log queries, discover the actual field names and level values:
oc debug-queries logs --name <pod-name> --namespace openshift-<operator-name> --tail 5 --output json
Level strings vary by workload. Controller-runtime logs normalize to ERROR, WARN, INFO, DEBUG. klog-format logs (used by etcd and other Kubernetes components) may normalize to E, W, I, F. Always check with --output json first to see actual level values.
Full-text search when you don't know which field contains the value:
oc debug-queries logs --name <pod-name> --namespace openshift-<operator-name> --tail 200 --query "where raw_line ~= '.*<search-term>.*'"
Remediation:
machine-config operator is degradedoc debug-queries get --resource clusteroperator --name etcd --namespace default
oc debug-queries list --resource pods --namespace openshift-etcd --selector "app=etcd"
oc debug-queries logs --name deployment/etcd-operator --namespace openshift-etcd-operator --tail 50 --query "where level = 'ERROR' or level = 'WARN'"
Common causes:
Remediation:
oc debug-queries list --resource pods --namespace openshift-kube-apiserver --selector "app=openshift-kube-apiserver"
oc debug-queries events --namespace openshift-kube-apiserver --sort-by time_desc --limit 10
oc debug-queries list --resource pods --all-namespaces --query "where status.phase = 'Failed'" --limit 20
oc debug-queries list --resource pods --all-namespaces --query "where status.phase = 'Pending'"
Pods with high restart counts:
oc metrics query --query "topk(10, sort_desc(kube_pod_container_status_restarts_total))"
Diagnosis:
oc debug-queries get --resource pod --name <pod-name> --namespace <namespace>
oc debug-queries logs --name <pod-name> --namespace <namespace> --previous
Common causes: missing config/secrets, OOM, application errors, image issues.
Diagnosis:
oc debug-queries events --namespace <namespace> --name <pod-name> --resource Pod
Common causes: wrong image name, registry auth missing, network issues to registry.
oc debug-queries list --resource pods --namespace openshift-ingress
oc debug-queries get --resource clusteroperator --name network --namespace default
oc debug-queries list --resource pods --namespace openshift-network-operator
Diagnosis:
oc debug-queries list --resource endpoints --namespace <namespace>
oc debug-queries list --resource pods --namespace openshift-ingress
oc debug-queries logs --name <router-pod> --namespace openshift-ingress --tail 20
oc debug-queries list --resource certificates --all-namespaces
oc debug-queries get --resource clusteroperator --name kube-apiserver --namespace default
oc debug-queries list --resource machineconfigpool --all-namespaces
oc debug-queries get --resource machineconfigpool --name worker --namespace default
oc debug-queries get --resource machineconfigpool --name master --namespace default
Diagnosis:
oc debug-queries list --resource machineconfigpool --all-namespaces --output json
Remediation:
oc debug-queries list --resource clusterversion --all-namespaces
oc debug-queries get --resource clusterversion --name version --namespace default
oc debug-queries list --resource resourcequota --all-namespaces
oc debug-queries list --resource limitrange --all-namespaces
oc debug-queries get --resource resourcequota --name <quota-name> --namespace <namespace>
When the user asks for a cluster health report, run these commands in parallel and present the results as a formatted summary with tables:
Cluster & nodes:
oc debug-queries list --resource clusterversion --all-namespaces
oc debug-queries list --resource nodes --all-namespaces
oc debug-queries list --resource clusteroperators --all-namespaces
oc debug-queries list --resource pods --all-namespaces --query "where status.phase != 'Running' and status.phase != 'Succeeded'" --limit 15
Resource usage:
oc metrics query --query "avg(instance:node_cpu:ratio) * 100"
oc metrics query --query "(1 - sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)) * 100"
oc metrics query --query "sum(kube_node_status_condition{condition='Ready',status='true'})"
Storage health:
oc metrics query --query "ceph_health_status"
oc metrics query --query "ceph_cluster_total_used_bytes / ceph_cluster_total_bytes * 100"
Format the results as a concise summary with:
Flag any issues found with brief remediation hints. If everything is healthy, say so clearly.
These operations require shell access:
oc get --raw /healthz
oc -n openshift-etcd exec $(oc -n openshift-etcd get pods -l app=etcd -o jsonpath='{.items[0].metadata.name}') -c etcd -- \
etcdctl member list -w table
oc -n openshift-etcd exec $(oc -n openshift-etcd get pods -l app=etcd -o jsonpath='{.items[0].metadata.name}') -c etcd -- \
etcdctl endpoint health --cluster -w table
oc -n openshift-etcd exec $(oc -n openshift-etcd get pods -l app=etcd -o jsonpath='{.items[0].metadata.name}') -c etcd -- \
etcdctl endpoint status --cluster -w table
oc run dns-test --rm -i --restart=Never --image=busybox -- nslookup kubernetes.default.svc.cluster.local
oc get secret -n openshift-kube-apiserver -o json | \
python3 -c "import json,sys; [print(i['metadata']['name']) for i in json.load(sys.stdin)['items'] if 'cert' in i['metadata']['name'].lower()]" 2>/dev/null
When you need to discover available flags or verify syntax:
oc debug-queries list --help
oc debug-queries logs --help
oc metrics query --help