with one click
node-health-check
// Check node health and diagnose node-level issues (NotReady, DiskPressure, MemoryPressure, PIDPressure). Inspects node conditions, resource allocation, and real-time usage.
// Check node health and diagnose node-level issues (NotReady, DiskPressure, MemoryPressure, PIDPressure). Inspects node conditions, resource allocation, and real-time usage.
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers SKILL.md format, script execution modes, and best practices.
Diagnose DNS resolution failures in the cluster (NXDOMAIN, timeouts, SERVFAIL). Checks CoreDNS health, service endpoints, and DNS configuration.
Ping a pod's gateway for a given network interface. Auto-detects gateway IP from the routing table, then pings it. First resolve_pod_netns, then node_script with netns param.
Show the gateway for a network interface in a Kubernetes pod. Reads the routing table via `ip -j route` from the pod's network namespace. First resolve_pod_netns, then node_script with netns param.
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Execute via node_script tool.
Diagnose NetworkPolicy-related connectivity issues (traffic unexpectedly blocked, default-deny effects, egress blocking DNS). Identifies which NetworkPolicies affect a pod, checks ingress/egress rules, and verifies CNI support.
| name | node-health-check |
| description | Check node health and diagnose node-level issues (NotReady, DiskPressure, MemoryPressure, PIDPressure). Inspects node conditions, resource allocation, and real-time usage. |
When nodes are NotReady, experiencing resource pressure, or suspected of causing pod failures, follow this flow to diagnose node-level issues.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to drain, cordon, or restart nodes — that should be left to the user or cluster administrator.
kubectl get nodes -o wide
Note the STATUS of each node. Healthy nodes show Ready. Look for NotReady, SchedulingDisabled, or condition-related flags like Ready,SchedulingDisabled.
For any node showing issues:
kubectl describe node <node>
Focus on the Conditions section. Key conditions:
| Condition | Healthy Value | Problem Value | Meaning |
|---|---|---|---|
| Ready | True | False/Unknown | Kubelet is healthy and can accept pods |
| MemoryPressure | False | True | Node memory usage is critically high |
| DiskPressure | False | True | Node disk usage exceeds eviction threshold |
| PIDPressure | False | True | Too many processes running on node |
| NetworkUnavailable | False | True | Node network is not configured correctly |
Also check:
kubectl top node <node>
Compare actual CPU and memory usage against the node's allocatable resources from step 2.
NotReady — Kubelet not respondingThe kubelet on the node is not communicating with the API server. Common causes:
If node-level logs are available, use the node-logs skill to check kubelet logs:
bash skills/core/node-logs/scripts/get-node-logs.sh \
--node <node> --unit kubelet --since "30m ago" --tail 100
Report the node's NotReady status and any kubelet errors to the user.
DiskPressure — Disk usage exceeds thresholdThe node's disk usage exceeds the eviction threshold (typically 85%). The kubelet will start evicting pods.
Check which pods are using the most ephemeral storage:
kubectl get pods --field-selector spec.nodeName=<node> -A -o wide
Advise the user to clean up unused images/containers, increase disk size, or move workloads to other nodes.
MemoryPressure — Memory usage critically highThe node's memory usage is critically high. The kubelet may evict pods based on their QoS class (BestEffort first, then Burstable).
Check pod memory usage on the node:
kubectl top pods --field-selector spec.nodeName=<node> -A --sort-by=memory
If the above doesn't work (field-selector may not be supported for top), list pods on the node and check their usage:
kubectl get pods --field-selector spec.nodeName=<node> -A -o wide
kubectl top pods -A --sort-by=memory | head -20
PIDPressure — Too many processesThe node is running too many processes. This can prevent new containers from starting.
Advise the user to investigate which pods are creating excessive processes, and consider setting PID limits in the container runtime or kubelet configuration.
NetworkUnavailable — Node network not configuredThe node's network plugin (CNI) has not configured networking. The CNI plugin may not be installed, crashed, or failed to initialize.
Check CNI pod status on the node:
kubectl get pods -A --field-selector spec.nodeName=<node> | grep -E 'cni|calico|cilium|flannel|weave'
SchedulingDisabled — Node is cordonedThe node has been cordoned (kubectl cordon) and will not accept new pods. Existing pods continue running.
This is usually intentional (maintenance). Report to the user that the node is cordoned.
If resource overcommitment is suspected:
kubectl describe node <node> | grep -A 20 "Allocated resources"
Compare the total requests against allocatable resources. If CPU or memory requests exceed 90% of allocatable, new pods may fail to schedule on this node.
After general health checks, search your skill list for node-level hardware or resource diagnostic skills (e.g., RDMA/RoCE config checks, GPU diagnostics, storage checks). Run any that match this node's characteristics. Each skill auto-detects whether it applies to the node — no pre-check needed on your part.
kubectl top requires the Metrics Server to be installed in the cluster. If it returns an error, the metrics server may not be available.lastTransitionTime — this tells you when the condition last changed, which helps correlate with events or changes.