ワンクリックでManusで任意のスキルを実行

$pwd:

degraded-operator-recovery

Name: Degraded Operator Recovery
Author: openshift

// Troubleshoot ClusterOperator in Degraded, Unavailable, or not Progressing state. Use when operator status shows error conditions, reconciliation failures, or degraded health checks.

Manusで実行

$ git log --oneline --stat

stars:68

forks:82

updated:2026年3月25日 09:00

SKILL.md

readonly

name	degraded-operator-recovery
description	Troubleshoot ClusterOperator in Degraded, Unavailable, or not Progressing state. Use when operator status shows error conditions, reconciliation failures, or degraded health checks.

Degraded Operator Recovery

When a user reports unhealthy cluster operators or a stuck upgrade, follow this structured approach to identify the blocking condition and provide recovery steps.

1. Assess Cluster Operator Health

Start by listing all cluster operators and their status conditions:

Identify operators with Degraded=True or Available=False.
Note operators with Progressing=True — these may be mid-reconciliation and need time, not intervention.
If multiple operators are degraded, identify dependencies. For example, if kube-apiserver is degraded, other operators that depend on the API server will also report issues.

Focus on the root cause operator — the one whose degradation is not explained by another operator's failure.

2. Read the Blocking Condition

For each degraded operator:

Read the operator's status conditions — the message field on the Degraded condition usually contains the specific error.
Check the lastTransitionTime to understand how long the operator has been in this state.
Look for common patterns in the condition message:
- Certificate expiry or rotation failures
- Webhook configuration errors
- Failed rollout of an operand deployment
- Resource contention (CPU/memory on control plane nodes)
- Quorum loss (etcd-specific)

Do not skip this step. The condition message is the single most informative piece of data.

3. Inspect Managed Operands

If the condition message is not sufficient:

Identify the operator's managed deployments, daemonsets, or statefulsets (usually in openshift-* namespaces).
Check if any operand pods are in CrashLoopBackOff, Pending, or Error state.
If operand pods are failing, triage them using the same approach as pod failure diagnosis — check logs and events.
For control plane operators (etcd, kube-apiserver, kube-controller-manager, kube-scheduler): check static pod status on control plane nodes.

4. Check for Upgrade-Related Issues

If this is happening during or after a cluster upgrade:

Check oc get clusterversion for the upgrade status and any reported failures.
Determine if the operator is stuck waiting for a node reboot (MachineConfigPool not updated).
Check if pending CSRs need approval — node certificate renewal during upgrade can block operators.
For transient Progressing=True during upgrade: advise waiting (up to the operator's expected rollout window) before intervening.

Distinguish between "upgrade in progress" (normal) and "upgrade stuck" (needs intervention).

5. Provide Recovery Steps

Once the blocking condition is identified:

State which operator is degraded and what the blocking condition is.
Provide specific recovery actions:
- Pending CSRs: approve them with oc adm certificate approve.
- Failed operand pod: follow pod triage to fix the underlying issue.
- Certificate issues: check if cert rotation can be triggered or if manual renewal is needed.
- Resource contention: identify the pressure source on control plane nodes.
- Webhook errors: check if the webhook service is available and the CA bundle is correct.
If the issue is internal to the operator and cannot be resolved by the user, recommend opening a support case with the specific condition message.

Quality Standards

Always start with oc get clusteroperators before diving deeper — this prevents chasing symptoms of a different root cause.
For etcd, kube-apiserver, and kube-controller-manager: warn about control plane impact before suggesting any restart or remediation.
Never suggest force-deleting operator-managed resources without explicit warning — operators reconcile state and force-deletion can cause split-brain.
If the operator is Progressing=True, advise waiting before intervening — premature action can make the situation worse.
Distinguish upgrade-related transient degradation from persistent failures that need action.

related-skills.json

同じリポジトリ

resolve-cve.md

from "openshift/lightspeed-service"

Resolve a CVE vulnerability issue from Jira. Reads the CVE details, assesses impact, and either marks "not affected" with a Jira comment and transition, bumps the affected dependency, or implements a code fix. Use when the user says "cve", "resolve CVE", or provides a CVE Jira issue.

2026-03-3168

migration-readiness.md

from "openshift/lightspeed-service"

Assess whether an application is ready to migrate to OpenShift. Use when the user describes an application and asks about migration, containerization, or moving to Kubernetes/OpenShift.

2026-03-3168

triage.md

from "openshift/lightspeed-service"

Triage production incidents involving data corruption, data loss, slow performance, or outages. Classify severity and recommend immediate actions.

2026-03-3168

deps-update.md

from "openshift/lightspeed-service"

Update Python dependencies to latest versions using uv, regenerate lock and requirements.txt, then verify linting and tests pass. Fix breakage from API changes in bumped packages. Use when the user says "deps update", "bump dependencies", or "update deps".

2026-03-3068

review-pr.md

from "openshift/lightspeed-service"

Review PR with structured approach covering architecture, naming, patterns, and critical questions

2026-03-2768

namespace-troubleshooting.md

from "openshift/lightspeed-service"

Troubleshoot namespace stuck in Terminating state, ResourceQuota exhaustion, or RBAC permission denied errors. Use when resources cannot be created or forbidden errors occur.

2026-03-2568

package.json

"author": "openshift"

"repository": "openshift/lightspeed-service"

GitHub リポジトリを開く Creator のリポジトリを見る

$ install --global

$ download --local

Manusで実行

$ useful --forSOC

ネットワーク・コンピュータシステム管理者コンピュータ・数学職15-1244L4

name	degraded-operator-recovery
description	Troubleshoot ClusterOperator in Degraded, Unavailable, or not Progressing state. Use when operator status shows error conditions, reconciliation failures, or degraded health checks.

Degraded Operator Recovery

When a user reports unhealthy cluster operators or a stuck upgrade, follow this structured approach to identify the blocking condition and provide recovery steps.

1. Assess Cluster Operator Health

Start by listing all cluster operators and their status conditions:

Identify operators with Degraded=True or Available=False.
Note operators with Progressing=True — these may be mid-reconciliation and need time, not intervention.
If multiple operators are degraded, identify dependencies. For example, if kube-apiserver is degraded, other operators that depend on the API server will also report issues.

Focus on the root cause operator — the one whose degradation is not explained by another operator's failure.

2. Read the Blocking Condition

For each degraded operator:

Read the operator's status conditions — the message field on the Degraded condition usually contains the specific error.
Check the lastTransitionTime to understand how long the operator has been in this state.
Look for common patterns in the condition message:
- Certificate expiry or rotation failures
- Webhook configuration errors
- Failed rollout of an operand deployment
- Resource contention (CPU/memory on control plane nodes)
- Quorum loss (etcd-specific)

Do not skip this step. The condition message is the single most informative piece of data.

3. Inspect Managed Operands

If the condition message is not sufficient:

Identify the operator's managed deployments, daemonsets, or statefulsets (usually in openshift-* namespaces).
Check if any operand pods are in CrashLoopBackOff, Pending, or Error state.
If operand pods are failing, triage them using the same approach as pod failure diagnosis — check logs and events.
For control plane operators (etcd, kube-apiserver, kube-controller-manager, kube-scheduler): check static pod status on control plane nodes.

4. Check for Upgrade-Related Issues

If this is happening during or after a cluster upgrade:

Check oc get clusterversion for the upgrade status and any reported failures.
Determine if the operator is stuck waiting for a node reboot (MachineConfigPool not updated).
Check if pending CSRs need approval — node certificate renewal during upgrade can block operators.
For transient Progressing=True during upgrade: advise waiting (up to the operator's expected rollout window) before intervening.

Distinguish between "upgrade in progress" (normal) and "upgrade stuck" (needs intervention).

5. Provide Recovery Steps

Once the blocking condition is identified:

State which operator is degraded and what the blocking condition is.
Provide specific recovery actions:
- Pending CSRs: approve them with oc adm certificate approve.
- Failed operand pod: follow pod triage to fix the underlying issue.
- Certificate issues: check if cert rotation can be triggered or if manual renewal is needed.
- Resource contention: identify the pressure source on control plane nodes.
- Webhook errors: check if the webhook service is available and the CA bundle is correct.
If the issue is internal to the operator and cannot be resolved by the user, recommend opening a support case with the specific condition message.

Quality Standards

Always start with oc get clusteroperators before diving deeper — this prevents chasing symptoms of a different root cause.
For etcd, kube-apiserver, and kube-controller-manager: warn about control plane impact before suggesting any restart or remediation.
Never suggest force-deleting operator-managed resources without explicit warning — operators reconcile state and force-deletion can cause split-brain.
If the operator is Progressing=True, advise waiting before intervening — premature action can make the situation worse.
Distinguish upgrade-related transient degradation from persistent failures that need action.

degraded-operator-recovery

Degraded Operator Recovery

1. Assess Cluster Operator Health

2. Read the Blocking Condition

3. Inspect Managed Operands

4. Check for Upgrade-Related Issues

5. Provide Recovery Steps

Quality Standards

このリポジトリの他の Skills

このリポジトリの他の Skills

Degraded Operator Recovery

1. Assess Cluster Operator Health

2. Read the Blocking Condition

3. Inspect Managed Operands

4. Check for Upgrade-Related Issues

5. Provide Recovery Steps

Quality Standards