Run any Skill in Manus with one click

$pwd:

test-instances

Name: Test Instances
Author: apecloud

// Use when creating and validating KubeBlocks database Cluster test instances across standalone, replication, cluster, parameter, dynamic/static, recovery, and engine-specific topologies.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 04:41

SKILL.md

readonly

package.json

"author": "apecloud"

"repository": "apecloud/kubeblocks-addon-docs"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	test-instances
description	Use when creating and validating KubeBlocks database Cluster test instances across standalone, replication, cluster, parameter, dynamic/static, recovery, and engine-specific topologies.

Reference resolution: when this source-derived skill mentions docs/..., resolve it from the shared support package beside the installed user skills: ~/.codex/skills/kubeblocks-addon-source-docs/docs/... for Codex or ~/.claude/skills/kubeblocks-addon-source-docs/docs/... for Claude Code. In the shared kubeblocks-addon-docs checkout, the same files live under skills/kubeblocks-addon-source-docs/docs/.... When it mentions scripts/..., resolve it from the same support package under scripts/.... If you are working inside a checkout of the original apecloud/kubeblocks-addon-skills, repo-relative paths are also valid.

Create database cluster instances to validate the deployed KubeBlocks addon, testing every topology.

Target: $ARGUMENTS (Engine name and optional version — e.g., redis or redis 7.2.4. Version omitted → tests all available versions.)

Step 0: Load Environment

SCRIPT_DIR="$(git rev-parse --show-toplevel 2>/dev/null || pwd)"
[ -f "$SCRIPT_DIR/.env" ] && source "$SCRIPT_DIR/.env"
[ -n "$KUBECONFIG" ] && export KUBECONFIG

# Parse arguments: ENGINE [VERSION]
ARGS=($ARGUMENTS)
ENGINE="${ARGS[0]:-}"
VERSION="${ARGS[1]:-}"

kubectl cluster-info --request-timeout=5s \
  || { echo "ERROR: kubectl cannot reach the cluster. Check KUBECONFIG in .env"; exit 1; }
echo "KUBECONFIG=${KUBECONFIG:-~/.kube/config}"

# Registry decision: probe docker.io; fall back to ALIYUN_IMAGE_REGISTRY if unreachable
if curl -s --connect-timeout 15 "https://registry-1.docker.io/v2/" -o /dev/null 2>/dev/null; then
  IMAGE_REGISTRY="docker.io"
  SKOPEO_CREDS="--no-creds"
elif [ -n "$ALIYUN_IMAGE_REGISTRY" ]; then
  IMAGE_REGISTRY="${ALIYUN_IMAGE_REGISTRY}"
  SKOPEO_CREDS="--creds ${ALIYUN_DOCKER_USERNAME}:${ALIYUN_DOCKER_PASSWORD}"
else
  IMAGE_REGISTRY="docker.io"
  SKOPEO_CREDS="--no-creds"
fi
echo "Image registry: ${IMAGE_REGISTRY}"

# Node arch/OS — used by skopeo to probe the correct manifest variant
NODE_ARCH=$(kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.architecture}' 2>/dev/null || echo "amd64")
NODE_OS=$(kubectl get nodes   -o jsonpath='{.items[0].status.nodeInfo.operatingSystem}' 2>/dev/null || echo "linux")
echo "Node arch: ${NODE_ARCH}  os: ${NODE_OS}"

Step 0b: Load Engine Hints

Check for engine-specific overrides and constraints before proceeding:

HINTS_FILE="docs/engine-hints/${ENGINE}.md"
[ -f "$HINTS_FILE" ] && echo "=== Engine hints found: $HINTS_FILE ===" || echo "(no engine-specific hints for $ENGINE)"

If the hints file exists, read it now before proceeding — it may override resource limits, list unsupported operations, or describe engine-specific behaviors for this run.

OPS Test Matrix

For each engine, tests are organized by Feature → Operation, matching the official KubeBlocks v1.0 regression report format:

Feature	Operations
Lifecycle	Create, Start, Stop, Restart (cluster + per-component), Update (TerminationPolicy WipeOut)
Scale	VerticalScaling, VolumeExpansion, HorizontalScaling In/Out, HscaleOfflineInstances, HscaleOnlineInstances, RebuildInstance
Upgrade	Service version upgrade (forward + backward)
SwitchOver	Promote, SwitchOver (per component)
Failover	ChaosMesh fault injection with expected HA recovery: Full CPU, Network Corrupt, OOM, Pod Kill, Kill 1, Network Loss, Network Delay, Pod Failure, Network Bandwidth, Network Partition, Delete Pod All
NoFailover	ChaosMesh fault injection without failover expected: DNS Error, Network Duplicate, DNS Random, Connection Stress, Time Offset
Backup Restore	Backup (xtrabackup / xtrabackup-inc / pbm-physical / wal-g / pg-basebackup / datafile / dump / full / volume-snapshot / topics), Schedule Backup/Restore, Restore, Restore Increment, Delete Restore Cluster
Parameter	Reconfiguring Dynamic (randomly select up to 10 dynamic params, verify no pod restart), Reconfiguring Static (randomly select up to 2 static params, verify pod restart triggered)
Accessibility	Expose Enable/Disable (internet/intranet), Connect
Stress	Bench (service + LB service), Tpch

ChaosMesh is the chaos engineering tool used for Failover and NoFailover tests. Step 1 checks for chaos-controller-manager in chaos-mesh and chaos-testing, and installs ChaosMesh automatically if absent. The resolved namespace is stored in $CHAOS_NS.

Single-node topology limitations: Some operations are architecturally incompatible with single-node topologies and must be marked N/A without attempting:

HorizontalScaling Out/In, HscaleOfflineInstances, HscaleOnlineInstances — single-node topologies cannot add or remove nodes; architecturally unsupported.

SwitchOver — no secondary exists to promote.

Failover (all 11 cases) — single-node has no HA election. Recovery is Kubernetes pod restart (restartPolicy), not application-level failover. Results should be labeled as "K8s pod restart recovery" rather than "HA failover". Test in multi-node topology for true failover validation.

Step 1: Prerequisites

ENGINE=<engine>

# 1. ChaosMesh availability — verify controller is running; install if absent
CHAOS_NS=""
for NS in chaos-mesh chaos-testing; do
  if kubectl get pods -n $NS --no-headers 2>/dev/null | grep -q "chaos-controller-manager"; then
    CHAOS_NS=$NS
    break
  fi
done

if [[ -z "$CHAOS_NS" ]]; then
  echo "ChaosMesh controller not found — installing via Helm..."
  helm repo add chaos-mesh https://charts.chaos-mesh.org 2>/dev/null || true
  helm repo update chaos-mesh
  helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
    --namespace chaos-testing --create-namespace \
    --set chaosDaemon.runtime=containerd \
    --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
    --version latest \
    --wait --timeout=120s \
    && CHAOS_NS=chaos-testing \
    || echo "ERROR: ChaosMesh install failed — chaos tests will be SKIPPED"
fi

if [[ -n "$CHAOS_NS" ]]; then
  # Verify controller pod is Running
  CM_POD=$(kubectl get pods -n $CHAOS_NS --no-headers 2>/dev/null \
    | grep "chaos-controller-manager" | awk '{print $1}' | head -1)
  CM_STATUS=$(kubectl get pod $CM_POD -n $CHAOS_NS \
    -o jsonpath='{.status.phase}' 2>/dev/null)
  echo "ChaosMesh controller: $CM_POD  phase=$CM_STATUS  namespace=$CHAOS_NS"
  [[ "$CM_STATUS" != "Running" ]] && echo "WARNING: ChaosMesh controller not Running — chaos tests may fail"
fi

# 2. ClusterDefinition must be Available
PHASE=$(kubectl get clusterdefinition $ENGINE -o jsonpath='{.status.phase}' 2>/dev/null)
echo "ClusterDefinition phase: $PHASE"
# If not Available → run /deploy-addon first

# 3. addons-cluster chart must exist
ls addons-cluster/$ENGINE/Chart.yaml 2>/dev/null || echo "addons-cluster not found — cannot test"

Step 1b: Load Known Issues

Query GitHub for open issues labelled skip-in-test in apecloud/kubeblocks-addons. Each matching issue represents a known failing test case — skip it this run; re-run automatically when the issue is closed.

Convention for issue titles: [<engine>] <Operation>: <short description> Example: [elasticsearch] full-backup: JavaClassNotFoundException in es-agent 0.1.0

ENGINE=<engine>

# Fetch all open issues with skip-in-test label for this engine
echo "=== Known Issues (skip-in-test) ==="
gh issue list \
  --repo apecloud/kubeblocks-addons \
  --label "skip-in-test" \
  --state open \
  --json number,title \
  --jq ".[] | select(.title | test(\"\\\\[${ENGINE}\"; \"i\")) | \"  SKIP  #\\(.number)  \\(.title)\""

# Save skip list for reference during the run
SKIP_ISSUES=$(gh issue list \
  --repo apecloud/kubeblocks-addons \
  --label "skip-in-test" \
  --state open \
  --json number,title \
  --jq "[.[] | select(.title | test(\"\\\\[${ENGINE}\"; \"i\")) | {number: .number, title: .title}]")
echo "$SKIP_ISSUES"

Before each test case, check whether its operation appears in $SKIP_ISSUES:

# Helper: returns issue number if this operation is a known skip, empty otherwise
function known_skip() {
  local operation="$1"
  echo "$SKIP_ISSUES" | python3 -c "
import sys, json
issues = json.load(sys.stdin)
op = '''$operation'''.lower()
for i in issues:
    if op in i['title'].lower():
        print(i['number'])
        break
"
}

# Usage before any test case:
ISSUE=$(known_skip "full-backup")
if [[ -n "$ISSUE" ]]; then
  echo "SKIPPED — known issue #${ISSUE} (skip-in-test)"
else
  # run the test
fi

Rule:

Issue OPEN + label skip-in-test → mark SKIPPED (known #XXXX), do not attempt.

Issue CLOSED → label is gone → test runs automatically next time.

To file a new known issue: gh issue create --label "skip-in-test,bug" --title "[<engine>] <Operation>: ...".

No extra files needed anywhere — GitHub Issues is the single source of truth.

Step 2: Enumerate Topologies and Component Names

ENGINE=<engine>

helm template test-addon addons/$ENGINE | python3 -c "
import sys, yaml
for doc in yaml.safe_load_all(sys.stdin):
    if doc and doc.get('kind') == 'ClusterDefinition':
        for t in doc.get('spec', {}).get('topologies', []):
            marker = ' [default]' if t.get('default') else ''
            print(f'Topology: {t[\"name\"]}{marker}  replicas suggestion: {t.get(\"replicas\", \"?\")}')
            for c in t.get('components', []):
                print(f'  component: {c[\"name\"]}  compDef: {c[\"compDef\"]}')
            for s in t.get('shardings', []):
                print(f'  sharding: {s[\"name\"]}')
"

Step 2b: Query Component Replicas Limits

Before generating cluster YAMLs, read the replicasLimit from every deployed ComponentDefinition for this engine. Use minReplicas as the lower bound when setting replicas for each component.

ENGINE=<engine>

kubectl get componentdefinition -o json | python3 -c "
import sys, json
data = json.load(sys.stdin)
engine = sys.argv[1]
for item in data.get('items', []):
    name = item['metadata']['name']
    if not name.startswith(engine):
        continue
    rl = item.get('spec', {}).get('replicasLimit') or {}
    min_r = rl.get('minReplicas', 1)
    max_r = rl.get('maxReplicas', 16384)
    print(f'{name}  minReplicas={min_r}  maxReplicas={max_r}')
" "$ENGINE"

Record the per-component minimums. For each component in every cluster YAML: set replicas = max(1, minReplicas). If minReplicas > 1, a topology using fewer replicas will hit PreCheckFailed before any pod is created.

Step 3: Enumerate Available Service Versions

ENGINE=<engine>

kubectl get componentversion $ENGINE -o json | python3 -c "
import sys, json
cv = json.load(sys.stdin)
releases = cv.get('spec', {}).get('releases', [])
for r in sorted(releases, key=lambda r: r['serviceVersion']):
    print(r['serviceVersion'])
" 2>/dev/null || echo "ComponentVersion $ENGINE not found"

If a specific version was requested in arguments, test only that version.

Step 4: Probe Image Existence

Before deploying test clusters, check whether the container images actually exist in the configured registry ($IMAGE_REGISTRY, set in Step 0).

ENGINE=<engine>

for VERSION in <versions-to-test>; do
  echo -n "${IMAGE_REGISTRY}/apecloud/${ENGINE}:${VERSION} ... "
  if skopeo inspect "docker://${IMAGE_REGISTRY}/apecloud/${ENGINE}:${VERSION}" \
      --override-arch "${NODE_ARCH}" --override-os "${NODE_OS}" \
      $SKOPEO_CREDS 2>/dev/null 1>/dev/null; then
    echo "EXISTS"
  else
    echo "MISSING — will skip"
  fi
done

For MISSING images: Record as "Image not in registry — test skipped". Do NOT change version tags. Proceed with remaining versions.

Engine-Specific Resource Minimums

Some engines require more memory than the generic template defaults (memory: 512Mi). Per-component overrides are documented in docs/engine-hints/<engine>.md. Read that file in Step 0b and apply any listed minimums when generating cluster YAMLs.

Feature Tests

Run the following for each (topology, version) combination where the image exists. Track every result as PASSED / FAILED / SKIPPED / Not implemented.

Timeouts and Early Failure Detection

Set these variables once at the start of each test session:

KB_POD=$(kubectl get pods -n kb-system --no-headers 2>/dev/null \
  | grep "^kubeblocks-[^d]" | awk '{print $1}' | head -1)
echo "KB operator pod: $KB_POD"

Timeout reference by operation type:

Operation	Expected	Timeout	Check KB logs if stuck >
Create cluster (multi-node)	120-180s	240s	90s
Stop	30-60s	90s	60s
Start	60-120s	180s	90s
Restart (OpsRequest)	60-90s	120s	60s
VerticalScaling	60-120s	180s	90s
VolumeExpansion	10-30s	60s	30s
HScaling Out/In	60-120s	180s	60s
Upgrade	120-240s	300s	120s
SwitchOver	30-60s	90s	45s
Failover chaos (60s fault)	60+60s	180s	150s
NoFailover chaos (60s fault)	60+30s	120s	100s
Backup	30-60s	120s	60s
Restore cluster	60-120s	180s	90s

OpsRequest wait helper — use this instead of bare kubectl wait or blind loops:

# wait_ops <opsrequest-name> <timeout-seconds>
function wait_ops() {
  local NAME=$1 TIMEOUT=${2:-120}
  for ((i=0; i<TIMEOUT; i+=10)); do
    PHASE=$(kubectl get opsrequest $NAME -o jsonpath='{.status.phase}' 2>/dev/null)
    [[ "$PHASE" == "Succeed" ]] && echo "✓ $NAME Succeed in ${i}s" && return 0
    [[ "$PHASE" == "Failed"  ]] && echo "✗ $NAME Failed:" \
      && kubectl get opsrequest $NAME -o jsonpath='{.status.conditions[-1].message}' 2>/dev/null && echo "" && return 1
    # Check KB logs at halfway point if still Running
    if (( i == TIMEOUT/2 )); then
      echo "  [${i}s] ops=$PHASE — checking KB logs:"
      kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null \
        | grep -E "ERROR|build error|$CLUSTER_NAME" | tail -8
    else
      (( i % 30 == 0 && i > 0 )) && echo "  [${i}s] ops=$PHASE"
    fi
    sleep 10
  done
  echo "✗ $NAME timeout after ${TIMEOUT}s — KB logs:"
  kubectl logs $KB_POD -n kb-system --tail=30 2>/dev/null \
    | grep -E "ERROR|build error|$CLUSTER_NAME" | tail -10
  return 1
}

Feature: Lifecycle

Create

mkdir -p workspace/tests
ENGINE=<engine>  TOPOLOGY=<topology>  VERSION=<version>
CLUSTER_NAME="${ENGINE:0:10}-${TOPOLOGY:0:8}"
cat > workspace/tests/${ENGINE}-${TOPOLOGY}-test.yaml << EOF
apiVersion: apps.kubeblocks.io/v1
kind: Cluster
metadata:
  name: ${CLUSTER_NAME}
  namespace: default
spec:
  terminationPolicy: Delete
  clusterDef: ${ENGINE}
  topology: ${TOPOLOGY}
  componentSpecs:
    # One entry per component in this topology (names from Step 2).
    # Set replicas = max(1, minReplicas) from Step 2b — using fewer than minReplicas
    # causes PreCheckFailed before any pod is created.
    - name: <component-name>
      serviceVersion: "${VERSION}"
      replicas: <replicas>  # from Step 2b: max(1, minReplicas for this component)
      resources:
        limits:   { cpu: "0.5", memory: "512Mi" }
        requests: { cpu: "0.1", memory: "256Mi" }
      volumeClaimTemplates:
        - name: data
          spec:
            accessModes: [ReadWriteOnce]
            storageClassName: ""
            resources:
              requests:
                storage: 20Gi
EOF

kubectl apply -f workspace/tests/${ENGINE}-${TOPOLOGY}-test.yaml

Wait for Running (timeout 240s — check KB operator logs if not Running by 90s):

CLUSTER_NAME=<cluster-name>
KB_POD=$(kubectl get pods -n kb-system --no-headers 2>/dev/null | grep "^kubeblocks-[^d]" | awk '{print $1}' | head -1)
TIMEOUT=240; INTERVAL=10; PHASE=""

for ((i=0; i<TIMEOUT; i+=INTERVAL)); do
  PHASE=$(kubectl get cluster "$CLUSTER_NAME" -o jsonpath='{.status.phase}' 2>/dev/null)
  echo "  [${i}s] phase=${PHASE:-unknown}"
  [[ "$PHASE" == "Running" ]] && echo "✓ Running" && break

  POD_TABLE=$(kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME" --no-headers 2>/dev/null)
  if echo "$POD_TABLE" | grep -qE 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull|CreateContainerConfigError'; then
    echo "✗ Unrecoverable pod failure:"; echo "$POD_TABLE"; PHASE="PodFailed"; break
  fi
  if (( i == 90 )); then
    echo "⚠ Still Creating at 90s — checking KB operator logs:"
    kubectl logs $KB_POD -n kb-system --tail=30 2>/dev/null \
      | grep -E "ERROR|build error|$CLUSTER_NAME" | tail -10
    echo "$POD_TABLE"
  fi
  sleep $INTERVAL
done

[[ "$PHASE" != "Running" ]] && echo "✗ Did not reach Running (last: $PHASE)"

Note: KubeBlocks v1 clusters have NO "Failed" or "Error" phase. Detect failures at the pod level.

Stop / Start

# Stop — MUST use --type=json to patch only the stop field.
# --type=merge replaces the entire componentSpecs array, wiping serviceVersion/replicas/resources.
# Find the array index for each component first (0-based).
kubectl patch cluster $CLUSTER_NAME --type=json \
  -p='[{"op":"add","path":"/spec/componentSpecs/0/stop","value":true}]'
# For multiple components add one op per index:
# -p='[{"op":"add","path":"/spec/componentSpecs/0/stop","value":true},{"op":"add","path":"/spec/componentSpecs/1/stop","value":true}]'

# Wait for Stopped — typical: 30-60s, timeout 90s
for i in {1..18}; do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Stopped" ]] && echo "✓ Stopped in $((i*5))s" && break
  (( i == 12 )) && echo "⚠ Still not Stopped at 60s — check KB logs:" \
    && kubectl logs $KB_POD -n kb-system --tail=10 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  sleep 5
done

# Start (reverse)
kubectl patch cluster $CLUSTER_NAME --type=json \
  -p='[{"op":"replace","path":"/spec/componentSpecs/0/stop","value":false}]'
# Wait for Running — typical: 60-120s, timeout 180s
for i in {1..36}; do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Running in $((i*5))s" && break
  (( i == 18 )) && echo "⚠ Still not Running at 90s — check KB logs:" \
    && kubectl logs $KB_POD -n kb-system --tail=10 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  sleep 5
done

Restart

# Restart entire cluster
kubectl annotate cluster $CLUSTER_NAME \
  kubeblocks.io/restart="$(date +%s)" --overwrite
kubectl wait cluster $CLUSTER_NAME --for=jsonpath='{.status.phase}'=Running --timeout=120s \
  || { echo "⚠ Restart timeout — KB logs:"; kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -10; }

# Restart specific component
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: restart-<component>-$(date +%s)
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: Restart
  restart:
    - componentName: <component>
EOF

Update TerminationPolicy

# Update TerminationPolicy — scalar field at spec level, --type=merge is safe here
kubectl patch cluster $CLUSTER_NAME --type=merge \
  -p '{"spec":{"terminationPolicy":"WipeOut"}}'

# Revert TerminationPolicy
kubectl patch cluster $CLUSTER_NAME --type=merge \
  -p '{"spec":{"terminationPolicy":"Delete"}}'

Feature: Scale

VerticalScaling

cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: vscale-$(date +%s)
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: VerticalScaling
  verticalScaling:
    - componentName: <component>
      requests: { cpu: "0.2", memory: "256Mi" }
      limits:   { cpu: "1",   memory: "1Gi"  }
EOF
wait_ops vscale-${TS} 180

VolumeExpansion

cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: volexp-$(date +%s)
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: VolumeExpansion
  volumeExpansion:
    - componentName: <component>
      volumeClaimTemplates:
        - name: data
          storage: "21Gi"
EOF
wait_ops volexp-${TS} 60

HorizontalScaling In / Out

Note: The replicas field does NOT exist in horizontalScaling (KB v1 API). Use scaleOut.replicaChanges to add replicas and scaleIn.replicaChanges to remove replicas. Single-node topologies: mark HScale Out/In as N/A — adding or removing nodes is architecturally unsupported.

Scale-In on distributed data nodes: The controller will log wait to delete "Pod/<name>" in Component: <comp> and appear stuck. This is correct — KubeBlocks is waiting for the member-leave lifecycle action to finish relocating data before deleting the pod. This can take several minutes even on a fresh cluster. Do not interpret this as a hang. Check engine hints for the expected timeout.

# Scale Out (increase replicas by N)
TS=$(date +%s)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: hscale-out-${TS}
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: HorizontalScaling
  horizontalScaling:
    - componentName: <component>
      scaleOut:
        replicaChanges: <N>
EOF
wait_ops hscale-out-${TS} 180

# Scale In (decrease replicas by N)
TS=$(date +%s)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: hscale-in-${TS}
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: HorizontalScaling
  horizontalScaling:
    - componentName: <component>
      scaleIn:
        replicaChanges: <N>
EOF
wait_ops hscale-in-${TS} 600   # distributed data nodes may need data relocation before pod deletion — allow up to 600s

HscaleOfflineInstances / HscaleOnlineInstances

# Take specific instance offline (by pod name suffix)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: hscale-offline-$(date +%s)
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: HorizontalScaling
  horizontalScaling:
    - componentName: <component>
      offlineInstancesToOnline: []
      onlineInstancesToOffline: ["<cluster-name>-<component>-N"]
EOF

# Bring it back online
# Swap offlineInstancesToOnline / onlineInstancesToOffline values

RebuildInstance

cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: rebuild-$(date +%s)
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: RebuildInstance
  rebuildFrom:
    - componentName: <component>
      instances:
        - name: <cluster-name>-<component>-N
EOF
wait_ops rebuild-${TS} 300

Feature: Upgrade

# Service version upgrade (forward)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: upgrade-$(date +%s)
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: Upgrade
  upgrade:
    components:
      - componentName: <component>
        serviceVersion: "<target-version>"
EOF
wait_ops upgrade-${TS} 300

# Downgrade (same OpsRequest type, earlier version)

Test both upgrade paths (forward to latest, backward to original) per the report pattern.

Engine constraint (not a bug): Some engines prohibit in-place downgrades at the data layer — the node will refuse to start if on-disk data is from a newer version. Mark such downgrades as N/A (Engine Constraint), not FAILED. Check engine hints for this engine's specific upgrade constraints.

Feature: SwitchOver

Switchover field semantics:

Field	Meaning
`instanceName`	The current primary pod to be demoted. Must be the actual pod name — `"*"` does NOT work (the controller treats it as a literal pod name and fails with "Pod not found").
`candidateName`	(Optional) The target pod to be promoted. Omit to let Syncer choose automatically.

Important: instanceName must be the explicit pod name of the current primary. If it points to a secondary, the switchover action executes on all pods but every pod's script sees KB_SWITCHOVER_ROLE=secondary and exits with 0 (no-op). The OpsRequest will report Succeed even though no switchover occurred. This is expected KubeBlocks behavior (not a bug) — the script's role check is a safety guard.

# Step 1: Find the current primary pod
PRIMARY_POD=$(kubectl get pods -l app.kubernetes.io/instance=$CLUSTER_NAME \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.kubeblocks\.io/role}{"\n"}{end}' \
  | grep primary | awk '{print $1}')

# Step 2: Issue switchover with explicit primary pod name
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: switchover-$(date +%s)
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: Switchover
  switchover:
    - componentName: <component>
      instanceName: "$PRIMARY_POD"       # must be the actual primary pod name
EOF
wait_ops switchover-${TS} 90

# With explicit candidate (optional — omit candidateName to let Syncer choose)
#  switchover:
#    - componentName: <component>
#      instanceName: "$PRIMARY_POD"              # current primary to demote
#      candidateName: "<cluster>-<component>-1"  # target to promote

Feature: Failover (ChaosMesh)

Chaos faults that trigger a failover — the HA mechanism must re-elect a new primary and the cluster should return to Running with data intact.

Single-node topology: These tests verify Kubernetes pod restart recovery only — there is no application-level HA election. Mark results as "K8s pod restart recovery (not HA failover)" and note the topology. For true failover validation, use multi-node topology with 3+ replicas.

Requires ChaosMesh. $CHAOS_NS was resolved in Step 1 (install performed automatically if absent). If still empty, mark all Failover and NoFailover tests as SKIPPED.

CLUSTER_NAME=<cluster-name>
COMPONENT=<component>
NAMESPACE=default

# Identify the current primary pod
PRIMARY_POD=$(kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME" \
  -o jsonpath='{.items[0].metadata.name}')  # adjust selector for primary role if applicable

# Set CHAOS_NS to whichever namespace ChaosMesh is installed in
CHAOS_NS=chaos-mesh   # or chaos-testing

Pod Kill

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: chaos-pod-kill-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
      app.kubernetes.io/component: "$COMPONENT"
  gracePeriod: 0
EOF
# Wait for cluster to recover to Running
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

Kill 1 (kill process PID 1)

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: chaos-kill1-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: container-kill
  mode: one
  containerNames: ["<main-container-name>"]
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

Pod Failure (pod unavailable for a period)

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: chaos-pod-failure-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: pod-failure
  mode: one
  duration: "60s"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

OOM (memory stress → OOM kill)

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: chaos-oom-$(date +%s)
  namespace: $CHAOS_NS
spec:
  mode: one
  duration: "30s"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
  stressors:
    memory:
      workers: 1
      size: "512MB"
EOF
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

Full CPU

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: chaos-cpu-$(date +%s)
  namespace: $CHAOS_NS
spec:
  mode: one
  duration: "60s"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
  stressors:
    cpu:
      workers: 2
      load: 100
EOF
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

Network Loss

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: chaos-netloss-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: loss
  mode: one
  duration: "60s"
  loss:
    loss: "100"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

Network Corrupt

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: chaos-netcorrupt-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: corrupt
  mode: one
  duration: "60s"
  corrupt:
    corrupt: "60"
    correlation: "25"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

Network Bandwidth

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: chaos-netbw-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: bandwidth
  mode: one
  duration: "60s"
  bandwidth:
    rate: "1mbps"
    limit: 100
    buffer: 10000
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

Network Delay

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: chaos-netdelay-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: delay
  mode: one
  duration: "60s"
  delay:
    latency: "500ms"
    correlation: "25"
    jitter: "50ms"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

Network Partition

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: chaos-partition-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: partition
  mode: one
  duration: "60s"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 60 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
  sleep 10
done

Delete Pod All

kubectl delete pods -l "app.kubernetes.io/instance=$CLUSTER_NAME" --force --grace-period=0
for ((i=0; i<=150; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 75 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 150 )) && echo "✗ No recovery after 150s — last phase: ${PHASE}" && break
  sleep 10
done

Feature: NoFailover (ChaosMesh — no HA election expected)

These faults should not trigger failover. The cluster may be temporarily degraded but should recover without leader election.

Network Duplicate

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: chaos-netdup-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: duplicate
  mode: one
  duration: "60s"
  duplicate:
    duplicate: "60"
    correlation: "25"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=90; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 40 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
  sleep 10
done

DNS Error

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: chaos-dnserr-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: error
  mode: one
  duration: "60s"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=90; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 40 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
  sleep 10
done

DNS Random

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: chaos-dnsrandom-$(date +%s)
  namespace: $CHAOS_NS
spec:
  action: random
  mode: one
  duration: "60s"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=90; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 40 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
  sleep 10
done

Time Offset

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: chaos-timeoffset-$(date +%s)
  namespace: $CHAOS_NS
spec:
  mode: one
  duration: "60s"
  timeOffset: "-1h"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=90; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 40 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
  sleep 10
done

Connection Stress

cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: chaos-connstress-$(date +%s)
  namespace: $CHAOS_NS
spec:
  mode: one
  duration: "60s"
  selector:
    namespaces: [$NAMESPACE]
    labelSelectors:
      app.kubernetes.io/instance: "$CLUSTER_NAME"
  stressors:
    cpu:
      workers: 1
      load: 50
EOF
for ((i=0; i<=90; i+=10)); do
  PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
  if (( i == 40 )); then
    echo "  [${i}s] still ${PHASE} — KB logs:"
    kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
  fi
  (( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
  sleep 10
done

Cleanup chaos objects after each test:

kubectl delete networkchaos,podchaos,stresschaos,dnschaos,timechaos \
  -n $CHAOS_NS -l "app.kubernetes.io/instance=$CLUSTER_NAME" 2>/dev/null || true

Feature: Backup Restore

The backup method depends on the engine. Common methods from the regression report:

Engine	Backup Methods
MySQL (8.0 / 5.7)	xtrabackup, xtrabackup-inc (incremental), volume-snapshot, Schedule xtrabackup
PostgreSQL	wal-g, pg-basebackup, volume-snapshot, Schedule pg-basebackup
MongoDB	pbm-physical, dump, datafile, volume-snapshot, Schedule pbm-physical
Redis	datafile, volume-snapshot, Schedule datafile, aof
Redis Cluster	datafile, Schedule datafile
Kafka	topics
Qdrant	datafile, Schedule datafile
Etcd	datafile
Elasticsearch	es-dump, full-backup (⚠ full-backup fails with JavaClassNotFoundException in ES 7.x — use es-dump)
Milvus	full, volume-snapshot
Clickhouse	full
TDengine	dump
Kingbase	full
GaussDB	gaussdb-roach
Oracle	oracle-rman
OceanBase Ent	full, full-for-rebuild
MSSQL	full
Doris	full

# Create backup
cat <<EOF | kubectl apply -f -
apiVersion: dataprotection.kubeblocks.io/v1alpha1
kind: Backup
metadata:
  name: backup-test-$(date +%s)
  namespace: default
spec:
  backupMethod: <method>          # e.g. xtrabackup, wal-g, datafile
  backupPolicyName: <cluster-name>-<component>-backup-policy
EOF

# Wait for backup Completed
kubectl wait backup backup-test-<ts> --for=jsonpath='{.status.phase}'=Completed --timeout=300s

# Restore from backup
cat <<EOF | kubectl apply -f -
apiVersion: apps.kubeblocks.io/v1
kind: Cluster
metadata:
  name: <cluster-name>-restore
  namespace: default
  annotations:
    kubeblocks.io/restore-from-backup: '{"<component>":{"name":"backup-test-<ts>","namespace":"default","volumeRestorePolicy":"Parallel"}}'
spec:
  terminationPolicy: Delete
  clusterDef: <engine>
  topology: <topology>
  componentSpecs:
    - name: <component>
      serviceVersion: "<version>"
      replicas: 1
      resources:
        limits:   { cpu: "0.5", memory: "512Mi" }
        requests: { cpu: "0.1", memory: "256Mi" }
      volumeClaimTemplates:
        - name: data
          spec:
            accessModes: [ReadWriteOnce]
            storageClassName: ""
            resources:
              requests:
                storage: 20Gi
EOF

kubectl wait cluster <cluster-name>-restore --for=jsonpath='{.status.phase}'=Running --timeout=300s

# Cleanup restore cluster
kubectl delete cluster <cluster-name>-restore

MSSQL restore override: MSSQL requires extra volumes (certificate Secret) that raw YAML cannot provide. Use helm install instead — see engine-hints:mssql "Restore from Backup" section for the exact command. The restoreFrom Helm value renders the kubeblocks.io/restore-from-backup annotation automatically.

# MSSQL restore example (use this instead of raw YAML above)
helm install <restore-name> addons-cluster/mssql \
  --set version=<version> \
  --set replicas=<N> \
  --set cpu=<cpu> --set memory=<memory> --set storage=<storage> \
  --set extra.terminationPolicy=Delete \
  --set-json 'restoreFrom="{\"mssql\":{\"name\":\"<backup-name>\",\"namespace\":\"default\",\"volumeRestorePolicy\":\"Parallel\"}}"'
# Cleanup: helm uninstall <restore-name>

Schedule Backup

cat <<EOF | kubectl apply -f -
apiVersion: dataprotection.kubeblocks.io/v1alpha1
kind: BackupSchedule
metadata:
  name: sched-backup-$(date +%s)
  namespace: default
spec:
  backupPolicyName: <cluster-name>-<component>-backup-policy
  schedules:
    - backupMethod: <method>
      cronExpression: "*/5 * * * *"   # every 5 min for testing
      enabled: true
      retentionPeriod: 1h
EOF
# Wait for at least one backup to complete, then delete schedule

Restore Increment (xtrabackup-inc)

# First ensure a full xtrabackup exists, then apply incremental backup,
# then restore incrementally. Only applicable to engines with xtrabackup-inc method.

Feature: Parameter (Reconfiguring Dynamic & Static)

Pre-check — ParametersDef is optional. Not every addon implements live reconfiguration. Check before attempting the test:
kubectl get parametersdef --no-headers 2>/dev/null | grep <engine>
Result Action Report state
ParametersDef found Run the test PASSED / FAILED
ParametersDef missing, team plans to add it Skip, file a Feature issue N/A (ParametersDef not yet implemented)
ParametersDef missing, intentionally not supported Skip, no issue needed N/A (not applicable for this engine)

Never mark FAILED just because ParametersDef is absent — that is a feature gap, not a broken feature. Only mark FAILED if ParametersDef exists but the OpsRequest errors out or times out.

Step A: Discover Parameters from ParametersDef

# Find all ParametersDef resources matching this engine's component
PARAMS_DEF_NAMES=$(kubectl get parametersdef --no-headers 2>/dev/null | grep <engine> | awk '{print $1}')
if [[ -z "$PARAMS_DEF_NAMES" ]]; then
  echo "No ParametersDef found for <engine> — marking Parameter tests as N/A"
  # Mark both Reconfiguring Dynamic and Reconfiguring Static as N/A
  # Skip to the next feature
fi

# For each ParametersDef, extract dynamic/static/immutable parameter lists
for PD_NAME in $PARAMS_DEF_NAMES; do
  echo "=== ParametersDef: $PD_NAME ==="

  DYNAMIC_PARAMS=$(kubectl get parametersdef $PD_NAME \
    -o jsonpath='{.spec.dynamicParameters[*]}' 2>/dev/null)
  STATIC_PARAMS=$(kubectl get parametersdef $PD_NAME \
    -o jsonpath='{.spec.staticParameters[*]}' 2>/dev/null)
  IMMUTABLE_PARAMS=$(kubectl get parametersdef $PD_NAME \
    -o jsonpath='{.spec.immutableParameters[*]}' 2>/dev/null)

  echo "Dynamic parameters ($(echo $DYNAMIC_PARAMS | wc -w | tr -d ' ')): $DYNAMIC_PARAMS"
  echo "Static parameters ($(echo $STATIC_PARAMS | wc -w | tr -d ' ')): $STATIC_PARAMS"
  echo "Immutable parameters ($(echo $IMMUTABLE_PARAMS | wc -w | tr -d ' ')): $IMMUTABLE_PARAMS"
done

Immutable parameters must NEVER be modified. They are read-only after initialization. Skip them entirely.

Step B: Read Current Config Values and Generate Safe Test Values

For each parameter selected for testing, read the current value from the running configuration and generate a safe new value.

# Get the config template name and file name from the ParametersDef
TEMPLATE_NAME=$(kubectl get parametersdef $PD_NAME \
  -o jsonpath='{.spec.templateName}' 2>/dev/null)
FILE_NAME=$(kubectl get parametersdef $PD_NAME \
  -o jsonpath='{.spec.fileName}' 2>/dev/null)

# Read current config from the running pod's ConfigMap
# The ConfigMap is typically named: <cluster>-<component>-<template>
CONFIG_CM=$(kubectl get configmap -l app.kubernetes.io/instance=$CLUSTER_NAME \
  --no-headers 2>/dev/null | grep "$TEMPLATE_NAME" | awk '{print $1}' | head -1)

if [[ -n "$CONFIG_CM" && -n "$FILE_NAME" ]]; then
  CURRENT_CONFIG=$(kubectl get configmap $CONFIG_CM \
    -o jsonpath="{.data.${FILE_NAME}}" 2>/dev/null)
fi

Value generation strategy — for each parameter, determine a safe new value:

Current Value Pattern	Strategy	Example
Integer (e.g. `100`, `4096`)	Increment by 1 or double if small (< 10)	`100` → `101`, `4` → `8`
Boolean (`on`/`off`, `yes`/`no`, `true`/`false`, `1`/`0`)	Toggle to the opposite	`yes` → `no`
Size with unit (`512MB`, `1G`, `2G`)	Double or halve (stay within reasonable bounds)	`512MB` → `1024MB`
Float (e.g. `0.5`, `1.5`)	Increment by 0.1	`0.5` → `0.6`
Unrecognized or unreadable	Skip this parameter, pick another	—

Important: If the parameter's current value cannot be read or the generated value would be invalid, skip that parameter and randomly select another one from the remaining pool. Never send a value that could crash the database.

# Helper: generate a safe test value for a parameter
# Usage: generate_test_value <current_value>
# Outputs the new value to stdout. Exits 1 if cannot determine a safe value.
function generate_test_value() {
  local CURRENT="$1"
  if [[ -z "$CURRENT" ]]; then
    return 1
  fi
  # Boolean toggle
  case "$CURRENT" in
    on)    echo "off"; return 0 ;;
    off)   echo "on"; return 0 ;;
    yes)   echo "no"; return 0 ;;
    no)    echo "yes"; return 0 ;;
    true)  echo "false"; return 0 ;;
    false) echo "true"; return 0 ;;
    ON)    echo "OFF"; return 0 ;;
    OFF)   echo "ON"; return 0 ;;
    1)     echo "0"; return 0 ;;
    0)     echo "1"; return 0 ;;
  esac
  # Size with unit (e.g. 512MB, 1G, 2G, 256K)
  if [[ "$CURRENT" =~ ^([0-9]+)([KMGkmg][Bb]?)$ ]]; then
    local NUM=${BASH_REMATCH[1]}
    local UNIT=${BASH_REMATCH[2]}
    local NEW_NUM=$((NUM + NUM / 2))  # increase by 50%
    [[ $NEW_NUM -eq $NUM ]] && NEW_NUM=$((NUM + 1))
    echo "${NEW_NUM}${UNIT}"; return 0
  fi
  # Integer
  if [[ "$CURRENT" =~ ^[0-9]+$ ]]; then
    local NEW_VAL=$((CURRENT + 1))
    echo "$NEW_VAL"; return 0
  fi
  # Float
  if [[ "$CURRENT" =~ ^[0-9]+\.[0-9]+$ ]]; then
    local NEW_VAL=$(echo "$CURRENT + 0.1" | bc)
    echo "$NEW_VAL"; return 0
  fi
  # Cannot determine a safe value
  return 1
}

Step C: Snapshot Pre-test Pod State

Before any parameter modifications, record the pod restart counts and UIDs to detect restarts later.

# Record restart counts and pod UIDs for all pods of this component
kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
  | tee /tmp/pod-state-before.txt
echo "--- Pre-test pod state recorded ---"

Step D: Reconfiguring Dynamic — Randomly Select Up to 10 Parameters

Randomly select up to 10 parameters from dynamicParameters[] (use all if fewer than 10 exist).
For each, read the current value and generate a safe test value.
Send a single Reconfiguring OpsRequest with all selected parameters.
After Succeed, verify no pod restarts occurred.

# Convert space-separated list to array and shuffle
DYNAMIC_ARRAY=($DYNAMIC_PARAMS)
DYNAMIC_COUNT=${#DYNAMIC_ARRAY[@]}

if [[ $DYNAMIC_COUNT -eq 0 ]]; then
  echo "No dynamic parameters defined — marking Reconfiguring Dynamic as N/A"
  # Report: N/A (no dynamic parameters defined in ParametersDef)
else
  # Shuffle and pick up to 10
  SHUFFLED=($(printf '%s\n' "${DYNAMIC_ARRAY[@]}" | shuf))
  PICK_COUNT=10
  (( DYNAMIC_COUNT < PICK_COUNT )) && PICK_COUNT=$DYNAMIC_COUNT
  SELECTED_DYNAMIC=("${SHUFFLED[@]:0:$PICK_COUNT}")

  echo "Selected $PICK_COUNT dynamic parameters for testing: ${SELECTED_DYNAMIC[*]}"

  # Build parameters YAML block, skipping any where we can't generate a safe value
  PARAMS_YAML=""
  TESTED_DYNAMIC=()
  TESTED_DYNAMIC_DESC=""
  for PARAM in "${SELECTED_DYNAMIC[@]}"; do
    # Read current value — try from ConfigMap or from the running pod
    CURRENT_VAL=""
    if [[ -n "$CURRENT_CONFIG" ]]; then
      # Attempt to extract value from config file content (ini/conf/properties format)
      CURRENT_VAL=$(echo "$CURRENT_CONFIG" | grep -E "^\s*${PARAM}\s*[=: ]" \
        | head -1 | sed -E 's/^[^=:]*[=: ]\s*//' | tr -d '"' | tr -d "'" | xargs)
    fi
    NEW_VAL=$(generate_test_value "$CURRENT_VAL")
    if [[ $? -ne 0 || -z "$NEW_VAL" ]]; then
      echo "  Skipping $PARAM — cannot determine safe test value (current=$CURRENT_VAL)"
      continue
    fi
    echo "  $PARAM: $CURRENT_VAL → $NEW_VAL"
    PARAMS_YAML="${PARAMS_YAML}        - key: ${PARAM}
          value: \"${NEW_VAL}\"
"
    TESTED_DYNAMIC+=("$PARAM=$NEW_VAL")
    TESTED_DYNAMIC_DESC="${TESTED_DYNAMIC_DESC}, ${PARAM}=${NEW_VAL}"
  done
  TESTED_DYNAMIC_DESC="${TESTED_DYNAMIC_DESC#, }"  # trim leading comma

  if [[ ${#TESTED_DYNAMIC[@]} -eq 0 ]]; then
    echo "Could not generate safe values for any dynamic parameter — marking as N/A"
  else
    # Snapshot pod state before dynamic test
    kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
      -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
      | tee /tmp/pod-state-before-dynamic.txt

    TS=$(date +%s)
    cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: reconfig-dynamic-${TS}
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: Reconfiguring
  reconfigures:
    - componentName: <component>
      parameters:
${PARAMS_YAML}EOF

    wait_ops reconfig-dynamic-${TS} 180

    # --- Verify NO pod restart after dynamic parameter change ---
    sleep 10  # brief stabilization
    kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
      -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
      | tee /tmp/pod-state-after-dynamic.txt

    RESTART_DETECTED=false
    while IFS=$'\t' read -r POD_NAME POD_UID RESTART_BEFORE; do
      RESTART_AFTER=$(grep "^${POD_NAME}" /tmp/pod-state-after-dynamic.txt | cut -f3)
      UID_AFTER=$(grep "^${POD_NAME}" /tmp/pod-state-after-dynamic.txt | cut -f2)
      if [[ "$UID_AFTER" != "$POD_UID" ]]; then
        echo "  ✗ Pod $POD_NAME was recreated (UID changed: $POD_UID → $UID_AFTER)"
        RESTART_DETECTED=true
      elif [[ -n "$RESTART_AFTER" && "$RESTART_AFTER" -gt "$RESTART_BEFORE" ]]; then
        echo "  ✗ Pod $POD_NAME restart count increased ($RESTART_BEFORE → $RESTART_AFTER)"
        RESTART_DETECTED=true
      else
        echo "  ✓ Pod $POD_NAME — no restart (count=$RESTART_BEFORE, UID unchanged)"
      fi
    done < /tmp/pod-state-before-dynamic.txt

    if [[ "$RESTART_DETECTED" == "true" ]]; then
      echo "✗ UNEXPECTED: Pod restart detected after dynamic parameter change"
      echo "  Dynamic parameters should be hot-reloaded without restart."
      echo "  This indicates a possible misclassification in ParametersDef."
      # Report: FAILED — pod restarted unexpectedly after dynamic param change
    else
      echo "✓ Reconfiguring Dynamic Succeed — no pod restart (${#TESTED_DYNAMIC[@]} params)"
      # Report: PASSED
    fi
  fi
fi

Step E: Reconfiguring Static — Randomly Select Up to 2 Parameters

Randomly select up to 2 parameters from staticParameters[] (use all if fewer than 2 exist).
For each, read the current value and generate a safe test value.
Send a Reconfiguring OpsRequest.
After Succeed, verify pods WERE restarted (restart count increased or pod UIDs changed).

STATIC_ARRAY=($STATIC_PARAMS)
STATIC_COUNT=${#STATIC_ARRAY[@]}

if [[ $STATIC_COUNT -eq 0 ]]; then
  echo "No static parameters defined — marking Reconfiguring Static as N/A"
  # Report: N/A (no static parameters defined in ParametersDef)
else
  # Shuffle and pick up to 2
  SHUFFLED_STATIC=($(printf '%s\n' "${STATIC_ARRAY[@]}" | shuf))
  PICK_STATIC=2
  (( STATIC_COUNT < PICK_STATIC )) && PICK_STATIC=$STATIC_COUNT
  SELECTED_STATIC=("${SHUFFLED_STATIC[@]:0:$PICK_STATIC}")

  echo "Selected $PICK_STATIC static parameters for testing: ${SELECTED_STATIC[*]}"

  # Build parameters YAML block
  STATIC_PARAMS_YAML=""
  TESTED_STATIC=()
  TESTED_STATIC_DESC=""
  for PARAM in "${SELECTED_STATIC[@]}"; do
    CURRENT_VAL=""
    if [[ -n "$CURRENT_CONFIG" ]]; then
      CURRENT_VAL=$(echo "$CURRENT_CONFIG" | grep -E "^\s*${PARAM}\s*[=: ]" \
        | head -1 | sed -E 's/^[^=:]*[=: ]\s*//' | tr -d '"' | tr -d "'" | xargs)
    fi
    NEW_VAL=$(generate_test_value "$CURRENT_VAL")
    if [[ $? -ne 0 || -z "$NEW_VAL" ]]; then
      echo "  Skipping $PARAM — cannot determine safe test value (current=$CURRENT_VAL)"
      continue
    fi
    echo "  $PARAM: $CURRENT_VAL → $NEW_VAL"
    STATIC_PARAMS_YAML="${STATIC_PARAMS_YAML}        - key: ${PARAM}
          value: \"${NEW_VAL}\"
"
    TESTED_STATIC+=("$PARAM=$NEW_VAL")
    TESTED_STATIC_DESC="${TESTED_STATIC_DESC}, ${PARAM}=${NEW_VAL}"
  done
  TESTED_STATIC_DESC="${TESTED_STATIC_DESC#, }"

  if [[ ${#TESTED_STATIC[@]} -eq 0 ]]; then
    echo "Could not generate safe values for any static parameter — marking as N/A"
  else
    # Snapshot pod state before static test
    kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
      -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
      | tee /tmp/pod-state-before-static.txt

    TS=$(date +%s)
    cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: reconfig-static-${TS}
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: Reconfiguring
  reconfigures:
    - componentName: <component>
      parameters:
${STATIC_PARAMS_YAML}EOF

    wait_ops reconfig-static-${TS} 300  # static params trigger rolling restart — allow more time

    # --- Verify pods WERE restarted after static parameter change ---
    sleep 15  # allow restart to complete
    kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
      -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
      | tee /tmp/pod-state-after-static.txt

    RESTART_CONFIRMED=false
    while IFS=$'\t' read -r POD_NAME POD_UID RESTART_BEFORE; do
      RESTART_AFTER=$(grep "^${POD_NAME}" /tmp/pod-state-after-static.txt | cut -f3)
      UID_AFTER=$(grep "^${POD_NAME}" /tmp/pod-state-after-static.txt | cut -f2)
      if [[ "$UID_AFTER" != "$POD_UID" ]]; then
        echo "  ✓ Pod $POD_NAME was recreated (UID changed: $POD_UID → $UID_AFTER)"
        RESTART_CONFIRMED=true
      elif [[ -n "$RESTART_AFTER" && "$RESTART_AFTER" -gt "$RESTART_BEFORE" ]]; then
        echo "  ✓ Pod $POD_NAME restart count increased ($RESTART_BEFORE → $RESTART_AFTER)"
        RESTART_CONFIRMED=true
      else
        echo "  ? Pod $POD_NAME — no restart detected (count=$RESTART_BEFORE, UID unchanged)"
      fi
    done < /tmp/pod-state-before-static.txt

    if [[ "$RESTART_CONFIRMED" == "true" ]]; then
      echo "✓ Reconfiguring Static Succeed — pod restart confirmed (${#TESTED_STATIC[@]} params)"
      # Report: PASSED
    else
      echo "✗ UNEXPECTED: No pod restart detected after static parameter change"
      echo "  Static parameters require a restart to take effect."
      echo "  This indicates a possible misclassification in ParametersDef."
      # Report: FAILED — no restart after static param change
    fi
  fi
fi

Edge Cases

Scenario	Action
Fewer than 10 dynamic params available	Test all available, note the count in the report
Fewer than 2 static params available	Test all available, note the count in the report
Zero dynamic params in ParametersDef	Mark Reconfiguring Dynamic as `N/A (no dynamic parameters defined)`
Zero static params in ParametersDef	Mark Reconfiguring Static as `N/A (no static parameters defined)`
Cannot read current value for a param	Skip that parameter, randomly pick another from the pool
Cannot generate a safe value	Skip that parameter, randomly pick another from the pool
OpsRequest stays Running beyond timeout	`wait_ops` handles this — marks `FAILED` with KB log output
Dynamic param change triggers restart	Mark `FAILED` — likely a misclassification in ParametersDef
Static param change does NOT trigger restart	Mark `FAILED` — likely a misclassification in ParametersDef

If ParametersDef is absent and you attempt the OpsRequest anyway, it will stay in Running indefinitely — cancel it with kubectl delete opsrequest reconfig-<ts> and mark the result N/A.

Feature: Accessibility

Expose (internet/intranet LoadBalancer service)

Expose strategy by component type:

Components with roles (e.g. MySQL primary/secondary): use the OpsRequest approach with roleSelector to expose only the primary.

Components without roles (check engine hints for which components this applies to): use the direct Service approach with apps.kubeblocks.io/component-name label selector. The OpsRequest approach will hang if the component has no roles and a roleSelector is injected.

Check whether a component has roles before choosing the approach:
kubectl get componentdefinition <compdef-name> -o jsonpath='{.spec.roles[*].name}'
# Empty output → use direct Service approach

Approach A — OpsRequest (for components WITH roles):

# NOTE: the "switch" field is required but missing from the CRD schema validation.
# Use --validate=false to bypass client-side validation.
# Do NOT include roleSelector for components that have no roles defined.
TS=$(date +%s)
cat <<EOF | kubectl apply --validate=false -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: expose-enable-${TS}
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: Expose
  expose:
    - componentName: <component>
      switch: Enable
      services:
        - name: internet
          serviceType: LoadBalancer
          annotations: {}
EOF
for i in {1..24}; do
  PHASE=$(kubectl get opsrequest expose-enable-${TS} -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Succeed" ]] && echo "✓ Expose Enable Succeed" && break
  [[ "$PHASE" == "Failed" ]]  && echo "✗ Expose Enable Failed" && break
  sleep 5
done

# Disable
TS=$(date +%s)
cat <<EOF | kubectl apply --validate=false -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: expose-disable-${TS}
  namespace: default
spec:
  clusterName: $CLUSTER_NAME
  type: Expose
  expose:
    - componentName: <component>
      switch: Disable
      services:
        - name: internet
          serviceType: LoadBalancer
EOF
for i in {1..24}; do
  PHASE=$(kubectl get opsrequest expose-disable-${TS} -o jsonpath='{.status.phase}' 2>/dev/null)
  [[ "$PHASE" == "Succeed" ]] && echo "✓ Expose Disable Succeed" && break
  sleep 5
done

Approach B — Direct Service (for components WITHOUT roles):

# Enable: create LB service using apps.kubeblocks.io/component-name label
COMPONENT=<component>   # e.g. master, data, mdit
SVC_NAME="${CLUSTER_NAME}-${COMPONENT}-internet"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: ${SVC_NAME}
  namespace: default
  labels:
    app.kubernetes.io/instance: ${CLUSTER_NAME}
    apps.kubeblocks.io/component-name: ${COMPONENT}
spec:
  type: LoadBalancer
  selector:
    app.kubernetes.io/instance: ${CLUSTER_NAME}
    apps.kubeblocks.io/component-name: ${COMPONENT}
  ports:
    - name: <port-name>   # e.g. http
      port: <port>        # e.g. 9200
      targetPort: <port-name>
      protocol: TCP
EOF

# Wait for external IP
for ((i=0; i<120; i+=5)); do
  IP=$(kubectl get svc ${SVC_NAME} -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null)
  [[ -n "$IP" ]] && echo "✓ LB ready: $IP" && break
  sleep 5
done

# Disable: delete the service
kubectl delete svc ${SVC_NAME}
echo "✓ LB removed"

Connect

# Get connection credential secret
kubectl get secret -l app.kubernetes.io/instance=$CLUSTER_NAME -o name
kubectl get secret <cluster-name>-<component>-account-root -o jsonpath='{.data.password}' \
  | base64 -d

# Engine-specific connection test (adapt per engine)
# MySQL example:
kubectl exec -it <pod-name> -- mysql -u root -p<password> -e "SELECT 1"
# PostgreSQL:
kubectl exec -it <pod-name> -- psql -U postgres -c "SELECT 1"
# Redis:
kubectl exec -it <pod-name> -- redis-cli PING

Feature: Stress (Bench)

# Use kbcli bench or engine-specific tool
# MySQL / PostgreSQL example with sysbench via kbcli:
kbcli bench sysbench $CLUSTER_NAME --component <component> \
  --driver mysql --database mydb --tables 5 --table-size 10000 \
  --duration 30 --threads 8

# Test against LB service (if Expose was enabled):
kbcli bench sysbench $CLUSTER_NAME --component <component> \
  --driver mysql --host <lb-ip> --port 3306 \
  --duration 30 --threads 8

Cleanup

kubectl delete cluster $CLUSTER_NAME
# terminationPolicy=Delete cleans up PVCs automatically

Troubleshooting: OpsRequest Stuck or Cluster Abnormal

When an OpsRequest stays in Running indefinitely, or the cluster enters Abnormal phase with no obvious pod-level error, check the KubeBlocks operator logs — they surface controller-level errors that are invisible in pod events:

# Find the KB operator pod
kubectl get pods -n kb-system --no-headers | grep kubeblocks

# Search for errors related to your cluster
kubectl logs <kubeblocks-pod> -n kb-system --tail=500 \
  | grep -E "ERROR|build error|$CLUSTER_NAME" \
  | grep -v "replicas.*out-of-limit" \
  | tail -30

# Common patterns and their meanings:
# "replicas 0 out-of-limit [1, 16384]"  → a component's replicas was zeroed out
#   → likely caused by --type=merge patch on componentSpecs array
#   → fix: kubectl patch --type=json to restore correct replicas
#
# "not all component sub-resources deleted" → component is stuck deleting
#   → check for finalizers: kubectl get component <name> -o jsonpath='{.metadata.finalizers}'
#
# "OpsRequest is forbidden when Cluster.status.phase=Updating"
#   → wait for cluster to return to Running before submitting next OpsRequest

Rule: never use --type=merge on spec.componentSpecs. It replaces the entire array, zeroing out replicas/resources/volumeClaimTemplates for every component not included in the patch body. Always use --type=json for any field inside componentSpecs.

Final Report

After all tests finish, output the results directly as markdown (do NOT wrap in a code block). The rendered markdown will display as formatted headers and tables.

Use exactly this structure, filling in actual values for every placeholder:

Instance Test Results

Engine: <engine> ( Topology = <topology> ; Replicas = <N> ) Component Definition: <cmpd-name> Component Version: <cv-name> Service Version: <version>

Feature	Operation	State	Description
Lifecycle	Create	PASSED	Create a cluster with topology <topology>, component definition <cmpd-name>, service version <version>
Lifecycle	Start	PASSED	Start the cluster
Lifecycle	Stop	PASSED	Stop the cluster
Lifecycle	Restart	PASSED	Restart the cluster
Lifecycle	Update	PASSED	Update TerminationPolicy to WipeOut
Scale	VerticalScaling	PASSED	VerticalScaling component <component>
Scale	VolumeExpansion	PASSED	VolumeExpansion component <component>
Scale	HorizontalScaling In	PASSED	HorizontalScaling In component <component>
Scale	HorizontalScaling Out	PASSED	HorizontalScaling Out component <component>
Scale	RebuildInstance	-	Not implemented or unsupported
Upgrade	Upgrade	PASSED	Upgrade component <component> from <v1> to <v2>
SwitchOver	SwitchOver	PASSED	SwitchOver component <component>
Failover	Kill 1	PASSED	Simulates conditions where process 1 killed
Failover	Pod Kill	PASSED	Simulates conditions where pods experience kill
NoFailover	Connection Stress	PASSED	Simulates conditions where pods experience connection stress
Backup Restore	Backup	PASSED	<method> Backup
Backup Restore	Restore	PASSED	<method> Restore
Backup Restore	Delete Restore Cluster	PASSED	Delete the <method> restore cluster
Parameter	Reconfiguring Dynamic	PASSED	Reconfiguring dynamic parameters (N params): <param1>=<val1>, <param2>=<val2>, ... — no pod restart confirmed
Parameter	Reconfiguring Static	PASSED	Reconfiguring static parameters (N params): <param1>=<val1>, <param2>=<val2>, ... — pod restart confirmed
Accessibility	Expose	PASSED	Expose Enable internet service on component <component>
Accessibility	Connect	PASSED	Connect to the cluster
Stress	Bench	PASSED	Bench the cluster via component <component>

Conclusion

All implemented operations: PASSED Not implemented or unsupported: <list operations marked "-"> Images not yet in registry: <list any skipped versions, or "none">

State values — use exactly these tokens in the State column:

PASSED — operation completed successfully
FAILED — operation attempted but did not succeed
SKIPPED (known #N) — skipped due to open GitHub issue #N with label skip-in-test
SKIPPED — precondition not met (e.g., image missing, ChaosMesh not installed)
N/A — architecturally not applicable for this topology or engine (e.g., HScale on single-node, downgrade on engine that forbids it)
- — feature not yet implemented in this addon

Save Report to File

After outputting the report, save it as a markdown file:

workspace/tests/<engine>-<topology>-report.md

The file must contain the complete report (everything from ## Instance Test Results through the ### Conclusion section). Create the workspace/tests/ directory if it does not already exist.

Filing Issues

When a test case FAILs due to a bug in the addon or its dependencies, file a GitHub issue at https://github.com/apecloud/kubeblocks-addons/issues.

Labels

Apply exactly one primary label:

Label	When to use
`Bug`	Something is broken — wrong behavior, crash, error
`Feature`	New capability that does not exist yet
`Improvement`	Existing feature works but could be better (performance, UX, coverage)
`Chore`	Maintenance, dependency update, CI, cleanup
`Document`	Documentation missing or incorrect

# Example: file a Bug and assign to a maintainer
gh issue create \
  --repo apecloud/kubeblocks-addons \
  --title "bug: <engine> <operation> fails with <error>" \
  --label "Bug" \
  --assignee <github-username> \
  --body "$(cat <<'EOF'
## Summary
<one-line description>

## Trigger Path
<exact call chain that reproduces the error>

## Root Cause
<what is actually wrong>

## Fix
<suggested fix>

## Workaround
<how to avoid the issue until it is fixed, if any>
EOF
)"

name	test-instances
description	Use when creating and validating KubeBlocks database Cluster test instances across standalone, replication, cluster, parameter, dynamic/static, recovery, and engine-specific topologies.