with one click
test-instances
// Use when creating and validating KubeBlocks database Cluster test instances across standalone, replication, cluster, parameter, dynamic/static, recovery, and engine-specific topologies.
// Use when creating and validating KubeBlocks database Cluster test instances across standalone, replication, cluster, parameter, dynamic/static, recovery, and engine-specific topologies.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | test-instances |
| description | Use when creating and validating KubeBlocks database Cluster test instances across standalone, replication, cluster, parameter, dynamic/static, recovery, and engine-specific topologies. |
Reference resolution: when this source-derived skill mentions docs/..., resolve it from the shared support package beside the installed user skills: ~/.codex/skills/kubeblocks-addon-source-docs/docs/... for Codex or ~/.claude/skills/kubeblocks-addon-source-docs/docs/... for Claude Code. In the shared kubeblocks-addon-docs checkout, the same files live under skills/kubeblocks-addon-source-docs/docs/.... When it mentions scripts/..., resolve it from the same support package under scripts/.... If you are working inside a checkout of the original apecloud/kubeblocks-addon-skills, repo-relative paths are also valid.
Create database cluster instances to validate the deployed KubeBlocks addon, testing every topology.
Target: $ARGUMENTS
(Engine name and optional version — e.g., redis or redis 7.2.4. Version omitted → tests all available versions.)
SCRIPT_DIR="$(git rev-parse --show-toplevel 2>/dev/null || pwd)"
[ -f "$SCRIPT_DIR/.env" ] && source "$SCRIPT_DIR/.env"
[ -n "$KUBECONFIG" ] && export KUBECONFIG
# Parse arguments: ENGINE [VERSION]
ARGS=($ARGUMENTS)
ENGINE="${ARGS[0]:-}"
VERSION="${ARGS[1]:-}"
kubectl cluster-info --request-timeout=5s \
|| { echo "ERROR: kubectl cannot reach the cluster. Check KUBECONFIG in .env"; exit 1; }
echo "KUBECONFIG=${KUBECONFIG:-~/.kube/config}"
# Registry decision: probe docker.io; fall back to ALIYUN_IMAGE_REGISTRY if unreachable
if curl -s --connect-timeout 15 "https://registry-1.docker.io/v2/" -o /dev/null 2>/dev/null; then
IMAGE_REGISTRY="docker.io"
SKOPEO_CREDS="--no-creds"
elif [ -n "$ALIYUN_IMAGE_REGISTRY" ]; then
IMAGE_REGISTRY="${ALIYUN_IMAGE_REGISTRY}"
SKOPEO_CREDS="--creds ${ALIYUN_DOCKER_USERNAME}:${ALIYUN_DOCKER_PASSWORD}"
else
IMAGE_REGISTRY="docker.io"
SKOPEO_CREDS="--no-creds"
fi
echo "Image registry: ${IMAGE_REGISTRY}"
# Node arch/OS — used by skopeo to probe the correct manifest variant
NODE_ARCH=$(kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.architecture}' 2>/dev/null || echo "amd64")
NODE_OS=$(kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.operatingSystem}' 2>/dev/null || echo "linux")
echo "Node arch: ${NODE_ARCH} os: ${NODE_OS}"
Check for engine-specific overrides and constraints before proceeding:
HINTS_FILE="docs/engine-hints/${ENGINE}.md"
[ -f "$HINTS_FILE" ] && echo "=== Engine hints found: $HINTS_FILE ===" || echo "(no engine-specific hints for $ENGINE)"
If the hints file exists, read it now before proceeding — it may override resource limits, list unsupported operations, or describe engine-specific behaviors for this run.
For each engine, tests are organized by Feature → Operation, matching the official KubeBlocks v1.0 regression report format:
| Feature | Operations |
|---|---|
| Lifecycle | Create, Start, Stop, Restart (cluster + per-component), Update (TerminationPolicy WipeOut) |
| Scale | VerticalScaling, VolumeExpansion, HorizontalScaling In/Out, HscaleOfflineInstances, HscaleOnlineInstances, RebuildInstance |
| Upgrade | Service version upgrade (forward + backward) |
| SwitchOver | Promote, SwitchOver (per component) |
| Failover | ChaosMesh fault injection with expected HA recovery: Full CPU, Network Corrupt, OOM, Pod Kill, Kill 1, Network Loss, Network Delay, Pod Failure, Network Bandwidth, Network Partition, Delete Pod All |
| NoFailover | ChaosMesh fault injection without failover expected: DNS Error, Network Duplicate, DNS Random, Connection Stress, Time Offset |
| Backup Restore | Backup (xtrabackup / xtrabackup-inc / pbm-physical / wal-g / pg-basebackup / datafile / dump / full / volume-snapshot / topics), Schedule Backup/Restore, Restore, Restore Increment, Delete Restore Cluster |
| Parameter | Reconfiguring Dynamic (randomly select up to 10 dynamic params, verify no pod restart), Reconfiguring Static (randomly select up to 2 static params, verify pod restart triggered) |
| Accessibility | Expose Enable/Disable (internet/intranet), Connect |
| Stress | Bench (service + LB service), Tpch |
ChaosMesh is the chaos engineering tool used for Failover and NoFailover tests. Step 1 checks for
chaos-controller-managerinchaos-meshandchaos-testing, and installs ChaosMesh automatically if absent. The resolved namespace is stored in$CHAOS_NS.
Single-node topology limitations: Some operations are architecturally incompatible with single-node topologies and must be marked
N/Awithout attempting:
- HorizontalScaling Out/In, HscaleOfflineInstances, HscaleOnlineInstances — single-node topologies cannot add or remove nodes; architecturally unsupported.
- SwitchOver — no secondary exists to promote.
- Failover (all 11 cases) — single-node has no HA election. Recovery is Kubernetes pod restart (restartPolicy), not application-level failover. Results should be labeled as "K8s pod restart recovery" rather than "HA failover". Test in multi-node topology for true failover validation.
ENGINE=<engine>
# 1. ChaosMesh availability — verify controller is running; install if absent
CHAOS_NS=""
for NS in chaos-mesh chaos-testing; do
if kubectl get pods -n $NS --no-headers 2>/dev/null | grep -q "chaos-controller-manager"; then
CHAOS_NS=$NS
break
fi
done
if [[ -z "$CHAOS_NS" ]]; then
echo "ChaosMesh controller not found — installing via Helm..."
helm repo add chaos-mesh https://charts.chaos-mesh.org 2>/dev/null || true
helm repo update chaos-mesh
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-testing --create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--version latest \
--wait --timeout=120s \
&& CHAOS_NS=chaos-testing \
|| echo "ERROR: ChaosMesh install failed — chaos tests will be SKIPPED"
fi
if [[ -n "$CHAOS_NS" ]]; then
# Verify controller pod is Running
CM_POD=$(kubectl get pods -n $CHAOS_NS --no-headers 2>/dev/null \
| grep "chaos-controller-manager" | awk '{print $1}' | head -1)
CM_STATUS=$(kubectl get pod $CM_POD -n $CHAOS_NS \
-o jsonpath='{.status.phase}' 2>/dev/null)
echo "ChaosMesh controller: $CM_POD phase=$CM_STATUS namespace=$CHAOS_NS"
[[ "$CM_STATUS" != "Running" ]] && echo "WARNING: ChaosMesh controller not Running — chaos tests may fail"
fi
# 2. ClusterDefinition must be Available
PHASE=$(kubectl get clusterdefinition $ENGINE -o jsonpath='{.status.phase}' 2>/dev/null)
echo "ClusterDefinition phase: $PHASE"
# If not Available → run /deploy-addon first
# 3. addons-cluster chart must exist
ls addons-cluster/$ENGINE/Chart.yaml 2>/dev/null || echo "addons-cluster not found — cannot test"
Query GitHub for open issues labelled skip-in-test in apecloud/kubeblocks-addons.
Each matching issue represents a known failing test case — skip it this run; re-run automatically when the issue is closed.
Convention for issue titles: [<engine>] <Operation>: <short description>
Example: [elasticsearch] full-backup: JavaClassNotFoundException in es-agent 0.1.0
ENGINE=<engine>
# Fetch all open issues with skip-in-test label for this engine
echo "=== Known Issues (skip-in-test) ==="
gh issue list \
--repo apecloud/kubeblocks-addons \
--label "skip-in-test" \
--state open \
--json number,title \
--jq ".[] | select(.title | test(\"\\\\[${ENGINE}\"; \"i\")) | \" SKIP #\\(.number) \\(.title)\""
# Save skip list for reference during the run
SKIP_ISSUES=$(gh issue list \
--repo apecloud/kubeblocks-addons \
--label "skip-in-test" \
--state open \
--json number,title \
--jq "[.[] | select(.title | test(\"\\\\[${ENGINE}\"; \"i\")) | {number: .number, title: .title}]")
echo "$SKIP_ISSUES"
Before each test case, check whether its operation appears in $SKIP_ISSUES:
# Helper: returns issue number if this operation is a known skip, empty otherwise
function known_skip() {
local operation="$1"
echo "$SKIP_ISSUES" | python3 -c "
import sys, json
issues = json.load(sys.stdin)
op = '''$operation'''.lower()
for i in issues:
if op in i['title'].lower():
print(i['number'])
break
"
}
# Usage before any test case:
ISSUE=$(known_skip "full-backup")
if [[ -n "$ISSUE" ]]; then
echo "SKIPPED — known issue #${ISSUE} (skip-in-test)"
else
# run the test
fi
Rule:
- Issue OPEN + label
skip-in-test→ markSKIPPED (known #XXXX), do not attempt.- Issue CLOSED → label is gone → test runs automatically next time.
- To file a new known issue:
gh issue create --label "skip-in-test,bug" --title "[<engine>] <Operation>: ...".- No extra files needed anywhere — GitHub Issues is the single source of truth.
ENGINE=<engine>
helm template test-addon addons/$ENGINE | python3 -c "
import sys, yaml
for doc in yaml.safe_load_all(sys.stdin):
if doc and doc.get('kind') == 'ClusterDefinition':
for t in doc.get('spec', {}).get('topologies', []):
marker = ' [default]' if t.get('default') else ''
print(f'Topology: {t[\"name\"]}{marker} replicas suggestion: {t.get(\"replicas\", \"?\")}')
for c in t.get('components', []):
print(f' component: {c[\"name\"]} compDef: {c[\"compDef\"]}')
for s in t.get('shardings', []):
print(f' sharding: {s[\"name\"]}')
"
Before generating cluster YAMLs, read the replicasLimit from every deployed ComponentDefinition for this engine. Use minReplicas as the lower bound when setting replicas for each component.
ENGINE=<engine>
kubectl get componentdefinition -o json | python3 -c "
import sys, json
data = json.load(sys.stdin)
engine = sys.argv[1]
for item in data.get('items', []):
name = item['metadata']['name']
if not name.startswith(engine):
continue
rl = item.get('spec', {}).get('replicasLimit') or {}
min_r = rl.get('minReplicas', 1)
max_r = rl.get('maxReplicas', 16384)
print(f'{name} minReplicas={min_r} maxReplicas={max_r}')
" "$ENGINE"
Record the per-component minimums. For each component in every cluster YAML: set replicas = max(1, minReplicas). If minReplicas > 1, a topology using fewer replicas will hit PreCheckFailed before any pod is created.
ENGINE=<engine>
kubectl get componentversion $ENGINE -o json | python3 -c "
import sys, json
cv = json.load(sys.stdin)
releases = cv.get('spec', {}).get('releases', [])
for r in sorted(releases, key=lambda r: r['serviceVersion']):
print(r['serviceVersion'])
" 2>/dev/null || echo "ComponentVersion $ENGINE not found"
If a specific version was requested in arguments, test only that version.
Before deploying test clusters, check whether the container images actually exist in the configured registry ($IMAGE_REGISTRY, set in Step 0).
ENGINE=<engine>
for VERSION in <versions-to-test>; do
echo -n "${IMAGE_REGISTRY}/apecloud/${ENGINE}:${VERSION} ... "
if skopeo inspect "docker://${IMAGE_REGISTRY}/apecloud/${ENGINE}:${VERSION}" \
--override-arch "${NODE_ARCH}" --override-os "${NODE_OS}" \
$SKOPEO_CREDS 2>/dev/null 1>/dev/null; then
echo "EXISTS"
else
echo "MISSING — will skip"
fi
done
For MISSING images: Record as "Image not in registry — test skipped". Do NOT change version tags. Proceed with remaining versions.
Some engines require more memory than the generic template defaults (memory: 512Mi).
Per-component overrides are documented in docs/engine-hints/<engine>.md.
Read that file in Step 0b and apply any listed minimums when generating cluster YAMLs.
Run the following for each (topology, version) combination where the image exists. Track every result as PASSED / FAILED / SKIPPED / Not implemented.
Set these variables once at the start of each test session:
KB_POD=$(kubectl get pods -n kb-system --no-headers 2>/dev/null \
| grep "^kubeblocks-[^d]" | awk '{print $1}' | head -1)
echo "KB operator pod: $KB_POD"
Timeout reference by operation type:
| Operation | Expected | Timeout | Check KB logs if stuck > |
|---|---|---|---|
| Create cluster (multi-node) | 120-180s | 240s | 90s |
| Stop | 30-60s | 90s | 60s |
| Start | 60-120s | 180s | 90s |
| Restart (OpsRequest) | 60-90s | 120s | 60s |
| VerticalScaling | 60-120s | 180s | 90s |
| VolumeExpansion | 10-30s | 60s | 30s |
| HScaling Out/In | 60-120s | 180s | 60s |
| Upgrade | 120-240s | 300s | 120s |
| SwitchOver | 30-60s | 90s | 45s |
| Failover chaos (60s fault) | 60+60s | 180s | 150s |
| NoFailover chaos (60s fault) | 60+30s | 120s | 100s |
| Backup | 30-60s | 120s | 60s |
| Restore cluster | 60-120s | 180s | 90s |
OpsRequest wait helper — use this instead of bare kubectl wait or blind loops:
# wait_ops <opsrequest-name> <timeout-seconds>
function wait_ops() {
local NAME=$1 TIMEOUT=${2:-120}
for ((i=0; i<TIMEOUT; i+=10)); do
PHASE=$(kubectl get opsrequest $NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Succeed" ]] && echo "✓ $NAME Succeed in ${i}s" && return 0
[[ "$PHASE" == "Failed" ]] && echo "✗ $NAME Failed:" \
&& kubectl get opsrequest $NAME -o jsonpath='{.status.conditions[-1].message}' 2>/dev/null && echo "" && return 1
# Check KB logs at halfway point if still Running
if (( i == TIMEOUT/2 )); then
echo " [${i}s] ops=$PHASE — checking KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null \
| grep -E "ERROR|build error|$CLUSTER_NAME" | tail -8
else
(( i % 30 == 0 && i > 0 )) && echo " [${i}s] ops=$PHASE"
fi
sleep 10
done
echo "✗ $NAME timeout after ${TIMEOUT}s — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=30 2>/dev/null \
| grep -E "ERROR|build error|$CLUSTER_NAME" | tail -10
return 1
}
mkdir -p workspace/tests
ENGINE=<engine> TOPOLOGY=<topology> VERSION=<version>
CLUSTER_NAME="${ENGINE:0:10}-${TOPOLOGY:0:8}"
cat > workspace/tests/${ENGINE}-${TOPOLOGY}-test.yaml << EOF
apiVersion: apps.kubeblocks.io/v1
kind: Cluster
metadata:
name: ${CLUSTER_NAME}
namespace: default
spec:
terminationPolicy: Delete
clusterDef: ${ENGINE}
topology: ${TOPOLOGY}
componentSpecs:
# One entry per component in this topology (names from Step 2).
# Set replicas = max(1, minReplicas) from Step 2b — using fewer than minReplicas
# causes PreCheckFailed before any pod is created.
- name: <component-name>
serviceVersion: "${VERSION}"
replicas: <replicas> # from Step 2b: max(1, minReplicas for this component)
resources:
limits: { cpu: "0.5", memory: "512Mi" }
requests: { cpu: "0.1", memory: "256Mi" }
volumeClaimTemplates:
- name: data
spec:
accessModes: [ReadWriteOnce]
storageClassName: ""
resources:
requests:
storage: 20Gi
EOF
kubectl apply -f workspace/tests/${ENGINE}-${TOPOLOGY}-test.yaml
Wait for Running (timeout 240s — check KB operator logs if not Running by 90s):
CLUSTER_NAME=<cluster-name>
KB_POD=$(kubectl get pods -n kb-system --no-headers 2>/dev/null | grep "^kubeblocks-[^d]" | awk '{print $1}' | head -1)
TIMEOUT=240; INTERVAL=10; PHASE=""
for ((i=0; i<TIMEOUT; i+=INTERVAL)); do
PHASE=$(kubectl get cluster "$CLUSTER_NAME" -o jsonpath='{.status.phase}' 2>/dev/null)
echo " [${i}s] phase=${PHASE:-unknown}"
[[ "$PHASE" == "Running" ]] && echo "✓ Running" && break
POD_TABLE=$(kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME" --no-headers 2>/dev/null)
if echo "$POD_TABLE" | grep -qE 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull|CreateContainerConfigError'; then
echo "✗ Unrecoverable pod failure:"; echo "$POD_TABLE"; PHASE="PodFailed"; break
fi
if (( i == 90 )); then
echo "⚠ Still Creating at 90s — checking KB operator logs:"
kubectl logs $KB_POD -n kb-system --tail=30 2>/dev/null \
| grep -E "ERROR|build error|$CLUSTER_NAME" | tail -10
echo "$POD_TABLE"
fi
sleep $INTERVAL
done
[[ "$PHASE" != "Running" ]] && echo "✗ Did not reach Running (last: $PHASE)"
Note: KubeBlocks v1 clusters have NO "Failed" or "Error" phase. Detect failures at the pod level.
# Stop — MUST use --type=json to patch only the stop field.
# --type=merge replaces the entire componentSpecs array, wiping serviceVersion/replicas/resources.
# Find the array index for each component first (0-based).
kubectl patch cluster $CLUSTER_NAME --type=json \
-p='[{"op":"add","path":"/spec/componentSpecs/0/stop","value":true}]'
# For multiple components add one op per index:
# -p='[{"op":"add","path":"/spec/componentSpecs/0/stop","value":true},{"op":"add","path":"/spec/componentSpecs/1/stop","value":true}]'
# Wait for Stopped — typical: 30-60s, timeout 90s
for i in {1..18}; do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Stopped" ]] && echo "✓ Stopped in $((i*5))s" && break
(( i == 12 )) && echo "⚠ Still not Stopped at 60s — check KB logs:" \
&& kubectl logs $KB_POD -n kb-system --tail=10 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
sleep 5
done
# Start (reverse)
kubectl patch cluster $CLUSTER_NAME --type=json \
-p='[{"op":"replace","path":"/spec/componentSpecs/0/stop","value":false}]'
# Wait for Running — typical: 60-120s, timeout 180s
for i in {1..36}; do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Running in $((i*5))s" && break
(( i == 18 )) && echo "⚠ Still not Running at 90s — check KB logs:" \
&& kubectl logs $KB_POD -n kb-system --tail=10 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
sleep 5
done
# Restart entire cluster
kubectl annotate cluster $CLUSTER_NAME \
kubeblocks.io/restart="$(date +%s)" --overwrite
kubectl wait cluster $CLUSTER_NAME --for=jsonpath='{.status.phase}'=Running --timeout=120s \
|| { echo "⚠ Restart timeout — KB logs:"; kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -10; }
# Restart specific component
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: restart-<component>-$(date +%s)
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: Restart
restart:
- componentName: <component>
EOF
# Update TerminationPolicy — scalar field at spec level, --type=merge is safe here
kubectl patch cluster $CLUSTER_NAME --type=merge \
-p '{"spec":{"terminationPolicy":"WipeOut"}}'
# Revert TerminationPolicy
kubectl patch cluster $CLUSTER_NAME --type=merge \
-p '{"spec":{"terminationPolicy":"Delete"}}'
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: vscale-$(date +%s)
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: VerticalScaling
verticalScaling:
- componentName: <component>
requests: { cpu: "0.2", memory: "256Mi" }
limits: { cpu: "1", memory: "1Gi" }
EOF
wait_ops vscale-${TS} 180
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: volexp-$(date +%s)
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: VolumeExpansion
volumeExpansion:
- componentName: <component>
volumeClaimTemplates:
- name: data
storage: "21Gi"
EOF
wait_ops volexp-${TS} 60
Note: The
replicasfield does NOT exist inhorizontalScaling(KB v1 API). UsescaleOut.replicaChangesto add replicas andscaleIn.replicaChangesto remove replicas. Single-node topologies: mark HScale Out/In as N/A — adding or removing nodes is architecturally unsupported.Scale-In on distributed data nodes: The controller will log
wait to delete "Pod/<name>" in Component: <comp>and appear stuck. This is correct — KubeBlocks is waiting for the member-leave lifecycle action to finish relocating data before deleting the pod. This can take several minutes even on a fresh cluster. Do not interpret this as a hang. Check engine hints for the expected timeout.
# Scale Out (increase replicas by N)
TS=$(date +%s)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: hscale-out-${TS}
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: HorizontalScaling
horizontalScaling:
- componentName: <component>
scaleOut:
replicaChanges: <N>
EOF
wait_ops hscale-out-${TS} 180
# Scale In (decrease replicas by N)
TS=$(date +%s)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: hscale-in-${TS}
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: HorizontalScaling
horizontalScaling:
- componentName: <component>
scaleIn:
replicaChanges: <N>
EOF
wait_ops hscale-in-${TS} 600 # distributed data nodes may need data relocation before pod deletion — allow up to 600s
# Take specific instance offline (by pod name suffix)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: hscale-offline-$(date +%s)
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: HorizontalScaling
horizontalScaling:
- componentName: <component>
offlineInstancesToOnline: []
onlineInstancesToOffline: ["<cluster-name>-<component>-N"]
EOF
# Bring it back online
# Swap offlineInstancesToOnline / onlineInstancesToOffline values
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: rebuild-$(date +%s)
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: RebuildInstance
rebuildFrom:
- componentName: <component>
instances:
- name: <cluster-name>-<component>-N
EOF
wait_ops rebuild-${TS} 300
# Service version upgrade (forward)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: upgrade-$(date +%s)
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: Upgrade
upgrade:
components:
- componentName: <component>
serviceVersion: "<target-version>"
EOF
wait_ops upgrade-${TS} 300
# Downgrade (same OpsRequest type, earlier version)
Test both upgrade paths (forward to latest, backward to original) per the report pattern.
Engine constraint (not a bug): Some engines prohibit in-place downgrades at the data layer — the node will refuse to start if on-disk data is from a newer version. Mark such downgrades as
N/A (Engine Constraint), not FAILED. Check engine hints for this engine's specific upgrade constraints.
Switchover field semantics:
| Field | Meaning |
|---|---|
instanceName | The current primary pod to be demoted. Must be the actual pod name — "*" does NOT work (the controller treats it as a literal pod name and fails with "Pod not found"). |
candidateName | (Optional) The target pod to be promoted. Omit to let Syncer choose automatically. |
Important: instanceName must be the explicit pod name of the current primary. If it points to a secondary,
the switchover action executes on all pods but every pod's script sees KB_SWITCHOVER_ROLE=secondary
and exits with 0 (no-op). The OpsRequest will report Succeed even though no switchover occurred.
This is expected KubeBlocks behavior (not a bug) — the script's role check is a safety guard.
# Step 1: Find the current primary pod
PRIMARY_POD=$(kubectl get pods -l app.kubernetes.io/instance=$CLUSTER_NAME \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.kubeblocks\.io/role}{"\n"}{end}' \
| grep primary | awk '{print $1}')
# Step 2: Issue switchover with explicit primary pod name
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: switchover-$(date +%s)
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: Switchover
switchover:
- componentName: <component>
instanceName: "$PRIMARY_POD" # must be the actual primary pod name
EOF
wait_ops switchover-${TS} 90
# With explicit candidate (optional — omit candidateName to let Syncer choose)
# switchover:
# - componentName: <component>
# instanceName: "$PRIMARY_POD" # current primary to demote
# candidateName: "<cluster>-<component>-1" # target to promote
Chaos faults that trigger a failover — the HA mechanism must re-elect a new primary and the cluster should return to Running with data intact.
Single-node topology: These tests verify Kubernetes pod restart recovery only — there is no application-level HA election. Mark results as "K8s pod restart recovery (not HA failover)" and note the topology. For true failover validation, use multi-node topology with 3+ replicas.
Requires ChaosMesh. $CHAOS_NS was resolved in Step 1 (install performed automatically if absent). If still empty, mark all Failover and NoFailover tests as SKIPPED.
CLUSTER_NAME=<cluster-name>
COMPONENT=<component>
NAMESPACE=default
# Identify the current primary pod
PRIMARY_POD=$(kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME" \
-o jsonpath='{.items[0].metadata.name}') # adjust selector for primary role if applicable
# Set CHAOS_NS to whichever namespace ChaosMesh is installed in
CHAOS_NS=chaos-mesh # or chaos-testing
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: chaos-pod-kill-$(date +%s)
namespace: $CHAOS_NS
spec:
action: pod-kill
mode: one
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
app.kubernetes.io/component: "$COMPONENT"
gracePeriod: 0
EOF
# Wait for cluster to recover to Running
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: chaos-kill1-$(date +%s)
namespace: $CHAOS_NS
spec:
action: container-kill
mode: one
containerNames: ["<main-container-name>"]
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: chaos-pod-failure-$(date +%s)
namespace: $CHAOS_NS
spec:
action: pod-failure
mode: one
duration: "60s"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: chaos-oom-$(date +%s)
namespace: $CHAOS_NS
spec:
mode: one
duration: "30s"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
stressors:
memory:
workers: 1
size: "512MB"
EOF
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: chaos-cpu-$(date +%s)
namespace: $CHAOS_NS
spec:
mode: one
duration: "60s"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
stressors:
cpu:
workers: 2
load: 100
EOF
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: chaos-netloss-$(date +%s)
namespace: $CHAOS_NS
spec:
action: loss
mode: one
duration: "60s"
loss:
loss: "100"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: chaos-netcorrupt-$(date +%s)
namespace: $CHAOS_NS
spec:
action: corrupt
mode: one
duration: "60s"
corrupt:
corrupt: "60"
correlation: "25"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: chaos-netbw-$(date +%s)
namespace: $CHAOS_NS
spec:
action: bandwidth
mode: one
duration: "60s"
bandwidth:
rate: "1mbps"
limit: 100
buffer: 10000
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: chaos-netdelay-$(date +%s)
namespace: $CHAOS_NS
spec:
action: delay
mode: one
duration: "60s"
delay:
latency: "500ms"
correlation: "25"
jitter: "50ms"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: chaos-partition-$(date +%s)
namespace: $CHAOS_NS
spec:
action: partition
mode: one
duration: "60s"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=120; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 60 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 120 )) && echo "✗ No recovery after 120s — last phase: ${PHASE}" && break
sleep 10
done
kubectl delete pods -l "app.kubernetes.io/instance=$CLUSTER_NAME" --force --grace-period=0
for ((i=0; i<=150; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 75 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 150 )) && echo "✗ No recovery after 150s — last phase: ${PHASE}" && break
sleep 10
done
These faults should not trigger failover. The cluster may be temporarily degraded but should recover without leader election.
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: chaos-netdup-$(date +%s)
namespace: $CHAOS_NS
spec:
action: duplicate
mode: one
duration: "60s"
duplicate:
duplicate: "60"
correlation: "25"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=90; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 40 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: chaos-dnserr-$(date +%s)
namespace: $CHAOS_NS
spec:
action: error
mode: one
duration: "60s"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=90; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 40 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: chaos-dnsrandom-$(date +%s)
namespace: $CHAOS_NS
spec:
action: random
mode: one
duration: "60s"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=90; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 40 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: chaos-timeoffset-$(date +%s)
namespace: $CHAOS_NS
spec:
mode: one
duration: "60s"
timeOffset: "-1h"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
EOF
for ((i=0; i<=90; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 40 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
sleep 10
done
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: chaos-connstress-$(date +%s)
namespace: $CHAOS_NS
spec:
mode: one
duration: "60s"
selector:
namespaces: [$NAMESPACE]
labelSelectors:
app.kubernetes.io/instance: "$CLUSTER_NAME"
stressors:
cpu:
workers: 1
load: 50
EOF
for ((i=0; i<=90; i+=10)); do
PHASE=$(kubectl get cluster $CLUSTER_NAME -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Running" ]] && echo "✓ Recovered in ${i}s" && break
if (( i == 40 )); then
echo " [${i}s] still ${PHASE} — KB logs:"
kubectl logs $KB_POD -n kb-system --tail=20 2>/dev/null | grep -E "ERROR|$CLUSTER_NAME" | tail -5
fi
(( i == 90 )) && echo "✗ No recovery after 90s — last phase: ${PHASE}" && break
sleep 10
done
Cleanup chaos objects after each test:
kubectl delete networkchaos,podchaos,stresschaos,dnschaos,timechaos \
-n $CHAOS_NS -l "app.kubernetes.io/instance=$CLUSTER_NAME" 2>/dev/null || true
The backup method depends on the engine. Common methods from the regression report:
| Engine | Backup Methods |
|---|---|
| MySQL (8.0 / 5.7) | xtrabackup, xtrabackup-inc (incremental), volume-snapshot, Schedule xtrabackup |
| PostgreSQL | wal-g, pg-basebackup, volume-snapshot, Schedule pg-basebackup |
| MongoDB | pbm-physical, dump, datafile, volume-snapshot, Schedule pbm-physical |
| Redis | datafile, volume-snapshot, Schedule datafile, aof |
| Redis Cluster | datafile, Schedule datafile |
| Kafka | topics |
| Qdrant | datafile, Schedule datafile |
| Etcd | datafile |
| Elasticsearch | es-dump, full-backup (⚠ full-backup fails with JavaClassNotFoundException in ES 7.x — use es-dump) |
| Milvus | full, volume-snapshot |
| Clickhouse | full |
| TDengine | dump |
| Kingbase | full |
| GaussDB | gaussdb-roach |
| Oracle | oracle-rman |
| OceanBase Ent | full, full-for-rebuild |
| MSSQL | full |
| Doris | full |
# Create backup
cat <<EOF | kubectl apply -f -
apiVersion: dataprotection.kubeblocks.io/v1alpha1
kind: Backup
metadata:
name: backup-test-$(date +%s)
namespace: default
spec:
backupMethod: <method> # e.g. xtrabackup, wal-g, datafile
backupPolicyName: <cluster-name>-<component>-backup-policy
EOF
# Wait for backup Completed
kubectl wait backup backup-test-<ts> --for=jsonpath='{.status.phase}'=Completed --timeout=300s
# Restore from backup
cat <<EOF | kubectl apply -f -
apiVersion: apps.kubeblocks.io/v1
kind: Cluster
metadata:
name: <cluster-name>-restore
namespace: default
annotations:
kubeblocks.io/restore-from-backup: '{"<component>":{"name":"backup-test-<ts>","namespace":"default","volumeRestorePolicy":"Parallel"}}'
spec:
terminationPolicy: Delete
clusterDef: <engine>
topology: <topology>
componentSpecs:
- name: <component>
serviceVersion: "<version>"
replicas: 1
resources:
limits: { cpu: "0.5", memory: "512Mi" }
requests: { cpu: "0.1", memory: "256Mi" }
volumeClaimTemplates:
- name: data
spec:
accessModes: [ReadWriteOnce]
storageClassName: ""
resources:
requests:
storage: 20Gi
EOF
kubectl wait cluster <cluster-name>-restore --for=jsonpath='{.status.phase}'=Running --timeout=300s
# Cleanup restore cluster
kubectl delete cluster <cluster-name>-restore
MSSQL restore override: MSSQL requires extra volumes (certificate Secret) that raw YAML cannot provide.
Use helm install instead — see engine-hints:mssql "Restore from Backup" section for the exact command.
The restoreFrom Helm value renders the kubeblocks.io/restore-from-backup annotation automatically.
# MSSQL restore example (use this instead of raw YAML above)
helm install <restore-name> addons-cluster/mssql \
--set version=<version> \
--set replicas=<N> \
--set cpu=<cpu> --set memory=<memory> --set storage=<storage> \
--set extra.terminationPolicy=Delete \
--set-json 'restoreFrom="{\"mssql\":{\"name\":\"<backup-name>\",\"namespace\":\"default\",\"volumeRestorePolicy\":\"Parallel\"}}"'
# Cleanup: helm uninstall <restore-name>
cat <<EOF | kubectl apply -f -
apiVersion: dataprotection.kubeblocks.io/v1alpha1
kind: BackupSchedule
metadata:
name: sched-backup-$(date +%s)
namespace: default
spec:
backupPolicyName: <cluster-name>-<component>-backup-policy
schedules:
- backupMethod: <method>
cronExpression: "*/5 * * * *" # every 5 min for testing
enabled: true
retentionPeriod: 1h
EOF
# Wait for at least one backup to complete, then delete schedule
# First ensure a full xtrabackup exists, then apply incremental backup,
# then restore incrementally. Only applicable to engines with xtrabackup-inc method.
Pre-check — ParametersDef is optional. Not every addon implements live reconfiguration. Check before attempting the test:
kubectl get parametersdef --no-headers 2>/dev/null | grep <engine>
Result Action Report state ParametersDef found Run the test PASSED/FAILEDParametersDef missing, team plans to add it Skip, file a Feature issue N/A (ParametersDef not yet implemented)ParametersDef missing, intentionally not supported Skip, no issue needed N/A (not applicable for this engine)Never mark
FAILEDjust because ParametersDef is absent — that is a feature gap, not a broken feature. Only markFAILEDif ParametersDef exists but the OpsRequest errors out or times out.
# Find all ParametersDef resources matching this engine's component
PARAMS_DEF_NAMES=$(kubectl get parametersdef --no-headers 2>/dev/null | grep <engine> | awk '{print $1}')
if [[ -z "$PARAMS_DEF_NAMES" ]]; then
echo "No ParametersDef found for <engine> — marking Parameter tests as N/A"
# Mark both Reconfiguring Dynamic and Reconfiguring Static as N/A
# Skip to the next feature
fi
# For each ParametersDef, extract dynamic/static/immutable parameter lists
for PD_NAME in $PARAMS_DEF_NAMES; do
echo "=== ParametersDef: $PD_NAME ==="
DYNAMIC_PARAMS=$(kubectl get parametersdef $PD_NAME \
-o jsonpath='{.spec.dynamicParameters[*]}' 2>/dev/null)
STATIC_PARAMS=$(kubectl get parametersdef $PD_NAME \
-o jsonpath='{.spec.staticParameters[*]}' 2>/dev/null)
IMMUTABLE_PARAMS=$(kubectl get parametersdef $PD_NAME \
-o jsonpath='{.spec.immutableParameters[*]}' 2>/dev/null)
echo "Dynamic parameters ($(echo $DYNAMIC_PARAMS | wc -w | tr -d ' ')): $DYNAMIC_PARAMS"
echo "Static parameters ($(echo $STATIC_PARAMS | wc -w | tr -d ' ')): $STATIC_PARAMS"
echo "Immutable parameters ($(echo $IMMUTABLE_PARAMS | wc -w | tr -d ' ')): $IMMUTABLE_PARAMS"
done
Immutable parameters must NEVER be modified. They are read-only after initialization. Skip them entirely.
For each parameter selected for testing, read the current value from the running configuration and generate a safe new value.
# Get the config template name and file name from the ParametersDef
TEMPLATE_NAME=$(kubectl get parametersdef $PD_NAME \
-o jsonpath='{.spec.templateName}' 2>/dev/null)
FILE_NAME=$(kubectl get parametersdef $PD_NAME \
-o jsonpath='{.spec.fileName}' 2>/dev/null)
# Read current config from the running pod's ConfigMap
# The ConfigMap is typically named: <cluster>-<component>-<template>
CONFIG_CM=$(kubectl get configmap -l app.kubernetes.io/instance=$CLUSTER_NAME \
--no-headers 2>/dev/null | grep "$TEMPLATE_NAME" | awk '{print $1}' | head -1)
if [[ -n "$CONFIG_CM" && -n "$FILE_NAME" ]]; then
CURRENT_CONFIG=$(kubectl get configmap $CONFIG_CM \
-o jsonpath="{.data.${FILE_NAME}}" 2>/dev/null)
fi
Value generation strategy — for each parameter, determine a safe new value:
| Current Value Pattern | Strategy | Example |
|---|---|---|
Integer (e.g. 100, 4096) | Increment by 1 or double if small (< 10) | 100 → 101, 4 → 8 |
Boolean (on/off, yes/no, true/false, 1/0) | Toggle to the opposite | yes → no |
Size with unit (512MB, 1G, 2G) | Double or halve (stay within reasonable bounds) | 512MB → 1024MB |
Float (e.g. 0.5, 1.5) | Increment by 0.1 | 0.5 → 0.6 |
| Unrecognized or unreadable | Skip this parameter, pick another | — |
Important: If the parameter's current value cannot be read or the generated value would be invalid, skip that parameter and randomly select another one from the remaining pool. Never send a value that could crash the database.
# Helper: generate a safe test value for a parameter
# Usage: generate_test_value <current_value>
# Outputs the new value to stdout. Exits 1 if cannot determine a safe value.
function generate_test_value() {
local CURRENT="$1"
if [[ -z "$CURRENT" ]]; then
return 1
fi
# Boolean toggle
case "$CURRENT" in
on) echo "off"; return 0 ;;
off) echo "on"; return 0 ;;
yes) echo "no"; return 0 ;;
no) echo "yes"; return 0 ;;
true) echo "false"; return 0 ;;
false) echo "true"; return 0 ;;
ON) echo "OFF"; return 0 ;;
OFF) echo "ON"; return 0 ;;
1) echo "0"; return 0 ;;
0) echo "1"; return 0 ;;
esac
# Size with unit (e.g. 512MB, 1G, 2G, 256K)
if [[ "$CURRENT" =~ ^([0-9]+)([KMGkmg][Bb]?)$ ]]; then
local NUM=${BASH_REMATCH[1]}
local UNIT=${BASH_REMATCH[2]}
local NEW_NUM=$((NUM + NUM / 2)) # increase by 50%
[[ $NEW_NUM -eq $NUM ]] && NEW_NUM=$((NUM + 1))
echo "${NEW_NUM}${UNIT}"; return 0
fi
# Integer
if [[ "$CURRENT" =~ ^[0-9]+$ ]]; then
local NEW_VAL=$((CURRENT + 1))
echo "$NEW_VAL"; return 0
fi
# Float
if [[ "$CURRENT" =~ ^[0-9]+\.[0-9]+$ ]]; then
local NEW_VAL=$(echo "$CURRENT + 0.1" | bc)
echo "$NEW_VAL"; return 0
fi
# Cannot determine a safe value
return 1
}
Before any parameter modifications, record the pod restart counts and UIDs to detect restarts later.
# Record restart counts and pod UIDs for all pods of this component
kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
| tee /tmp/pod-state-before.txt
echo "--- Pre-test pod state recorded ---"
dynamicParameters[] (use all if fewer than 10 exist).Reconfiguring OpsRequest with all selected parameters.# Convert space-separated list to array and shuffle
DYNAMIC_ARRAY=($DYNAMIC_PARAMS)
DYNAMIC_COUNT=${#DYNAMIC_ARRAY[@]}
if [[ $DYNAMIC_COUNT -eq 0 ]]; then
echo "No dynamic parameters defined — marking Reconfiguring Dynamic as N/A"
# Report: N/A (no dynamic parameters defined in ParametersDef)
else
# Shuffle and pick up to 10
SHUFFLED=($(printf '%s\n' "${DYNAMIC_ARRAY[@]}" | shuf))
PICK_COUNT=10
(( DYNAMIC_COUNT < PICK_COUNT )) && PICK_COUNT=$DYNAMIC_COUNT
SELECTED_DYNAMIC=("${SHUFFLED[@]:0:$PICK_COUNT}")
echo "Selected $PICK_COUNT dynamic parameters for testing: ${SELECTED_DYNAMIC[*]}"
# Build parameters YAML block, skipping any where we can't generate a safe value
PARAMS_YAML=""
TESTED_DYNAMIC=()
TESTED_DYNAMIC_DESC=""
for PARAM in "${SELECTED_DYNAMIC[@]}"; do
# Read current value — try from ConfigMap or from the running pod
CURRENT_VAL=""
if [[ -n "$CURRENT_CONFIG" ]]; then
# Attempt to extract value from config file content (ini/conf/properties format)
CURRENT_VAL=$(echo "$CURRENT_CONFIG" | grep -E "^\s*${PARAM}\s*[=: ]" \
| head -1 | sed -E 's/^[^=:]*[=: ]\s*//' | tr -d '"' | tr -d "'" | xargs)
fi
NEW_VAL=$(generate_test_value "$CURRENT_VAL")
if [[ $? -ne 0 || -z "$NEW_VAL" ]]; then
echo " Skipping $PARAM — cannot determine safe test value (current=$CURRENT_VAL)"
continue
fi
echo " $PARAM: $CURRENT_VAL → $NEW_VAL"
PARAMS_YAML="${PARAMS_YAML} - key: ${PARAM}
value: \"${NEW_VAL}\"
"
TESTED_DYNAMIC+=("$PARAM=$NEW_VAL")
TESTED_DYNAMIC_DESC="${TESTED_DYNAMIC_DESC}, ${PARAM}=${NEW_VAL}"
done
TESTED_DYNAMIC_DESC="${TESTED_DYNAMIC_DESC#, }" # trim leading comma
if [[ ${#TESTED_DYNAMIC[@]} -eq 0 ]]; then
echo "Could not generate safe values for any dynamic parameter — marking as N/A"
else
# Snapshot pod state before dynamic test
kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
| tee /tmp/pod-state-before-dynamic.txt
TS=$(date +%s)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: reconfig-dynamic-${TS}
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: Reconfiguring
reconfigures:
- componentName: <component>
parameters:
${PARAMS_YAML}EOF
wait_ops reconfig-dynamic-${TS} 180
# --- Verify NO pod restart after dynamic parameter change ---
sleep 10 # brief stabilization
kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
| tee /tmp/pod-state-after-dynamic.txt
RESTART_DETECTED=false
while IFS=$'\t' read -r POD_NAME POD_UID RESTART_BEFORE; do
RESTART_AFTER=$(grep "^${POD_NAME}" /tmp/pod-state-after-dynamic.txt | cut -f3)
UID_AFTER=$(grep "^${POD_NAME}" /tmp/pod-state-after-dynamic.txt | cut -f2)
if [[ "$UID_AFTER" != "$POD_UID" ]]; then
echo " ✗ Pod $POD_NAME was recreated (UID changed: $POD_UID → $UID_AFTER)"
RESTART_DETECTED=true
elif [[ -n "$RESTART_AFTER" && "$RESTART_AFTER" -gt "$RESTART_BEFORE" ]]; then
echo " ✗ Pod $POD_NAME restart count increased ($RESTART_BEFORE → $RESTART_AFTER)"
RESTART_DETECTED=true
else
echo " ✓ Pod $POD_NAME — no restart (count=$RESTART_BEFORE, UID unchanged)"
fi
done < /tmp/pod-state-before-dynamic.txt
if [[ "$RESTART_DETECTED" == "true" ]]; then
echo "✗ UNEXPECTED: Pod restart detected after dynamic parameter change"
echo " Dynamic parameters should be hot-reloaded without restart."
echo " This indicates a possible misclassification in ParametersDef."
# Report: FAILED — pod restarted unexpectedly after dynamic param change
else
echo "✓ Reconfiguring Dynamic Succeed — no pod restart (${#TESTED_DYNAMIC[@]} params)"
# Report: PASSED
fi
fi
fi
staticParameters[] (use all if fewer than 2 exist).Reconfiguring OpsRequest.STATIC_ARRAY=($STATIC_PARAMS)
STATIC_COUNT=${#STATIC_ARRAY[@]}
if [[ $STATIC_COUNT -eq 0 ]]; then
echo "No static parameters defined — marking Reconfiguring Static as N/A"
# Report: N/A (no static parameters defined in ParametersDef)
else
# Shuffle and pick up to 2
SHUFFLED_STATIC=($(printf '%s\n' "${STATIC_ARRAY[@]}" | shuf))
PICK_STATIC=2
(( STATIC_COUNT < PICK_STATIC )) && PICK_STATIC=$STATIC_COUNT
SELECTED_STATIC=("${SHUFFLED_STATIC[@]:0:$PICK_STATIC}")
echo "Selected $PICK_STATIC static parameters for testing: ${SELECTED_STATIC[*]}"
# Build parameters YAML block
STATIC_PARAMS_YAML=""
TESTED_STATIC=()
TESTED_STATIC_DESC=""
for PARAM in "${SELECTED_STATIC[@]}"; do
CURRENT_VAL=""
if [[ -n "$CURRENT_CONFIG" ]]; then
CURRENT_VAL=$(echo "$CURRENT_CONFIG" | grep -E "^\s*${PARAM}\s*[=: ]" \
| head -1 | sed -E 's/^[^=:]*[=: ]\s*//' | tr -d '"' | tr -d "'" | xargs)
fi
NEW_VAL=$(generate_test_value "$CURRENT_VAL")
if [[ $? -ne 0 || -z "$NEW_VAL" ]]; then
echo " Skipping $PARAM — cannot determine safe test value (current=$CURRENT_VAL)"
continue
fi
echo " $PARAM: $CURRENT_VAL → $NEW_VAL"
STATIC_PARAMS_YAML="${STATIC_PARAMS_YAML} - key: ${PARAM}
value: \"${NEW_VAL}\"
"
TESTED_STATIC+=("$PARAM=$NEW_VAL")
TESTED_STATIC_DESC="${TESTED_STATIC_DESC}, ${PARAM}=${NEW_VAL}"
done
TESTED_STATIC_DESC="${TESTED_STATIC_DESC#, }"
if [[ ${#TESTED_STATIC[@]} -eq 0 ]]; then
echo "Could not generate safe values for any static parameter — marking as N/A"
else
# Snapshot pod state before static test
kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
| tee /tmp/pod-state-before-static.txt
TS=$(date +%s)
cat <<EOF | kubectl apply -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: reconfig-static-${TS}
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: Reconfiguring
reconfigures:
- componentName: <component>
parameters:
${STATIC_PARAMS_YAML}EOF
wait_ops reconfig-static-${TS} 300 # static params trigger rolling restart — allow more time
# --- Verify pods WERE restarted after static parameter change ---
sleep 15 # allow restart to complete
kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER_NAME,app.kubernetes.io/component=<component>" \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.uid}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' \
| tee /tmp/pod-state-after-static.txt
RESTART_CONFIRMED=false
while IFS=$'\t' read -r POD_NAME POD_UID RESTART_BEFORE; do
RESTART_AFTER=$(grep "^${POD_NAME}" /tmp/pod-state-after-static.txt | cut -f3)
UID_AFTER=$(grep "^${POD_NAME}" /tmp/pod-state-after-static.txt | cut -f2)
if [[ "$UID_AFTER" != "$POD_UID" ]]; then
echo " ✓ Pod $POD_NAME was recreated (UID changed: $POD_UID → $UID_AFTER)"
RESTART_CONFIRMED=true
elif [[ -n "$RESTART_AFTER" && "$RESTART_AFTER" -gt "$RESTART_BEFORE" ]]; then
echo " ✓ Pod $POD_NAME restart count increased ($RESTART_BEFORE → $RESTART_AFTER)"
RESTART_CONFIRMED=true
else
echo " ? Pod $POD_NAME — no restart detected (count=$RESTART_BEFORE, UID unchanged)"
fi
done < /tmp/pod-state-before-static.txt
if [[ "$RESTART_CONFIRMED" == "true" ]]; then
echo "✓ Reconfiguring Static Succeed — pod restart confirmed (${#TESTED_STATIC[@]} params)"
# Report: PASSED
else
echo "✗ UNEXPECTED: No pod restart detected after static parameter change"
echo " Static parameters require a restart to take effect."
echo " This indicates a possible misclassification in ParametersDef."
# Report: FAILED — no restart after static param change
fi
fi
fi
| Scenario | Action |
|---|---|
| Fewer than 10 dynamic params available | Test all available, note the count in the report |
| Fewer than 2 static params available | Test all available, note the count in the report |
| Zero dynamic params in ParametersDef | Mark Reconfiguring Dynamic as N/A (no dynamic parameters defined) |
| Zero static params in ParametersDef | Mark Reconfiguring Static as N/A (no static parameters defined) |
| Cannot read current value for a param | Skip that parameter, randomly pick another from the pool |
| Cannot generate a safe value | Skip that parameter, randomly pick another from the pool |
| OpsRequest stays Running beyond timeout | wait_ops handles this — marks FAILED with KB log output |
| Dynamic param change triggers restart | Mark FAILED — likely a misclassification in ParametersDef |
| Static param change does NOT trigger restart | Mark FAILED — likely a misclassification in ParametersDef |
If
ParametersDefis absent and you attempt the OpsRequest anyway, it will stay inRunningindefinitely — cancel it withkubectl delete opsrequest reconfig-<ts>and mark the resultN/A.
Expose strategy by component type:
- Components with roles (e.g. MySQL primary/secondary): use the OpsRequest approach with
roleSelectorto expose only the primary.- Components without roles (check engine hints for which components this applies to): use the direct Service approach with
apps.kubeblocks.io/component-namelabel selector. The OpsRequest approach will hang if the component has no roles and aroleSelectoris injected.Check whether a component has roles before choosing the approach:
kubectl get componentdefinition <compdef-name> -o jsonpath='{.spec.roles[*].name}' # Empty output → use direct Service approach
Approach A — OpsRequest (for components WITH roles):
# NOTE: the "switch" field is required but missing from the CRD schema validation.
# Use --validate=false to bypass client-side validation.
# Do NOT include roleSelector for components that have no roles defined.
TS=$(date +%s)
cat <<EOF | kubectl apply --validate=false -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: expose-enable-${TS}
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: Expose
expose:
- componentName: <component>
switch: Enable
services:
- name: internet
serviceType: LoadBalancer
annotations: {}
EOF
for i in {1..24}; do
PHASE=$(kubectl get opsrequest expose-enable-${TS} -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Succeed" ]] && echo "✓ Expose Enable Succeed" && break
[[ "$PHASE" == "Failed" ]] && echo "✗ Expose Enable Failed" && break
sleep 5
done
# Disable
TS=$(date +%s)
cat <<EOF | kubectl apply --validate=false -f -
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
name: expose-disable-${TS}
namespace: default
spec:
clusterName: $CLUSTER_NAME
type: Expose
expose:
- componentName: <component>
switch: Disable
services:
- name: internet
serviceType: LoadBalancer
EOF
for i in {1..24}; do
PHASE=$(kubectl get opsrequest expose-disable-${TS} -o jsonpath='{.status.phase}' 2>/dev/null)
[[ "$PHASE" == "Succeed" ]] && echo "✓ Expose Disable Succeed" && break
sleep 5
done
Approach B — Direct Service (for components WITHOUT roles):
# Enable: create LB service using apps.kubeblocks.io/component-name label
COMPONENT=<component> # e.g. master, data, mdit
SVC_NAME="${CLUSTER_NAME}-${COMPONENT}-internet"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: ${SVC_NAME}
namespace: default
labels:
app.kubernetes.io/instance: ${CLUSTER_NAME}
apps.kubeblocks.io/component-name: ${COMPONENT}
spec:
type: LoadBalancer
selector:
app.kubernetes.io/instance: ${CLUSTER_NAME}
apps.kubeblocks.io/component-name: ${COMPONENT}
ports:
- name: <port-name> # e.g. http
port: <port> # e.g. 9200
targetPort: <port-name>
protocol: TCP
EOF
# Wait for external IP
for ((i=0; i<120; i+=5)); do
IP=$(kubectl get svc ${SVC_NAME} -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null)
[[ -n "$IP" ]] && echo "✓ LB ready: $IP" && break
sleep 5
done
# Disable: delete the service
kubectl delete svc ${SVC_NAME}
echo "✓ LB removed"
# Get connection credential secret
kubectl get secret -l app.kubernetes.io/instance=$CLUSTER_NAME -o name
kubectl get secret <cluster-name>-<component>-account-root -o jsonpath='{.data.password}' \
| base64 -d
# Engine-specific connection test (adapt per engine)
# MySQL example:
kubectl exec -it <pod-name> -- mysql -u root -p<password> -e "SELECT 1"
# PostgreSQL:
kubectl exec -it <pod-name> -- psql -U postgres -c "SELECT 1"
# Redis:
kubectl exec -it <pod-name> -- redis-cli PING
# Use kbcli bench or engine-specific tool
# MySQL / PostgreSQL example with sysbench via kbcli:
kbcli bench sysbench $CLUSTER_NAME --component <component> \
--driver mysql --database mydb --tables 5 --table-size 10000 \
--duration 30 --threads 8
# Test against LB service (if Expose was enabled):
kbcli bench sysbench $CLUSTER_NAME --component <component> \
--driver mysql --host <lb-ip> --port 3306 \
--duration 30 --threads 8
kubectl delete cluster $CLUSTER_NAME
# terminationPolicy=Delete cleans up PVCs automatically
When an OpsRequest stays in Running indefinitely, or the cluster enters Abnormal phase with no obvious pod-level error, check the KubeBlocks operator logs — they surface controller-level errors that are invisible in pod events:
# Find the KB operator pod
kubectl get pods -n kb-system --no-headers | grep kubeblocks
# Search for errors related to your cluster
kubectl logs <kubeblocks-pod> -n kb-system --tail=500 \
| grep -E "ERROR|build error|$CLUSTER_NAME" \
| grep -v "replicas.*out-of-limit" \
| tail -30
# Common patterns and their meanings:
# "replicas 0 out-of-limit [1, 16384]" → a component's replicas was zeroed out
# → likely caused by --type=merge patch on componentSpecs array
# → fix: kubectl patch --type=json to restore correct replicas
#
# "not all component sub-resources deleted" → component is stuck deleting
# → check for finalizers: kubectl get component <name> -o jsonpath='{.metadata.finalizers}'
#
# "OpsRequest is forbidden when Cluster.status.phase=Updating"
# → wait for cluster to return to Running before submitting next OpsRequest
Rule: never use
--type=mergeonspec.componentSpecs. It replaces the entire array, zeroing out replicas/resources/volumeClaimTemplates for every component not included in the patch body. Always use--type=jsonfor any field insidecomponentSpecs.
After all tests finish, output the results directly as markdown (do NOT wrap in a code block). The rendered markdown will display as formatted headers and tables.
Use exactly this structure, filling in actual values for every placeholder:
Engine: <engine> ( Topology = <topology> ; Replicas = <N> ) Component Definition: <cmpd-name> Component Version: <cv-name> Service Version: <version>
| Feature | Operation | State | Description |
|---|---|---|---|
| Lifecycle | Create | PASSED | Create a cluster with topology <topology>, component definition <cmpd-name>, service version <version> |
| Lifecycle | Start | PASSED | Start the cluster |
| Lifecycle | Stop | PASSED | Stop the cluster |
| Lifecycle | Restart | PASSED | Restart the cluster |
| Lifecycle | Update | PASSED | Update TerminationPolicy to WipeOut |
| Scale | VerticalScaling | PASSED | VerticalScaling component <component> |
| Scale | VolumeExpansion | PASSED | VolumeExpansion component <component> |
| Scale | HorizontalScaling In | PASSED | HorizontalScaling In component <component> |
| Scale | HorizontalScaling Out | PASSED | HorizontalScaling Out component <component> |
| Scale | RebuildInstance | - | Not implemented or unsupported |
| Upgrade | Upgrade | PASSED | Upgrade component <component> from <v1> to <v2> |
| SwitchOver | SwitchOver | PASSED | SwitchOver component <component> |
| Failover | Kill 1 | PASSED | Simulates conditions where process 1 killed |
| Failover | Pod Kill | PASSED | Simulates conditions where pods experience kill |
| NoFailover | Connection Stress | PASSED | Simulates conditions where pods experience connection stress |
| Backup Restore | Backup | PASSED | <method> Backup |
| Backup Restore | Restore | PASSED | <method> Restore |
| Backup Restore | Delete Restore Cluster | PASSED | Delete the <method> restore cluster |
| Parameter | Reconfiguring Dynamic | PASSED | Reconfiguring dynamic parameters (N params): <param1>=<val1>, <param2>=<val2>, ... — no pod restart confirmed |
| Parameter | Reconfiguring Static | PASSED | Reconfiguring static parameters (N params): <param1>=<val1>, <param2>=<val2>, ... — pod restart confirmed |
| Accessibility | Expose | PASSED | Expose Enable internet service on component <component> |
| Accessibility | Connect | PASSED | Connect to the cluster |
| Stress | Bench | PASSED | Bench the cluster via component <component> |
All implemented operations: PASSED Not implemented or unsupported: <list operations marked "-"> Images not yet in registry: <list any skipped versions, or "none">
State values — use exactly these tokens in the State column:
PASSED — operation completed successfullyFAILED — operation attempted but did not succeedSKIPPED (known #N) — skipped due to open GitHub issue #N with label skip-in-testSKIPPED — precondition not met (e.g., image missing, ChaosMesh not installed)N/A — architecturally not applicable for this topology or engine (e.g., HScale on single-node, downgrade on engine that forbids it)- — feature not yet implemented in this addonAfter outputting the report, save it as a markdown file:
workspace/tests/<engine>-<topology>-report.md
The file must contain the complete report (everything from ## Instance Test Results through the ### Conclusion section). Create the workspace/tests/ directory if it does not already exist.
When a test case FAILs due to a bug in the addon or its dependencies, file a GitHub issue at https://github.com/apecloud/kubeblocks-addons/issues.
Apply exactly one primary label:
| Label | When to use |
|---|---|
Bug | Something is broken — wrong behavior, crash, error |
Feature | New capability that does not exist yet |
Improvement | Existing feature works but could be better (performance, UX, coverage) |
Chore | Maintenance, dependency update, CI, cleanup |
Document | Documentation missing or incorrect |
# Example: file a Bug and assign to a maintainer
gh issue create \
--repo apecloud/kubeblocks-addons \
--title "bug: <engine> <operation> fails with <error>" \
--label "Bug" \
--assignee <github-username> \
--body "$(cat <<'EOF'
## Summary
<one-line description>
## Trigger Path
<exact call chain that reproduces the error>
## Root Cause
<what is actually wrong>
## Fix
<suggested fix>
## Workaround
<how to avoid the issue until it is fixed, if any>
EOF
)"