ワンクリックでManusで任意のスキルを実行

$pwd:

robustmq-chaos-test

Name: Robustmq Chaos Test
Author: robustmq

// 7×24 chaos testing for RobustMQ. Injects broker-kill and network-delay faults, validates SDK client resilience across Python/Go/Rust/Java, and publishes a Markdown + JSON report to GitHub after each run.

Manusで実行

$ git log --oneline --stat

stars:1,594

forks:221

updated:2026年5月28日 07:49

ファイルエクスプローラー

20 ファイル

SKILL.md

readonly

name	robustmq-chaos-test
description	7×24 chaos testing for RobustMQ. Injects broker-kill and network-delay faults, validates SDK client resilience across Python/Go/Rust/Java, and publishes a Markdown + JSON report to GitHub after each run.
requires_tools	["cluster_manage","observability","client","chaos","report"]
cron	0 /4 * *

RobustMQ Chaos Test Skill

When to Use

Cron trigger (every 4 hours): the system message will say

"按 P0 跑一轮 RobustMQ 故障场景"

Manual CLI: user says something like

"帮我跑一轮 RobustMQ chaos 测试" / "run a chaos test round"

In both cases execute the Full Run below.

If the user says "按 P1 跑一轮" or names a specific scenario, execute the Single Scenario flow for that scenario only.

Pre-check

Before starting any run:

Call cluster_manage(action=status).
- If status is NOT stopped, call cluster_manage(action=stop) to clear any leftover state from a previous run.
Verify chaos-test/config.yml has cluster.binary and cluster.project_root filled in correctly (the cluster tool fails fast if binary is missing — surface that error immediately and stop).

Scenario Catalogue

Scenario name	fault_type	target	params	Core?
broker-kill-single	broker-kill	robustmq-server	—	✅
network-delay-100ms	network-delay	eth0	delay_ms=100, jitter_ms=10	—
leader-transfer	broker-kill	robustmq-server	—	✅

Note: Update this table when new scenarios are added. Target names and interface names depend on the deployment environment — verify before running.

Core scenarios: broker-kill-single, leader-transfer. Run passed = all core pass AND non-core pass rate ≥ 75%.

Single Scenario — 5-Step Flow

Execute these steps sequentially. Do NOT skip steps.

Step 1 — Baseline Snapshot

observability(action=snapshot, data_dirs=<from cluster start>)

Record the snapshot as baseline. Proceed even if some metrics are unavailable; log a warning but do not abort.

Step 2 — Inject Fault

chaos(action=inject, fault_type=<type>, target=<target>, params=<params>)

Save the returned fault_id. If inject returns an error, mark the scenario passed=False with status=inject_error and skip to Step 5 (skip recover).

Step 3 — Fault-Period SDK Observation (record only)

client(action=run, scenario=<scenario>, cluster_endpoint=<endpoint>)

Record all results. Do NOT use these results to determine pass/fail. Their only purpose is observability — they show what clients experienced during the fault. A high loss rate here is expected and normal.

Step 4 — Recover

chaos(action=recover, fault_id=<fault_id>)

If recover returns an error, log it and continue — attempt self-healing validation anyway.

Step 5 — Self-Healing Validation (sole pass/fail basis)

Wait 60 seconds after recovery, then run:

client(action=run, scenario=<scenario>, cluster_endpoint=<endpoint>)

Pass criteria (ALL must hold):

exit_code == 0
lost == 0
p99_ms < 500

If any criterion fails → scenario passed=False. If status=script_format_error → scenario passed=False, note the format error separately (this is a test-infrastructure issue, not a RobustMQ bug).

Full Run Flow

Pre-check (see above).
Start cluster: cluster_manage(action=start) → save endpoint and data_dirs.
Run each scenario using the Single Scenario flow.
- Run scenarios sequentially (not in parallel) to avoid interference.
- If a scenario crashes the cluster (all brokers dead), restart it before continuing: cluster_manage(action=stop) → cluster_manage(action=start).
Stop cluster: cluster_manage(action=stop).
Generate report: report(action=generate_and_push, run_data={...}).
- run_data must include: run_id, started_at, finished_at, scenarios.
- Each scenario entry: scenario name, sdk, passed, sent, received, lost, p99_ms, duration_seconds, errors, status.
Send Feishu notification:
- If run_passed=True: send brief pass message with github_url.
- If run_passed=False: send failure alert listing failed scenarios and github_url.
- If consecutive_failures >= 3: prepend 🚨 连续 {n} 轮失败，请人工介入.

Circuit Breaker

Track consecutive_failures across runs (persist in your memory or state):

Increment on run_passed=False.
Reset to 0 on run_passed=True.
If consecutive_failures >= 3: send an urgent Feishu alert and pause the cron schedule. Do NOT continue running automatically until a human acknowledges and resets the counter.

Feishu Message Templates

Pass:

✅ RobustMQ 故障测试通过
Run ID: {run_id}  时间: {finished_at}
核心场景: 全部通过  总通过率: {pass_rate}%
报告: {github_url}

Fail:

❌ RobustMQ 故障测试失败
Run ID: {run_id}  时间: {finished_at}
失败场景: {failed_scenario_list}
报告: {github_url}

Circuit breaker:

🚨 连续 {consecutive_failures} 轮失败，请人工介入
最后失败: {run_id}  {finished_at}
报告: {github_url}

Pitfalls

Never judge pass/fail on fault-period results (Step 3). Only Step 5 post-recovery validation counts.
script_format_error ≠ RobustMQ bug. Report it separately; do not inflate the failure count. Fix the script first.
ROBUSTMQ_HOME must be set before any run. The cluster tool returns an error immediately if it is not — surface it and stop rather than retrying.
Consecutive failures count whole runs, not individual scenarios. One run with two failed scenarios = 1 failure, not 2.
eth0 is not universal. The network-delay target interface name varies by host. Verify it before running in a new environment.
Deploy Key permissions. If report returns push_error, the reports are still written locally at json_path / markdown_path. Investigate the key before declaring the run lost.
60-second wait is mandatory. Do not skip or shorten it — RobustMQ leader election and connection re-establishment take time.

related-skills.json

同じリポジトリ

update-api-docs.md

from "robustmq/robustmq"

Audit and update the HTTP API documentation under docs/zh/Api/ and docs/en/Api/ against the actual route definitions in src/admin-server/src/. Uses path.rs as the single source of truth. Fixes wrong URI prefixes, non-existent routes, wrong request/response fields, and syncs the English docs to match the Chinese ones.

2026-05-241.6k

new-broker.md

from "robustmq/robustmq"

Complete step-by-step guide for implementing a new protocol Broker in RobustMQ. Use when the user asks to add a new broker, implement a new protocol, or scaffold a new broker crate.

2026-05-011.6k

review-and-fix.md

from "robustmq/robustmq"

Deep analysis and iterative fixing of a Rust source file. Finds logic errors, lock/concurrency issues, and simplification opportunities, then fixes them one by one until the file is clean.

2026-05-011.6k

connector-delivery.md

from "robustmq/robustmq"

Implements new RobustMQ MQTT connector integrations end-to-end using project conventions. Use when the user asks to add, implement, or support a new connector type such as webhook, opentsdb, clickhouse, influxdb, cassandra, mqtt bridge, or protocol-compatible targets.

2026-03-031.6k

robustmq-metrics.md

from "robustmq/robustmq"

Designs and implements minimal, high-value metrics for RobustMQ services and dashboards. Use when the user asks to add metrics, improve observability, or update Grafana panels for core processing pipelines.

2026-03-031.6k

create-issue.md

from "robustmq/robustmq"

Create GitHub issues for the RobustMQ project. Use when the user asks to create an issue, file a bug, propose a feature, or track a task.

2026-03-031.6k

package.json

"author": "robustmq"

"repository": "robustmq/robustmq"

GitHub リポジトリを開く Creator のリポジトリを見る

$ install --global

$ download --local

Manusで実行

$ useful --forSOC

ソフトウェア品質保証アナリスト・テスターコンピュータ・数学職15-1253L4

name	robustmq-chaos-test
description	7×24 chaos testing for RobustMQ. Injects broker-kill and network-delay faults, validates SDK client resilience across Python/Go/Rust/Java, and publishes a Markdown + JSON report to GitHub after each run.
requires_tools	["cluster_manage","observability","client","chaos","report"]
cron	0 /4 * *

RobustMQ Chaos Test Skill

When to Use

Cron trigger (every 4 hours): the system message will say

"按 P0 跑一轮 RobustMQ 故障场景"

Manual CLI: user says something like

"帮我跑一轮 RobustMQ chaos 测试" / "run a chaos test round"

In both cases execute the Full Run below.

If the user says "按 P1 跑一轮" or names a specific scenario, execute the Single Scenario flow for that scenario only.

Pre-check

Before starting any run:

Call cluster_manage(action=status).
- If status is NOT stopped, call cluster_manage(action=stop) to clear any leftover state from a previous run.
Verify chaos-test/config.yml has cluster.binary and cluster.project_root filled in correctly (the cluster tool fails fast if binary is missing — surface that error immediately and stop).

Scenario Catalogue

Scenario name	fault_type	target	params	Core?
broker-kill-single	broker-kill	robustmq-server	—	✅
network-delay-100ms	network-delay	eth0	delay_ms=100, jitter_ms=10	—
leader-transfer	broker-kill	robustmq-server	—	✅

Note: Update this table when new scenarios are added. Target names and interface names depend on the deployment environment — verify before running.

Core scenarios: broker-kill-single, leader-transfer. Run passed = all core pass AND non-core pass rate ≥ 75%.

Single Scenario — 5-Step Flow

Execute these steps sequentially. Do NOT skip steps.

Step 1 — Baseline Snapshot

observability(action=snapshot, data_dirs=<from cluster start>)

Record the snapshot as baseline. Proceed even if some metrics are unavailable; log a warning but do not abort.

Step 2 — Inject Fault

chaos(action=inject, fault_type=<type>, target=<target>, params=<params>)

Save the returned fault_id. If inject returns an error, mark the scenario passed=False with status=inject_error and skip to Step 5 (skip recover).

Step 3 — Fault-Period SDK Observation (record only)

client(action=run, scenario=<scenario>, cluster_endpoint=<endpoint>)

Step 4 — Recover

chaos(action=recover, fault_id=<fault_id>)

If recover returns an error, log it and continue — attempt self-healing validation anyway.

Step 5 — Self-Healing Validation (sole pass/fail basis)

Wait 60 seconds after recovery, then run:

client(action=run, scenario=<scenario>, cluster_endpoint=<endpoint>)

Pass criteria (ALL must hold):

exit_code == 0
lost == 0
p99_ms < 500

Full Run Flow

Pre-check (see above).
Start cluster: cluster_manage(action=start) → save endpoint and data_dirs.
Run each scenario using the Single Scenario flow.
- Run scenarios sequentially (not in parallel) to avoid interference.
- If a scenario crashes the cluster (all brokers dead), restart it before continuing: cluster_manage(action=stop) → cluster_manage(action=start).
Stop cluster: cluster_manage(action=stop).
Generate report: report(action=generate_and_push, run_data={...}).
- run_data must include: run_id, started_at, finished_at, scenarios.
- Each scenario entry: scenario name, sdk, passed, sent, received, lost, p99_ms, duration_seconds, errors, status.
Send Feishu notification:
- If run_passed=True: send brief pass message with github_url.
- If run_passed=False: send failure alert listing failed scenarios and github_url.
- If consecutive_failures >= 3: prepend 🚨 连续 {n} 轮失败，请人工介入.

Circuit Breaker

Track consecutive_failures across runs (persist in your memory or state):

Increment on run_passed=False.
Reset to 0 on run_passed=True.
If consecutive_failures >= 3: send an urgent Feishu alert and pause the cron schedule. Do NOT continue running automatically until a human acknowledges and resets the counter.

Feishu Message Templates

Pass:

✅ RobustMQ 故障测试通过
Run ID: {run_id}  时间: {finished_at}
核心场景: 全部通过  总通过率: {pass_rate}%
报告: {github_url}

Fail:

❌ RobustMQ 故障测试失败
Run ID: {run_id}  时间: {finished_at}
失败场景: {failed_scenario_list}
报告: {github_url}

Circuit breaker:

🚨 连续 {consecutive_failures} 轮失败，请人工介入
最后失败: {run_id}  {finished_at}
报告: {github_url}

Pitfalls

Never judge pass/fail on fault-period results (Step 3). Only Step 5 post-recovery validation counts.
script_format_error ≠ RobustMQ bug. Report it separately; do not inflate the failure count. Fix the script first.
ROBUSTMQ_HOME must be set before any run. The cluster tool returns an error immediately if it is not — surface it and stop rather than retrying.
Consecutive failures count whole runs, not individual scenarios. One run with two failed scenarios = 1 failure, not 2.
eth0 is not universal. The network-delay target interface name varies by host. Verify it before running in a new environment.
Deploy Key permissions. If report returns push_error, the reports are still written locally at json_path / markdown_path. Investigate the key before declaring the run lost.
60-second wait is mandatory. Do not skip or shorten it — RobustMQ leader election and connection re-establishment take time.

robustmq-chaos-test

RobustMQ Chaos Test Skill

When to Use

Pre-check

Scenario Catalogue

Single Scenario — 5-Step Flow

Step 1 — Baseline Snapshot

Step 2 — Inject Fault

Step 3 — Fault-Period SDK Observation (record only)

Step 4 — Recover

Step 5 — Self-Healing Validation (sole pass/fail basis)

Full Run Flow

Circuit Breaker

Feishu Message Templates

Pitfalls

このリポジトリの他の Skills

このリポジトリの他の Skills

RobustMQ Chaos Test Skill

When to Use

Pre-check

Scenario Catalogue

Single Scenario — 5-Step Flow

Step 1 — Baseline Snapshot

Step 2 — Inject Fault

Step 3 — Fault-Period SDK Observation (record only)

Step 4 — Recover

Step 5 — Self-Healing Validation (sole pass/fail basis)

Full Run Flow

Circuit Breaker

Feishu Message Templates

Pitfalls