ワンクリックで
robustmq-chaos-test
// 7×24 chaos testing for RobustMQ. Injects broker-kill and network-delay faults, validates SDK client resilience across Python/Go/Rust/Java, and publishes a Markdown + JSON report to GitHub after each run.
// 7×24 chaos testing for RobustMQ. Injects broker-kill and network-delay faults, validates SDK client resilience across Python/Go/Rust/Java, and publishes a Markdown + JSON report to GitHub after each run.
Audit and update the HTTP API documentation under docs/zh/Api/ and docs/en/Api/ against the actual route definitions in src/admin-server/src/. Uses path.rs as the single source of truth. Fixes wrong URI prefixes, non-existent routes, wrong request/response fields, and syncs the English docs to match the Chinese ones.
Complete step-by-step guide for implementing a new protocol Broker in RobustMQ. Use when the user asks to add a new broker, implement a new protocol, or scaffold a new broker crate.
Deep analysis and iterative fixing of a Rust source file. Finds logic errors, lock/concurrency issues, and simplification opportunities, then fixes them one by one until the file is clean.
Implements new RobustMQ MQTT connector integrations end-to-end using project conventions. Use when the user asks to add, implement, or support a new connector type such as webhook, opentsdb, clickhouse, influxdb, cassandra, mqtt bridge, or protocol-compatible targets.
Designs and implements minimal, high-value metrics for RobustMQ services and dashboards. Use when the user asks to add metrics, improve observability, or update Grafana panels for core processing pipelines.
Create GitHub issues for the RobustMQ project. Use when the user asks to create an issue, file a bug, propose a feature, or track a task.
| name | robustmq-chaos-test |
| description | 7×24 chaos testing for RobustMQ. Injects broker-kill and network-delay faults, validates SDK client resilience across Python/Go/Rust/Java, and publishes a Markdown + JSON report to GitHub after each run. |
| requires_tools | ["cluster_manage","observability","client","chaos","report"] |
| cron | 0 */4 * * * |
Cron trigger (every 4 hours): the system message will say
"按 P0 跑一轮 RobustMQ 故障场景"
Manual CLI: user says something like
"帮我跑一轮 RobustMQ chaos 测试" / "run a chaos test round"
In both cases execute the Full Run below.
If the user says "按 P1 跑一轮" or names a specific scenario, execute the Single Scenario flow for that scenario only.
Before starting any run:
cluster_manage(action=status).
status is NOT stopped, call cluster_manage(action=stop) to clear
any leftover state from a previous run.chaos-test/config.yml has cluster.binary and cluster.project_root
filled in correctly (the cluster tool fails fast if binary is missing —
surface that error immediately and stop).| Scenario name | fault_type | target | params | Core? |
|---|---|---|---|---|
| broker-kill-single | broker-kill | robustmq-server | — | ✅ |
| network-delay-100ms | network-delay | eth0 | delay_ms=100, jitter_ms=10 | — |
| leader-transfer | broker-kill | robustmq-server | — | ✅ |
Note: Update this table when new scenarios are added. Target names and interface names depend on the deployment environment — verify before running.
Core scenarios: broker-kill-single, leader-transfer. Run passed = all core pass AND non-core pass rate ≥ 75%.
Execute these steps sequentially. Do NOT skip steps.
observability(action=snapshot, data_dirs=<from cluster start>)
Record the snapshot as baseline. Proceed even if some metrics are unavailable;
log a warning but do not abort.
chaos(action=inject, fault_type=<type>, target=<target>, params=<params>)
Save the returned fault_id. If inject returns an error, mark the scenario
passed=False with status=inject_error and skip to Step 5 (skip recover).
client(action=run, scenario=<scenario>, cluster_endpoint=<endpoint>)
Record all results. Do NOT use these results to determine pass/fail. Their only purpose is observability — they show what clients experienced during the fault. A high loss rate here is expected and normal.
chaos(action=recover, fault_id=<fault_id>)
If recover returns an error, log it and continue — attempt self-healing validation anyway.
Wait 60 seconds after recovery, then run:
client(action=run, scenario=<scenario>, cluster_endpoint=<endpoint>)
Pass criteria (ALL must hold):
exit_code == 0lost == 0p99_ms < 500If any criterion fails → scenario passed=False.
If status=script_format_error → scenario passed=False, note the format error
separately (this is a test-infrastructure issue, not a RobustMQ bug).
cluster_manage(action=start) → save endpoint and data_dirs.cluster_manage(action=stop) → cluster_manage(action=start).cluster_manage(action=stop).report(action=generate_and_push, run_data={...}).
run_data must include: run_id, started_at, finished_at, scenarios.run_passed=True: send brief pass message with github_url.run_passed=False: send failure alert listing failed scenarios and github_url.consecutive_failures >= 3: prepend 🚨 连续 {n} 轮失败,请人工介入.Track consecutive_failures across runs (persist in your memory or state):
run_passed=False.run_passed=True.consecutive_failures >= 3: send an urgent Feishu alert and pause
the cron schedule. Do NOT continue running automatically until a human
acknowledges and resets the counter.Pass:
✅ RobustMQ 故障测试通过
Run ID: {run_id} 时间: {finished_at}
核心场景: 全部通过 总通过率: {pass_rate}%
报告: {github_url}
Fail:
❌ RobustMQ 故障测试失败
Run ID: {run_id} 时间: {finished_at}
失败场景: {failed_scenario_list}
报告: {github_url}
Circuit breaker:
🚨 连续 {consecutive_failures} 轮失败,请人工介入
最后失败: {run_id} {finished_at}
报告: {github_url}
script_format_error ≠ RobustMQ bug. Report it separately; do not
inflate the failure count. Fix the script first.report returns push_error, the reports are
still written locally at json_path / markdown_path. Investigate the key
before declaring the run lost.