一键在 Manus 中运行任何 Skill

$pwd:

perf-doctor

Name: Perf Doctor
Author: redai-infra

// Diagnose Relax training launch scripts for misconfigured flags that hurt performance (time/MFU) or waste GPU memory (cards needed). Use when user asks to review/audit/check a training script, mentions "perf doctor", suspects a config is slow or OOM-prone, or wants a sanity check before launching. Produces a two-section markdown report (Performance + Memory) with cited flags, severity, and concrete fixes.

在 Manus 中运行

$ git log --oneline --stat

stars:402

forks:40

updated:2026年5月29日 09:18

文件资源管理器

3 个文件

SKILL.md

readonly

name	perf-doctor
description	Diagnose Relax training launch scripts for misconfigured flags that hurt performance (time/MFU) or waste GPU memory (cards needed). Use when user asks to review/audit/check a training script, mentions "perf doctor", suspects a config is slow or OOM-prone, or wants a sanity check before launching. Produces a two-section markdown report (Performance + Memory) with cited flags, severity, and concrete fixes.
argument-hint	<path-to-launch-script>

perf-doctor

诊断 Relax 训练启动脚本（scripts/training/**/*.sh），找出影响 执行性能（耗时 / MFU） 或 显存占用（所需卡量） 的不合理配置。

使用方式

/perf-doctor scripts/training/text/run-qwen36-35B-A3B-8xgpu.sh

参数：单个启动脚本绝对或相对路径。

执行步骤

读取脚本 — 收集 *_ARGS=( ... ) 数组与 ray job submit ... train 行里的所有 --flag value。同时 follow source ${MODEL_CONFIG_DIR}/... 拿模型架构（dense / MoE、是否 multimodal）。
抽取 context —
- 文件名解析：run-<model>-<size>-<NxgpuY>(-async|-image|-video)?.sh → 总 GPU 数、节点数、模式、模态
- flag 解析：TP/PP/CP/EP/ETP、--colocate vs --fully-async、--rollout-max-response-len、--max-tokens-per-gpu、--resource、--num-iters-per-train-update、--max-staleness、--num-data-storage-units
- 默认 GPU 假设：H20 96GB，除非用户在 prompt 里给出别的（A100 80G / H100 80G 等）
加载规则 — 读 references/rules.md，逐条判断 applies / borderline / not-applicable
对照 baseline — 读 references/baselines.md，若用户脚本与某条 baseline 同模型 + 同 GPU 数量级，把 baseline 的并行 / batch / mem 配置作为合理区间锚点；偏离 ≥ 2 档时把 baseline 数值写进对应 finding 的 Cost 一栏佐证
输出报告 — 严格按下方输出模板渲染

触发判断原则

不要机械触发：CPU offload 三件套在 35B-A3B 4×H20 这种边界 case 是必需的；规则 Skip when 节里写了什么时候它就是对的，要尊重
不要 false positive：脚本里如果有 # NOTE(...) 注释解释为什么开 / 关某个 flag，把它当作有效理由，降级到 info 或跳过
借助推理而非穷举：references/rules.md 是知识库不是判定表 — 模型大小 × dtype 估算显存预算、TP×CP×PP 是否合理、async 资源比是否平衡，都要 case-by-case 算

输出模板

# perf-doctor: <script-name>

**Context:** model=<X> (<dense|MoE>, <text|mm|video>) · GPUs=<N> (<nodes>×<g/n>) · mode=<colocate|fully-async> · TP<x>/PP<x>/CP<x>/EP<x>/ETP<x> · max-resp-len=<X> · GPU=H20 96GB (assumed)

---

## 🚀 Performance findings

### [WARN] R-P0X — <short title>
- **Setting:** `--flag value`（脚本行号或所在 ARGS 组）
- **Cost:** <估算的 MFU / 耗时影响>
- **Fix:** <可直接照抄的 flag 修改>
- **Skip if:** <什么情况下当前设置反而是对的>

(more findings...)

## 💾 Memory findings

### [WARN] R-M0X — <short title>
- **Setting:** `--flag value`
- **Cost:** <显存影响 / 卡量影响>
- **Fix:** <修改建议>
- **Skip if:** <justified condition>

(more findings...)

---

## Summary
- Critical: N · Warn: N · Info: N
- **Top action:** <一句话最该改的>

严禁

❌ 不要 修改脚本。只诊断，给文字建议
❌ 不要执行训练 / dry-run / benchmark
❌ 不要分析日志（那是 debug-hang 的事）
❌ 不要建议不在 references/rules.md 里、且自己不能给出 Relax-specific 依据的"通用 ML 优化技巧"

Rule catalog

完整规则在 references/rules.md。每条规则字段：Category / Severity / Trigger / Why / Fix / Skip when。新规则直接往该文件追加即可，无需改 SKILL.md。

Baselines

经过验证的参考配置在 references/baselines.md，作为合理区间锚点用。新 baseline 按文件末尾模板追加即可。

related-skills.json

同仓库

relax-dev-debug.md

from "redai-infra/Relax"

Develop and debug the Relax reinforcement learning project. Use this skill whenever modifying code in the relax/ directory, or running remote training jobs on a Ray cluster for validation. Also use it when the user mentions training, debugging training runs, submitting Ray jobs, or fixing training errors.

2026-05-14402

ssh-ray-cluster.md

from "redai-infra/Relax"

Connect to a remote Ray cluster head node via SSH (paramiko) to execute commands, check cluster status, inspect logs, and debug training jobs. Use this skill when the user asks to SSH into a remote machine, check Ray cluster status, or run remote commands on the Ray head node.

2026-05-14402

code-review.md

from "redai-infra/Relax"

Expert code review of current git changes with a senior engineer lens. Detects SOLID violations, security risks, Python anti-patterns, and ML/distributed training issues. Tailored for the Relax reinforcement learning framework.

2026-04-14402

creating-skills.md

from "redai-infra/Relax"

Guide for creating Claude Code skills following Anthropic's official best practices. Use when user wants to create a new skill, build a skill, write SKILL.md, update an existing skill, or needs skill creation guidelines. Provides structure, frontmatter fields, naming conventions, and new features like dynamic context injection and subagent execution.

2026-04-14402

debug-hang.md

from "redai-infra/Relax"

自动排查 Ray 调度的分布式训练任务 hang 问题。使用当训练任务无响应、资源利用率异常、任务长时间无进度时。自动收集集群状态、任务调用栈、Actor 状态，分析阻塞链条并定位根因。

2026-04-14402

doc-writer.md

from "redai-infra/Relax"

Write and maintain bilingual (English + Chinese) documentation for the Relax project. Use when user asks to create, update, or translate documentation pages. Ensures format correctness (VitePress, sidebar config, bilingual parity) and content correctness (matches actual codebase, no fabricated features).

2026-04-14402

package.json

"author": "redai-infra"

"repository": "redai-infra/Relax"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件开发工程师计算机与数学类职业15-1252L4

name	perf-doctor
description	Diagnose Relax training launch scripts for misconfigured flags that hurt performance (time/MFU) or waste GPU memory (cards needed). Use when user asks to review/audit/check a training script, mentions "perf doctor", suspects a config is slow or OOM-prone, or wants a sanity check before launching. Produces a two-section markdown report (Performance + Memory) with cited flags, severity, and concrete fixes.
argument-hint	<path-to-launch-script>

perf-doctor

诊断 Relax 训练启动脚本（scripts/training/**/*.sh），找出影响 执行性能（耗时 / MFU） 或 显存占用（所需卡量） 的不合理配置。

使用方式

/perf-doctor scripts/training/text/run-qwen36-35B-A3B-8xgpu.sh

参数：单个启动脚本绝对或相对路径。

执行步骤

读取脚本 — 收集 *_ARGS=( ... ) 数组与 ray job submit ... train 行里的所有 --flag value。同时 follow source ${MODEL_CONFIG_DIR}/... 拿模型架构（dense / MoE、是否 multimodal）。
抽取 context —
- 文件名解析：run-<model>-<size>-<NxgpuY>(-async|-image|-video)?.sh → 总 GPU 数、节点数、模式、模态
- flag 解析：TP/PP/CP/EP/ETP、--colocate vs --fully-async、--rollout-max-response-len、--max-tokens-per-gpu、--resource、--num-iters-per-train-update、--max-staleness、--num-data-storage-units
- 默认 GPU 假设：H20 96GB，除非用户在 prompt 里给出别的（A100 80G / H100 80G 等）
加载规则 — 读 references/rules.md，逐条判断 applies / borderline / not-applicable
对照 baseline — 读 references/baselines.md，若用户脚本与某条 baseline 同模型 + 同 GPU 数量级，把 baseline 的并行 / batch / mem 配置作为合理区间锚点；偏离 ≥ 2 档时把 baseline 数值写进对应 finding 的 Cost 一栏佐证
输出报告 — 严格按下方输出模板渲染

触发判断原则

不要机械触发：CPU offload 三件套在 35B-A3B 4×H20 这种边界 case 是必需的；规则 Skip when 节里写了什么时候它就是对的，要尊重
不要 false positive：脚本里如果有 # NOTE(...) 注释解释为什么开 / 关某个 flag，把它当作有效理由，降级到 info 或跳过
借助推理而非穷举：references/rules.md 是知识库不是判定表 — 模型大小 × dtype 估算显存预算、TP×CP×PP 是否合理、async 资源比是否平衡，都要 case-by-case 算

输出模板

# perf-doctor: <script-name>

**Context:** model=<X> (<dense|MoE>, <text|mm|video>) · GPUs=<N> (<nodes>×<g/n>) · mode=<colocate|fully-async> · TP<x>/PP<x>/CP<x>/EP<x>/ETP<x> · max-resp-len=<X> · GPU=H20 96GB (assumed)

---

## 🚀 Performance findings

### [WARN] R-P0X — <short title>
- **Setting:** `--flag value`（脚本行号或所在 ARGS 组）
- **Cost:** <估算的 MFU / 耗时影响>
- **Fix:** <可直接照抄的 flag 修改>
- **Skip if:** <什么情况下当前设置反而是对的>

(more findings...)

## 💾 Memory findings

### [WARN] R-M0X — <short title>
- **Setting:** `--flag value`
- **Cost:** <显存影响 / 卡量影响>
- **Fix:** <修改建议>
- **Skip if:** <justified condition>

(more findings...)

---

## Summary
- Critical: N · Warn: N · Info: N
- **Top action:** <一句话最该改的>

严禁

❌ 不要 修改脚本。只诊断，给文字建议
❌ 不要执行训练 / dry-run / benchmark
❌ 不要分析日志（那是 debug-hang 的事）
❌ 不要建议不在 references/rules.md 里、且自己不能给出 Relax-specific 依据的"通用 ML 优化技巧"

Rule catalog

完整规则在 references/rules.md。每条规则字段：Category / Severity / Trigger / Why / Fix / Skip when。新规则直接往该文件追加即可，无需改 SKILL.md。

Baselines

经过验证的参考配置在 references/baselines.md，作为合理区间锚点用。新 baseline 按文件末尾模板追加即可。

perf-doctor

perf-doctor

使用方式

执行步骤

触发判断原则

输出模板

严禁

Rule catalog

Baselines

同仓库更多 Skills

同仓库更多 Skills

perf-doctor

使用方式

执行步骤

触发判断原则

输出模板

严禁

Rule catalog

Baselines