com um clique
debug-hang
// 自动排查 Ray 调度的分布式训练任务 hang 问题。使用当训练任务无响应、资源利用率异常、任务长时间无进度时。自动收集集群状态、任务调用栈、Actor 状态,分析阻塞链条并定位根因。
// 自动排查 Ray 调度的分布式训练任务 hang 问题。使用当训练任务无响应、资源利用率异常、任务长时间无进度时。自动收集集群状态、任务调用栈、Actor 状态,分析阻塞链条并定位根因。
Diagnose Relax training launch scripts for misconfigured flags that hurt performance (time/MFU) or waste GPU memory (cards needed). Use when user asks to review/audit/check a training script, mentions "perf doctor", suspects a config is slow or OOM-prone, or wants a sanity check before launching. Produces a two-section markdown report (Performance + Memory) with cited flags, severity, and concrete fixes.
Develop and debug the Relax reinforcement learning project. Use this skill whenever modifying code in the relax/ directory, or running remote training jobs on a Ray cluster for validation. Also use it when the user mentions training, debugging training runs, submitting Ray jobs, or fixing training errors.
Connect to a remote Ray cluster head node via SSH (paramiko) to execute commands, check cluster status, inspect logs, and debug training jobs. Use this skill when the user asks to SSH into a remote machine, check Ray cluster status, or run remote commands on the Ray head node.
Expert code review of current git changes with a senior engineer lens. Detects SOLID violations, security risks, Python anti-patterns, and ML/distributed training issues. Tailored for the Relax reinforcement learning framework.
Guide for creating Claude Code skills following Anthropic's official best practices. Use when user wants to create a new skill, build a skill, write SKILL.md, update an existing skill, or needs skill creation guidelines. Provides structure, frontmatter fields, naming conventions, and new features like dynamic context injection and subagent execution.
Write and maintain bilingual (English + Chinese) documentation for the Relax project. Use when user asks to create, update, or translate documentation pages. Ensures format correctness (VitePress, sidebar config, bilingual parity) and content correctness (matches actual codebase, no fabricated features).
| name | debug-hang |
| description | 自动排查 Ray 调度的分布式训练任务 hang 问题。使用当训练任务无响应、资源利用率异常、任务长时间无进度时。自动收集集群状态、任务调用栈、Actor 状态,分析阻塞链条并定位根因。 |
| allowed-tools | ["bash","read","grep","glob"] |
目标: 确认集群健康状态和资源使用情况
ray status --address <address>
ray job list --address="<address>" | grep RUNNING
关注指标: 节点存活、CPU/GPU 使用率(异常低 → hang)、Pending resource demands、Object store 内存。
ray list tasks --address="<address>" --filter "JOB_ID=<job_id>" --filter "state=RUNNING" --format yaml
关键字段: name(业务逻辑)、actor_id、worker_pid(调用栈用)、node_id(py-spy 必须在正确节点执行)。
重要:
py-spy dump --pid <pid>必须在目标进程所在的节点上执行。
# 列出所有节点
ray job submit --working-dir "./" --address="<address>" -- \
python scripts/tools/run_on_each_ray_node.py --list
# 在指定节点执行 py-spy(推荐)
ray job submit --working-dir "./" --address="<address>" -- \
python scripts/tools/run_on_each_ray_node.py -n <node_id_or_ip> "py-spy dump --pid <pid>"
# 在所有 GPU 节点执行(单节点集群适用)
ray job submit --working-dir "./" --address="<address>" -- \
python scripts/tools/run_on_each_ray_node.py "py-spy dump --pid <pid>"
重点关注: 主线程阻塞点、后台线程状态、[Has the GIL] 标记。
ray list actors --address="<address>" --filter "JOB_ID=<job_id>" --filter "STATE=ALIVE" --format yaml
分析维度: 数据流方向(生产者→消费者)、调用关系(parent_task_id→task_id)、资源竞争。
| 模式 | 调用栈特征 | 排查方向 |
|---|---|---|
| 数据等待 | time.sleep 在迭代器/队列中 | 上游数据生产者是否工作 |
| 分布式同步 | dist.broadcast, dist.all_reduce, dist.barrier | 所有 rank 是否到达同步点 |
| 条件等待 | while True: if condition: break; sleep | 条件是否有机会满足 |
| 资源竞争 | 锁/信号量等待 | 是否存在死锁 |
| 远程调用阻塞 | ray.get 等待 | 被调用方是否响应 |
| 网络 I/O | socket read/write | 对端是否存活 |
#!/bin/bash
# scripts/tools/diagnose_ray_hang.sh
set -e
RAY_ADDRESS="${1:-$RAY_ADDRESS}"
OUTPUT_DIR="${2:-/tmp/ray_hang_diag_$(date +%Y%m%d_%H%M%S)}"
mkdir -p "$OUTPUT_DIR"
echo "=== Phase 1: Cluster Status ===" | tee "$OUTPUT_DIR/01_cluster.txt"
ray status --address "$RAY_ADDRESS" 2>&1 | tee -a "$OUTPUT_DIR/01_cluster.txt"
echo -e "\n=== Phase 2: Running Jobs ===" | tee "$OUTPUT_DIR/02_jobs.txt"
ray job list --address="$RAY_ADDRESS" 2>&1 | tee "$OUTPUT_DIR/02_jobs.txt"
JOB_ID=$(grep -oP "job_id='\K[^']+" "$OUTPUT_DIR/02_jobs.txt" | head -1)
[ -z "$JOB_ID" ] && echo "No running job found" && exit 1
echo "Target Job ID: $JOB_ID" | tee -a "$OUTPUT_DIR/02_jobs.txt"
echo -e "\n=== Phase 3: Running Tasks ===" | tee "$OUTPUT_DIR/03_tasks.txt"
ray list tasks --address="$RAY_ADDRESS" --filter "JOB_ID=$JOB_ID" --filter "state=RUNNING" --format yaml 2>&1 | tee "$OUTPUT_DIR/03_tasks.txt"
echo -e "\n=== Phase 4: Active Actors ===" | tee "$OUTPUT_DIR/04_actors.txt"
ray list actors --address="$RAY_ADDRESS" --filter "JOB_ID=$JOB_ID" --filter "STATE=ALIVE" --format yaml 2>&1 | tee "$OUTPUT_DIR/04_actors.txt"
echo -e "\n=== Phase 5: Stack Traces ===" | tee "$OUTPUT_DIR/05_stacks.txt"
awk '
/node_id:/ { node=$2 }
/worker_pid:/ { pid=$2; print pid, node }
' "$OUTPUT_DIR/03_tasks.txt" | while read PID NODE_ID; do
echo -e "\n--- PID $PID (node: $NODE_ID) ---" | tee -a "$OUTPUT_DIR/05_stacks.txt"
ray job submit --working-dir "./" --address="$RAY_ADDRESS" -- \
python scripts/tools/run_on_each_ray_node.py -n "$NODE_ID" "py-spy dump --pid $PID" 2>&1 | tee -a "$OUTPUT_DIR/05_stacks.txt"
done
echo -e "\n=== Diagnosis Complete ==="
echo "Output saved to: $OUTPUT_DIR"
## 集群状态摘要
- 活跃节点: X / Y
- GPU 使用率: Z%
- 异常信号: ...
## 阻塞 Task 分析
| Task Name | PID | 阻塞位置 | 模式分类 |
|-----------|-----|----------|----------|
## 阻塞链条
Actor A (阻塞于条件 X)
↑ 等待
Actor B (阻塞于数据 Y)
↑ 等待
Actor C (空闲,未生产数据 Y) ← 根因
## 根因诊断
- 主要原因 / 触发条件 / 影响范围
## 修复建议
RAY_ADDRESS 若未显式指定端口,按 6379 处理(如 x.x.x.x → x.x.x.x:6379)| 案例 | 描述 |
|---|---|
| case-rollout-eval-onload-hang.md | Rollout eval 等待 onload 状态导致 hang(配置与逻辑不匹配) |