一键在 Manus 中运行任何 Skill

$pwd:

ssh-ray-cluster

Name: Ssh Ray Cluster
Author: redai-infra

// Connect to a remote Ray cluster head node via SSH (paramiko) to execute commands, check cluster status, inspect logs, and debug training jobs. Use this skill when the user asks to SSH into a remote machine, check Ray cluster status, or run remote commands on the Ray head node.

在 Manus 中运行

$ git log --oneline --stat

stars:402

forks:40

updated:2026年5月14日 04:09

SKILL.md

readonly

related-skills.json

同仓库

perf-doctor.md

from "redai-infra/Relax"

Diagnose Relax training launch scripts for misconfigured flags that hurt performance (time/MFU) or waste GPU memory (cards needed). Use when user asks to review/audit/check a training script, mentions "perf doctor", suspects a config is slow or OOM-prone, or wants a sanity check before launching. Produces a two-section markdown report (Performance + Memory) with cited flags, severity, and concrete fixes.

2026-05-29402

relax-dev-debug.md

from "redai-infra/Relax"

Develop and debug the Relax reinforcement learning project. Use this skill whenever modifying code in the relax/ directory, or running remote training jobs on a Ray cluster for validation. Also use it when the user mentions training, debugging training runs, submitting Ray jobs, or fixing training errors.

2026-05-14402

code-review.md

from "redai-infra/Relax"

Expert code review of current git changes with a senior engineer lens. Detects SOLID violations, security risks, Python anti-patterns, and ML/distributed training issues. Tailored for the Relax reinforcement learning framework.

2026-04-14402

creating-skills.md

from "redai-infra/Relax"

Guide for creating Claude Code skills following Anthropic's official best practices. Use when user wants to create a new skill, build a skill, write SKILL.md, update an existing skill, or needs skill creation guidelines. Provides structure, frontmatter fields, naming conventions, and new features like dynamic context injection and subagent execution.

2026-04-14402

debug-hang.md

from "redai-infra/Relax"

自动排查 Ray 调度的分布式训练任务 hang 问题。使用当训练任务无响应、资源利用率异常、任务长时间无进度时。自动收集集群状态、任务调用栈、Actor 状态，分析阻塞链条并定位根因。

2026-04-14402

doc-writer.md

from "redai-infra/Relax"

Write and maintain bilingual (English + Chinese) documentation for the Relax project. Use when user asks to create, update, or translate documentation pages. Ensures format correctness (VitePress, sidebar config, bilingual parity) and content correctness (matches actual codebase, no fabricated features).

2026-04-14402

package.json

"author": "redai-infra"

"repository": "redai-infra/Relax"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

网络与计算机系统管理员计算机与数学类职业15-1244L4

name	ssh-ray-cluster
description	Connect to a remote Ray cluster head node via SSH (paramiko) to execute commands, check cluster status, inspect logs, and debug training jobs. Use this skill when the user asks to SSH into a remote machine, check Ray cluster status, or run remote commands on the Ray head node.

SSH to Ray Cluster

This skill provides a standardized way to connect to a remote Ray cluster head node via SSH using paramiko, execute commands, and retrieve results. It is used for cluster inspection, log retrieval, and remote debugging.

Prerequisites

The user must provide the following details (ask if missing — do not invent values, and do not write them into this skill file):

Parameter	Purpose
`host`	Remote machine IP
`port`	SSH port
`username`	SSH username
`password`	SSH password
`RELAX_PROJECT_ROOT`	Absolute path to the Relax project root

Connection details and the project root are typically recorded in the session's auto-memory (see reference_ray_cluster_ssh.md). Read them from memory or ask the user — do not hard-code them in this skill or in scripts checked into the repo.

HARD REQUIREMENT — always run project commands from the Relax project root

A one-shot paramiko.exec_command starts the remote shell in the user's home directory (typically /root or another non-project dir), not in the Relax project root. Any command that touches a project-relative path (scripts/..., log/..., relax/..., tests/..., pyproject.toml, etc.) MUST be prefixed with cd "$RELAX_PROJECT_ROOT" (or the resolved path) inside the same command string — splitting cd into a separate exec_command call does NOT work, because each call opens a fresh shell back at the home directory.

RELAX_PROJECT_ROOT is a session-level value supplied by the user / read from auto-memory (see Prerequisites). Do not hard-code its value in this skill, in checked-in scripts, or in any reusable artifact — resolve it at command-build time from memory or by asking the user.

Required pattern for any project-relative command

# RELAX_PROJECT_ROOT must be resolved from memory or user input first.
cmd = (
    f'cd {shlex.quote(RELAX_PROJECT_ROOT)} && '
    '<your command here>'
)
ssh.exec_command(cmd, timeout=...)

Examples that REQUIRE the cd prefix:

bash scripts/entrypoint/ray-job.sh ...
bash scripts/training/text/run-<model>-<size>-<gpus>.sh
python scripts/tools/run_on_each_ray_node.py ...
bash scripts/tools/kill_for_ray.sh
tail -n 100 log/<run-name>-*.log
pre-commit run --all-files
pytest tests/test_foo.py

Examples that do NOT need the cd (they take absolute paths or are host-global tools that touch no project files):

ray status, ray job list, ray job logs <ID>, ray job status <ID>
nvidia-smi ...
ls /tmp/ray/session_latest/logs/
ps -ef | grep ...

When in doubt: add the cd. It is harmless on host-global commands and mandatory on project-relative ones.

Symptom that the `cd` was lost

bash: scripts/...: No such file or directory
python: can't open file '<home-dir>/scripts/...'
ls: cannot access 'log/': No such file or directory

Fix: add cd "$RELAX_PROJECT_ROOT" && to the front of the command and re-run. Do NOT retry blindly — a missing cd will keep failing the same way.

Connection Pattern

Use Python's paramiko library to establish SSH connections. Always use a one-shot pattern: connect, execute, close. Do not try to maintain persistent connections across tool calls.

Basic connection template

python3 -c "
import paramiko
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect('<HOST>', port=<PORT>, username='<USERNAME>', password='<PASSWORD>', timeout=10)

stdin, stdout, stderr = ssh.exec_command('<COMMAND>', timeout=30)
output = stdout.read().decode()
errors = stderr.read().decode()
print(output)
if errors:
    print('STDERR:', errors)

ssh.close()
"

Multi-command template

When you need to run multiple commands in sequence:

python3 -c "
import paramiko
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect('<HOST>', port=<PORT>, username='<USERNAME>', password='<PASSWORD>', timeout=10)

commands = [
    ('Description 1', 'command1'),
    ('Description 2', 'command2'),
]

for desc, cmd in commands:
    print(f'=== {desc} ===')
    stdin, stdout, stderr = ssh.exec_command(cmd, timeout=30)
    print(stdout.read().decode())
    err = stderr.read().decode()
    if err:
        print('STDERR:', err)
    print()

ssh.close()
"

Common Operations

1. Check Ray cluster status

ray status 2>&1 | head -30

Shows active/idle nodes, GPU/CPU usage, pending demands.

2. List Ray jobs

ray job list 2>&1 | head -50

Shows all submitted jobs with their status (RUNNING, FAILED, SUCCEEDED).

3. Get running job logs

ray job logs <JOB_ID> 2>&1 | tail -100

4. Check GPU usage across nodes

nvidia-smi --query-gpu=index,memory.used,memory.total,utilization.gpu --format=csv,noheader

5. Check specific worker node logs

SGLang engine logs are typically found in Ray's log directory:

ls -lt /tmp/ray/session_latest/logs/ | head -20

6. Kill residual processes

cd "$RELAX_PROJECT_ROOT" && bash scripts/tools/kill_for_ray.sh

7. Run command on all nodes

cd "$RELAX_PROJECT_ROOT" && python scripts/tools/run_on_each_ray_node.py command "<COMMAND>"

8. Launch / relaunch a training run

Per the HARD REQUIREMENT above, the cd into the project root and the launch must be in the same command string.

cd "$RELAX_PROJECT_ROOT" && \
  nohup bash scripts/entrypoint/ray-job.sh <RUN_SCRIPT> > <OUT_FILE> 2>&1 &

Verify CWD before launch by chaining pwd && ls scripts/entrypoint/ray-job.sh in the same command — if pwd doesn't report the project root, the cd was dropped and the launch will fail.

Working directory pitfall (`cd` over SSH) — supplementary patterns

See the HARD REQUIREMENT section near the top for the rule. Two equivalent patterns satisfy it; mixing them does not:

Single-line, single-shell (preferred for paramiko exec_command): chain cd with && inside the same quoted command string, e.g. ssh ... 'cd "$RELAX_PROJECT_ROOT" && bash scripts/...'. If you split the cd into a separate ssh / exec_command invocation, the next call starts back in the home directory.

Heredoc to remote bash (useful for multi-step launches):

ssh ... bash <<EOF
cd "$RELAX_PROJECT_ROOT"
nohup bash scripts/entrypoint/ray-job.sh <RUN_SCRIPT> > <OUT_FILE> 2>&1 &
echo "PID=\$!"
EOF

Symptom that the cd was lost: bash: <script>: No such file or directory, python: can't open file '<home>/scripts/...', or ls: cannot access 'log/'. Fix by adding the cd "$RELAX_PROJECT_ROOT" && prefix and re-launching — do not retry blindly.

Verifying long-running launches over paramiko

paramiko.exec_command will raise socket.timeout / PipeTimeout on the stdout.read() side if the remote command hasn't finished within timeout seconds — even though the command itself keeps running on the remote end. This is expected for chained commands that include long-running steps like ray serve shutdown -y (~30s) or backgrounded nohup bash ray-job.sh ....

Don't retry blindly on PipeTimeout. Instead, treat it as "submission is in flight, verify separately":

# 1. Fire the launch chain — accept that read() may timeout
try:
    stdout.read()
except Exception:
    pass

# 2. Open a fresh connection and verify
ssh2.exec_command("pgrep -af 'ray-job.sh' | head; ray job list 2>&1 | grep RUNNING | head")

A successful launch is confirmed by either: a ray-job.sh PID still alive, or a fresh RUNNING entry in ray job list with a recent start_time.

Forbidden Operations

When SSH'd into a Ray cluster, never execute the following destructive commands without an explicit, in-conversation user request — the user is typically running a job that must not be killed:

ray stop (kills the entire Ray runtime on the node)
pkill -9 python / pkill -9 -f ... / any wide kill -9 against training processes
bash scripts/tools/kill_for_ray.sh (despite being listed under Common Operations — treat it the same way)
ray job stop <id>
rm -rf on /tmp/ray/ or any session/log directory while a job is running

These bypass graceful shutdown, lose in-flight state, and can crash other tenants on shared clusters. If you think one is needed (e.g. cleanup before a fresh run), ask first and quote the exact command.

Read-only inspection (ray status, ray job list, ray job logs, nvidia-smi, ls, cat, tail, grep, ps, py-spy dump) is always fine.

Important Notes

Timeout: Always set timeout=10 on ssh.connect() and timeout=30 on exec_command() to avoid hanging.
Always close: Call ssh.close() in a finally block or at the end to release the connection.
No interactive commands: Never run interactive commands (vim, top without -b, etc.) via paramiko.
Head node has no GPU: The Ray head node typically has no GPUs. GPU commands should be run on worker nodes via run_on_each_ray_node.py.
One-shot pattern: Each execute_command call creates a fresh SSH connection. Do not attempt to reuse connections across tool calls.
Large output: For commands that produce large output, always pipe through head or tail to limit the response size.

Troubleshooting

Connection refused

Verify the SSH tunnel is active on the user's side
Check that the port is correct (non-standard ports like 2360 are common with tunneling)

Command timeout

Some commands (e.g., ray job logs for large jobs) may take a long time
Increase timeout parameter on exec_command() or use tail/head to limit output

Permission denied

Verify username and password
Some clusters require key-based authentication — use paramiko.RSAKey.from_private_key_file() instead

ssh-ray-cluster

同仓库更多 Skills

同仓库更多 Skills

SSH to Ray Cluster

Prerequisites

HARD REQUIREMENT — always run project commands from the Relax project root

Required pattern for any project-relative command

Symptom that the cd was lost

Connection Pattern

Basic connection template

Multi-command template

Common Operations

1. Check Ray cluster status

2. List Ray jobs

3. Get running job logs

4. Check GPU usage across nodes

5. Check specific worker node logs

6. Kill residual processes

7. Run command on all nodes

8. Launch / relaunch a training run

Working directory pitfall (cd over SSH) — supplementary patterns

Verifying long-running launches over paramiko

Forbidden Operations

Important Notes

Troubleshooting

Connection refused

Command timeout

Permission denied

SSH to Ray Cluster

Prerequisites

HARD REQUIREMENT — always run project commands from the Relax project root

Required pattern for any project-relative command

Symptom that the cd was lost

Connection Pattern

Basic connection template

Multi-command template

Common Operations

1. Check Ray cluster status

2. List Ray jobs

3. Get running job logs

4. Check GPU usage across nodes

5. Check specific worker node logs

6. Kill residual processes

7. Run command on all nodes

8. Launch / relaunch a training run

Working directory pitfall (cd over SSH) — supplementary patterns

Verifying long-running launches over paramiko

Forbidden Operations

Important Notes

Troubleshooting

Connection refused

Command timeout

Permission denied

Symptom that the `cd` was lost

Working directory pitfall (`cd` over SSH) — supplementary patterns

Symptom that the `cd` was lost

Working directory pitfall (`cd` over SSH) — supplementary patterns