一键在 Manus 中运行任何 Skill

$pwd:

run-atom-workload

Name: Run Atom Workload
Author: ROCm

// Run any ATOM workload — accuracy eval (GSM8K via lm_eval), performance benchmark, concurrency sweep, offline simple_inference, or fault repro under rocm-debug-agent. Use when the user asks to "test accuracy", "测精度", "跑 GSM8K", "跑 benchmark", "test performance", "run sweep", "repro the fault", "测一下 MTP1 精度", "跑 simple_inference" — anything that drives an ATOM workload. Encodes the canonical flow (stop → start → workload-in-shell-bg → wait_infer_drain → stop) and the model-family env vars. Same pattern works for both server-based workloads (lm_eval / benchmark client) and offline simple_inference. Do NOT use for profiling traces (use capture-trace).

在 Manus 中运行

$ git log --oneline --stat

stars:99

forks:64

updated:2026年5月23日 16:05

SKILL.md

readonly

related-skills.json

同仓库

atom-patterns.md

from "ROCm/ATOM"

Coding patterns and architecture index for the ATOM LLM inference engine

2026-05-2399

capture-trace.md

from "ROCm/ATOM"

Capture a PyTorch profiler / kineto trace from a running ATOM server for a short benchmark window. Use when the user asks for "a trace", "profiler trace", "GPU trace", or "抓 trace" for performance investigation — what kernels ran, what's on the critical path, what's slow. Do NOT use for crashes (use debug-agent-locate-kernel) or numerical bugs (use dump-bisect-debug).

2026-05-2399

debug-agent-locate-kernel.md

from "ROCm/ATOM"

Identify which GPU kernel is faulting/hanging in ATOM via rocm-debug-agent (for faults/asserts) or rocgdb (for silent livelocks). debug-agent dumps wave registers + faulting PC + (with --save-code-objects) disassembled code object on memory faults / ASSERT_TRAP. rocgdb attaches to a live process and lists in-flight `info dispatches` + HSA `info queues` — works when the kernel isn't faulting but just stuck (e.g. atomic-counter deadlock). Use when: server crashes with "Memory access fault by GPU node-N", server hangs with GPU at 100% but no token output, kernel asserting `s_trap`, or `HIP_LAUNCH_BLOCKING=1` makes a hang vanish. Do NOT use for: numerical bugs (use dump-bisect-debug), compile errors, OOM.

2026-05-2399

dump-bisect-debug.md

from "ROCm/ATOM"

Locate forward numerical bugs by dumping intermediate tensors from a target implementation and a known-good reference, then bisecting layer by layer. Also covers batch-invariance bisect (the same token at any batch position should produce a bitwise-identical output, per DeepSeek V4 paper §3.3). Use when "the output is wrong but I don't know where" — model produces gibberish, degenerates, or picks the wrong token, but code review reveals nothing.

2026-05-2399

package.json

"author": "ROCm"

"repository": "ROCm/ATOM"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件开发工程师计算机与数学类职业15-1252L4

# Step 3 — single Bash tool call, command ends with `&` bash scripts/run_gsm8k_eval.sh /data/MODEL 30000 3 & # Step 4 — single Bash tool call, blocks bash scripts/wait_infer_drain.sh 30000 30 10

Model

Required env vars

Required CLI args

DeepSeek-V4-Pro

AITER_BF16_FP8_MOE_BOUND=0 ATOM_MOE_GU_ITLV=1 AITER_LOG_LEVEL=WARNING

--kv_cache_dtype fp8 --level 0

DeepSeek-R1-0528 (default)

AITER_LOG_LEVEL=WARNING

--kv_cache_dtype fp8

Kimi-K2.5-MXFP4

HSA_NO_SCRATCH_RECLAIM=1 AITER_LOG_LEVEL=WARNING

--kv_cache_dtype fp8 --trust-remote-code (tp=4)

Workload

Command (note trailing &)

Optional client log for drain

GSM8K accuracy

bash scripts/run_gsm8k_eval.sh MODEL PORT NUM_FEWSHOT &

/app/logs_claude/gsm8k_eval.log (lm_eval is silent during requests; drain's auto-discovered server log carries the engine markers — passing this log only helps fault grep coverage)

Single benchmark

bash scripts/run_benchmark.sh MODEL PORT ISL OSL CONC [PROMPT_MULT] [PROFILE] &

/app/logs_claude/benchmark.log (has tqdm progress, useful mtime signal)

Concurrency sweep

bash scripts/run_benchmark_sweep.sh MODEL PORT ISL OSL "CONC1 CONC2 ..." &

/app/logs_claude/benchmark.log (overwritten per step)

Script

What it does

Step

Blocks?

stop_atom_server.sh

Kill all atom + multiproc children, wait for VRAM=0

1, 5

Yes ≤60s

start_atom_server.sh MODEL TP PORT [ARGS...]

Clean GPU, fork python in bg, best-effort 120s ready poll (must wrap with & — step 2.5 is the real gate)

2 (server)

Self-blocks ≤120s but unreliable as gate

start_simple_inference.sh MODEL TP [ARGS...]

Offline inference (no server, runs prompts) — wrap with & for drain

3 (offline)

Blocks unless &

run_gsm8k_eval.sh MODEL PORT FEWSHOT

lm_eval local-completions GSM8K — wrap with & for drain

3 (server)

Blocks unless &

run_benchmark.sh MODEL PORT ISL OSL CONC [PMULT] [PROF]

Single perf point — wrap with & for drain

3 (server)

Blocks unless &

run_benchmark_sweep.sh MODEL PORT ISL OSL "CONCs"

Loop run_benchmark — wrap with & for drain

3 (server)

Blocks unless &

wait_infer_drain.sh PORT MAX_MIN POLL [LOG] [STUCK]

Monitor workload for drain / hang / fault (auto-discovers server log)

Yes, until exit code

wait_server_ready.sh PORT MAX_MIN POLL LOG

Mandatory ready-gate after start; polls /v1/models + greps log for startup errors

2.5 (server)

Yes, until ready or fail

run_debug_agent.sh [--simple] MODEL TP [PORT] [ARGS...]

Server (or simple_inference) under rocm-debug-agent — fault repro

2 (replaces start)

Yes, until ready or fault

# Step 1 bash scripts/stop_atom_server.sh # Step 2 — note trailing `&` (REQUIRED — see Hard rule 8) AITER_BF16_FP8_MOE_BOUND=0 ATOM_MOE_GU_ITLV=1 AITER_LOG_LEVEL=WARNING \ bash scripts/start_atom_server.sh /data/DeepSeek-V4-Pro 8 30000 \ --kv_cache_dtype fp8 --method mtp --num-speculative-tokens 3 --level 0 & # Step 2.5 — MANDATORY foreground ready gate. Bash tool timeout ≥ 900000ms. # Abort to step 5 on non-zero exit. bash scripts/wait_server_ready.sh 30000 15 5 /app/logs_claude/atom_server.log # Step 3 — note trailing `&` bash scripts/run_gsm8k_eval.sh /data/DeepSeek-V4-Pro 30000 3 & # Step 4 — drain auto-discovers server log; no LOG_FILE needed bash scripts/wait_infer_drain.sh 30000 30 10 # Step 5 bash scripts/stop_atom_server.sh # Read result grep -E "flexible-extract|strict-match" /app/logs_claude/gsm8k_eval.log | head -2

# Step 1 bash scripts/stop_atom_server.sh # Step 2+3 fused (simple_inference IS the workload host) — note trailing `&` AITER_BF16_FP8_MOE_BOUND=0 ATOM_MOE_GU_ITLV=1 AITER_LOG_LEVEL=WARNING \ bash scripts/start_simple_inference.sh /data/DeepSeek-V4-Pro 8 \ --kv_cache_dtype fp8 --level 0 & # Step 4 — drain auto-discovers via /proc; PORT unused bash scripts/wait_infer_drain.sh 0 15 10 # Step 5 bash scripts/stop_atom_server.sh

# Step 3 — single Bash tool call, command ends with `&` bash scripts/run_gsm8k_eval.sh /data/MODEL 30000 3 & # Step 4 — single Bash tool call, blocks bash scripts/wait_infer_drain.sh 30000 30 10

Model

Required env vars

Required CLI args

DeepSeek-V4-Pro

AITER_BF16_FP8_MOE_BOUND=0 ATOM_MOE_GU_ITLV=1 AITER_LOG_LEVEL=WARNING

--kv_cache_dtype fp8 --level 0

DeepSeek-R1-0528 (default)

AITER_LOG_LEVEL=WARNING

--kv_cache_dtype fp8

Kimi-K2.5-MXFP4

HSA_NO_SCRATCH_RECLAIM=1 AITER_LOG_LEVEL=WARNING

--kv_cache_dtype fp8 --trust-remote-code (tp=4)

Workload

Command (note trailing &)

Optional client log for drain

GSM8K accuracy

bash scripts/run_gsm8k_eval.sh MODEL PORT NUM_FEWSHOT &

/app/logs_claude/gsm8k_eval.log (lm_eval is silent during requests; drain's auto-discovered server log carries the engine markers — passing this log only helps fault grep coverage)

Single benchmark

bash scripts/run_benchmark.sh MODEL PORT ISL OSL CONC [PROMPT_MULT] [PROFILE] &

/app/logs_claude/benchmark.log (has tqdm progress, useful mtime signal)

Concurrency sweep

bash scripts/run_benchmark_sweep.sh MODEL PORT ISL OSL "CONC1 CONC2 ..." &

/app/logs_claude/benchmark.log (overwritten per step)

Script

What it does

Step

Blocks?

stop_atom_server.sh

Kill all atom + multiproc children, wait for VRAM=0

1, 5

Yes ≤60s

start_atom_server.sh MODEL TP PORT [ARGS...]

Clean GPU, fork python in bg, best-effort 120s ready poll (must wrap with & — step 2.5 is the real gate)

2 (server)

Self-blocks ≤120s but unreliable as gate

start_simple_inference.sh MODEL TP [ARGS...]

Offline inference (no server, runs prompts) — wrap with & for drain

3 (offline)

Blocks unless &

run_gsm8k_eval.sh MODEL PORT FEWSHOT

lm_eval local-completions GSM8K — wrap with & for drain

3 (server)

Blocks unless &

run_benchmark.sh MODEL PORT ISL OSL CONC [PMULT] [PROF]

Single perf point — wrap with & for drain

3 (server)

Blocks unless &

run_benchmark_sweep.sh MODEL PORT ISL OSL "CONCs"

Loop run_benchmark — wrap with & for drain

3 (server)

Blocks unless &

wait_infer_drain.sh PORT MAX_MIN POLL [LOG] [STUCK]

Monitor workload for drain / hang / fault (auto-discovers server log)

Yes, until exit code

wait_server_ready.sh PORT MAX_MIN POLL LOG

Mandatory ready-gate after start; polls /v1/models + greps log for startup errors

2.5 (server)

Yes, until ready or fail

run_debug_agent.sh [--simple] MODEL TP [PORT] [ARGS...]

Server (or simple_inference) under rocm-debug-agent — fault repro

2 (replaces start)

Yes, until ready or fault

name	run-atom-workload
description	Run any ATOM workload — accuracy eval (GSM8K via lm_eval), performance benchmark, concurrency sweep, offline simple_inference, or fault repro under rocm-debug-agent. Use when the user asks to "test accuracy", "测精度", "跑 GSM8K", "跑 benchmark", "test performance", "run sweep", "repro the fault", "测一下 MTP1 精度", "跑 simple_inference" — anything that drives an ATOM workload. Encodes the canonical flow (stop → start → workload-in-shell-bg → wait_infer_drain → stop) and the model-family env vars. Same pattern works for both server-based workloads (lm_eval / benchmark client) and offline simple_inference. Do NOT use for profiling traces (use capture-trace).
version	1.6.0
scope	ATOM on AMD ROCm; `scripts/` orchestration scripts under repo root
last_updated	"2026-05-21T00:00:00.000Z"

name	run-atom-workload
description	Run any ATOM workload — accuracy eval (GSM8K via lm_eval), performance benchmark, concurrency sweep, offline simple_inference, or fault repro under rocm-debug-agent. Use when the user asks to "test accuracy", "测精度", "跑 GSM8K", "跑 benchmark", "test performance", "run sweep", "repro the fault", "测一下 MTP1 精度", "跑 simple_inference" — anything that drives an ATOM workload. Encodes the canonical flow (stop → start → workload-in-shell-bg → wait_infer_drain → stop) and the model-family env vars. Same pattern works for both server-based workloads (lm_eval / benchmark client) and offline simple_inference. Do NOT use for profiling traces (use capture-trace).
version	1.6.0
scope	ATOM on AMD ROCm; `scripts/` orchestration scripts under repo root
last_updated	"2026-05-21T00:00:00.000Z"

run-atom-workload

Path convention

Why this skill exists

Backgrounding mechanism — shell `&`, NOT claude task

Canonical 5-step flow

Step 1 — clean GPU (always)

Step 2 — start workload host (blocks until ready / completion)

Step 2.5 — verify server ready (MANDATORY for server-based workloads)

Step 3 — launch workload in shell background (`&`)

Step 4 — wait_infer_drain (blocks, with early fault/hang detection)

Step 4.5 — hang inspection (optional, only when drain exit=1)

Step 5 — teardown (always)

Reading results

Reading baselines

Hard rules (do not violate)

Reference: each script in one line

Worked example: V4-Pro MTP3 GSM8K accuracy

Worked example: V4-Pro offline simple_inference

Path convention

Why this skill exists

Backgrounding mechanism — shell `&`, NOT claude task

Canonical 5-step flow

Step 1 — clean GPU (always)

Step 2 — start workload host (blocks until ready / completion)

Step 2.5 — verify server ready (MANDATORY for server-based workloads)

Step 3 — launch workload in shell background (`&`)

Step 4 — wait_infer_drain (blocks, with early fault/hang detection)

Step 4.5 — hang inspection (optional, only when drain exit=1)

Step 5 — teardown (always)

Reading results

Reading baselines

Hard rules (do not violate)

Reference: each script in one line

Worked example: V4-Pro MTP3 GSM8K accuracy

Worked example: V4-Pro offline simple_inference

run-atom-workload

同仓库更多 Skills

Path convention

Why this skill exists

Backgrounding mechanism — shell &, NOT claude task

Canonical 5-step flow

Step 1 — clean GPU (always)

Step 2 — start workload host (blocks until ready / completion)

Step 2.5 — verify server ready (MANDATORY for server-based workloads)

Step 3 — launch workload in shell background (&)

Step 4 — wait_infer_drain (blocks, with early fault/hang detection)

Step 4.5 — hang inspection (optional, only when drain exit=1)

Step 5 — teardown (always)

Reading results

Reading baselines

Hard rules (do not violate)

Reference: each script in one line

Worked example: V4-Pro MTP3 GSM8K accuracy

Worked example: V4-Pro offline simple_inference

Path convention

Why this skill exists

Backgrounding mechanism — shell &, NOT claude task

Canonical 5-step flow

Step 1 — clean GPU (always)

Step 2 — start workload host (blocks until ready / completion)

Step 2.5 — verify server ready (MANDATORY for server-based workloads)

Step 3 — launch workload in shell background (&)

Step 4 — wait_infer_drain (blocks, with early fault/hang detection)

Step 4.5 — hang inspection (optional, only when drain exit=1)

Step 5 — teardown (always)

Reading results

Reading baselines

Hard rules (do not violate)

Reference: each script in one line

Worked example: V4-Pro MTP3 GSM8K accuracy

Worked example: V4-Pro offline simple_inference

同仓库更多 Skills

Backgrounding mechanism — shell `&`, NOT claude task

Step 3 — launch workload in shell background (`&`)

Backgrounding mechanism — shell `&`, NOT claude task

Step 3 — launch workload in shell background (`&`)