Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

capture-trace

Name: Capture Trace
Author: ROCm

// Capture a PyTorch profiler / kineto trace from a running ATOM server for a short benchmark window. Use when the user asks for "a trace", "profiler trace", "GPU trace", or "抓 trace" for performance investigation — what kernels ran, what's on the critical path, what's slow. Do NOT use for crashes (use debug-agent-locate-kernel) or numerical bugs (use dump-bisect-debug).

Exécuter dans Manus

$ git log --oneline --stat

stars:99

forks:64

updated:23 mai 2026 à 16:05

SKILL.md

readonly

related-skills.json

même dépôt

atom-patterns.md

from "ROCm/ATOM"

Coding patterns and architecture index for the ATOM LLM inference engine

2026-05-2399

debug-agent-locate-kernel.md

from "ROCm/ATOM"

Identify which GPU kernel is faulting/hanging in ATOM via rocm-debug-agent (for faults/asserts) or rocgdb (for silent livelocks). debug-agent dumps wave registers + faulting PC + (with --save-code-objects) disassembled code object on memory faults / ASSERT_TRAP. rocgdb attaches to a live process and lists in-flight `info dispatches` + HSA `info queues` — works when the kernel isn't faulting but just stuck (e.g. atomic-counter deadlock). Use when: server crashes with "Memory access fault by GPU node-N", server hangs with GPU at 100% but no token output, kernel asserting `s_trap`, or `HIP_LAUNCH_BLOCKING=1` makes a hang vanish. Do NOT use for: numerical bugs (use dump-bisect-debug), compile errors, OOM.

2026-05-2399

dump-bisect-debug.md

from "ROCm/ATOM"

Locate forward numerical bugs by dumping intermediate tensors from a target implementation and a known-good reference, then bisecting layer by layer. Also covers batch-invariance bisect (the same token at any batch position should produce a bitwise-identical output, per DeepSeek V4 paper §3.3). Use when "the output is wrong but I don't know where" — model produces gibberish, degenerates, or picks the wrong token, but code review reveals nothing.

2026-05-2399

run-atom-workload.md

from "ROCm/ATOM"

Run any ATOM workload — accuracy eval (GSM8K via lm_eval), performance benchmark, concurrency sweep, offline simple_inference, or fault repro under rocm-debug-agent. Use when the user asks to "test accuracy", "测精度", "跑 GSM8K", "跑 benchmark", "test performance", "run sweep", "repro the fault", "测一下 MTP1 精度", "跑 simple_inference" — anything that drives an ATOM workload. Encodes the canonical flow (stop → start → workload-in-shell-bg → wait_infer_drain → stop) and the model-family env vars. Same pattern works for both server-based workloads (lm_eval / benchmark client) and offline simple_inference. Do NOT use for profiling traces (use capture-trace).

2026-05-2399

package.json

"author": "ROCm"

"repository": "ROCm/ATOM"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Administrateurs de réseaux et de systèmes informatiquesProfessions informatiques et mathématiques15-1244L4

ls /app/ATOM/scripts/start_atom_server.sh # launcher ls /app/ATOM/scripts/run_benchmark.sh # bench driver (passes --profile when PROFILE=1) ls /app/ATOM/scripts/wait_server_ready.sh # ready-poll python3 -c "import torch.profiler" # kineto present

Param

Meaning

Typical

MODEL

Model path under /data/

/data/DeepSeek-V4-Pro

TP

Tensor-parallel size

8 (4 for Kimi, 1 for gpt-oss-120b)

ISL / OSL

Random input / output length

1024 / 1024

CONC

Concurrency the bench keeps in flight

64 or 128

PROMPT_MULTIPLIER

Total prompts = CONC * this

1 for trace runs (override the script default of 10)

ATOM_PROFILER_MORE

1 = shapes + stack + memory (large traces, OOM risk); 0 = kernel-name only

0 unless asked

TRACE_DIR

Where the kineto .pt.trace.json.gz lands

/app/logs_claude/traces/<run-name>

EXTRA_ARGS

Forwarded to the openai server (MTP, kv-cache, etc.)

See [[atom-patterns]]

TRACE_DIR=/app/logs_claude/traces/<run-name> mkdir -p "$TRACE_DIR" # ATOM_PROFILER_MORE on the server env — not the client. ATOM_PROFILER_MORE=0 \ bash /app/ATOM/scripts/start_atom_server.sh \ "$MODEL" "$TP" 8000 \ --torch-profiler-dir "$TRACE_DIR" \ $EXTRA_ARGS

for i in $(seq 1 60); do GZ=$(find "$TRACE_DIR" -name "*.pt.trace.json.gz" | wc -l) JSON=$(find "$TRACE_DIR" -name "*.pt.trace.json" -not -name "*.gz" | wc -l) echo "[t=${i}0s] gz=$GZ json=$JSON" # Done = expected gz count AND no orphan .json (the .json is deleted after gzip) [ "$GZ" -ge "$TP" ] && [ "$JSON" -eq 0 ] && break sleep 10 done

zcat "$TRACE_DIR"/rank_0/*.gz | python3 -c " import json, sys events = json.load(sys.stdin)['traceEvents'] names = {e.get('name','') for e in events} for kw in ['<kernel-substring>', ...]: hits = sorted(n for n in names if kw in n) print(f'{kw}: {len(hits)} matches') for h in hits[:5]: print(f' {h}') "

zcat "$TRACE_DIR"/rank_0/*.gz | python3 -c " import json, sys events = json.load(sys.stdin)['traceEvents'] def stat(kw): m = [e for e in events if kw in e.get('name','') and 'dur' in e] if m: print(f'{kw}: count={len(m)} total_us={sum(e[\"dur\"] for e in m)}') stat('aiter::topk_softplus') stat('aiter::moe_forward') # ... "

Param

Meaning

Typical

MODEL

Model path under /data/

/data/DeepSeek-V4-Pro

TP

Tensor-parallel size

8 (4 for Kimi, 1 for gpt-oss-120b)

ISL / OSL

Random input / output length

1024 / 1024

CONC

Concurrency the bench keeps in flight

64 or 128

PROMPT_MULTIPLIER

Total prompts = CONC * this

1 for trace runs (override the script default of 10)

ATOM_PROFILER_MORE

1 = shapes + stack + memory (large traces, OOM risk); 0 = kernel-name only

0 unless asked

TRACE_DIR

Where the kineto .pt.trace.json.gz lands

/app/logs_claude/traces/<run-name>

EXTRA_ARGS

Forwarded to the openai server (MTP, kv-cache, etc.)

See [[atom-patterns]]

name	capture-trace
description	Capture a PyTorch profiler / kineto trace from a running ATOM server for a short benchmark window. Use when the user asks for "a trace", "profiler trace", "GPU trace", or "抓 trace" for performance investigation — what kernels ran, what's on the critical path, what's slow. Do NOT use for crashes (use debug-agent-locate-kernel) or numerical bugs (use dump-bisect-debug).
version	1.2.0
scope	ATOM on AMD ROCm (PyTorch kineto profiler, per-rank `*.pt.trace.json.gz`)
last_updated	"2026-05-20T00:00:00.000Z"

name	capture-trace
description	Capture a PyTorch profiler / kineto trace from a running ATOM server for a short benchmark window. Use when the user asks for "a trace", "profiler trace", "GPU trace", or "抓 trace" for performance investigation — what kernels ran, what's on the critical path, what's slow. Do NOT use for crashes (use debug-agent-locate-kernel) or numerical bugs (use dump-bisect-debug).
version	1.2.0
scope	ATOM on AMD ROCm (PyTorch kineto profiler, per-rank `*.pt.trace.json.gz`)
last_updated	"2026-05-20T00:00:00.000Z"

capture-trace

When to use

Critical pre-flight

Required tools

Parameters

Workflow

Step 1: Launch the server with the profiler bound

Step 2: Drive a SHORT bench with `PROFILE=1`

Step 3: Wait for the exporter to finish (asynchronous on the server)

Step 4: Verify the layout

Step 5: Inspect

`record_function` tag format

`ATOM_PROFILER_MORE` cost

Looking up model configs

Anti-patterns

Cross-references

When to use

Critical pre-flight

Required tools

Parameters

Workflow

Step 1: Launch the server with the profiler bound

Step 2: Drive a SHORT bench with `PROFILE=1`

Step 3: Wait for the exporter to finish (asynchronous on the server)

Step 4: Verify the layout

Step 5: Inspect

`record_function` tag format

`ATOM_PROFILER_MORE` cost

Looking up model configs

Anti-patterns

Cross-references

capture-trace

Plus depuis ce dépôt

Plus depuis ce dépôt

When to use

Critical pre-flight

Required tools

Parameters

Workflow

Step 1: Launch the server with the profiler bound

Step 2: Drive a SHORT bench with PROFILE=1

Step 3: Wait for the exporter to finish (asynchronous on the server)

Step 4: Verify the layout

Step 5: Inspect

record_function tag format

ATOM_PROFILER_MORE cost

Looking up model configs

Anti-patterns

Cross-references

When to use

Critical pre-flight

Required tools

Parameters

Workflow

Step 1: Launch the server with the profiler bound

Step 2: Drive a SHORT bench with PROFILE=1

Step 3: Wait for the exporter to finish (asynchronous on the server)

Step 4: Verify the layout

Step 5: Inspect

record_function tag format

ATOM_PROFILER_MORE cost

Looking up model configs

Anti-patterns

Cross-references

Step 2: Drive a SHORT bench with `PROFILE=1`

`record_function` tag format

`ATOM_PROFILER_MORE` cost

Step 2: Drive a SHORT bench with `PROFILE=1`

`record_function` tag format

`ATOM_PROFILER_MORE` cost