원클릭으로
원클릭으로
Capture a PyTorch profiler / kineto trace from a running ATOM server for a short benchmark window. Use when the user asks for "a trace", "profiler trace", "GPU trace", or "抓 trace" for performance investigation — what kernels ran, what's on the critical path, what's slow. Do NOT use for crashes (use debug-agent-locate-kernel) or numerical bugs (use dump-bisect-debug).
Identify which GPU kernel is faulting/hanging in ATOM via rocm-debug-agent (for faults/asserts) or rocgdb (for silent livelocks). debug-agent dumps wave registers + faulting PC + (with --save-code-objects) disassembled code object on memory faults / ASSERT_TRAP. rocgdb attaches to a live process and lists in-flight `info dispatches` + HSA `info queues` — works when the kernel isn't faulting but just stuck (e.g. atomic-counter deadlock). Use when: server crashes with "Memory access fault by GPU node-N", server hangs with GPU at 100% but no token output, kernel asserting `s_trap`, or `HIP_LAUNCH_BLOCKING=1` makes a hang vanish. Do NOT use for: numerical bugs (use dump-bisect-debug), compile errors, OOM.
Locate forward numerical bugs by dumping intermediate tensors from a target implementation and a known-good reference, then bisecting layer by layer. Also covers batch-invariance bisect (the same token at any batch position should produce a bitwise-identical output, per DeepSeek V4 paper §3.3). Use when "the output is wrong but I don't know where" — model produces gibberish, degenerates, or picks the wrong token, but code review reveals nothing.
Run any ATOM workload — accuracy eval (GSM8K via lm_eval), performance benchmark, concurrency sweep, offline simple_inference, or fault repro under rocm-debug-agent. Use when the user asks to "test accuracy", "测精度", "跑 GSM8K", "跑 benchmark", "test performance", "run sweep", "repro the fault", "测一下 MTP1 精度", "跑 simple_inference" — anything that drives an ATOM workload. Encodes the canonical flow (stop → start → workload-in-shell-bg → wait_infer_drain → stop) and the model-family env vars. Same pattern works for both server-based workloads (lm_eval / benchmark client) and offline simple_inference. Do NOT use for profiling traces (use capture-trace).
| name | atom-patterns |
| description | Coding patterns and architecture index for the ATOM LLM inference engine |
| version | 1.2.0 |
| scope | ATOM repo on AMD ROCm |
| last_updated | "2026-05-20T00:00:00.000Z" |
atom/
├── config.py # Config, QuantizationConfig, HF config loading
├── entrypoints/ # Server entry (openai_server.py)
├── examples/ # simple_inference.py (offline smoke test)
├── model_engine/ # Core engine pipeline
│ ├── llm_engine.py # Top-level engine
│ ├── engine_core.py # Per-DP-rank loop
│ ├── scheduler.py # Batch scheduling
│ └── model_runner.py # Forward pass, CUDAGraph, KV cache binding
├── model_loader/
│ └── loader.py # Weight loading (safetensors, FP8/FP4, WeightsMapper)
├── model_ops/ # AITER kernel wrappers
│ ├── linear.py # LinearBase, ColumnParallel, RowParallel
│ ├── moe.py # FusedMoE, Mxfp4MoEMethod, weight_loader
│ ├── fused_moe_triton.py # Triton matmul_ogs MoE path
│ ├── attention_mla.py # MLA attention (DeepSeek)
│ ├── attention_mha.py # Standard MHA attention
│ └── paged_attention.py # Paged attention backend
├── models/ # Model implementations (one file per family;
│ # `*_mtp.py` for MTP / speculative variants —
│ # run `ls atom/models/` for the current set)
├── spec_decode/
│ └── eagle.py # MTP proposer (speculative decoding)
├── plugin/ # vLLM/SGLang plugin adapters
└── utils/
├── envs.py # All ATOM_* env var definitions
└── forward_context.py # Module-level forward context
Every model class follows this contract:
class NewModelForCausalLM(nn.Module):
# Weight loading config (class-level)
packed_modules_mapping = { ... }
weights_mapping = { ... }
def __init__(self, config: Config, prefix: str = ""):
...
def forward(self, input_ids, positions, intermediate_tensors=None, inputs_embeds=None):
return hidden_states # or logits
def compute_logits(self, hidden_states):
return self.lm_head(hidden_states)
Registration in model_runner.py:
support_model_arch_dict = {
"NewModelForCausalLM": ("new_model", "NewModelForCausalLM"),
}
One model file often serves multiple HuggingFace model_types when their architecture is identical or a strict subset. The mapping lives in atom/model_engine/model_runner.py:support_model_arch_dict — that's the authoritative source. Common patterns:
deepseek_v2.py)._mtp.py companion when the spec-decode head differs from the base model.ColumnParallelLinear: shards output dim, no all-reduce neededRowParallelLinear: shards input dim, all-reduce on output (reduce_results=True)ReplicatedLinear: full copy on each rank (gates, small projections)MoE pattern: FusedMoE + shared_experts both use reduce_results=False, parent does one all-reduce.
atom/models/new_model.py — Model implementationatom/model_engine/model_runner.py — Register in support_model_arch_dictatom/config.py — Add to _CONFIG_REGISTRY if config schema differs.github/benchmark/models_accuracy.json — CI accuracy test entryrecipes/ — Usage recipegrep same pattern across codebase (fix-then-sweep)simple_inference.py smoke testlm_eval for accuracy regressionweight (FP8/FP4 packed) + weight.scale (E8M0 block scale).scale → .weight_scale_inv → .weight_scale (auto-rename in loader)process_weights_after_loading() hook: shuffle weights for CK kernel layoutMxfp4MoEMethod.create_weights() + mxf4_merged_weight_loader()@support_torch_compile decorated models (breaks Dynamo)forward() (has @torch.inference_mode()), NOT in run_model()ATOM_V4_DIAG=1)--level 0 --enforce-eager to disable both torch.compile and CUDAGraphtests/ directory at repo roottorch.cudatest_<module>.py (e.g., test_scheduler.py, test_block_manager.py)python -m atom.examples.simple_inference --model <path> --kv_cache_dtype fp8lm_eval with gsm8k (CI threshold != actual baseline)Authoritative list: atom/utils/envs.py (all ATOM_* defined as lazy lambdas). To read what an env var does, grep the file for its name — the lambda and the call site comment describe the behavior. Required per-model env vars are listed in .github/benchmark/models.json and .github/benchmark/models_accuracy.json.
.github/benchmark/models_accuracy.json (model matrix, thresholds, baselines)..github/benchmark/models.json (server args, bench args, runner pinning)..github/dashboard/index.html (gh-pages)..github/workflows/*.yaml is the source of truth (action versions pinned per workflow).