| name | kernel-trace-analysis |
| description | Profile GPU kernels using rocprofv3 to collect ATT instruction-level traces, then analyze the trace data using hotspot_analyzer.py to identify top-K stall hotspots (VMEM-load, VMEM-wait, LDS/SMEM-wait, barrier, MFMA stalls) mapped back to source lines, and produce an actionable optimization plan. Usage: /kernel-trace-analysis <cmd> Can also analyze an existing dispatch dir directly: /kernel-trace-analysis --dir <path>
|
| tools | Read,Edit,Bash,Grep,Glob,Agent,Write |
| note | All analysis is done programmatically via hotspot_analyzer.py + code.json. Do NOT use GUI tools. |
Kernel Trace Analysis
Profile and analyze GPU kernel ATT traces to identify stall hotspots and produce
an optimization plan.
Arguments
| Argument | Description |
|---|
<CMD> | Command to profile. Example: python bench_pa.py --batch 32 |
--dir <path> | Skip collection; analyze existing ui_output_agent_*_dispatch_* directory |
--topk N | Show top-N hotspots (default: 15) |
Analyzer Scripts
scripts/hotspot_analyzer.py — reads a ui_output_agent_*_dispatch_* ATT
directory; reports top-K stall hotspots, stall-type breakdown, and occupancy
(combined-VGPR-pool model, reads accum/LDS/SGPR from out_kernel_trace.csv).
scripts/pmc_l2_analyzer.py — reads rocprofv3 PMC counter CSV(s); reports
L2 hit rate, HBM 32B-partial fraction, and over-fetch ratio. Use when a
kernel is memory-bound and you need to know why (ATT has no cache counters).
See "L2 / HBM efficiency analysis" under Step 5.
Workflow
Mode A: Analyze existing dispatch directory
If the user provides --dir <path> or already has a ui_output_agent_*_dispatch_* directory:
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 15 --mode both
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 5 --mode src --detail --context 4
Skip to Step 4: Interpret Results.
Mode B: Full collection workflow
Step 1: Kernel Discovery
touch /tmp/trace_ts
rocprofv3 --stats --kernel-trace -f csv -- <CMD> 2>&1
find . -maxdepth 3 -name "*stats*" -newer /tmp/trace_ts -type f 2>/dev/null
Parse the stats CSV and present a kernel table:
| Rank | Kernel Name | Calls | Total (us) | Avg (us) | % GPU Time |
|---|
Ask the user which kernel to trace if not obvious.
Prefer results.db if available — use sqlite3 for structured queries:
sqlite3 results.db "
SELECT ks.KernelName, COUNT(*) calls,
ROUND(AVG(kd.end-kd.start)/1000.0,1) avg_us
FROM rocpd_kernel_dispatch kd
JOIN rocpd_info_kernel_symbol ks ON kd.kernel_symbol_id=ks.id
GROUP BY ks.KernelName ORDER BY avg_us DESC LIMIT 20;"
Step 2: Configure input.yaml
cp ~/Documents/input.yaml /tmp/trace_input.yaml
Edit /tmp/trace_input.yaml:
jobs:
-
kernel_include_regex: <KERNEL_NAME_PATTERN>
kernel_iteration_range: "[1, [3-4]]"
output_file: out
output_directory: kernel_trace_output
output_format: [csv]
truncate_kernels: true
sys_trace: true
advanced_thread_trace: true
att_target_cu: 1
att_shader_engine_mask: "0xf"
att_simd_select: "0xf"
att_buffer_size: "0x6000000"
Key notes:
kernel_iteration_range: "[1, [3-4]]" skips warmup, traces dispatches 3-4
att_buffer_size: 96MB per SE; increase to "0xC000000" if truncated
att_target_cu: 1: single CU keeps output manageable
Step 3: Collect ATT Trace
FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1 rocprofv3 -i /tmp/trace_input.yaml -- <CMD> 2>&1
find . -type d -name "ui_output_agent_*" -newer /tmp/trace_ts 2>/dev/null
If rocprof-trace-decoder library is missing:
wget -q https://github.com/ROCm/rocprof-trace-decoder/releases/download/0.1.6/rocprof-trace-decoder-manylinux-2.28-0.1.6-Linux.sh
chmod +x rocprof-trace-decoder-manylinux-2.28-0.1.6-Linux.sh
./rocprof-trace-decoder-manylinux-2.28-0.1.6-Linux.sh --skip-license --prefix=/tmp/rtd-install
find /tmp/rtd-install -name '*.so*' -exec cp -a {} /opt/rocm/lib/ \;
ldconfig
Output structure:
ui_output_agent_<PID>_dispatch_<N>/
├── code.json ← PRIMARY: per-instruction stall/cycle data
├── snapshots.json ← source file path mapping (virtual → local filename)
├── source_0_*.py ← embedded source files
├── filenames.json ← wave file index
├── occupancy.json ← occupancy timeline
└── se*_sm*_sl*_wv*.json ← per-wave raw traces
Step 4: Run hotspot_analyzer.py
Write the script (see above), then run:
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 15 --mode both
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 5 --mode src --detail --context 4
python /tmp/hotspot_analyzer.py <dispatch_dir> --mode asm --topk 20
Step 5: Interpret Results
code.json field reference
Each row in code["code"] is:
[asm, _, pc_index, source_loc, _, pc_addr, exec_count, total_cycles, stall_cycles, issue_cycles]
0 1 2 3 4 5 6 7 8 9
- col[8]
stall_cycles: cycles the instruction was blocked from issuing — primary hotspot metric
- col[7]
total_cycles: total cycles charged to this instruction across all waves
- col[3]
source_loc: "/path/to/file.py:LINE" — virtual path resolved via snapshots.json
- col[6]
exec_count: number of wave-threads that executed this instruction
snapshots.json: resolving source paths
snapshots.json encodes a nested dict tree mapping virtual paths to local filenames:
{"/": {"FlyDSL": {"kernels": {"pa_decode_sw_fp8_ps.py": "source_0_pa_decode_sw_fp8_ps.py"}}}}
Flatten recursively: /FlyDSL/kernels/pa_decode_sw_fp8_ps.py → source_0_pa_decode_sw_fp8_ps.py
Stall type classification
| Type | Instructions | Root Cause |
|---|
VMEM-load | buffer_load_*, global_load_* | Load itself stalled (VMEM queue full or back-pressure from no compute to hide behind) |
VMEM-wait | s_waitcnt vmcnt(N) | Waiting for outstanding VMEM loads to complete |
LDS/SMEM-wait | s_waitcnt lgkmcnt(N) | Waiting for LDS or SMEM ops |
barrier | s_barrier | Cross-wave sync — slowest wave dominates |
MFMA/FMA | v_mfma_* | MFMA dependency chain (RAW hazard) |
LDS | ds_read_*, ds_write_* | LDS access latency |
Common hotspot patterns
Pattern 1: V/K loads inside MFMA loop → very high stall rate (80–95%)
for k_step in range_constexpr(QKHELOOP * 2):
if k_step % 2 == 0:
v_data = buffer_ops.buffer_load(...)
acc = rocdl.mfma_f32_16x16x32_fp8_fp8(...)
for td in range_constexpr(TLOOP):
v_prefetch[td] = [buffer_ops.buffer_load(...) for _ in range_constexpr(QKHELOOP)]
for td in range_constexpr(TLOOP):
for k_step in range_constexpr(QKHELOOP * 2):
acc = rocdl.mfma_f32_16x16x32_fp8_fp8(...)
v_results[td] = v_prefetch[td]
Pattern 2: Sequential loads with no compute → VMEM queue saturation
for td in range_constexpr(TLOOP):
for qkhe in range_constexpr(QKHELOOP):
k4 = buffer_ops.buffer_load(k_rsrc, ka_dw, ...)
Pattern 3: LDS prob reads immediately before PV MFMA → lgkmcnt stall
for vhe in ...:
for vt in ...:
p_i64 = lds_read(...)
tmp = mfma(v_i64, p_i64, ...)
for vhe in ...:
for vt in ...:
p_i64s.append(lds_read(...))
for vhe in ...:
for vt in ...:
tmp = mfma(v_i64s[...], p_i64s[...], ...)
Pattern 4: Scale loads too close to usage
for td in range_constexpr(TLOOP):
k_scale = buffer_ops.buffer_load(ks_rsrc, ...)
result = acc * k_scale
Pattern 5: Hotspot attributed to kernel entry line
When @flyc.kernel / kernel decorator line appears as the top hotspot with a mix of
VMEM-wait + barrier stall types — this is a debug info aggregation artifact.
MLIR/compiler-generated instructions (address arithmetic, cndmask, prologue setup) map
to the outermost scope line. Ignore this line; focus on lines with explicit user ops.
Register pressure check (architecture-aware)
hotspot_analyzer.py auto-detects the GPU architecture from ISA instruction patterns
and computes occupancy (waves/SIMD) as the minimum across every resource limiter:
occupancy = min(vgpr_limit, lds_limit, sgpr_limit, hw_max=8)
vgpr_limit = 512 // (arch_vgpr_alloc + accum_vgpr_alloc) # per SIMD
lds_limit = (LDS_total // lds_per_wg) * waves_per_wg // 4_SIMDs # per SIMD
sgpr_limit = 800 // sgpr_alloc # per SIMD
VGPR is a combined 512-entry pool on BOTH gfx942 and gfx950. CDNA2 (gfx90a)
unified the arch (256) and accum (256) VGPR files into one 512 budget per SIMD,
and gfx942/gfx950 inherit that. Occupancy from VGPR is 512 / (arch + accum) on
both — NOT 256 / max(arch, accum). (The separate-pool 256/max model only
applied to gfx908 / CDNA1, where accum VGPRs were a distinct file accessible
only by MFMA.)
| Property | CDNA3 (gfx942) | CDNA4 (gfx950) |
|---|
| VGPR pool | 512 combined (256 arch + 256 accum, unified budget) | 512 combined (same) |
| Occupancy formula (VGPR) | 512 / (arch_alloc + accum_alloc) | 512 / (arch_alloc + accum_alloc) |
| Alloc granularity | 8 VGPRs | 8 VGPRs |
| LDS size | 64 KB | 160 KB |
| LDS alloc block | 256 bytes | 1280 bytes |
| VMCNT width | 6 bits (max 63 in-flight) | 6 bits (max 63 in-flight) |
| LGKMCNT width | 4 bits (max 15 in-flight) | 4 bits (max 15 in-flight) |
What actually changed in CDNA4 vs CDNA3 is the LDS size (64KB→160KB) and the LDS
alloc granularity — not the VGPR pooling model.
Reading the real counts. code.json only holds the (often single-CU,
often vgpr-form) disassembly, so it cannot reveal accum_vgpr / LDS / SGPR /
workgroup size — an AGPR-form-blind ISA scan reports accum=0 and gets
occupancy badly wrong. The analyzer reads out_kernel_trace.csv (staged next to
the dispatch dir) for the authoritative Accum_VGPR_Count / LDS_Block_Size /
SGPR_Count / Workgroup_Size_*. arch_vgpr is taken as max(ISA_scan, CSV) so
a bogus-low CSV VGPR_Count field can't under-report. If no CSV is found it
falls back to ISA-only and prints a warning.
Auto-detection: gfx950-specific instructions (v_mfma_scale_f32_*, v_mfma_f32_16x16x128_*,
v_mfma_f32_32x32x64_*) indicate CDNA4. Absence indicates CDNA3.
sqlite3 results.db "
SELECT ks.KernelName, ki.arch_vgpr_count, ki.accum_vgpr_count, ki.lds_size
FROM rocpd_kernel_dispatch kd
JOIN rocpd_info_kernel_symbol ks ON kd.kernel_symbol_id=ks.id
JOIN rocpd_info_kernel ki ON kd.kernel_id=ki.id LIMIT 5;"
Worked example (PA decode, gfx942): arch 144 + accum 136 = 280 combined → 512//280 = 1
wave/SIMD, VGPR-bound (LDS allows 5, SGPR allows 7). Reaching 2 waves needs
combined ≤ 256, e.g. freeing ~24 VGPRs.
Warning: maxnreg forcing accum_vgpr=0 doubles occupancy but causes MFMA spills through
arch_vgpr — measured 4.5x GPU slowdown. Do not use maxnreg for MFMA-heavy kernels.
L2 / HBM efficiency analysis (PMC, not ATT)
When the ATT hotspots are dominated by VMEM-load at high stall rate (e.g.
40-50% of stall, ~94% per-load), the kernel is memory-bound and the next
question is why — and ATT cannot answer it (it has no cache counters).
Capture PMC counters (see capture-kernel-trace "PMC Mode") and analyze with
scripts/pmc_l2_analyzer.py:
python scripts/pmc_l2_analyzer.py \
/tmp/pmc_out/pass_1/pmc_l2_counter_collection.csv \
/tmp/pmc_ea_out/pass_1/pmc_ea_counter_collection.csv \
--kernel <kernel> --ideal-gb <bytes_per_dispatch_GB> --ea-channels 2
Three metrics, three decisions:
| Metric | Formula | What it tells you |
|---|
| L2 hit rate | TCC_HIT/(TCC_HIT+TCC_MISS) | Is there temporal reuse to exploit? |
| 32B fraction | TCC_EA0_RDREQ_32B/TCC_EA0_RDREQ | Spatial locality / cache-line waste |
| over-fetch | est_HBM_bytes / (ideal_GB × dispatches) | Redundant fetching |
Decision tree for a memory-bound decode kernel:
- L2 hit rate < 5% → pure streaming, no reuse. This is expected and
correct for decode with independent per-sequence paged KV — each KV byte
is read once; the GQA (×heads) and MTP (×seq) reuse is captured in
registers/LDS, never re-reads L2. "Improving L2 hit rate" is a non-goal.
The only thing that raises it is real KV reuse = shared-prefix serving
(a workload/scheduling property, not a kernel change).
- 32B fraction ≈ 0% → full 64B cache lines, no spatial-locality waste.
Nothing to fix at the line level. (High 32B% would point to scattered/
misaligned access worth restructuring.)
- over-fetch ≈ 1.0x → the kernel reads exactly the data it needs. The
achieved bandwidth (compute as
ideal_bytes / kernel_time) is then the
real ceiling for this access pattern. 50-60% of theoretical HBM peak is
normal even for clean streaming; paged-gather decode living at ~54% with
0% partial + ~1.0x over-fetch is healthy, not a defect.
Worked example (PA decode, gfx942, bs=16, ctx=131072, batch=256):
L2 hit 1.7%, 32B 0%, over-fetch 1.04x, 2.85 TB/s = 54% peak. Conclusion: the
memory subsystem is clean; there is no KV-load optimization left — verified
by also testing block_size 16→64 (regressed +7.8%) and confirming dwordx8
doesn't exist on CDNA3 (dwordx4 / 16B is the max single vector load).
Counter-capture caveat: keep each PMC job to ≤ ~4 TCC counters (single
hardware pass). Multi-pass collection has triggered a GPU Hang on gfx942 — see
capture-kernel-trace.
MFMA latency reference (cycles = pipeline depth)
| Instruction | Variant | Cycles | Notes |
|---|
v_mfma_f32_*_f16 / _bf16 | 16x16x16 | 16 | |
v_mfma_f32_*_f16 / _bf16 | 32x32x8 | 32 | |
v_mfma_f32_*_fp8_fp8 | 16x16x32 | 16 | CDNA3+CDNA4 |
v_mfma_f32_*_fp8_fp8 | 32x32x16 | 32 | CDNA3+CDNA4 |
v_mfma_f32_16x16x128_f8f6f4 | 16x16x128 | 16 or 32 | CDNA4 only; 32 if either A or B is FP8 |
v_mfma_f32_32x32x64_f8f6f4 | 32x32x64 | 32 or 64 | CDNA4 only; 64 if either A or B is FP8 |
v_mfma_scale_f32_16x16x128_f8f6f4 | 16x16x128 | 16 or 32 | CDNA4 only; with block exponent scaling |
v_mfma_scale_f32_32x32x64_f8f6f4 | 32x32x64 | 32 or 64 | CDNA4 only; with block exponent scaling |
v_mfma_f32_*_f32 | 16x16x4 | 32 | |
v_mfma_f32_*_f32 | 32x32x2 | 64 | |
v_mfma_f64_16x16x4_f64 | 16x16x4 | 64 | |
MFMA dependency NOPs (CDNA4, from ISA reference Table 38)
These are the minimum independent instructions (or s_nop counts) required between
MFMA result production and consumption. The values vary by MFMA variant:
| Dependency pattern | Required waits | Comment |
|---|
| XDL write -> same XDL read SrcC (accumulate, exact same vDst) | 0-2 | Forwarding path; back-to-back accumulation OK |
| XDL write -> VALU/VM/LDS/FLAT read result (RAW) | 5, 8, 12, or 20 | No forwarding; must wait for MFMA commit to VGPR |
| XDL write -> MFMA read as SrcA or SrcB | 5, 8, 12, or 20 | No forwarding path |
| Non-DLops VALU write -> MFMA read | 2 | No 4/8 cycle forwarding path |
| VALU writes SGPR -> VMEM reads that SGPR | 5 | HW does NOT check this — user must add waits |
| V_CMPX* writes EXEC -> V_MFMA* | 4 | No EXEC forwarding with MFMA |
Wait counts for "5, 8, 12, or 20" depend on MFMA variant:
- 5 waits: 16x16 4-block variants (8 cycle MFMAs)
- 8 waits: 16x16x16 F16/BF16 etc. (16 cycle MFMAs)
- 12 waits: 32x32x8, 16x16x4 F32, etc. (32 cycle MFMAs)
- 20 waits: 32x32x4 F32, 32x32x2 F32 (64 cycle MFMAs)
Step 6: Optimization Plan
After running hotspot_analyzer.py --detail, produce a prioritized plan:
## Stall Summary
- Total stalls: X cycles (Y% of kernel)
- Top stall type: VMEM-load (Z%)
## Hotspot Analysis
### #1 :LINE stall=XK (N%) VMEM-load stall_rate=92%
Root cause: buffer_load inside QK MFMA loop — only 1 MFMA of hiding time.
Fix: Move all V loads before the QK MFMA loop.
Estimated gain: ~20% kernel cycle reduction.
### #2 :LINE stall=XK (N%) VMEM-load stall_rate=80%
Root cause: K loads sequential with no compute interleaved.
Fix: Prefetch next tile's K during current tile's MFMA (double-buffer pattern).
See /prefetch-data-load skill.
### #3 ...
## Priority Order
1. [HIGH] Fix V-load position (24% of all stalls, easy refactor)
2. [HIGH] K-load cross-tile prefetch (8% of stalls, needs _process_block restructure)
3. [MED] Move scale loads earlier (8% of stalls, trivial move)
4. [LOW] Batch LDS reads before PV MFMA (4% of stalls, loop split)
Error Handling
| Error | Fix |
|---|
rocprof-trace-decoder library path not found | Install decoder .so (see Step 3) |
| Trace output empty | Check kernel_include_regex matches exactly |
| Trace truncated | Increase att_buffer_size to "0xC000000" |
kernel_iteration_range mismatch | Adjust range; try "[0, [1-2]]" |
INVALID_SHADER_DATA | aqlprofile/decoder version mismatch — update both |
Source loc all "" | Set FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1; check -g flag in compile pipeline |
| Top hotspot is kernel decorator line | Debug info artifact — skip it, focus on op lines |
--att flag error | --att is boolean, no value; use -i input.yaml for full config |
| GPU Hang / HW Exception during PMC | Too many counters → multi-pass. Split into single-pass jobs of ≤ ~4 TCC counters |
PMC accum_vgpr=0 but kernel uses MFMA | vgpr-form MFMA: accumulators are in the arch VGPR file; read total from VGPR_Count + Accum_VGPR_Count |