تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

kernel-trace-analysis

Name: Kernel Trace Analysis
Author: ROCm

// Profile GPU kernels using rocprofv3 to collect ATT instruction-level traces, then analyze the trace data using hotspot_analyzer.py to identify top-K stall hotspots (VMEM-load, VMEM-wait, LDS/SMEM-wait, barrier, MFMA stalls) mapped back to source lines, and produce an actionable optimization plan. Usage: /kernel-trace-analysis <cmd> Can also analyze an existing dispatch dir directly: /kernel-trace-analysis --dir <path>

تشغيل في Manus

$ git log --oneline --stat

stars:١٩٢

forks:٥٦

updated:٢٩ مايو ٢٠٢٦ في ٠٧:١٤

مستكشف الملفات

3 ملفات

SKILL.md

readonly

related-skills.json

نفس المستودع

format-code.md

from "ROCm/FlyDSL"

Format and clean up changed files before committing, matching the project's CI style gate. Formats Python with black + ruff and C/C++ with clang-format using the repository's .clang-format. Use when the user says "format code", "clean up code", "lint", "format before commit", "/format-code", or mentions black, ruff, clang-format, or CI style failures while tidying their working tree.

2026-05-29192

capture-kernel-trace.md

from "ROCm/FlyDSL"

Capture GPU kernel ATT (Advanced Thread Trace) via rocprofv3 on a remote Docker container or locally. Discovers kernel names, configures input.yaml with the target kernel_include_regex, runs rocprofv3 -i input.yaml with FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1, and downloads the latest ui_output_agent_* directory for analysis. Usage: /capture-kernel-trace <test_script.py> [kernel_name_pattern]

2026-05-29192

lds-optimization.md

from "ROCm/FlyDSL"

Optimize LDS (Local Data Share / shared memory) access patterns in FlyDSL GPU kernels. Diagnose bank conflicts and high lgkmcnt stalls from ATT trace data, then apply swizzle or padding layouts to eliminate conflicts. Also increase the distance between LDS write and subsequent LDS read to hide LDS latency. LDS read preceded by write always requires a sync (s_waitcnt lgkmcnt or s_barrier). Use when trace analysis shows ds_read/ds_write/lgkmcnt as a bottleneck. Usage: /lds-optimization

2026-05-29192

prefetch-data-load.md

from "ROCm/FlyDSL"

Apply prefetch optimization to FlyDSL kernel loops: pre-load the first iteration's data before the loop, issue async loads for the next iteration inside the loop body, and swap buffers at the loop tail via runtime loop-carried values. This overlaps data load latency with compute instructions. Use when a kernel has a loop where buffer_load feeds into MFMA/compute and load latency is exposed. Usage: /prefetch-data-load

2026-05-29192

flydsl-kernel-authoring.md

from "ROCm/FlyDSL"

Comprehensive reference for authoring FlyDSL GPU kernels on AMD GPUs. Covers the layout algebra, tiled copy/MMA, buffer ops, loop-carried range loops, SmemAllocator, autotuning, and common patterns. Use when writing, reviewing, or understanding FlyDSL kernel code.

2026-05-27192

build-rocm-image.md

from "ROCm/FlyDSL"

Connect to a remote host via SSH and build a Docker image with rocprofv3, aiter, and FlyDSL. Use when user wants to build/rebuild the ROCm development image on a remote host. Usage: /build-rocm-image <hostname>

2026-05-20192

package.json

"author": "ROCm"

"repository": "ROCm/FlyDSL"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مطوّرو البرمجياتمهن الحاسوب والرياضيات15-1252L4

name	kernel-trace-analysis
description	Profile GPU kernels using rocprofv3 to collect ATT instruction-level traces, then analyze the trace data using hotspot_analyzer.py to identify top-K stall hotspots (VMEM-load, VMEM-wait, LDS/SMEM-wait, barrier, MFMA stalls) mapped back to source lines, and produce an actionable optimization plan. Usage: /kernel-trace-analysis <cmd> Can also analyze an existing dispatch dir directly: /kernel-trace-analysis --dir <path>
tools	Read,Edit,Bash,Grep,Glob,Agent,Write
note	All analysis is done programmatically via hotspot_analyzer.py + code.json. Do NOT use GUI tools.

Kernel Trace Analysis

Profile and analyze GPU kernel ATT traces to identify stall hotspots and produce an optimization plan.

Arguments

Argument	Description
`<CMD>`	Command to profile. Example: `python bench_pa.py --batch 32`
`--dir <path>`	Skip collection; analyze existing `ui_output_agent__dispatch_` directory
`--topk N`	Show top-N hotspots (default: 15)

Analyzer Scripts

scripts/hotspot_analyzer.py — reads a ui_output_agent_*_dispatch_* ATT directory; reports top-K stall hotspots, stall-type breakdown, and occupancy (combined-VGPR-pool model, reads accum/LDS/SGPR from out_kernel_trace.csv).
scripts/pmc_l2_analyzer.py — reads rocprofv3 PMC counter CSV(s); reports L2 hit rate, HBM 32B-partial fraction, and over-fetch ratio. Use when a kernel is memory-bound and you need to know why (ATT has no cache counters). See "L2 / HBM efficiency analysis" under Step 5.

Workflow

Mode A: Analyze existing dispatch directory

If the user provides --dir <path> or already has a ui_output_agent_*_dispatch_* directory:

# Write hotspot_analyzer.py (see above), then:
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 15 --mode both
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 5 --mode src --detail --context 4

Skip to Step 4: Interpret Results.

Mode B: Full collection workflow

Step 1: Kernel Discovery

touch /tmp/trace_ts
rocprofv3 --stats --kernel-trace -f csv -- <CMD> 2>&1
find . -maxdepth 3 -name "*stats*" -newer /tmp/trace_ts -type f 2>/dev/null

Parse the stats CSV and present a kernel table:

Rank	Kernel Name	Calls	Total (us)	Avg (us)	% GPU Time

Ask the user which kernel to trace if not obvious.

Prefer results.db if available — use sqlite3 for structured queries:

sqlite3 results.db "
SELECT ks.KernelName, COUNT(*) calls,
       ROUND(AVG(kd.end-kd.start)/1000.0,1) avg_us
FROM rocpd_kernel_dispatch kd
JOIN rocpd_info_kernel_symbol ks ON kd.kernel_symbol_id=ks.id
GROUP BY ks.KernelName ORDER BY avg_us DESC LIMIT 20;"

Step 2: Configure input.yaml

cp ~/Documents/input.yaml /tmp/trace_input.yaml

Edit /tmp/trace_input.yaml:

jobs:
   -
       kernel_include_regex: <KERNEL_NAME_PATTERN>
       kernel_iteration_range: "[1, [3-4]]"
       output_file: out
       output_directory: kernel_trace_output
       output_format: [csv]
       truncate_kernels: true
       sys_trace: true
       advanced_thread_trace: true
       att_target_cu: 1
       att_shader_engine_mask: "0xf"
       att_simd_select: "0xf"
       att_buffer_size: "0x6000000"

Key notes:

kernel_iteration_range: "[1, [3-4]]" skips warmup, traces dispatches 3-4
att_buffer_size: 96MB per SE; increase to "0xC000000" if truncated
att_target_cu: 1: single CU keeps output manageable

Step 3: Collect ATT Trace

FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1 rocprofv3 -i /tmp/trace_input.yaml -- <CMD> 2>&1
find . -type d -name "ui_output_agent_*" -newer /tmp/trace_ts 2>/dev/null

If rocprof-trace-decoder library is missing:

wget -q https://github.com/ROCm/rocprof-trace-decoder/releases/download/0.1.6/rocprof-trace-decoder-manylinux-2.28-0.1.6-Linux.sh
chmod +x rocprof-trace-decoder-manylinux-2.28-0.1.6-Linux.sh
./rocprof-trace-decoder-manylinux-2.28-0.1.6-Linux.sh --skip-license --prefix=/tmp/rtd-install
find /tmp/rtd-install -name '*.so*' -exec cp -a {} /opt/rocm/lib/ \;
ldconfig

Output structure:

ui_output_agent_<PID>_dispatch_<N>/
├── code.json          ← PRIMARY: per-instruction stall/cycle data
├── snapshots.json     ← source file path mapping (virtual → local filename)
├── source_0_*.py      ← embedded source files
├── filenames.json     ← wave file index
├── occupancy.json     ← occupancy timeline
└── se*_sm*_sl*_wv*.json  ← per-wave raw traces

Step 4: Run hotspot_analyzer.py

Write the script (see above), then run:

# Full report
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 15 --mode both

# Source-level with code context (best for optimization)
python /tmp/hotspot_analyzer.py <dispatch_dir> --topk 5 --mode src --detail --context 4

# ASM-only for instruction-level detail
python /tmp/hotspot_analyzer.py <dispatch_dir> --mode asm --topk 20

Step 5: Interpret Results

code.json field reference

Each row in code["code"] is:

[asm, _, pc_index, source_loc, _, pc_addr, exec_count, total_cycles, stall_cycles, issue_cycles]
  0   1     2          3       4     5          6            7              8             9

col[8] stall_cycles: cycles the instruction was blocked from issuing — primary hotspot metric
col[7] total_cycles: total cycles charged to this instruction across all waves
col[3] source_loc: "/path/to/file.py:LINE" — virtual path resolved via snapshots.json
col[6] exec_count: number of wave-threads that executed this instruction

snapshots.json: resolving source paths

snapshots.json encodes a nested dict tree mapping virtual paths to local filenames:

{"/": {"FlyDSL": {"kernels": {"pa_decode_sw_fp8_ps.py": "source_0_pa_decode_sw_fp8_ps.py"}}}}

Flatten recursively: /FlyDSL/kernels/pa_decode_sw_fp8_ps.py → source_0_pa_decode_sw_fp8_ps.py

Stall type classification

Type	Instructions	Root Cause
`VMEM-load`	`buffer_load_`, `global_load_`	Load itself stalled (VMEM queue full or back-pressure from no compute to hide behind)
`VMEM-wait`	`s_waitcnt vmcnt(N)`	Waiting for outstanding VMEM loads to complete
`LDS/SMEM-wait`	`s_waitcnt lgkmcnt(N)`	Waiting for LDS or SMEM ops
`barrier`	`s_barrier`	Cross-wave sync — slowest wave dominates
`MFMA/FMA`	`v_mfma_*`	MFMA dependency chain (RAW hazard)
`LDS`	`ds_read_`, `ds_write_`	LDS access latency

Common hotspot patterns

Pattern 1: V/K loads inside MFMA loop → very high stall rate (80–95%)

# BAD: load and MFMA alternate — only 1 MFMA of hiding time
for k_step in range_constexpr(QKHELOOP * 2):
    if k_step % 2 == 0:
        v_data = buffer_ops.buffer_load(...)   # stall_rate ~92%
    acc = rocdl.mfma_f32_16x16x32_fp8_fp8(...)

# GOOD: batch all loads before the MFMA loop
for td in range_constexpr(TLOOP):
    v_prefetch[td] = [buffer_ops.buffer_load(...) for _ in range_constexpr(QKHELOOP)]

for td in range_constexpr(TLOOP):
    for k_step in range_constexpr(QKHELOOP * 2):
        acc = rocdl.mfma_f32_16x16x32_fp8_fp8(...)   # entire QK MFMA hides VMEM latency
    v_results[td] = v_prefetch[td]   # already in registers

Pattern 2: Sequential loads with no compute → VMEM queue saturation

# BAD: all loads back-to-back, no compute interleaved
for td in range_constexpr(TLOOP):
    for qkhe in range_constexpr(QKHELOOP):
        k4 = buffer_ops.buffer_load(k_rsrc, ka_dw, ...)   # queue fills up

# GOOD: prefetch next tile's K loads during current tile's MFMA computation

Pattern 3: LDS prob reads immediately before PV MFMA → lgkmcnt stall

# BAD: LDS reads and MFMA in same loop
for vhe in ...:
    for vt in ...:
        p_i64 = lds_read(...)    # issued here
        tmp = mfma(v_i64, p_i64, ...)   # immediately consumed → lgkmcnt stall

# GOOD: batch all LDS reads first, then all MFMAs
for vhe in ...:
    for vt in ...:
        p_i64s.append(lds_read(...))    # all LDS reads issued first

for vhe in ...:
    for vt in ...:
        tmp = mfma(v_i64s[...], p_i64s[...], ...)   # LDS data already ready

Pattern 4: Scale loads too close to usage

# BAD: scale load and usage separated by only TLOOP MFMAs
for td in range_constexpr(TLOOP):
    k_scale = buffer_ops.buffer_load(ks_rsrc, ...)   # issued here
# ... small compute gap ...
    result = acc * k_scale   # used too soon → stall

# GOOD: issue scale loads at the very beginning of the block,
# before K loads, to maximise latency hiding distance

Pattern 5: Hotspot attributed to kernel entry line

When @flyc.kernel / kernel decorator line appears as the top hotspot with a mix of VMEM-wait + barrier stall types — this is a debug info aggregation artifact. MLIR/compiler-generated instructions (address arithmetic, cndmask, prologue setup) map to the outermost scope line. Ignore this line; focus on lines with explicit user ops.

Register pressure check (architecture-aware)

hotspot_analyzer.py auto-detects the GPU architecture from ISA instruction patterns and computes occupancy (waves/SIMD) as the minimum across every resource limiter:

occupancy = min(vgpr_limit, lds_limit, sgpr_limit, hw_max=8)
  vgpr_limit = 512 // (arch_vgpr_alloc + accum_vgpr_alloc)              # per SIMD
  lds_limit  = (LDS_total // lds_per_wg) * waves_per_wg // 4_SIMDs      # per SIMD
  sgpr_limit = 800 // sgpr_alloc                                        # per SIMD

VGPR is a combined 512-entry pool on BOTH gfx942 and gfx950. CDNA2 (gfx90a) unified the arch (256) and accum (256) VGPR files into one 512 budget per SIMD, and gfx942/gfx950 inherit that. Occupancy from VGPR is 512 / (arch + accum) on both — NOT 256 / max(arch, accum). (The separate-pool 256/max model only applied to gfx908 / CDNA1, where accum VGPRs were a distinct file accessible only by MFMA.)

Property	CDNA3 (gfx942)	CDNA4 (gfx950)
VGPR pool	512 combined (256 arch + 256 accum, unified budget)	512 combined (same)
Occupancy formula (VGPR)	`512 / (arch_alloc + accum_alloc)`	`512 / (arch_alloc + accum_alloc)`
Alloc granularity	8 VGPRs	8 VGPRs
LDS size	64 KB	160 KB
LDS alloc block	256 bytes	1280 bytes
VMCNT width	6 bits (max 63 in-flight)	6 bits (max 63 in-flight)
LGKMCNT width	4 bits (max 15 in-flight)	4 bits (max 15 in-flight)

What actually changed in CDNA4 vs CDNA3 is the LDS size (64KB→160KB) and the LDS alloc granularity — not the VGPR pooling model.

Reading the real counts. code.json only holds the (often single-CU, often vgpr-form) disassembly, so it cannot reveal accum_vgpr / LDS / SGPR / workgroup size — an AGPR-form-blind ISA scan reports accum=0 and gets occupancy badly wrong. The analyzer reads out_kernel_trace.csv (staged next to the dispatch dir) for the authoritative Accum_VGPR_Count / LDS_Block_Size / SGPR_Count / Workgroup_Size_*. arch_vgpr is taken as max(ISA_scan, CSV) so a bogus-low CSV VGPR_Count field can't under-report. If no CSV is found it falls back to ISA-only and prints a warning.

Auto-detection: gfx950-specific instructions (v_mfma_scale_f32_*, v_mfma_f32_16x16x128_*, v_mfma_f32_32x32x64_*) indicate CDNA4. Absence indicates CDNA3.

sqlite3 results.db "
SELECT ks.KernelName, ki.arch_vgpr_count, ki.accum_vgpr_count, ki.lds_size
FROM rocpd_kernel_dispatch kd
JOIN rocpd_info_kernel_symbol ks ON kd.kernel_symbol_id=ks.id
JOIN rocpd_info_kernel ki ON kd.kernel_id=ki.id LIMIT 5;"

Worked example (PA decode, gfx942): arch 144 + accum 136 = 280 combined → 512//280 = 1 wave/SIMD, VGPR-bound (LDS allows 5, SGPR allows 7). Reaching 2 waves needs combined ≤ 256, e.g. freeing ~24 VGPRs.

Warning: maxnreg forcing accum_vgpr=0 doubles occupancy but causes MFMA spills through arch_vgpr — measured 4.5x GPU slowdown. Do not use maxnreg for MFMA-heavy kernels.

L2 / HBM efficiency analysis (PMC, not ATT)

When the ATT hotspots are dominated by VMEM-load at high stall rate (e.g. 40-50% of stall, ~94% per-load), the kernel is memory-bound and the next question is why — and ATT cannot answer it (it has no cache counters). Capture PMC counters (see capture-kernel-trace "PMC Mode") and analyze with scripts/pmc_l2_analyzer.py:

python scripts/pmc_l2_analyzer.py \
    /tmp/pmc_out/pass_1/pmc_l2_counter_collection.csv \
    /tmp/pmc_ea_out/pass_1/pmc_ea_counter_collection.csv \
    --kernel <kernel> --ideal-gb <bytes_per_dispatch_GB> --ea-channels 2

Three metrics, three decisions:

Metric	Formula	What it tells you
L2 hit rate	`TCC_HIT/(TCC_HIT+TCC_MISS)`	Is there temporal reuse to exploit?
32B fraction	`TCC_EA0_RDREQ_32B/TCC_EA0_RDREQ`	Spatial locality / cache-line waste
over-fetch	`est_HBM_bytes / (ideal_GB × dispatches)`	Redundant fetching

Decision tree for a memory-bound decode kernel:

L2 hit rate < 5% → pure streaming, no reuse. This is expected and correct for decode with independent per-sequence paged KV — each KV byte is read once; the GQA (×heads) and MTP (×seq) reuse is captured in registers/LDS, never re-reads L2. "Improving L2 hit rate" is a non-goal. The only thing that raises it is real KV reuse = shared-prefix serving (a workload/scheduling property, not a kernel change).
32B fraction ≈ 0% → full 64B cache lines, no spatial-locality waste. Nothing to fix at the line level. (High 32B% would point to scattered/ misaligned access worth restructuring.)
over-fetch ≈ 1.0x → the kernel reads exactly the data it needs. The achieved bandwidth (compute as ideal_bytes / kernel_time) is then the real ceiling for this access pattern. 50-60% of theoretical HBM peak is normal even for clean streaming; paged-gather decode living at ~54% with 0% partial + ~1.0x over-fetch is healthy, not a defect.

Worked example (PA decode, gfx942, bs=16, ctx=131072, batch=256): L2 hit 1.7%, 32B 0%, over-fetch 1.04x, 2.85 TB/s = 54% peak. Conclusion: the memory subsystem is clean; there is no KV-load optimization left — verified by also testing block_size 16→64 (regressed +7.8%) and confirming dwordx8 doesn't exist on CDNA3 (dwordx4 / 16B is the max single vector load).

Counter-capture caveat: keep each PMC job to ≤ ~4 TCC counters (single hardware pass). Multi-pass collection has triggered a GPU Hang on gfx942 — see capture-kernel-trace.

MFMA latency reference (cycles = pipeline depth)

Instruction	Variant	Cycles	Notes
`v_mfma_f32_*_f16` / `_bf16`	16x16x16	16
`v_mfma_f32_*_f16` / `_bf16`	32x32x8	32
`v_mfma_f32_*_fp8_fp8`	16x16x32	16	CDNA3+CDNA4
`v_mfma_f32_*_fp8_fp8`	32x32x16	32	CDNA3+CDNA4
`v_mfma_f32_16x16x128_f8f6f4`	16x16x128	16 or 32	CDNA4 only; 32 if either A or B is FP8
`v_mfma_f32_32x32x64_f8f6f4`	32x32x64	32 or 64	CDNA4 only; 64 if either A or B is FP8
`v_mfma_scale_f32_16x16x128_f8f6f4`	16x16x128	16 or 32	CDNA4 only; with block exponent scaling
`v_mfma_scale_f32_32x32x64_f8f6f4`	32x32x64	32 or 64	CDNA4 only; with block exponent scaling
`v_mfma_f32_*_f32`	16x16x4	32
`v_mfma_f32_*_f32`	32x32x2	64
`v_mfma_f64_16x16x4_f64`	16x16x4	64

MFMA dependency NOPs (CDNA4, from ISA reference Table 38)

These are the minimum independent instructions (or s_nop counts) required between MFMA result production and consumption. The values vary by MFMA variant:

Dependency pattern	Required waits	Comment
XDL write -> same XDL read SrcC (accumulate, exact same vDst)	0-2	Forwarding path; back-to-back accumulation OK
XDL write -> VALU/VM/LDS/FLAT read result (RAW)	5, 8, 12, or 20	No forwarding; must wait for MFMA commit to VGPR
XDL write -> MFMA read as SrcA or SrcB	5, 8, 12, or 20	No forwarding path
Non-DLops VALU write -> MFMA read	2	No 4/8 cycle forwarding path
VALU writes SGPR -> VMEM reads that SGPR	5	HW does NOT check this — user must add waits
V_CMPX* writes EXEC -> V_MFMA*	4	No EXEC forwarding with MFMA

Wait counts for "5, 8, 12, or 20" depend on MFMA variant:

5 waits: 16x16 4-block variants (8 cycle MFMAs)
8 waits: 16x16x16 F16/BF16 etc. (16 cycle MFMAs)
12 waits: 32x32x8, 16x16x4 F32, etc. (32 cycle MFMAs)
20 waits: 32x32x4 F32, 32x32x2 F32 (64 cycle MFMAs)

Step 6: Optimization Plan

After running hotspot_analyzer.py --detail, produce a prioritized plan:

## Stall Summary
- Total stalls: X cycles (Y% of kernel)
- Top stall type: VMEM-load (Z%)

## Hotspot Analysis

### #1 :LINE  stall=XK (N%)  VMEM-load  stall_rate=92%
Root cause: buffer_load inside QK MFMA loop — only 1 MFMA of hiding time.
Fix: Move all V loads before the QK MFMA loop.
Estimated gain: ~20% kernel cycle reduction.

### #2 :LINE  stall=XK (N%)  VMEM-load  stall_rate=80%
Root cause: K loads sequential with no compute interleaved.
Fix: Prefetch next tile's K during current tile's MFMA (double-buffer pattern).
See /prefetch-data-load skill.

### #3 ...

## Priority Order
1. [HIGH]  Fix V-load position (24% of all stalls, easy refactor)
2. [HIGH]  K-load cross-tile prefetch (8% of stalls, needs _process_block restructure)
3. [MED]   Move scale loads earlier (8% of stalls, trivial move)
4. [LOW]   Batch LDS reads before PV MFMA (4% of stalls, loop split)

Error Handling

Error	Fix
`rocprof-trace-decoder library path not found`	Install decoder .so (see Step 3)
Trace output empty	Check `kernel_include_regex` matches exactly
Trace truncated	Increase `att_buffer_size` to `"0xC000000"`
`kernel_iteration_range` mismatch	Adjust range; try `"[0, [1-2]]"`
`INVALID_SHADER_DATA`	aqlprofile/decoder version mismatch — update both
Source loc all `""`	Set `FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1`; check `-g` flag in compile pipeline
Top hotspot is kernel decorator line	Debug info artifact — skip it, focus on op lines
`--att` flag error	`--att` is boolean, no value; use `-i input.yaml` for full config
GPU Hang / HW Exception during PMC	Too many counters → multi-pass. Split into single-pass jobs of ≤ ~4 TCC counters
PMC `accum_vgpr=0` but kernel uses MFMA	vgpr-form MFMA: accumulators are in the arch VGPR file; read total from `VGPR_Count + Accum_VGPR_Count`

kernel-trace-analysis

المزيد من هذا المستودع

Kernel Trace Analysis

Arguments

Analyzer Scripts

Workflow

Mode A: Analyze existing dispatch directory

Mode B: Full collection workflow

Step 1: Kernel Discovery

Step 2: Configure input.yaml

Step 3: Collect ATT Trace

Step 4: Run hotspot_analyzer.py

Step 5: Interpret Results

code.json field reference

snapshots.json: resolving source paths

Stall type classification

Common hotspot patterns

Pattern 1: V/K loads inside MFMA loop → very high stall rate (80–95%)

Pattern 2: Sequential loads with no compute → VMEM queue saturation

Pattern 3: LDS prob reads immediately before PV MFMA → lgkmcnt stall

Pattern 4: Scale loads too close to usage

Pattern 5: Hotspot attributed to kernel entry line

Register pressure check (architecture-aware)

L2 / HBM efficiency analysis (PMC, not ATT)

MFMA latency reference (cycles = pipeline depth)

MFMA dependency NOPs (CDNA4, from ISA reference Table 38)

Step 6: Optimization Plan

Error Handling

Kernel Trace Analysis

Arguments

Analyzer Scripts

Workflow

Mode A: Analyze existing dispatch directory

Mode B: Full collection workflow

Step 1: Kernel Discovery

Step 2: Configure input.yaml

Step 3: Collect ATT Trace

Step 4: Run hotspot_analyzer.py

Step 5: Interpret Results

code.json field reference

snapshots.json: resolving source paths

Stall type classification

Common hotspot patterns

Pattern 1: V/K loads inside MFMA loop → very high stall rate (80–95%)

Pattern 2: Sequential loads with no compute → VMEM queue saturation

Pattern 3: LDS prob reads immediately before PV MFMA → lgkmcnt stall

Pattern 4: Scale loads too close to usage

Pattern 5: Hotspot attributed to kernel entry line

Register pressure check (architecture-aware)

L2 / HBM efficiency analysis (PMC, not ATT)

MFMA latency reference (cycles = pipeline depth)

MFMA dependency NOPs (CDNA4, from ISA reference Table 38)

Step 6: Optimization Plan

Error Handling

المزيد من هذا المستودع