Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

oob-detection

Name: Oob Detection
Author: ROCm

// Detect out-of-bounds memory accesses in CPU or GPU code using static interval analysis and runtime assertions/printfs. Use when investigating OOB, buffer overrun, invalid memory access, HIP/ROCm illegal address, CUDA illegal memory access, silent tensor corruption, or suspicious buffer_load/store address arithmetic.

In Manus ausführen

$ git log --oneline --stat

stars:192

forks:56

updated:20. Mai 2026 um 06:06

SKILL.md

readonly

name	oob-detection
description	Detect out-of-bounds memory accesses in CPU or GPU code using static interval analysis and runtime assertions/printfs. Use when investigating OOB, buffer overrun, invalid memory access, HIP/ROCm illegal address, CUDA illegal memory access, silent tensor corruption, or suspicious buffer_load/store address arithmetic.
allowed-tools	Read Edit Bash Grep Glob Agent

OOB Detection

Use this skill when a kernel, runtime, or host path may read or write outside its intended buffer or logical tile. Prefer proving the address range first instead of relying only on runtime failures.

1. Classify the OOB

Decide which boundary may be violated:

Boundary	Meaning	Typical detector
Physical allocation OOB	Address leaves the allocated tensor/buffer	HIP illegal address, runtime failure
Logical object OOB	Address stays in allocation but crosses a row/head/tile	Static interval analysis, explicit runtime check
Lane/thread ownership OOB	Thread reads another lane's slot	Static interval analysis, debug printf/assert
LDS/shared-memory OOB	Address exceeds allocated shared-memory region	Static interval analysis, LDS index guard

Physical tools usually miss logical OOB because the access can still be inside the same allocation.

2. Static Interval Analysis

For each memory access, write the exact element range:

start = base + offset_expression
end   = start + vec_width - 1
legal = [object_base, object_base + object_extent - 1]

Then substitute known ranges:

Thread/lane ids: threadIdx.x, lane, lane16id, warp_id, rowid
Compile-time loops: range_constexpr(N) gives i in [0, N-1]
Vector widths: buffer_load(..., vec_width=W, dtype=T) reads W elements of T
Strides and shapes: tensor .shape, .stride, layout shape/stride, tile extents
Masks/clamps: select(valid, value, safe_value) changes the range only if it dominates the load/store

If max(end) > object_base + object_extent - 1, the OOB is statically proven. If min(start) < object_base, the lower-bound OOB is statically proven.

FlyDSL Example

For a Q head with HEAD_SIZE = 128:

q_elem = q_base + lane16id * 8
load_start = q_elem + qwi * 4
load_end = load_start + 3
lane16id in [0, 15], qwi in [0, 3]

max(load_end) = q_base + 15*8 + 3*4 + 3 = q_base + 135
legal head end = q_base + 127

This proves logical OOB for the head. If each lane owns only 8 elements, then qwi in [2, 3] also proves lane-slot OOB for every lane.

3. Add Runtime Logical Checks

When static proof is not enough or the formula depends on runtime values, add a temporary guard immediately before the load/store. In FlyDSL kernels, prefer a small printf with the failing coordinates:

load_start = q_elem + arith.constant(qwi * 4, type=T.i32)
load_end = load_start + arith.constant(vec_width - 1, type=T.i32)
legal_end = q_base + arith.constant(HEAD_SIZE - 1, type=T.i32)

if load_end > legal_end:
    fx.printf(
        "OOB q load: lane=%d qwi=%d q_base=%d load=[%d,%d] legal_end=%d\n",
        lane16id,
        arith.constant(qwi, type=T.i32),
        q_base,
        load_start,
        load_end,
        legal_end,
    )

For stores, include the output tensor/tile coordinates and the flattened offset. Keep debug prints narrow; too many lanes printing can hide the useful signal.

4. Fix Strategy

Prefer fixing the invariant, not masking the fault:

If the loop trip count is wrong, reduce the loop or vector width.
If the per-lane ownership changed, update the layout, LDS store, and reader together.
If a boundary tile is partial, clamp or predicate before the load/store.
If a descriptor/resource range is too large or offset overflows i32, chunk the buffer resource or widen arithmetic before truncation.
After the fix, rerun both the focused failing test and one neighboring shape that exercises the boundary.

5. Report Format

When reporting an OOB investigation, include:

Access expression and element units
Proven or observed failing range
Legal range and which boundary was violated
Minimal fix and validation command

related-skills.json

gleiches Repository

format-code.md

from "ROCm/FlyDSL"

Format and clean up changed files before committing, matching the project's CI style gate. Formats Python with black + ruff and C/C++ with clang-format using the repository's .clang-format. Use when the user says "format code", "clean up code", "lint", "format before commit", "/format-code", or mentions black, ruff, clang-format, or CI style failures while tidying their working tree.

2026-05-29192

capture-kernel-trace.md

from "ROCm/FlyDSL"

Capture GPU kernel ATT (Advanced Thread Trace) via rocprofv3 on a remote Docker container or locally. Discovers kernel names, configures input.yaml with the target kernel_include_regex, runs rocprofv3 -i input.yaml with FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1, and downloads the latest ui_output_agent_* directory for analysis. Usage: /capture-kernel-trace <test_script.py> [kernel_name_pattern]

2026-05-29192

kernel-trace-analysis.md

from "ROCm/FlyDSL"

Profile GPU kernels using rocprofv3 to collect ATT instruction-level traces, then analyze the trace data using hotspot_analyzer.py to identify top-K stall hotspots (VMEM-load, VMEM-wait, LDS/SMEM-wait, barrier, MFMA stalls) mapped back to source lines, and produce an actionable optimization plan. Usage: /kernel-trace-analysis <cmd> Can also analyze an existing dispatch dir directly: /kernel-trace-analysis --dir <path>

2026-05-29192

lds-optimization.md

from "ROCm/FlyDSL"

Optimize LDS (Local Data Share / shared memory) access patterns in FlyDSL GPU kernels. Diagnose bank conflicts and high lgkmcnt stalls from ATT trace data, then apply swizzle or padding layouts to eliminate conflicts. Also increase the distance between LDS write and subsequent LDS read to hide LDS latency. LDS read preceded by write always requires a sync (s_waitcnt lgkmcnt or s_barrier). Use when trace analysis shows ds_read/ds_write/lgkmcnt as a bottleneck. Usage: /lds-optimization

2026-05-29192

prefetch-data-load.md

from "ROCm/FlyDSL"

Apply prefetch optimization to FlyDSL kernel loops: pre-load the first iteration's data before the loop, issue async loads for the next iteration inside the loop body, and swap buffers at the loop tail via runtime loop-carried values. This overlaps data load latency with compute instructions. Use when a kernel has a loop where buffer_load feeds into MFMA/compute and load latency is exposed. Usage: /prefetch-data-load

2026-05-29192

flydsl-kernel-authoring.md

from "ROCm/FlyDSL"

Comprehensive reference for authoring FlyDSL GPU kernels on AMD GPUs. Covers the layout algebra, tiled copy/MMA, buffer ops, loop-carried range loops, SmemAllocator, autotuning, and common patterns. Use when writing, reviewing, or understanding FlyDSL kernel code.

2026-05-27192

package.json

"author": "ROCm"

"repository": "ROCm/FlyDSL"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

InformationssicherheitsanalystenInformatik- und Mathematikberufe15-1212L4

name	oob-detection
description	Detect out-of-bounds memory accesses in CPU or GPU code using static interval analysis and runtime assertions/printfs. Use when investigating OOB, buffer overrun, invalid memory access, HIP/ROCm illegal address, CUDA illegal memory access, silent tensor corruption, or suspicious buffer_load/store address arithmetic.
allowed-tools	Read Edit Bash Grep Glob Agent

OOB Detection

Use this skill when a kernel, runtime, or host path may read or write outside its intended buffer or logical tile. Prefer proving the address range first instead of relying only on runtime failures.

1. Classify the OOB

Decide which boundary may be violated:

Boundary	Meaning	Typical detector
Physical allocation OOB	Address leaves the allocated tensor/buffer	HIP illegal address, runtime failure
Logical object OOB	Address stays in allocation but crosses a row/head/tile	Static interval analysis, explicit runtime check
Lane/thread ownership OOB	Thread reads another lane's slot	Static interval analysis, debug printf/assert
LDS/shared-memory OOB	Address exceeds allocated shared-memory region	Static interval analysis, LDS index guard

Physical tools usually miss logical OOB because the access can still be inside the same allocation.

2. Static Interval Analysis

For each memory access, write the exact element range:

start = base + offset_expression
end   = start + vec_width - 1
legal = [object_base, object_base + object_extent - 1]

Then substitute known ranges:

Thread/lane ids: threadIdx.x, lane, lane16id, warp_id, rowid
Compile-time loops: range_constexpr(N) gives i in [0, N-1]
Vector widths: buffer_load(..., vec_width=W, dtype=T) reads W elements of T
Strides and shapes: tensor .shape, .stride, layout shape/stride, tile extents
Masks/clamps: select(valid, value, safe_value) changes the range only if it dominates the load/store

If max(end) > object_base + object_extent - 1, the OOB is statically proven. If min(start) < object_base, the lower-bound OOB is statically proven.

FlyDSL Example

For a Q head with HEAD_SIZE = 128:

q_elem = q_base + lane16id * 8
load_start = q_elem + qwi * 4
load_end = load_start + 3
lane16id in [0, 15], qwi in [0, 3]

max(load_end) = q_base + 15*8 + 3*4 + 3 = q_base + 135
legal head end = q_base + 127

This proves logical OOB for the head. If each lane owns only 8 elements, then qwi in [2, 3] also proves lane-slot OOB for every lane.

3. Add Runtime Logical Checks

load_start = q_elem + arith.constant(qwi * 4, type=T.i32)
load_end = load_start + arith.constant(vec_width - 1, type=T.i32)
legal_end = q_base + arith.constant(HEAD_SIZE - 1, type=T.i32)

if load_end > legal_end:
    fx.printf(
        "OOB q load: lane=%d qwi=%d q_base=%d load=[%d,%d] legal_end=%d\n",
        lane16id,
        arith.constant(qwi, type=T.i32),
        q_base,
        load_start,
        load_end,
        legal_end,
    )

For stores, include the output tensor/tile coordinates and the flattened offset. Keep debug prints narrow; too many lanes printing can hide the useful signal.

4. Fix Strategy

Prefer fixing the invariant, not masking the fault:

If the loop trip count is wrong, reduce the loop or vector width.
If the per-lane ownership changed, update the layout, LDS store, and reader together.
If a boundary tile is partial, clamp or predicate before the load/store.
If a descriptor/resource range is too large or offset overflows i32, chunk the buffer resource or widen arithmetic before truncation.
After the fix, rerun both the focused failing test and one neighboring shape that exercises the boundary.

5. Report Format

When reporting an OOB investigation, include:

Access expression and element units
Proven or observed failing range
Legal range and which boundary was violated
Minimal fix and validation command

oob-detection

OOB Detection

1. Classify the OOB

2. Static Interval Analysis

FlyDSL Example

3. Add Runtime Logical Checks

4. Fix Strategy

5. Report Format

Mehr aus diesem Repository

Mehr aus diesem Repository

OOB Detection

1. Classify the OOB

2. Static Interval Analysis

FlyDSL Example

3. Add Runtime Logical Checks

4. Fix Strategy

5. Report Format