| name | triton-ascend-optimization |
| description | Triton Ascend 性能优化通用策略: BLOCK_SIZE 选择 (1024-2048 for elementwise, must be <65536), grid configuration (use VEC_CORE_NUM / CUBE_CORE_NUM, 2D/3D grid for matmul / conv / reduce, 1D grid + inner loop for elementwise / pointwise), 256B alignment for memory transfers, autotune block-size patterns, fp16 / fp32 precision conversion. Bind via keywords like matmul, elementwise, reduce, block_size, grid, autotune, alignment, fp16, fp32, tile, interleaved-loop, cube-core, vec-core. |
| category | guide |
| version | 1.0.0 |
| metadata | {"backend":"ascend","dsl":"triton_ascend","hardware":"Atlas A2, Atlas A3"} |
| structure | {"child_skills":["triton-ascend-memory","triton-ascend-grid-config","triton-ascend-debugging"]} |
Triton Ascend 性能优化指南
优化策略 Checklist
Reduction 优化
每个核心先局部标量累加,最后一次原子写入:
core_sum = 0.0
for block_start in range(pid, total_blocks, CORE_NUM):
data = tl.load(...)
core_sum += tl.sum(data, axis=0)
tl.atomic_add(output_ptr, core_sum)
数值稳定性
防溢出
max_val = tl.max(scores, axis=0)
scores = scores - max_val
p = tl.math.exp2(scores)
防负值开方
- 任何 sqrt 前确保非负:
max(input, 0.) 或 max(input, eps)
精度提升
- matmul 使用 fp32 累加器:
acc = tl.zeros([M, N], dtype=tl.float32)
- 最后再转回目标精度:
result = acc.to(tl.float16)