Run any Skill in Manus with one click

pto-isa-matmul-l2-schedule

Stars58

Forks38

UpdatedMay 21, 2026 at 01:48

PTO-DSL matmul L2-reuse scheduler for Ascend A2/A3: persistent-block GEMM with N-group swizzle along the inner M walk and M-direction zigzag at N-group boundaries. Captures the tile-id math, the CANN platform_config- driven swizzleCountN budget (with the 32 MiB safety-ratio cliff), the DN-B layout note, the runtime wiring, and the verification path against torch_npu. Use when tuning a matmul-shaped kernel that profiles as L2-bound, porting the swizzle/zigzag schedule to a new persistent-block kernel, choosing swizzleCountN for a new SoC, or deciding between the manual SPMD-static baseline and this persistent + swizzle schedule. Scoped to one schedule recipe — add a separate skill for other PTO-ISA performance patterns (vector reduce, flash-attention scheduling, etc.).

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

hw-native-sys

hw-native-sys/pto-isa

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

2 files

SKILL.md

readonly

name

pto-isa-matmul-l2-schedule

description

PTO-ISA Matmul L2-Reuse Schedule (N-Group Swizzle + M-Direction Zigzag)

A single, focused recipe: how to schedule a matmul-shaped kernel on Ascend A2/A3 so the B panel stays hot in L2 across multiple M rows, and the boundary A tile is reused at every N-group switch. The recipe is implemented in the in-tree PTO-DSL GEMM under kernels/python/gemm/; everything you need to apply or port it lives in this skill — the source tree is only there if you want to see it wired end-to-end.

Quick Start

Read references/matmul-n-swizzle-m-zigzag.md — the full recipe (when to use, tile-id math, L2 budget, constraints, anti-patterns).
Reproduce the canonical numbers on Ascend910B2:
```
cd ${repo}/kernels/python/gemm
python3 run.py --case a2a3_perf_6144 --torch-npu --benchmark
```
The runner builds a shape-specialized kernel per case on demand and then runs it.
To target a new SoC: drop the CANN <soc>.ini into the platform_config/ directory the runner resolves (via ASCEND_HOME_PATH / ASCEND_TOOLKIT_HOME), then run run.py --soc <name>. cube_core_cnt and l2_size feed swizzleCountN and blockDim directly — no code change required.

Core Workflow

Diagnose — confirm the kernel is L2-bound (TLOAD dominates while B traffic is large and reusable). For the generic profiling loop see docs/coding/performance-best-practices.md.
Apply the recipe — persistent block + N-group swizzle + M-direction zigzag. See the reference.
Wire the runtime — derive blockDim and swizzleCountN from CANN platform_config/<soc>.ini. Validate against the base-tile grid before building the kernel.
Verify — correctness vs torch.matmul, then benchmark ratio vs torch_npu ABt.

Scope

This skill is deliberately narrow — it covers only the matmul L2-reuse schedule. Anything outside that scope belongs in a different skill:

Other PTO-ISA performance patterns (vector reduce, flash-attention scheduling, K-split GEMM, etc.) → add a sibling skill, do not extend this one.
Generic PTO optimization concepts (instruction selection, pipeline overlap, profiling methodology) → covered by docs/coding/opt.md and docs/coding/performance-best-practices.md.
Build / constraints / debugging / review guardrails → covered by ../pto-isa-dev/SKILL.md.

Working Rules

Validate the schedule's constraints on the host before building the kernel; surface a clear error rather than letting it mis-schedule.
Keep swizzleCountN derived from l2_size rather than hard-coded — the same kernel must run on multiple Ascend SoCs.
When porting the tile-id math to a non-GEMM persistent-block kernel, only rename the two outer axes; the formula structure is shape-agnostic.
The recipe is meant to stand on its own — the formulas, constraints, and verification path in the reference should suffice without bouncing back to source code. Adding a helper script under this skill (e.g. a swizzleCountN calculator that parses a CANN <soc>.ini, or a small benchmark sweeper) is fine if it makes applying the recipe materially easier; do not add scripts that just duplicate what kernels/python/gemm/run.py already does.

Where the Reference Implementation Lives

kernels/python/gemm/ — the PTO-DSL GEMM kernel: the schedule, the buffer pyramid, the K-panel pipeline, the CANN platform_config wiring, the per-case build flow, and the torch_npu benchmark harness, all wired together.

kernels/manual/a2a3/gemm_performance/ — the simpler manual SPMD-static baseline this recipe replaces when L2 reuse is the bottleneck.

More from this repository

same repository

pto-isa

hw-native-sys/pto-isa

使用PTO-ISA实现指定算子功能的完整流程指南，涵盖ISA指令选择、数据流分析、指令功能解释和kernel代码生成

2026-06-0458

pto-isa-flash-atten-a3-pipeline

hw-native-sys/pto-isa

PTO-DSL Flash Attention four-stage cross-core software pipeline for Ascend A3: compute_qk (Cube) -> compute_p (Vec) -> compute_pv (Cube) -> compute_gu (Vec), staged through a GM software FIFO. Captures the steady-state rhythm (cube-side per-tile emit_qk_pv interleaving, vec-side "drain GU then produce P"), the QK_PRELOAD / EXP_RING / S1_TILE knobs and their invariants, the UB 192 KiB budget with the row_slice working-tile shrink, the empirical S1 >= 16384 -> S1_TILE = 512 recommendation, and the op-pattern PIPE_V barrier removal recipe. Use when tuning the in-tree DSL Flash Attention, porting the four-stage pipeline to a new persistent-block kernel that mixes cube + vec stages through a GM FIFO, choosing QK_PRELOAD / S1_TILE for a new shape mix, or deciding when a PIPE_V barrier in generated C++ is safe to drop. Scoped to A3 non-causal prefill with HEAD=128, S0=128, CUBE_S1=128 -- other Flash Attention flavors (causal mask, GQA/MQA, KV-cache decode, A5 NZ/NZ+1 layout) belong in sibling skills.

2026-05-2558

pto-comm

hw-native-sys/pto-isa

基于 PTO-COMM ISA 开发通信算子的完整指南。涵盖 Host-Device 架构、文件结构、通信模式（P2P/集合通信/通算融合）、同步策略、信号矩阵设计、多 Block 调度、远端地址管理、构建系统配置等。触发：需要使用 PTO-COMM 开发通信算子、设计通信 kernel、编写 Host 侧代码、配置 CMakeLists 时。

2026-04-2758

pto-isa-dev

hw-native-sys/pto-isa

Work effectively in PTO-ISA: choose the right backend, run CPU/SIM/NPU flows, trace instruction constraints, understand A2/A3 vs A5 differences, align with PTO-AS, debug failures, and apply review-derived guardrails from recent PRs.

2026-04-2758

name

pto-isa-matmul-l2-schedule

description

PTO-ISA Matmul L2-Reuse Schedule (N-Group Swizzle + M-Direction Zigzag)

Quick Start

Read references/matmul-n-swizzle-m-zigzag.md — the full recipe (when to use, tile-id math, L2 budget, constraints, anti-patterns).
Reproduce the canonical numbers on Ascend910B2:
```
cd ${repo}/kernels/python/gemm
python3 run.py --case a2a3_perf_6144 --torch-npu --benchmark
```
The runner builds a shape-specialized kernel per case on demand and then runs it.
To target a new SoC: drop the CANN <soc>.ini into the platform_config/ directory the runner resolves (via ASCEND_HOME_PATH / ASCEND_TOOLKIT_HOME), then run run.py --soc <name>. cube_core_cnt and l2_size feed swizzleCountN and blockDim directly — no code change required.

Core Workflow

Diagnose — confirm the kernel is L2-bound (TLOAD dominates while B traffic is large and reusable). For the generic profiling loop see docs/coding/performance-best-practices.md.
Apply the recipe — persistent block + N-group swizzle + M-direction zigzag. See the reference.
Wire the runtime — derive blockDim and swizzleCountN from CANN platform_config/<soc>.ini. Validate against the base-tile grid before building the kernel.
Verify — correctness vs torch.matmul, then benchmark ratio vs torch_npu ABt.

Scope

This skill is deliberately narrow — it covers only the matmul L2-reuse schedule. Anything outside that scope belongs in a different skill:

Other PTO-ISA performance patterns (vector reduce, flash-attention scheduling, K-split GEMM, etc.) → add a sibling skill, do not extend this one.
Generic PTO optimization concepts (instruction selection, pipeline overlap, profiling methodology) → covered by docs/coding/opt.md and docs/coding/performance-best-practices.md.
Build / constraints / debugging / review guardrails → covered by ../pto-isa-dev/SKILL.md.

Working Rules

Validate the schedule's constraints on the host before building the kernel; surface a clear error rather than letting it mis-schedule.
Keep swizzleCountN derived from l2_size rather than hard-coded — the same kernel must run on multiple Ascend SoCs.
When porting the tile-id math to a non-GEMM persistent-block kernel, only rename the two outer axes; the formula structure is shape-agnostic.
The recipe is meant to stand on its own — the formulas, constraints, and verification path in the reference should suffice without bouncing back to source code. Adding a helper script under this skill (e.g. a swizzleCountN calculator that parses a CANN <soc>.ini, or a small benchmark sweeper) is fine if it makes applying the recipe materially easier; do not add scripts that just duplicate what kernels/python/gemm/run.py already does.

Where the Reference Implementation Lives

kernels/manual/a2a3/gemm_performance/ — the simpler manual SPMD-static baseline this recipe replaces when L2 reuse is the bottleneck.