Skip to main content
Execute qualquer Skill no Manus
com um clique

pto-isa-flash-atten-a3-pipeline

Estrelas58
Forks38
Atualizado25 de maio de 2026 às 11:01

PTO-DSL Flash Attention four-stage cross-core software pipeline for Ascend A3: compute_qk (Cube) -> compute_p (Vec) -> compute_pv (Cube) -> compute_gu (Vec), staged through a GM software FIFO. Captures the steady-state rhythm (cube-side per-tile emit_qk_pv interleaving, vec-side "drain GU then produce P"), the QK_PRELOAD / EXP_RING / S1_TILE knobs and their invariants, the UB 192 KiB budget with the row_slice working-tile shrink, the empirical S1 >= 16384 -> S1_TILE = 512 recommendation, and the op-pattern PIPE_V barrier removal recipe. Use when tuning the in-tree DSL Flash Attention, porting the four-stage pipeline to a new persistent-block kernel that mixes cube + vec stages through a GM FIFO, choosing QK_PRELOAD / S1_TILE for a new shape mix, or deciding when a PIPE_V barrier in generated C++ is safe to drop. Scoped to A3 non-causal prefill with HEAD=128, S0=128, CUBE_S1=128 -- other Flash Attention flavors (causal mask, GQA/MQA, KV-cache decode, A5 NZ/NZ+1 layout) belong in sibling skills.

Instalação

Instalar com Codex ou Claude Copie este prompt, cole no Codex, Claude ou outro assistente e deixe que ele revise a página da skill e instale para você.

Explorador de arquivos
4 arquivos
SKILL.md
readonly