with one click
pytorch2flydsl-translation
// Use when translating PyTorch GPU kernels to FlyDSL. Provides API reference, translation guides, and strategy for mapping PyTorch ops to FlyDSL equivalents.
// Use when translating PyTorch GPU kernels to FlyDSL. Provides API reference, translation guides, and strategy for mapping PyTorch ops to FlyDSL equivalents.
Use when trying to optimize end-to-end SGLang performance with gemm tuning for FP8 models on AMD HIP/ROCm by replacing the default Triton GEMM backend with a tuned Composable Kernel (CK) path through aiter; this skill is the verified playbook for that entire process, using FP8 block-wise GEMM (gemm_a8w8_blockscale) as the primary worked example—GEMM shape/dispatch logging in SGLang, CK composable-kernel tuning, and AITER_CONFIG_GEMM_A8W8_BLOCKSCALE CSV integration. FP8 blockscale and bpreshuffle should also apply by switch the place for dumping gemm and the ck tool used for tuning.
Use when generating a fixed test harness for a Triton (@triton.jit) GPU kernel under the v3 GEAK preprocess pipeline. Covers harness CLI contract, Triton-specific entry-point detection, three-tier shape lists, --iterations argparse rule, and the GPU-RNG-pollution pitfall that rocprofv3 punishes.
Use when generating a fixed test harness for a HIP / CUDA / CK / HSACO GPU kernel under the v3 GEAK preprocess pipeline. Covers harness CLI contract, the three HIP build shapes (pybind11, standalone make, raw hipcc), the COMMANDMENT wrapper-script rule, --iterations argparse, and the GPU-RNG-pollution pitfall that rocprofv3 punishes.
Use when working with FlyDSL kernels (`@flyc.kernel` / `flydsl.compiler`) on AMD GPUs. Covers three complementary workflows: writing new tile-programmed kernels, optimizing existing kernels for performance, and debugging correctness issues (NaN, wrong results, compilation errors, hangs).
| name | pytorch2flydsl-translation |
| description | Use when translating PyTorch GPU kernels to FlyDSL. Provides API reference, translation guides, and strategy for mapping PyTorch ops to FlyDSL equivalents. |
This skill provides knowledge and strategy for translating PyTorch GPU kernels to FlyDSL, a domain-specific language for AMD GPU kernel programming.
compile_preshuffle_gemm_a8() from kernels.preshuffle_gemm.
CRITICAL: B-matrix must be preshuffled with shuffle_weight(B.contiguous(), layout=(16, 16)) from tests.utils.
All tensor args must be .view(-1). Scales: torch.empty(0, device=dev, dtype=torch.float32) for fp16.build_flash_attn_func_module() from kernels.flash_attn_func
when head_dim>=64, head_dim%32==0, seq_len%128==0. NEVER decompose attention into separate
GEMM+softmax+GEMM calls when flash attention fits — decomposed is 5-10x slower.
NEVER use Python for-loops over batch*heads to call GEMM one at a time.
Builder: build_flash_attn_func_module(num_heads=H, head_dim=D, causal=True, dtype_str="f16").
Launcher: fn(Q.view(-1), K.view(-1), V.view(-1), O.view(-1), batch_size, seq_len, stream=stream).
Note: num_heads is baked in at build time, NOT passed at launch time.build_softmax_module(M, N, dtype_str) — call as fn(input, output, M, stream=stream)build_layernorm_module(M, N, dtype_str) — call as fn(input, gamma, beta, output, M, stream=stream)build_rmsnorm_module(M, N, dtype_str) — call as fn(input, gamma, output, M, stream=stream)torch.nn.functional (ONLY ops with no FlyDSL equivalent)CRITICAL: Do NOT use torch.matmul, F.linear, nn.Linear, or F.scaled_dot_product_attention. These ALL have FlyDSL pre-built replacements. PyTorch fallback is ONLY for Conv2d, MaxPool2d, BatchNorm2d.
The docs/ subdirectory contains detailed API references and translation guides:
flydsl_translation_api_reference.md — FlyDSL compiler API, expression types, kernel patternsflydsl_translation_guide.md — PyTorch op mapping, structural patterns, common pitfallsflydsl_translation_gemm.md — GEMM/Linear translation with preshuffle_gemmflydsl_translation_attention.md — Attention/SDPA translation with flash_attnflydsl_translation_reductions.md — Reduction ops (sum, mean, softmax, layernorm)