Run any Skill in Manus with one click

$pwd:

pytorch2flydsl-translation

Name: Pytorch2flydsl Translation
Author: AMD-AGI

// Use when translating PyTorch GPU kernels to FlyDSL. Provides API reference, translation guides, and strategy for mapping PyTorch ops to FlyDSL equivalents.

Run Skill in Manus

$ git log --oneline --stat

stars:102

forks:26

updated:April 22, 2026 at 14:20

File Explorer

7 files

SKILL.md

readonly

name	pytorch2flydsl-translation
description	Use when translating PyTorch GPU kernels to FlyDSL. Provides API reference, translation guides, and strategy for mapping PyTorch ops to FlyDSL equivalents.

PyTorch to FlyDSL Translation Skill

This skill provides knowledge and strategy for translating PyTorch GPU kernels to FlyDSL, a domain-specific language for AMD GPU kernel programming.

Translation Strategy (in order of preference)

GEMM / Linear: Use compile_preshuffle_gemm_a8() from kernels.preshuffle_gemm. CRITICAL: B-matrix must be preshuffled with shuffle_weight(B.contiguous(), layout=(16, 16)) from tests.utils. All tensor args must be .view(-1). Scales: torch.empty(0, device=dev, dtype=torch.float32) for fp16.
Attention / SDPA: ALWAYS use build_flash_attn_func_module() from kernels.flash_attn_func when head_dim>=64, head_dim%32==0, seq_len%128==0. NEVER decompose attention into separate GEMM+softmax+GEMM calls when flash attention fits — decomposed is 5-10x slower. NEVER use Python for-loops over batch*heads to call GEMM one at a time. Builder: build_flash_attn_func_module(num_heads=H, head_dim=D, causal=True, dtype_str="f16"). Launcher: fn(Q.view(-1), K.view(-1), V.view(-1), O.view(-1), batch_size, seq_len, stream=stream). Note: num_heads is baked in at build time, NOT passed at launch time.
Softmax: build_softmax_module(M, N, dtype_str) — call as fn(input, output, M, stream=stream)
LayerNorm: build_layernorm_module(M, N, dtype_str) — call as fn(input, gamma, beta, output, M, stream=stream)
RMSNorm: build_rmsnorm_module(M, N, dtype_str) — call as fn(input, gamma, output, M, stream=stream)
Element-wise ops (relu, sigmoid, tanh, clamp, etc.): Write custom @flyc.kernel with layout algebra
Reductions (sum, mean): Manual block reduction with wave shuffle
Conv/Pool/BatchNorm: Use torch.nn.functional (ONLY ops with no FlyDSL equivalent)
Complex models: Use FlyDSL for ALL ops except conv/pool/batchnorm

CRITICAL: Do NOT use torch.matmul, F.linear, nn.Linear, or F.scaled_dot_product_attention. These ALL have FlyDSL pre-built replacements. PyTorch fallback is ONLY for Conv2d, MaxPool2d, BatchNorm2d.

Reference Documentation

The docs/ subdirectory contains detailed API references and translation guides:

flydsl_translation_api_reference.md — FlyDSL compiler API, expression types, kernel patterns
flydsl_translation_guide.md — PyTorch op mapping, structural patterns, common pitfalls
flydsl_translation_gemm.md — GEMM/Linear translation with preshuffle_gemm
flydsl_translation_attention.md — Attention/SDPA translation with flash_attn
flydsl_translation_reductions.md — Reduction ops (sum, mean, softmax, layernorm)

related-skills.json

same repository

fp8-gemm-tuning-sglang-aiter.md

from "AMD-AGI/GEAK"

Use when trying to optimize end-to-end SGLang performance with gemm tuning for FP8 models on AMD HIP/ROCm by replacing the default Triton GEMM backend with a tuned Composable Kernel (CK) path through aiter; this skill is the verified playbook for that entire process, using FP8 block-wise GEMM (gemm_a8w8_blockscale) as the primary worked example—GEMM shape/dispatch logging in SGLang, CK composable-kernel tuning, and AITER_CONFIG_GEMM_A8W8_BLOCKSCALE CSV integration. FP8 blockscale and bpreshuffle should also apply by switch the place for dumping gemm and the ck tool used for tuning.

2026-05-18102

triton.md

from "AMD-AGI/GEAK"

Use when generating a fixed test harness for a Triton (@triton.jit) GPU kernel under the v3 GEAK preprocess pipeline. Covers harness CLI contract, Triton-specific entry-point detection, three-tier shape lists, --iterations argparse rule, and the GPU-RNG-pollution pitfall that rocprofv3 punishes.

2026-05-17102

hip.md

from "AMD-AGI/GEAK"

Use when generating a fixed test harness for a HIP / CUDA / CK / HSACO GPU kernel under the v3 GEAK preprocess pipeline. Covers harness CLI contract, the three HIP build shapes (pybind11, standalone make, raw hipcc), the COMMANDMENT wrapper-script rule, --iterations argparse, and the GPU-RNG-pollution pitfall that rocprofv3 punishes.

2026-05-17102

flydsl.md

from "AMD-AGI/GEAK"

Use when working with FlyDSL kernels (`@flyc.kernel` / `flydsl.compiler`) on AMD GPUs. Covers three complementary workflows: writing new tile-programmed kernels, optimizing existing kernels for performance, and debugging correctness issues (NaN, wrong results, compilation errors, hangs).

2026-05-11102

package.json

"author": "AMD-AGI"

"repository": "AMD-AGI/GEAK"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

PyTorch to FlyDSL Translation Skill

This skill provides knowledge and strategy for translating PyTorch GPU kernels to FlyDSL, a domain-specific language for AMD GPU kernel programming.

Translation Strategy (in order of preference)

GEMM / Linear: Use compile_preshuffle_gemm_a8() from kernels.preshuffle_gemm. CRITICAL: B-matrix must be preshuffled with shuffle_weight(B.contiguous(), layout=(16, 16)) from tests.utils. All tensor args must be .view(-1). Scales: torch.empty(0, device=dev, dtype=torch.float32) for fp16.

Attention / SDPA: ALWAYS use build_flash_attn_func_module() from kernels.flash_attn_func when head_dim>=64, head_dim%32==0, seq_len%128==0. NEVER decompose attention into separate GEMM+softmax+GEMM calls when flash attention fits — decomposed is 5-10x slower. NEVER use Python for-loops over batch*heads to call GEMM one at a time. Builder: build_flash_attn_func_module(num_heads=H, head_dim=D, causal=True, dtype_str="f16"). Launcher: fn(Q.view(-1), K.view(-1), V.view(-1), O.view(-1), batch_size, seq_len, stream=stream). Note: num_heads is baked in at build time, NOT passed at launch time.

Softmax: build_softmax_module(M, N, dtype_str) — call as fn(input, output, M, stream=stream)

LayerNorm: build_layernorm_module(M, N, dtype_str) — call as fn(input, gamma, beta, output, M, stream=stream)

RMSNorm: build_rmsnorm_module(M, N, dtype_str) — call as fn(input, gamma, output, M, stream=stream)

Element-wise ops (relu, sigmoid, tanh, clamp, etc.): Write custom @flyc.kernel with layout algebra

Reductions (sum, mean): Manual block reduction with wave shuffle

Conv/Pool/BatchNorm: Use torch.nn.functional (ONLY ops with no FlyDSL equivalent)

Complex models: Use FlyDSL for ALL ops except conv/pool/batchnorm

CRITICAL: Do NOT use torch.matmul, F.linear, nn.Linear, or F.scaled_dot_product_attention. These ALL have FlyDSL pre-built replacements. PyTorch fallback is ONLY for Conv2d, MaxPool2d, BatchNorm2d.

Reference Documentation

The docs/ subdirectory contains detailed API references and translation guides:

flydsl_translation_api_reference.md — FlyDSL compiler API, expression types, kernel patterns

flydsl_translation_guide.md — PyTorch op mapping, structural patterns, common pitfalls

flydsl_translation_gemm.md — GEMM/Linear translation with preshuffle_gemm

flydsl_translation_attention.md — Attention/SDPA translation with flash_attn

flydsl_translation_reductions.md — Reduction ops (sum, mean, softmax, layernorm)

pytorch2flydsl-translation

PyTorch to FlyDSL Translation Skill

Translation Strategy (in order of preference)

Reference Documentation

More from this repository

More from this repository

PyTorch to FlyDSL Translation Skill

Translation Strategy (in order of preference)

Reference Documentation