| name | tilelang-ascend-kernel |
| description | Use when the user wants to add, modify, debug, or review an xLLM TileLang Ascend kernel or specialization, including Python kernel definitions, generated Ascend-C source, runtime wrapper dispatch, TileLang CMake wiring, and NPU tests. |
TileLang Ascend Kernel
When to use
Use this skill when the task involves any of the following in the xLLM repo:
xllm/compiler/tilelang/targets/ascend/kernels/*.py
xllm/core/kernels/npu/tilelang/*_wrapper.cpp
xllm/core/kernels/npu/tilelang/CMakeLists.txt
- generated TileLang artifacts such as
manifest.json, registry.inc, or specialization .cpp
Run build and test commands inside the NPU container.
Run TileLang commands from the xLLM repo root, not from an installed-package environment.
Entry points and TL_ROOT
- Prefer
python xllm/compiler/tilelang_launcher.py ... for end-to-end TileLang compile flows.
- From the xLLM repo root, use
export TL_ROOT=$PWD/third_party/tilelang-ascend for xLLM TileLang tooling and verify test -f "$TL_ROOT/tilelang/__init__.py".
- Before any raw script does
import tilelang, run export TL_ROOT=$PWD/third_party/tilelang-ascend && source third_party/tilelang-ascend/set_env.sh, then execute the script.
- Do not run kernel files directly with
python rope.py; use module execution because these files rely on relative imports.
- For direct kernel-script debugging, run them as modules and pass required CLI args:
cd xllm
python -m compiler.tilelang.targets.ascend.kernels.rope \
--output ../.tmp/rope.cpp
- The same module-style rule applies to other kernel files under
xllm/compiler/tilelang/targets/ascend/kernels/.
Primary Reference And Mode Preference
Primary reference:
third_party/tilelang-ascend/docs/TileLang-Ascend Programming Guide.md
Use third_party/tilelang-ascend/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-tile-ops.md when the task depends on T.tile.xxx semantics such as compare, select, cast, or other vector intrinsics.
Default to Expert mode for xLLM Ascend kernels:
- prefer
T.tile.xxx, explicit UB/shared allocation, and explicit T.copy
- prefer explicit
T.serial control for row/block traversal
- do not introduce Developer mode
T.Parallel unless the kernel is a clearly tile-local element-wise expression and the change does not reduce control over UB usage, temporary buffers, or exact runtime semantics
- when translating Triton kernels, preserve the Triton runtime semantics first, then choose the smallest Expert-mode lowering that matches them
Common Triton To TileLang-Ascend Semantics
Use this table as the quick semantic mapping when translating Triton kernels:
| Triton pattern | TileLang-Ascend pattern | Notes |
|---|
x + y, x - y, x * y, x / y | T.tile.add/sub/mul/div | Prefer tile ops in Expert-style vector code instead of hand-written scalar loops. |
tl.exp(x), tl.log(x), tl.abs(x) | T.tile.exp, T.tile.ln, T.tile.abs | TileLang uses ln, not log. |
x.to(tl.float32) or tl.cast(...) | T.tile.cast(dst, src, "CAST_NONE", count) | Pick a non-default cast mode only when rounding semantics are required. |
x <= y, x < y, x >= y, x == y | T.tile.compare(mask, x, y, "LE"/"LT"/"GE"/"EQ") | T.tile.compare produces a bit mask, not a float tensor. |
tl.where(cond, a, b) | T.tile.select(dst, selMask, a, b, selMode) | API-level match. If cond is a comparison expression such as x <= y, materialize selMask with T.tile.compare(...) first; use the matching VSEL_* mode for tensor-tensor or tensor-scalar selection. |
tl.full(shape, value, dtype) | allocate buffer + T.tile.fill(dst, value) | Separate allocation from initialization. |
tl.arange(0, N) | T.tile.createvecindex(dst, 0) or explicit loop indices | Prefer createvecindex only when the kernel truly needs a vector index tensor. |
Rules for semantic-preserving lowering:
- Preserve Triton control-flow, masking, and parameter semantics. Do not substitute a numerically similar formula unless the runtime-visible behavior is unchanged for the supported input domain.
- Keep the kernel ABI aligned with the lowering. Every runtime parameter must either participate in the TileLang implementation or be removed from the interface.
- Add targeted tests for branch, mask, and boundary behavior. Do not rely only on random inputs if some paths are hit only under specific values.
- Choose the correct
VSEL_* mode based on the source operands. VSEL_CMPMASK_SPR is the natural match for a mask produced by T.tile.compare; VSEL_TENSOR_SCALAR_MODE and VSEL_TENSOR_TENSOR_MODE are for explicit tensor/scalar or tensor/tensor selection modes.
New kernel
Follow this order:
- Implement
build_<kernel>_kernel(...)
- Implement
generate_source(...)
- Declare
DISPATCH_SCHEMA and SPECIALIZATIONS
- Run TileLang compilation once and inspect
registry.inc
- Add or update
<kernel>_wrapper.cpp
- Register the kernel in
xllm/core/kernels/npu/tilelang/CMakeLists.txt with:
tilelang_register_runtime_kernel(NAME <kernel> WRAPPER_SRCS <srcs...>)
For wrapper work:
- do kernel precision alignment on the Python side first (
build_<kernel>_kernel(...) / generate_source(...)), not in the C++ wrapper
- handwrite tensor checks, layout transforms, and
build_runtime_specialization(...)
- use generated
make_<kernel>_specialization(...) and find_<kernel>_kernel_entry(...)
- do not handwrite kernel-specific specialization structs or kernel fn typedefs
- never use
permute, contiguous, transpose, reshape (when non-trivial), clone, or any operation that triggers a device memory copy in the wrapper. The wrapper must only pass pointers to existing tensors. If the kernel needs a different layout, handle it inside the kernel itself or require the caller to provide the correct layout.
New specialization
Use this path when the kernel logic and wrapper ABI stay the same.
- Update the existing kernel's
SPECIALIZATIONS
- Confirm every runtime dispatch field still matches
DISPATCH_SCHEMA
- Re-run TileLang compilation
- Check that
registry.inc contains the new entry
- Check that the wrapper's
build_runtime_specialization(...) still constructs matching values
Debug generated Ascend-C
When the task is to inspect codegen or compare two kernel implementations, use an isolated output root:
python xllm/compiler/tilelang_launcher.py compile-kernels \
--target ascend \
--device a3 \
--output-root .tmp/tilelang_debug \
--kernels <kernel> \
--force
Then inspect:
.tmp/tilelang_debug/targets/ascend/<kernel>/<variant_key>/<kernel>_<variant_key>_kernel.cpp
.tmp/tilelang_debug/targets/ascend/<kernel>/registry.inc
.tmp/tilelang_debug/targets/ascend/<kernel>/manifest.json
Use this path before changing wrapper code when you need to understand generated symbols, field order, or ABI.
Validate
Prefer the narrowest command first:
python -m py_compile xllm/compiler/tilelang/targets/ascend/kernels/<kernel>.py
cd xllm && python -m compiler.tilelang.targets.ascend.kernels.<kernel> --output ../.tmp/<kernel>.cpp
python xllm/compiler/tilelang_launcher.py prepare-ascend
python setup.py test --test-name <wrapper_test_target> --device npu
References
Read docs/en/dev_guide/tilelang_ascend_kernel_dev.md for mechanism details.
Use rope as the concrete template:
xllm/compiler/tilelang/targets/ascend/kernels/rope.py
xllm/core/kernels/npu/tilelang/rope_wrapper.cpp
xllm/core/kernels/npu/tilelang/CMakeLists.txt