Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

xe2-dpas-patterns

Name: Xe2 Dpas Patterns
Author: ModelTC

// Use this skill when writing, loading operands for, or storing results from XMX DPAS instructions on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) GPU using SYCL ESIMD. Xe2 is the GPU architecture; LNL and BMG are product names. Covers all four DPAS operand load patterns (lsc_load_2d, lsc_gather, VNNI packing), scatter/store write-back, Usage 1 vs Usage 2 orientation, and the SOA property of lsc_gather. Applicable to any kernel using DPAS: GEMM, attention, convolution, etc.

Exécuter dans Manus

$ git log --oneline --stat

stars:2 318

forks:204

updated:17 avril 2026 à 07:14

Explorateur de fichiers

2 fichiers

SKILL.md

readonly

related-skills.json

même dépôt

sycl-esimd-to-python-wheel.md

from "ModelTC/LightX2V"

Full pipeline for turning a SYCL/ESIMD GPU kernel into a Python-importable wheel package on Windows with Intel oneAPI 2025.x and conda. Covers every layer of the stack: ESIMD kernel (.cpp/.h) → Windows DLL (icpx) → PyTorch C++ extension (.pyd, CMake) → Python package → wheel (.whl, scikit-build-core). Use this skill whenever the user is working on Intel Arc GPU (Xe2 / BMG / PTL-H) SYCL or ESIMD kernels and wants to expose them to Python, package them as a wheel, set up a build script, debug build failures, or understand how the DLL + .pyd + wheel layers fit together. Also use it when they hit Windows-specific build issues like setvars.bat failing, cmake.exe producing no output, or ur_api.h not found.

2026-04-172.3k

esimd-lsc-2d-gather-scatter.md

from "ModelTC/LightX2V"

LSC 2D block load/store, 1D block load/store, and gather/scatter operations in Intel ESIMD. Use this skill when working with lsc_load_2d, lsc_store_2d, lsc_prefetch_2d, config_2d_mem_access, block_load, block_store, gather, or scatter in ESIMD kernels. Covers 2D surface descriptors, transposed VNNI loads, tile size constraints, cache hints, and common pitfalls like the rvalue bit_cast_view bug and half transpose limitation.

2026-04-172.3k

esimd-lsc-slm.md

from "ModelTC/LightX2V"

LSC Shared Local Memory (SLM) operations in Intel ESIMD. Use this skill when working with slm_init, slm_block_load, slm_block_store, lsc_slm_gather, lsc_slm_scatter, SLM layout design, barrier synchronization, named barriers, cooperative SLM loading, or any kernel that uses workgroup shared memory on Intel GPUs. Covers SLM size limits, bank conflicts, the lsc_slm_scatter transpose trick, and common pitfalls like forgetting slm_init or conditional barriers causing GPU hangs.

2026-04-172.3k

intel-esimd-base.md

from "ModelTC/LightX2V"

Foundational Intel ESIMD GPU programming skill. Use this skill proactively whenever the user is writing, optimizing, or debugging any SYCL/ESIMD kernel for Intel GPUs — including Intel Arc, Iris Xe, or Data Center GPU Max. Covers kernel design, memory access patterns (block_load, gather, SLM), data types, vectorization, workgroup patterns, hardware characteristics, performance analysis, and troubleshooting. Trigger this even when the user does not explicitly say "ESIMD" — invoke it for any Intel GPU kernel development, performance bottleneck questions, or SYCL optimization tasks targeting Intel hardware.

2026-04-172.3k

intel-esimd-fuse.md

from "ModelTC/LightX2V"

Expert guidance for implementing fused multi-operation kernels on Intel GPUs using ESIMD. Use this skill whenever the user needs to fuse multiple operations into a single kernel pass to minimize memory traffic, such as softmax + top-K + normalize, or any pipeline that chains reduction, selection, and normalization in one kernel. Also trigger for ESIMD softmax implementation, vectorized exp on simd<float,N> for a full row, detail::sum vs reduce pitfall (reduce silently returns 0), fused attention block selection with probability normalization, or any kernel that computes softmax probabilities and immediately selects the top-K entries. The main example is the fused softmax+topk+normalize V2 variant achieving 43.2 GB/s (43% bandwidth utilization) for seq_len=32K, N=128, K=8.

2026-04-172.3k

intel-gpu-hw-info.md

from "ModelTC/LightX2V"

Definitive reference for Intel GPU hardware specifications across architectures. Covers Xe2 (Lunar Lake/LNL, Battlemage/BMG) and Xe3 (Panther Lake/PTL, Panther Lake-H/PTLH) GPU hardware: XE core counts, memory bandwidth, XMX/DPAS compute, GRF sizes, SLM limits, thread counts, EU layout, L3 cache, TDP. Use whenever the user asks about Intel GPU specs, hardware comparison, architecture differences, roofline parameters, or thread/memory limits. Trigger for questions like "how many XE cores", "what is BMG bandwidth", "PTL vs BMG", "Xe2 specs", "LNL GPU", etc.

2026-04-172.3k

package.json

"author": "ModelTC"

"repository": "ModelTC/LightX2V"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Développeurs de logicielsProfessions informatiques et mathématiques15-1252L4

name

xe2-dpas-patterns

description

Use this skill when writing, loading operands for, or storing results from XMX DPAS instructions on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) GPU using SYCL ESIMD. Xe2 is the GPU architecture; LNL and BMG are product names. Covers all four DPAS operand load patterns (lsc_load_2d, lsc_gather, VNNI packing), scatter/store write-back, Usage 1 vs Usage 2 orientation, and the SOA property of lsc_gather. Applicable to any kernel using DPAS: GEMM, attention, convolution, etc.

Xe2 DPAS Patterns Skill

Reference for loading and storing DPAS operands on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) via SYCL ESIMD. All patterns are validated in assets/fp16_dpas_ult.cpp (4 test cases, all PASS).

DPAS Register Contract (FP16, XE2)

// xmx::dpas<RC, SD, Tacc, Tc, Tb, Ta>(acc, b_tile, a_tile)
// XE2: RC=8 fixed, SD=8 fixed (32-bit systolic depth)
// FP16: SD=8 systolic steps × 2 fp16/step = 16 K-elements per call
xmx::dpas<8, 8, sycl::half, sycl::half, sycl::half, sycl::half>(acc, b_tile, a_tile)

Register	Type	Size	Layout	Role
`a_tile`	`simd<half, 8*16>`	128 half	`[m*16+k]` row-major	8 M-rows × 16 K-cols
`b_tile`	`simd<half, 16*16>`	256 half	`uint32[k_pair*16+n]` VNNI	16 K-elems × 16 N-cols
`acc`	`simd<half, 8*16>`	128 half	`[m*16+n]` row-major	8 M-rows × 16 N-cols

Result: acc[m*16+n] += sum_k a_tile[m*16+k] * b_tile_fp16(n,k)

Arg order: dpas(acc, b_tile, a_tile) — b_tile is 2nd, a_tile is 3rd. Easy to mix up.

Two Usage Orientations

Usage 1 — Standard (M=8, N=16, K=16)

A[M×K] → a_tile     B[N×K] or B_T[K×N] → b_tile
dpas(acc, b_tile, a_tile) → acc[m*16+n] = C[m,n]   (direct store)

Usage 2 — Swapped (M=16, N=8, K=16)

Swap roles: A→b_tile (gather VNNI), B→a_tile (load_2d or gather). Useful when N < 16 at the DPAS call level.

A[M×K] → b_tile (gather VNNI, 16 M-lanes)
B[N×K] → a_tile (load_2d or gather)
dpas(acc, b_tile=A_vnni, a_tile=B) → acc[n*16+m] = C[m,n]  (TRANSPOSED — needs scatter write-back)

VNNI Layout (b_tile)

b_tile uint32[k_pair * 16 + n_col] = packed { fp16[n_col, k_pair*2], fp16[n_col, k_pair*2+1] }
  k_pair = 0..7  (outer, stride 16)
  n_col  = 0..15 (inner, stride 1)

lsc_gather SOA Property (key insight)

lsc_gather<T, NElts, DS, L1H, L2H, N>(ptr, offsets) returns simd<T, N*NElts> in SOA layout:

result[elem * N + lane]  =  T at  ptr[ byte_offset[lane] + elem * sizeof(T) ]

This SOA property is what makes gather produce DPAS-ready layouts without register repack:

For b_tile (B[N,K]): use N rows as lanes → result[k_pair*16+n] = VNNI
For a_tile (B_T[K,N]): use K rows as lanes with fp16/u16 type → result[e*16+k] = N-outer a_tile

Four Load Patterns

Pattern 1: a_tile from A[M×K] — lsc_load_2d, no transform

xesimd::config_2d_mem_access<sycl::half, 16/*BW=K*/, 8/*BH=M*/, 1> payA(
    A, K*2u-1u, M-1u, K*2u-1u, 0u, 0u);
simd<sycl::half, 8*16> a_tile = xesimd::lsc_load_2d<
    sycl::half, 16, 8, 1, false/*T*/, false/*VNNI*/, cached, cached>(payA);
// a_tile[m*16+k] = A[m, k]  ✓

Pattern 2: b_tile from B_T[K×N] — lsc_load_2d VNNI=true

xesimd::config_2d_mem_access<sycl::half, 16/*BW=N*/, 16/*BH=K*/, 1> payB(
    B_T, N*2u-1u, K-1u, N*2u-1u, 0u, 0u);
simd<sycl::half, 16*16> b_tile = xesimd::lsc_load_2d<
    sycl::half, 16, 16, 1, false, true/*VNNI*/, cached, cached>(payB);
// b_tile uint32[k_pair*16+n] = {B_T[k*2,n], B_T[k*2+1,n]}  VNNI ✓

Pattern 3: b_tile from B[N×K] — lsc_gather u32/NElts=8, N-lanes

// N=16 lanes (one per output N-column), NElts=8 k-pairs per lane (= 16 fp16 = K)
const uint32_t* B_u32 = reinterpret_cast<const uint32_t*>(B);
simd<uint32_t, 16> b_off;
for (int n = 0; n < 16; n++)
    b_off[n] = (uint32_t)(n_base + n) * K * 2u;  // byte offset to B[n, k_start]
simd<sycl::half, 16*16> b_tile;
b_tile.template bit_cast_view<uint32_t>() =
    xesimd::lsc_gather<uint32_t, 8,
        xesimd::lsc_data_size::u32,
        xesimd::cache_hint::cached, xesimd::cache_hint::cached,
        16, uint32_t>(B_u32, b_off);
// SOA: result_u32[k_pair*16+n] = {B[n,k+k_pair*2], B[n,k+k_pair*2+1]}  VNNI ✓

Pattern 4: a_tile from B_T[K×N] — lsc_gather fp16/u16, K-lanes

lsc_load_2d transpose=true is rejected for fp16 at compile time. Use fp16 gather instead:

// K=16 lanes (one per K-row of B_T), NElts=8 fp16 per lane (= one full row = N elements)
// SOA: result[e*16+k] = B_T[k][e] = B[e,k]  → matches a_tile[n*16+k]=B[n,k] directly ✓
// WARNING: do NOT use lsc_gather<uint32_t,4> — uint32 packs {B[n_u32*2,k], B[n_u32*2+1,k]}
//   (different N rows, same k) which does NOT match a_tile uint32 layout (same N, adjacent k).
simd<uint32_t, 16> a_off;
for (int k = 0; k < K; k++)
    a_off[k] = (uint32_t)k * (uint32_t)N * 2u;  // byte offset to B_T[k, 0]
simd<sycl::half, 8*16> a_tile =
    xesimd::lsc_gather<sycl::half, 8,
        xesimd::lsc_data_size::u16,
        xesimd::cache_hint::cached, xesimd::cache_hint::cached,
        16, uint32_t>(B_T, a_off);
// a_tile[n*16+k] = B[n, k]  ✓  (no bit_cast_view, no register repack)

Store / Write-back Patterns

Usage 1 write-back — lsc_store_2d (acc is direct C[m,n])

xesimd::lsc_store_2d<sycl::half, 16, 8,
    xesimd::cache_hint::write_back, xesimd::cache_hint::write_back>(
    C, N*2u-1u, M-1u, N*2u-1u, (uint32_t)n_start, (uint32_t)m_start, acc);

Usage 2 write-back — lsc_scatter fp16/u16 (acc is transposed C[n,m])

acc[ni*16+mj] = C[mj, ni]. Scatter SOA: data[e*16+lane] → ptr[offset[lane] + e*sizeof(half)]. Setting offset[mj] = mj * N * sizeof(half) and data = acc works directly — no repack needed.

// SOA: data[e*16+mj] → C[mj, e]
// data[e*16+mj] = C[mj,e] = acc[e*16+mj]  → data = acc directly ✓
simd<uint32_t, 16> sc_off;
for (int mj = 0; mj < 16; mj++)
    sc_off[mj] = (uint32_t)(m_base + mj) * N_total * 2u + (uint32_t)n_base * 2u;
xesimd::lsc_scatter<sycl::half, 8,
    xesimd::lsc_data_size::u16,
    xesimd::cache_hint::write_back, xesimd::cache_hint::write_back,
    16, uint32_t>(C, sc_off, acc);
// acc passed as data directly — no register repack ✓

Summary: No-Shuffle Rules

Operation	Method	Shuffle needed?
a_tile ← A[M,K]	`lsc_load_2d` no-transform	none
b_tile ← B_T[K,N]	`lsc_load_2d` VNNI=true	none (hardware)
b_tile ← B[N,K]	`lsc_gather<u32,8,u32,N=16>` N-lanes	none (SOA=VNNI)
a_tile ← B_T[K,N]	`lsc_gather<half,8,u16,N=16>` K-lanes	none (SOA=N-outer)
store C (Usage 1)	`lsc_store_2d`	none
store C (Usage 2)	`lsc_scatter<half,8,u16,N=16>` M-lanes	none (SOA matches acc)

All six operations are shuffle-free when using the correct gather/scatter type and lane axis.

Asset

File	Purpose
`assets/fp16_dpas_ult.cpp`	Unit test for all 4 load patterns + scatter write-back. All cases PASS.

Compile and run:

icpx fp16_dpas_ult.cpp -o fp16_dpas_ult.exe \
  -fsycl -fsycl-targets=spir64_gen \
  -Xs "-device bmg -options -doubleGRF"
powershell.exe -Command "& './fp16_dpas_ult.exe'"

xe2-dpas-patterns

Plus depuis ce dépôt

Plus depuis ce dépôt

Xe2 DPAS Patterns Skill

DPAS Register Contract (FP16, XE2)

Two Usage Orientations

Usage 1 — Standard (M=8, N=16, K=16)

Usage 2 — Swapped (M=16, N=8, K=16)

VNNI Layout (b_tile)

lsc_gather SOA Property (key insight)

Four Load Patterns

Pattern 1: a_tile from A[M×K] — lsc_load_2d, no transform

Pattern 2: b_tile from B_T[K×N] — lsc_load_2d VNNI=true

Pattern 3: b_tile from B[N×K] — lsc_gather u32/NElts=8, N-lanes

Pattern 4: a_tile from B_T[K×N] — lsc_gather fp16/u16, K-lanes

Store / Write-back Patterns

Usage 1 write-back — lsc_store_2d (acc is direct C[m,n])

Usage 2 write-back — lsc_scatter fp16/u16 (acc is transposed C[n,m])

Summary: No-Shuffle Rules

Asset

Xe2 DPAS Patterns Skill

DPAS Register Contract (FP16, XE2)

Two Usage Orientations

Usage 1 — Standard (M=8, N=16, K=16)

Usage 2 — Swapped (M=16, N=8, K=16)

VNNI Layout (b_tile)

lsc_gather SOA Property (key insight)

Four Load Patterns

Pattern 1: a_tile from A[M×K] — lsc_load_2d, no transform

Pattern 2: b_tile from B_T[K×N] — lsc_load_2d VNNI=true

Pattern 3: b_tile from B[N×K] — lsc_gather u32/NElts=8, N-lanes

Pattern 4: a_tile from B_T[K×N] — lsc_gather fp16/u16, K-lanes

Store / Write-back Patterns

Usage 1 write-back — lsc_store_2d (acc is direct C[m,n])

Usage 2 write-back — lsc_scatter fp16/u16 (acc is transposed C[n,m])

Summary: No-Shuffle Rules

Asset