Run any Skill in Manus with one click

$pwd:

intel-gpu-kernel-opt

Name: Intel Gpu Kernel Opt
Author: ModelTC

// General Intel GPU kernel optimization methodology. Use this skill when profiling or optimizing any ESIMD or SYCL kernel on Intel GPUs, performing roofline analysis, diagnosing bottlenecks (register spill, SLM bank conflicts, barrier overhead, memory coalescing), comparing Xe2 vs Xe3 hardware, or planning an optimization workflow. Covers VTune and GTPin profiling, key metrics (TFLOPS, GB/s, peak %), hardware comparison (Xe2: LNL/BMG vs Xe3: PTL/PTLH), and optimization patterns (prefetch, load/compute separation, loop unrolling, SIMD width selection). Xe2 is the architecture for Lunar Lake (LNL) and Battlemage (BMG); Xe3 is the architecture for Panther Lake (PTL) and Panther Lake-H (PTLH). Trigger for any Intel GPU performance question.

Run Skill in Manus

$ git log --oneline --stat

stars:2,318

forks:204

updated:April 17, 2026 at 07:14

SKILL.md

readonly

related-skills.json

same repository

sycl-esimd-to-python-wheel.md

from "ModelTC/LightX2V"

Full pipeline for turning a SYCL/ESIMD GPU kernel into a Python-importable wheel package on Windows with Intel oneAPI 2025.x and conda. Covers every layer of the stack: ESIMD kernel (.cpp/.h) → Windows DLL (icpx) → PyTorch C++ extension (.pyd, CMake) → Python package → wheel (.whl, scikit-build-core). Use this skill whenever the user is working on Intel Arc GPU (Xe2 / BMG / PTL-H) SYCL or ESIMD kernels and wants to expose them to Python, package them as a wheel, set up a build script, debug build failures, or understand how the DLL + .pyd + wheel layers fit together. Also use it when they hit Windows-specific build issues like setvars.bat failing, cmake.exe producing no output, or ur_api.h not found.

2026-04-172.3k

esimd-lsc-2d-gather-scatter.md

from "ModelTC/LightX2V"

LSC 2D block load/store, 1D block load/store, and gather/scatter operations in Intel ESIMD. Use this skill when working with lsc_load_2d, lsc_store_2d, lsc_prefetch_2d, config_2d_mem_access, block_load, block_store, gather, or scatter in ESIMD kernels. Covers 2D surface descriptors, transposed VNNI loads, tile size constraints, cache hints, and common pitfalls like the rvalue bit_cast_view bug and half transpose limitation.

2026-04-172.3k

esimd-lsc-slm.md

from "ModelTC/LightX2V"

LSC Shared Local Memory (SLM) operations in Intel ESIMD. Use this skill when working with slm_init, slm_block_load, slm_block_store, lsc_slm_gather, lsc_slm_scatter, SLM layout design, barrier synchronization, named barriers, cooperative SLM loading, or any kernel that uses workgroup shared memory on Intel GPUs. Covers SLM size limits, bank conflicts, the lsc_slm_scatter transpose trick, and common pitfalls like forgetting slm_init or conditional barriers causing GPU hangs.

2026-04-172.3k

intel-esimd-base.md

from "ModelTC/LightX2V"

Foundational Intel ESIMD GPU programming skill. Use this skill proactively whenever the user is writing, optimizing, or debugging any SYCL/ESIMD kernel for Intel GPUs — including Intel Arc, Iris Xe, or Data Center GPU Max. Covers kernel design, memory access patterns (block_load, gather, SLM), data types, vectorization, workgroup patterns, hardware characteristics, performance analysis, and troubleshooting. Trigger this even when the user does not explicitly say "ESIMD" — invoke it for any Intel GPU kernel development, performance bottleneck questions, or SYCL optimization tasks targeting Intel hardware.

2026-04-172.3k

intel-esimd-fuse.md

from "ModelTC/LightX2V"

Expert guidance for implementing fused multi-operation kernels on Intel GPUs using ESIMD. Use this skill whenever the user needs to fuse multiple operations into a single kernel pass to minimize memory traffic, such as softmax + top-K + normalize, or any pipeline that chains reduction, selection, and normalization in one kernel. Also trigger for ESIMD softmax implementation, vectorized exp on simd<float,N> for a full row, detail::sum vs reduce pitfall (reduce silently returns 0), fused attention block selection with probability normalization, or any kernel that computes softmax probabilities and immediately selects the top-K entries. The main example is the fused softmax+topk+normalize V2 variant achieving 43.2 GB/s (43% bandwidth utilization) for seq_len=32K, N=128, K=8.

2026-04-172.3k

intel-gpu-hw-info.md

from "ModelTC/LightX2V"

Definitive reference for Intel GPU hardware specifications across architectures. Covers Xe2 (Lunar Lake/LNL, Battlemage/BMG) and Xe3 (Panther Lake/PTL, Panther Lake-H/PTLH) GPU hardware: XE core counts, memory bandwidth, XMX/DPAS compute, GRF sizes, SLM limits, thread counts, EU layout, L3 cache, TDP. Use whenever the user asks about Intel GPU specs, hardware comparison, architecture differences, roofline parameters, or thread/memory limits. Trigger for questions like "how many XE cores", "what is BMG bandwidth", "PTL vs BMG", "Xe2 specs", "LNL GPU", etc.

2026-04-172.3k

package.json

"author": "ModelTC"

"repository": "ModelTC/LightX2V"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name

intel-gpu-kernel-opt

description

General Intel GPU kernel optimization methodology. Use this skill when profiling or optimizing any ESIMD or SYCL kernel on Intel GPUs, performing roofline analysis, diagnosing bottlenecks (register spill, SLM bank conflicts, barrier overhead, memory coalescing), comparing Xe2 vs Xe3 hardware, or planning an optimization workflow. Covers VTune and GTPin profiling, key metrics (TFLOPS, GB/s, peak %), hardware comparison (Xe2: LNL/BMG vs Xe3: PTL/PTLH), and optimization patterns (prefetch, load/compute separation, loop unrolling, SIMD width selection). Xe2 is the architecture for Lunar Lake (LNL) and Battlemage (BMG); Xe3 is the architecture for Panther Lake (PTL) and Panther Lake-H (PTLH). Trigger for any Intel GPU performance question.

Intel GPU Kernel Optimization Methodology

Systematic approach to profiling, diagnosing, and optimizing GPU kernels on Intel Xe2 (Lunar Lake/LNL, Battlemage/BMG) and Xe3 (Panther Lake/PTL, Panther Lake-H/PTLH) architectures. Covers roofline analysis, bottleneck identification, profiling tools, hardware comparison, and proven optimization patterns.

Version: 1.0.0 Last Updated: 2026-03-12

Optimization Workflow
Roofline Analysis
Key Performance Metrics
Hardware Comparison: Xe2 vs Xe3
Common Bottlenecks
Profiling Tools
Optimization Patterns
Optimization Priority Checklist
Case Study: SDP HD=256 Optimization Journey

Optimization Workflow

The fundamental cycle for GPU kernel optimization:

Profile --> Identify Bottleneck --> Optimize --> Verify --> Repeat

Step-by-Step

Baseline: Establish a correct, reproducible benchmark with warmup iterations
Profile: Capture hardware metrics (VTune GPU Compute Hotspots or GTPin)
Identify bottleneck: Classify as compute-bound, memory-bound, or latency-bound
Hypothesize: Form a specific hypothesis about what limits performance
Optimize: Apply a single targeted optimization
Verify: Measure both correctness AND performance
Record: Document the result (positive or negative) for future reference
Repeat: Return to step 2

Critical rule: Apply ONE optimization at a time. Multiple simultaneous changes make it impossible to attribute performance differences.

Roofline Analysis

Classification

A kernel is either compute-bound or memory-bound:

Arithmetic Intensity (AI) = FLOPs / Bytes_transferred
Ridge Point = Peak_TFLOPS / Peak_BW_TB/s

If AI > Ridge Point: compute-bound (optimize ALU utilization)
If AI < Ridge Point: memory-bound (optimize memory access)

Xe2 (BMG) Roofline

Peak FP16 XMX:  135 TFLOPS
Peak BW:        520 GB/s = 0.52 TB/s
Ridge Point:    135 / 0.52 = ~260 FLOPs/byte

Most attention/GEMM kernels with HEAD_DIM >= 128 are compute-bound on BMG.

Xe3 (PTL) Roofline

Peak FP16 XMX:  TBD (depends on SKU, ~30-50 TFLOPS estimated)
Peak BW:        112 GB/s = 0.112 TB/s
Ridge Point:    ~300-450 FLOPs/byte (estimated)

PTL's lower bandwidth makes more workloads memory-bound compared to BMG.

Practical Roofline Calculation

// GEMM: C[M,N] = A[M,K] x B[K,N]
double flops = 2.0 * M * N * K;
double bytes = (M*K + K*N + M*N) * sizeof(half);  // read A, B; write C
double ai = flops / bytes;

// Flash Attention: 4*q*kv*d*heads + 2*q*kv*heads (softmax)
double attn_flops = 4.0 * q_len * kv_len * head_dim * num_heads
                  + 2.0 * q_len * kv_len * num_heads;

Key Performance Metrics

Metric	Formula	Target
TFLOPS	FLOPs / (time_ms * 1e9)	Higher is better
GB/s	Bytes / (time_ms * 1e6)	Compare to peak BW
% of Peak Compute	Measured TFLOPS / Peak TFLOPS * 100	> 60% is good
% of Peak BW	Measured GB/s / Peak GB/s * 100	> 70% is good
XMX Utilization	(from VTune) XMX busy cycles / total cycles	> 50% for GEMM
Register Spill	(from compiler output or VTune)	0 is ideal
Occupancy	Active threads / Max threads	Higher generally better

Benchmarking Best Practices

// Standard benchmark pattern
constexpr int WARMUP = 5;
constexpr int ITERATIONS = 100;

// Warmup (let GPU clock stabilize, populate caches)
for (int i = 0; i < WARMUP; i++) {
    kernel.execute();
    q.wait();
}

// Timed iterations
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < ITERATIONS; i++) {
    kernel.execute();
    q.wait();
}
auto end = std::chrono::high_resolution_clock::now();
double avg_ms = std::chrono::duration<double, std::milli>(end - start).count() / ITERATIONS;

Hardware Comparison: Xe2 vs Xe3

Parameter	Xe2 (LNL / Lunar Lake, BMG / Battlemage)	Xe3 (PTL / Panther Lake, PTLH / Panther Lake-H)
GPU Type	Discrete (dGPU)	Integrated (iGPU)
XE Cores	32	12
Memory BW	~520 GB/s (GDDR6)	~112 GB/s (LPDDR5x shared)
Max Threads	2048 per XE core	TBD
SLM per XE Core	64 KB	64 KB
GRF Mode	doubleGRF (256 regs/thread)	TBD
FP16 XMX Peak	~135 TFLOPS	TBD (~30-50 TFLOPS est.)
DPAS Systolic Depth	8	TBD
L3 Cache	~8-16 MB	TBD
Power	~150-225W TDP	~15-30W (shared package)

Architecture Implications

BMG (Xe2):

Compute-rich: 32 XE cores, high TFLOPS ceiling
High bandwidth: 520 GB/s dedicated GDDR6
Optimization focus: maximize XMX utilization, minimize register spill
doubleGRF essential for large tile sizes (256 GRF entries per thread)

PTL (Xe3):

Bandwidth-constrained: 112 GB/s shared with CPU
Fewer cores: 12 XE cores, lower absolute throughput
Optimization focus: minimize memory traffic, maximize data reuse
Shared memory: CPU activity impacts GPU bandwidth

Common Bottlenecks

1. Register Spill

Symptom: Performance far below compute roofline despite high arithmetic intensity.

Diagnosis: Build with -save-temps, check compiler output for spill/fill counts. In VTune, look for high "Send" cycle counts from register spill loads/stores.

Solutions:

Reduce tile sizes (e.g., M=16 to M=8)
Use doubleGRF (-Xs "-options -doubleGRF")
Reduce live register ranges (reorder computations)
Split large kernels

2. SLM Bank Conflicts

Symptom: SLM access latency higher than expected, visible in VTune shared memory stalls.

Diagnosis: Check SLM access patterns for stride conflicts. 32 banks, 4 bytes/bank on Intel GPUs.

Solutions:

Pad SLM rows to avoid stride conflicts
Ensure consecutive threads access consecutive banks
Rearrange data layout in SLM

3. Barrier Overhead

Symptom: Many barriers per iteration, visible as synchronization stalls in profiler.

Diagnosis: Count barriers per kernel iteration. Each barrier has 50-100 cycle overhead.

Solutions:

Minimize barrier count (combine phases)
Use named barriers for subset synchronization
Use split arrive/wait to overlap independent work
Eliminate redundant barriers (e.g., double-buffered SLM may remove a barrier that was never blocking)

4. Memory Coalescing Issues

Symptom: Bandwidth utilization far below peak.

Diagnosis: Check access patterns. Consecutive threads should access consecutive addresses.

Solutions:

Ensure contiguous block_load patterns
Use lsc_load_2d for 2D tiled access (hardware handles coalescing)
Avoid scatter/gather when block operations suffice

5. Low DPAS/XMX Utilization

Symptom: Compute-bound kernel with low XMX busy percentage.

Diagnosis: In ISA, check for consecutive DPAS instructions vs intervening sends/movs. VTune XMX busy metric.

Solutions:

Increase DPAS density (more consecutive DPAS without interruption)
Pipeline data loads with compute (prefetch next tile, compute current)
Reduce register transpose overhead between DPAS phases
Use {Atomic} annotations where possible (compiler-dependent)

6. Instruction Overhead (mov Instructions)

Symptom: Large blocks of mov(4|M0) instructions in ISA dump, especially for data shuffling.

Diagnosis: ISA disassembly analysis. Common in register transpose operations.

Solutions:

Use SLM for data layout transformations (write one layout, read another)
lsc_slm_scatter for transposing via SLM (proved +1.7 TFLOPS in SDP HD=256)
Separate type conversion from data movement (FP32->FP16 before transpose)

Profiling Tools

VTune GPU Compute/Media Hotspots

Full-featured profiler for Intel GPUs. Captures hardware counters, timeline, and per-kernel metrics.

# Capture GPU Compute Hotspots
vtune -collect gpu-hotspots -knob gpu-sampling-interval=1 -- ./kernel.exe

# Key metrics to examine:
# - XMX Busy %
# - EU Active %
# - L3 Bandwidth
# - SLM Bandwidth
# - GPU Occupancy
# - Stall reasons (Scoreboard, Send, Inst Fetch, etc.)

See intel-gpu-vtune-profiling skill for detailed VTune workflow.

GTPin (ISA-Level Profiling)

Instruction-level profiling showing cycle counts per ISA instruction. Essential for micro-optimization.

# GTPin profiling
set GTPIN_KIT=C:\path\to\gtpin
%GTPIN_KIT%\Bin\gtpin.exe --installDir %GTPIN_KIT%\Bin ^
    -t LatencyProfiler -- ./kernel.exe

See xe2-gtpin-profiling skill for GTPin setup and ISA analysis.

ISA Disassembly

# Build with ISA dump
icpx kernel.cpp -o kernel.exe -fsycl -fsycl-targets=spir64_gen `
    -Xs "-device bmg -options -doubleGRF" -O2 -w -save-temps

# Disassemble
ocloc disasm -file <spirv_file> -device bmg -dump isa_dump/

# Key things to look for:
# - DPAS instruction count and density (consecutive DPAS)
# - mov instruction count (especially mov(4|M0) for register shuffles)
# - send instruction count (memory operations)
# - Spill/fill patterns (stack access)

Optimization Patterns

1. Load/Compute Separation

Separate all data loads from all compute operations. Allows the GPU to pipeline loads and fill memory latency with compute.

// BAD: interleaved load/compute
for (int i = 0; i < N; i++) {
    auto data = block_load<half, 128>(ptr + i * 128);
    result[i] = compute(data);
}

// GOOD: all loads first, then all compute (+29% measured)
simd<half, 128> data[N];
for (int i = 0; i < N; i++) data[i] = block_load<half, 128>(ptr + i * 128);
for (int i = 0; i < N; i++) result[i] = compute(data[i]);

2. Cross-Phase Prefetch

Prefetch data for the next phase while computing the current phase.

// During VS GEMM phase, prefetch K for next QK iteration
for (int k = 0; k < kv_len; k += KV_CHUNK) {
    // QK phase
    qk_compute(k);

    // Prefetch next K during VS phase
    if (k + KV_CHUNK < kv_len)
        lsc_prefetch_2d<...>(k_payload);  // Next K tile

    // VS phase
    vs_compute(k);
}

3. Pipelined K Loads

Load the next data tile during the current DPAS computation.

// Load first K tile
auto k_tile = lsc_load_2d<...>(k_payload);

for (int d = 0; d < D_BLKS; d++) {
    // Start next load
    k_payload.set_x((d + 1) * K);
    auto k_next = (d < D_BLKS - 1) ? lsc_load_2d<...>(k_payload) : k_tile;

    // Compute with current tile (overlaps with next load)
    acc = xmx::dpas<...>(acc, k_tile, q_tile[d]);

    k_tile = k_next;
}

4. Loop Unrolling

// Explicit unroll for critical inner loops
#pragma unroll
for (int i = 0; i < TILE_SIZE; i++) {
    acc += data[i] * weights[i];
}

#pragma unroll is essential for ESIMD inner loops. Without it, the compiler may generate loop overhead that dominates small tile computations.

5. SIMD Width Selection

Choose SIMD width based on the operation:

// SIMD16 for reductions (reciprocal, sum)
simd<float, 16> inv = 1.0f / sum;  // Single SIMD16 divide

// SIMD32 for mad fusion
// fp32_sum = fp32_sum * delta + local_sum generates SIMD32 mad with 2x throughput

// Match SIMD width to data tile size for zero-cost select operations
simd<half, 128> full_row;
auto slice = full_row.select<16, 1>(offset);  // Zero-cost view, no data movement

6. Deferred Compensation (Online Softmax)

In flash attention, defer the A_tile *= delta correction to before VS GEMM rather than interleaving inside the VS loop:

// BAD: interleaved compensation inside VS inner loop
for (int kv = 0; kv < KV_PER_SG; kv++) {
    a_tile[kv] *= delta;  // Breaks DPAS pipeline
    acc = dpas(acc, s_tile, v_tile);
}

// GOOD: deferred compensation before VS phase
for (int ii = 0; ii < Q_ROWS; ii++) a_tile[ii] *= delta;  // All at once
for (int kv = 0; kv < KV_PER_SG; kv++) {
    acc = dpas(acc, s_tile, v_tile);  // Clean DPAS pipeline
}

7. Type Conversion Before Data Movement

When both type conversion and data shuffling are needed, do conversion first to reduce the data volume for the shuffle.

// BAD: transpose fp32 data (4 bytes per element), then convert
// Moves 2x the data during transpose

// GOOD: convert fp32 -> fp16 first, then transpose fp16 data (+14% measured)
simd<half, N> fp16_data = convert<half>(fp32_data);
// Now transpose fp16_data (half the register traffic)

Optimization Priority Checklist

Apply optimizations in this priority order (highest impact first):

Eliminate register spill — Use doubleGRF, reduce tile sizes, reorder operations
Maximize DPAS density — Consecutive DPAS instructions, pipeline loads
Minimize data movement — SLM transpose instead of register transpose, type convert before shuffle
Optimize memory access — block_load over gather, 2D loads for tiled access, prefetch
Reduce barrier count — Named barriers, split arrive/wait, eliminate redundant barriers
Fine-tune arithmetic — SIMD reciprocal, mad fusion, vectorized exp
Tune tile sizes — Balance occupancy vs work per thread vs register pressure

Case Study: SDP HD=256 Optimization Journey

The HD=256 flash attention kernel on BMG illustrates the methodology:

Stage	TFLOPS	% Peak	Key Optimization
Baseline (VNNI)	64	47%	Initial working kernel
Deferred compensation	70	52%	Clean DPAS pipeline
Named barriers	71	53%	Small barrier overhead reduction
FP16 before transpose	82	61%	Type convert before shuffle (-50% register traffic)
SIMD reciprocal + mad fusion	83	61%	Fine arithmetic tuning
32-thread WG + pipelined K	84	62%	More DPAS parallelism
lsc_slm_scatter S transpose	88	65%	SLM scatter eliminates ~270 mov instructions

Key takeaways:

The two biggest wins were data movement optimizations (FP16 transpose, SLM scatter)
Arithmetic micro-optimizations gave small but cumulative gains
Several approaches that seemed promising (double-buffered SLM, early V loads, VNNI interleave) were neutral or regressive
Always measure; intuition about GPU performance is often wrong

Related Skills

Skill	Relevance
`intel-gpu-vtune-profiling`	Detailed VTune GPU profiling workflow
`xe2-gtpin-profiling`	GTPin ISA-level profiling on Xe2
`intel-esimd-base`	Foundational ESIMD programming
`esimd-lsc-2d-gather-scatter`	LSC 2D/1D/gather/scatter operations
`esimd-lsc-slm`	SLM operations and patterns
`xe2-sdp-hd256`	HD=256 SDP kernel optimization case study
`xe2-sdp-kernels`	HD=128 SDP kernels
`xe2-dpas-patterns`	DPAS/XMX tiling and VNNI layout
`xe2-nbarrier-pipelining`	Named barrier arrive/wait patterns
`xe3-onednn-fp16-gemm`	oneDNN FP16 GEMM on Xe3
`xe3-esimd-kernels`	ESIMD kernels on Xe3/PTL
`sycl-esimd-build`	Build flags, doubleGRF, spill detection

intel-gpu-kernel-opt

More from this repository

More from this repository

Intel GPU Kernel Optimization Methodology

Table of Contents

Optimization Workflow

Step-by-Step

Roofline Analysis

Classification

Xe2 (BMG) Roofline

Xe3 (PTL) Roofline

Practical Roofline Calculation

Key Performance Metrics

Benchmarking Best Practices

Hardware Comparison: Xe2 vs Xe3

Architecture Implications

Common Bottlenecks

1. Register Spill

2. SLM Bank Conflicts

3. Barrier Overhead

4. Memory Coalescing Issues

5. Low DPAS/XMX Utilization

6. Instruction Overhead (mov Instructions)

Profiling Tools

VTune GPU Compute/Media Hotspots

GTPin (ISA-Level Profiling)

ISA Disassembly

Optimization Patterns

1. Load/Compute Separation

2. Cross-Phase Prefetch

3. Pipelined K Loads

4. Loop Unrolling

5. SIMD Width Selection

6. Deferred Compensation (Online Softmax)

7. Type Conversion Before Data Movement

Optimization Priority Checklist

Case Study: SDP HD=256 Optimization Journey

Related Skills

Intel GPU Kernel Optimization Methodology

Table of Contents

Optimization Workflow

Step-by-Step

Roofline Analysis

Classification

Xe2 (BMG) Roofline

Xe3 (PTL) Roofline

Practical Roofline Calculation

Key Performance Metrics

Benchmarking Best Practices

Hardware Comparison: Xe2 vs Xe3

Architecture Implications

Common Bottlenecks

1. Register Spill

2. SLM Bank Conflicts

3. Barrier Overhead

4. Memory Coalescing Issues

5. Low DPAS/XMX Utilization

6. Instruction Overhead (mov Instructions)

Profiling Tools

VTune GPU Compute/Media Hotspots

GTPin (ISA-Level Profiling)

ISA Disassembly

Optimization Patterns

1. Load/Compute Separation

2. Cross-Phase Prefetch

3. Pipelined K Loads

4. Loop Unrolling

5. SIMD Width Selection

6. Deferred Compensation (Online Softmax)

7. Type Conversion Before Data Movement

Optimization Priority Checklist

Case Study: SDP HD=256 Optimization Journey

Related Skills