一键在 Manus 中运行任何 Skill

$pwd:

xpu-kernels

Name: Xpu Kernels
Author: huggingface

// Provides guidance for writing, optimizing, and benchmarking Triton kernels for Intel XPU GPUs (Battlemage/Arc Pro B50) using the Xe-Forge optimization framework. Includes an LLM-driven trial-loop workflow (analyze, validate, benchmark, profile, finalize), XPU-specific patterns (tensor descriptors, GRF mode, tile swizzling), KernelBench fused kernels, and Flash Attention.

在 Manus 中运行

$ git log --oneline --stat

stars:654

forks:94

updated:2026年5月20日 09:00

文件资源管理器

26 个文件

SKILL.md

readonly

related-skills.json

同仓库

rocm-kernels.md

from "huggingface/kernels"

Provides guidance for writing and benchmarking optimized Triton kernels for AMD GPUs (MI355X, R9700) on ROCm, targeting HuggingFace diffusers (LTX-Video, SD3, FLUX) and transformers. Core kernels: RMSNorm, RoPE 3D, GEGLU, AdaLN. Includes XCD swizzle, autotune, diffusers integration patterns, and LTX-Video pipeline injection.

2026-04-15654

package.json

"author": "huggingface"

"repository": "huggingface/kernels"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件开发工程师计算机与数学类职业15-1252L4

一键运行任何 Skill

name	xpu-kernels
description	Provides guidance for writing, optimizing, and benchmarking Triton kernels for Intel XPU GPUs (Battlemage/Arc Pro B50) using the Xe-Forge optimization framework. Includes an LLM-driven trial-loop workflow (analyze, validate, benchmark, profile, finalize), XPU-specific patterns (tensor descriptors, GRF mode, tile swizzling), KernelBench fused kernels, and Flash Attention.
disable-model-invocation	false
user-invocable	true
allowed-tools	Read, Grep, Glob, Bash
argument-hint	kernel type: gemm, reduction, flash-attention, optimize, benchmark, tensor-descriptors, xe-forge

XPU Triton Kernels for Intel GPUs

This skill provides patterns and guidance for developing optimized Triton kernels targeting Intel XPU GPUs (Battlemage/Arc Pro B50). It integrates the Xe-Forge optimization framework — an LLM-driven loop that transforms PyTorch code into fast Triton kernels.

Quick Start

Optimize a Kernel (Xe-Forge Workflow)

The full optimization workflow analyzes a PyTorch baseline, generates Triton kernel variants in a branching trial tree, benchmarks each on XPU hardware, and finalizes the best result.

# 1. Analyze the baseline
python scripts/analyze_kernel.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 2. Initialize trial tracking
python scripts/trial_manager.py init 70_Gemm_Sigmoid test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 3. Validate a generated kernel (no GPU needed)
python scripts/validate_triton.py my_kernel.py

# 4. Benchmark correctness + performance
python scripts/benchmark.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py my_kernel.py

# 5. Profile with VTune (optional)
python scripts/xpu_profiler.py my_kernel.py

# 6. Finalize best trial
python scripts/trial_manager.py finalize 70_Gemm_Sigmoid optimized_triton.py

Supported Hardware

GPU	Architecture	XVEs	Mem BW	Key Feature	Verified
Battlemage G21 / Arc Pro B50	Xe2	128	~500 GB/s	Tensor descriptors, GRF 256	Yes

See the Intel XPU Backend for Triton for supported hardware.

When This Skill Applies

Use this skill when:

Optimizing PyTorch operations into Triton kernels for Intel XPU
Writing GEMM, fused kernels, reductions, or Flash Attention for Intel GPUs
Running the Xe-Forge optimization loop (analyze → validate → benchmark → profile → finalize)
Benchmarking kernel performance against PyTorch baseline on XPU

Xe-Forge Optimization Workflow

Transform PyTorch code into optimized Triton kernels for Intel XPU. Kernels must be numerically equivalent and faster than baseline.

Configuration — Read `config.yaml` first

At the start of every session, read scripts/config.yaml. It controls:

max_trials — hard cap on optimization trials; always run all of them (use this instead of hardcoded "10")
vtune_enabled — if false, skip ALL VTune profiling steps (Step 3.6 and profiler-related decisions)
vtune_bin — path to the VTune binary (also settable via VTUNE_BIN env var)

Rules — Never Violate

ONLY create Triton kernel files (test_kernels/*_triton.py or trial files t<trial_id>.py).
NEVER create benchmark scripts, test scripts, helper utilities, or any other Python files.
NEVER write custom scripts to measure performance or test correctness — ONLY use scripts/benchmark.py.
If a tool fails, STOP and report the error. Do NOT work around it with custom scripts.
Generated kernels must be self-contained — all helper functions inline.
You MUST run all max_trials trials from config.yaml. Do NOT stop early due to plateau — LLM sampling can discover new ideas at any point. The only valid early stop is speedup > 5x.

Mandatory Tools

CRITICAL — Single-XPU serialization: There is only ONE XPU on this machine. You MUST NOT run multiple GPU workloads in parallel. benchmark.py and xpu_profiler.py must execute strictly one at a time — concurrent GPU jobs produce wrong results. CPU-only tools (analyze_kernel.py, validate_triton.py, trial_manager.py) are safe to parallelize with each other and with anything else.

Tool	Command	Purpose
Analyze	`python scripts/analyze_kernel.py <file>`	Static analysis: operations, shapes, fusion opportunities
Validate	`python scripts/validate_triton.py <file>`	Syntax + constraint checks before GPU time
Benchmark	`python scripts/benchmark.py <baseline> <triton> [--triton-baseline] [--baseline-us <cached>]`	Correctness + performance via ai-bench
Profile	`python scripts/xpu_profiler.py <file>`	VTune GPU hardware counters + recommendations
Init trials	`python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline]`	Initialize trial tracking
Save trial	`python scripts/trial_manager.py save <kernel_name> <file> [--parent <parent_id>] [--strategy "..."]`	Save trial to tree
Record result	`python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass\|fail> --speedup <float> --baseline_us <float> --triton_us <float>`	Record benchmark result
Check status	`python scripts/trial_manager.py status <kernel_name>`	View trial tree
Best trial	`python scripts/trial_manager.py best <kernel_name>`	Get best trial
Baseline time	`python scripts/trial_manager.py baseline-us <kernel_name>`	Cached baseline time for `--baseline-us`
Finalize	`python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py`	Copy best trial to output

Workflow Steps

Step 1: Analyze

Read the baseline source file. Identify shapes, dtypes, operations, fusion opportunities.
If baseline is PyTorch: run python scripts/analyze_kernel.py <pytorch_file>.
If baseline is Triton (--triton-baseline): skip analyze_kernel.py (it only supports PyTorch). Read the Triton file directly.
Read relevant knowledge base files: start with references/correctness.yaml and references/xpu_optimizations.yaml.
Read references/implementation_reference.md for templates and the Model class pattern.

Step 2: Initialize

python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline]

Step 3: Trial Loop (always run all `max_trials` from config.yaml)

For each trial:

Write kernel — start from templates or modify previous trial. See references/implementation_reference.md.
Validate — python scripts/validate_triton.py <triton_file> (fix until passing; doesn't count as a trial).
Save — python scripts/trial_manager.py save <kernel_name> <triton_file> --parent <parent_id> --strategy "description". Omit --parent for the first trial (t0).
Benchmark (MANDATORY every trial):
- Trial t0: python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] (measures both baseline and triton).
- Trials t1+: Get cached baseline via python scripts/trial_manager.py baseline-us <kernel_name>, then run python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] --baseline-us <cached_value> (skips baseline perf, saves time).
- After finalize: Re-run benchmark.py without --baseline-us for final accurate comparison.
Record — python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass|fail> --speedup <float> --baseline_us <float> --triton_us <float> (runtimes from benchmark output).
Profile (MANDATORY after t1, if vtune_enabled is true in config.yaml) — Run python scripts/xpu_profiler.py <triton_file> after your first benchmarked trial. Use its output to guide subsequent trial strategies. Run again if speedup plateaus after 2+ additional trials. Skip this step entirely if vtune_enabled is false.
Decide next action (use profiler output from step 6 to inform decisions):
- Speedup > 5x → stop (excellent), finalize
- Speedup improved → continue on this branch, try next optimization level
- Speedup regressed → branch back to best trial, try different strategy
- Correctness failed → fix on same branch
- Profiler says low occupancy (if vtune_enabled) → increase tile sizes, check references/xpu_optimizations.yaml
- Profiler says overhead kernels dominate (if vtune_enabled) → pre-pack to bf16, see references/optimization_levels.yaml
- Plateau → do NOT stop. Try a fundamentally different approach (different algorithm, tiling, fusion strategy). LLM sampling can discover new ideas.
- See references/optimization_strategies.md for the full "try harder" decision tree

Step 4: Finalize

python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py

Reference Docs — Read During Step 1

Doc	Contents
`references/implementation_reference.md`	Code templates, Model class pattern, GEMM example
`references/optimization_strategies.md`	Strategy reference, optimization levels, checklist
`references/workflow_details.md`	Detailed workflow, decision tree, benchmarking/validation details
`references/correctness.yaml`	Critical constraints to avoid bugs
`references/xpu_optimizations.yaml`	XPU-specific patterns (tensor descriptors, GRF, swizzling)
`references/fusion_patterns.yaml`	When to fuse vs split operations
`references/optimization_levels.yaml`	Progressive optimization with "try harder" decision tree

Existing Baselines Are Naive

The test_kernels/*.py Triton files (non-pytorch) are unoptimized baselines. They use manual pointer arithmetic, lack autotune, and miss XPU optimizations. Do NOT copy their patterns. Use references/implementation_reference.md instead.

Core XPU Kernel Patterns

Tensor Descriptors (Preferred on XPU)

Tensor descriptors produce better address generation and memory access codegen than block pointers on Intel XPU.

desc = tl.make_tensor_descriptor(
    base=ptr, shape=[M, N],
    strides=[stride_m, stride_n],
    block_shape=[BLOCK_M, BLOCK_N],
)
block = tl.load(desc, [pid_m, pid_n], boundary_check=(0, 1))

GRF Mode '256'

Use the large register file for compute-heavy kernels:

@triton.autotune(
    configs=[triton.Config({'BLOCK_M': 256, 'BLOCK_N': 256}, num_warps=32)],
    key=['M', 'N', 'K'],
)
@triton.jit(launch_metadata=lambda *args, **kwargs: {'grf_mode': '256'})
def kernel(...):
    ...

Tile Swizzling

Use 1D grid with GROUP_SIZE_M for L2 locality:

grid = lambda META: (triton.cdiv(M, META['BLOCK_M']) * triton.cdiv(N, META['BLOCK_N']),)
# Inside kernel:
pid = tl.program_id(0)
num_pid_n = tl.cdiv(N, BLOCK_N)
group_id = pid // (GROUP_SIZE_M * num_pid_n)

bf16 Inputs with fp32 Accumulation

a = tl.load(a_desc, [pid_m, k], boundary_check=(0, 1))
b = tl.load(b_desc, [k, pid_n], boundary_check=(0, 1))
acc += tl.dot(a.to(tl.bfloat16), b.to(tl.bfloat16), acc=acc)  # fp32 accumulator

Critical XPU Constraints

NO default values for @triton.autotune meta-parameters in kernel signature
1D grid when using tile swizzling (GROUP_SIZE_M)
boundary_check uses dimension indices (0, 1), not booleans
Cast batch indices to int64 before stride multiplication
Prefer tensor descriptors over block pointers for all new XPU kernels
Do NOT mix block pointer and tensor descriptor APIs on same operation
Pre-zero output buffers when using atomic accumulation
Model class must be compatible with ai-bench (nn.Module with nn.Linear)
Match get_inputs(), get_init_inputs(), and module-level constants from *_pytorch.py

Full constraint list: correctness.yaml

Performance Results

Measured on Intel Battlemage G21 / Arc Pro B50 (128 XVEs). All runtimes are median of benchmark trials.

KernelBench Level 2 — Fused Kernels (bf16)

Speedup is vs. PyTorch eager baseline. Includes GEMM+Sigmoid+Scaling, GEMM+GELU+Softmax, Conv+BatchNorm+ReLU, and other fused patterns.

Flash Attention Forward (fp16)

Baseline is the flash attention kernel from the Intel XPU Triton backend; speedup is vs. that kernel across multiple sequence lengths.

Full results: see the Xe-Forge repository.

Common Issues

Issue	Symptom	Fix
Autotune BLOCK_D	Wrong results (max_abs 4-8+)	Never autotune BLOCK_D. Use `triton.next_power_of_2(D)`
Python min/max	Runtime error	`tl.minimum()`/`tl.maximum()`

Project Structure

xpu-kernels/
├── SKILL.md                                    # This file (skill definition + workflow)
├── manifest.txt                                # Files included in this skill
│
├── scripts/                                    # Standalone CLI tools
│   ├── analyze_kernel.py                       # PyTorch → operations, shapes, fusion opportunities
│   ├── validate_triton.py                      # Syntax + constraint checks
│   ├── benchmark.py                            # Correctness + performance via ai-bench
│   ├── trial_manager.py                        # Tree-structured trial management
│   ├── xpu_profiler.py                         # VTune GPU hardware counters
│   ├── config.py                               # Shared configuration loader
│   ├── config.yaml                             # Session config (max_trials, vtune)
│   └── requirements.txt                        # Python dependencies
│
└── references/                                 # Knowledge base + integration guides
    ├── correctness.yaml                        # Hard constraints for XPU Triton
    ├── xpu_optimizations.yaml                  # Tensor descriptors, GRF, swizzling
    ├── implementation_reference.md             # Code templates, Model class pattern
    ├── implementation_reference.md             # Code templates, Model class pattern
    ├── optimization_strategies.md              # Strategy reference + "try harder" tree
    ├── optimization_levels.yaml                # Progressive L1-L5 optimization levels
    ├── workflow_details.md                     # Detailed workflow and decision tree
    ├── fusion_patterns.yaml                    # When to fuse vs split
    ├── memory_patterns.yaml                    # Access patterns and coalescing
    ├── dtype_optimizations.yaml                # Mixed precision choices
    ├── persistent_kernel_patterns.yaml         # Stream K and persistent kernels
    ├── kernel-templates.md                     # Triton kernel templates for XPU
    └── kernelbench-classification.md           # KernelBench operator taxonomy

XPU Triton Kernels for Intel GPUs

Quick Start

Optimize a Kernel (Xe-Forge Workflow)

The full optimization workflow analyzes a PyTorch baseline, generates Triton kernel variants in a branching trial tree, benchmarks each on XPU hardware, and finalizes the best result.

# 1. Analyze the baseline
python scripts/analyze_kernel.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 2. Initialize trial tracking
python scripts/trial_manager.py init 70_Gemm_Sigmoid test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 3. Validate a generated kernel (no GPU needed)
python scripts/validate_triton.py my_kernel.py

# 4. Benchmark correctness + performance
python scripts/benchmark.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py my_kernel.py

# 5. Profile with VTune (optional)
python scripts/xpu_profiler.py my_kernel.py

# 6. Finalize best trial
python scripts/trial_manager.py finalize 70_Gemm_Sigmoid optimized_triton.py

Supported Hardware

GPU	Architecture	XVEs	Mem BW	Key Feature	Verified
Battlemage G21 / Arc Pro B50	Xe2	128	~500 GB/s	Tensor descriptors, GRF 256	Yes

See the Intel XPU Backend for Triton for supported hardware.

When This Skill Applies

Use this skill when:

Optimizing PyTorch operations into Triton kernels for Intel XPU
Writing GEMM, fused kernels, reductions, or Flash Attention for Intel GPUs
Running the Xe-Forge optimization loop (analyze → validate → benchmark → profile → finalize)
Benchmarking kernel performance against PyTorch baseline on XPU

Xe-Forge Optimization Workflow

Transform PyTorch code into optimized Triton kernels for Intel XPU. Kernels must be numerically equivalent and faster than baseline.

Configuration — Read `config.yaml` first

At the start of every session, read scripts/config.yaml. It controls:

max_trials — hard cap on optimization trials; always run all of them (use this instead of hardcoded "10")
vtune_enabled — if false, skip ALL VTune profiling steps (Step 3.6 and profiler-related decisions)
vtune_bin — path to the VTune binary (also settable via VTUNE_BIN env var)

Rules — Never Violate

ONLY create Triton kernel files (test_kernels/*_triton.py or trial files t<trial_id>.py).
NEVER create benchmark scripts, test scripts, helper utilities, or any other Python files.
NEVER write custom scripts to measure performance or test correctness — ONLY use scripts/benchmark.py.
If a tool fails, STOP and report the error. Do NOT work around it with custom scripts.
Generated kernels must be self-contained — all helper functions inline.
You MUST run all max_trials trials from config.yaml. Do NOT stop early due to plateau — LLM sampling can discover new ideas at any point. The only valid early stop is speedup > 5x.

Mandatory Tools

Tool	Command	Purpose
Analyze	`python scripts/analyze_kernel.py <file>`	Static analysis: operations, shapes, fusion opportunities
Validate	`python scripts/validate_triton.py <file>`	Syntax + constraint checks before GPU time
Benchmark	`python scripts/benchmark.py <baseline> <triton> [--triton-baseline] [--baseline-us <cached>]`	Correctness + performance via ai-bench
Profile	`python scripts/xpu_profiler.py <file>`	VTune GPU hardware counters + recommendations
Init trials	`python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline]`	Initialize trial tracking
Save trial	`python scripts/trial_manager.py save <kernel_name> <file> [--parent <parent_id>] [--strategy "..."]`	Save trial to tree
Record result	`python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass\|fail> --speedup <float> --baseline_us <float> --triton_us <float>`	Record benchmark result
Check status	`python scripts/trial_manager.py status <kernel_name>`	View trial tree
Best trial	`python scripts/trial_manager.py best <kernel_name>`	Get best trial
Baseline time	`python scripts/trial_manager.py baseline-us <kernel_name>`	Cached baseline time for `--baseline-us`
Finalize	`python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py`	Copy best trial to output

Workflow Steps

Step 1: Analyze

Read the baseline source file. Identify shapes, dtypes, operations, fusion opportunities.
If baseline is PyTorch: run python scripts/analyze_kernel.py <pytorch_file>.
If baseline is Triton (--triton-baseline): skip analyze_kernel.py (it only supports PyTorch). Read the Triton file directly.
Read relevant knowledge base files: start with references/correctness.yaml and references/xpu_optimizations.yaml.
Read references/implementation_reference.md for templates and the Model class pattern.

Step 2: Initialize

python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline]

Step 3: Trial Loop (always run all `max_trials` from config.yaml)

For each trial:

Write kernel — start from templates or modify previous trial. See references/implementation_reference.md.
Validate — python scripts/validate_triton.py <triton_file> (fix until passing; doesn't count as a trial).
Save — python scripts/trial_manager.py save <kernel_name> <triton_file> --parent <parent_id> --strategy "description". Omit --parent for the first trial (t0).
Benchmark (MANDATORY every trial):
- Trial t0: python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] (measures both baseline and triton).
- Trials t1+: Get cached baseline via python scripts/trial_manager.py baseline-us <kernel_name>, then run python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] --baseline-us <cached_value> (skips baseline perf, saves time).
- After finalize: Re-run benchmark.py without --baseline-us for final accurate comparison.
Record — python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass|fail> --speedup <float> --baseline_us <float> --triton_us <float> (runtimes from benchmark output).
Profile (MANDATORY after t1, if vtune_enabled is true in config.yaml) — Run python scripts/xpu_profiler.py <triton_file> after your first benchmarked trial. Use its output to guide subsequent trial strategies. Run again if speedup plateaus after 2+ additional trials. Skip this step entirely if vtune_enabled is false.
Decide next action (use profiler output from step 6 to inform decisions):
- Speedup > 5x → stop (excellent), finalize
- Speedup improved → continue on this branch, try next optimization level
- Speedup regressed → branch back to best trial, try different strategy
- Correctness failed → fix on same branch
- Profiler says low occupancy (if vtune_enabled) → increase tile sizes, check references/xpu_optimizations.yaml
- Profiler says overhead kernels dominate (if vtune_enabled) → pre-pack to bf16, see references/optimization_levels.yaml
- Plateau → do NOT stop. Try a fundamentally different approach (different algorithm, tiling, fusion strategy). LLM sampling can discover new ideas.
- See references/optimization_strategies.md for the full "try harder" decision tree

Step 4: Finalize

python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py

Reference Docs — Read During Step 1

Doc	Contents
`references/implementation_reference.md`	Code templates, Model class pattern, GEMM example
`references/optimization_strategies.md`	Strategy reference, optimization levels, checklist
`references/workflow_details.md`	Detailed workflow, decision tree, benchmarking/validation details
`references/correctness.yaml`	Critical constraints to avoid bugs
`references/xpu_optimizations.yaml`	XPU-specific patterns (tensor descriptors, GRF, swizzling)
`references/fusion_patterns.yaml`	When to fuse vs split operations
`references/optimization_levels.yaml`	Progressive optimization with "try harder" decision tree

Existing Baselines Are Naive

Core XPU Kernel Patterns

Tensor Descriptors (Preferred on XPU)

Tensor descriptors produce better address generation and memory access codegen than block pointers on Intel XPU.

desc = tl.make_tensor_descriptor(
    base=ptr, shape=[M, N],
    strides=[stride_m, stride_n],
    block_shape=[BLOCK_M, BLOCK_N],
)
block = tl.load(desc, [pid_m, pid_n], boundary_check=(0, 1))

GRF Mode '256'

Use the large register file for compute-heavy kernels:

@triton.autotune(
    configs=[triton.Config({'BLOCK_M': 256, 'BLOCK_N': 256}, num_warps=32)],
    key=['M', 'N', 'K'],
)
@triton.jit(launch_metadata=lambda *args, **kwargs: {'grf_mode': '256'})
def kernel(...):
    ...

Tile Swizzling

Use 1D grid with GROUP_SIZE_M for L2 locality:

grid = lambda META: (triton.cdiv(M, META['BLOCK_M']) * triton.cdiv(N, META['BLOCK_N']),)
# Inside kernel:
pid = tl.program_id(0)
num_pid_n = tl.cdiv(N, BLOCK_N)
group_id = pid // (GROUP_SIZE_M * num_pid_n)

bf16 Inputs with fp32 Accumulation

a = tl.load(a_desc, [pid_m, k], boundary_check=(0, 1))
b = tl.load(b_desc, [k, pid_n], boundary_check=(0, 1))
acc += tl.dot(a.to(tl.bfloat16), b.to(tl.bfloat16), acc=acc)  # fp32 accumulator

Critical XPU Constraints

NO default values for @triton.autotune meta-parameters in kernel signature
1D grid when using tile swizzling (GROUP_SIZE_M)
boundary_check uses dimension indices (0, 1), not booleans
Cast batch indices to int64 before stride multiplication
Prefer tensor descriptors over block pointers for all new XPU kernels
Do NOT mix block pointer and tensor descriptor APIs on same operation
Pre-zero output buffers when using atomic accumulation
Model class must be compatible with ai-bench (nn.Module with nn.Linear)
Match get_inputs(), get_init_inputs(), and module-level constants from *_pytorch.py

Full constraint list: correctness.yaml

Performance Results

Measured on Intel Battlemage G21 / Arc Pro B50 (128 XVEs). All runtimes are median of benchmark trials.

KernelBench Level 2 — Fused Kernels (bf16)

Speedup is vs. PyTorch eager baseline. Includes GEMM+Sigmoid+Scaling, GEMM+GELU+Softmax, Conv+BatchNorm+ReLU, and other fused patterns.

Flash Attention Forward (fp16)

Baseline is the flash attention kernel from the Intel XPU Triton backend; speedup is vs. that kernel across multiple sequence lengths.

Full results: see the Xe-Forge repository.

Common Issues

Issue	Symptom	Fix
Autotune BLOCK_D	Wrong results (max_abs 4-8+)	Never autotune BLOCK_D. Use `triton.next_power_of_2(D)`
Python min/max	Runtime error	`tl.minimum()`/`tl.maximum()`

Project Structure

xpu-kernels/
├── SKILL.md                                    # This file (skill definition + workflow)
├── manifest.txt                                # Files included in this skill
│
├── scripts/                                    # Standalone CLI tools
│   ├── analyze_kernel.py                       # PyTorch → operations, shapes, fusion opportunities
│   ├── validate_triton.py                      # Syntax + constraint checks
│   ├── benchmark.py                            # Correctness + performance via ai-bench
│   ├── trial_manager.py                        # Tree-structured trial management
│   ├── xpu_profiler.py                         # VTune GPU hardware counters
│   ├── config.py                               # Shared configuration loader
│   ├── config.yaml                             # Session config (max_trials, vtune)
│   └── requirements.txt                        # Python dependencies
│
└── references/                                 # Knowledge base + integration guides
    ├── correctness.yaml                        # Hard constraints for XPU Triton
    ├── xpu_optimizations.yaml                  # Tensor descriptors, GRF, swizzling
    ├── implementation_reference.md             # Code templates, Model class pattern
    ├── implementation_reference.md             # Code templates, Model class pattern
    ├── optimization_strategies.md              # Strategy reference + "try harder" tree
    ├── optimization_levels.yaml                # Progressive L1-L5 optimization levels
    ├── workflow_details.md                     # Detailed workflow and decision tree
    ├── fusion_patterns.yaml                    # When to fuse vs split
    ├── memory_patterns.yaml                    # Access patterns and coalescing
    ├── dtype_optimizations.yaml                # Mixed precision choices
    ├── persistent_kernel_patterns.yaml         # Stream K and persistent kernels
    ├── kernel-templates.md                     # Triton kernel templates for XPU
    └── kernelbench-classification.md           # KernelBench operator taxonomy

xpu-kernels

同仓库更多 Skills

同仓库更多 Skills

XPU Triton Kernels for Intel GPUs

Quick Start

Optimize a Kernel (Xe-Forge Workflow)

Supported Hardware

When This Skill Applies

Xe-Forge Optimization Workflow

Configuration — Read config.yaml first

Rules — Never Violate

Mandatory Tools

Workflow Steps

Step 1: Analyze

Step 2: Initialize

Step 3: Trial Loop (always run all max_trials from config.yaml)

Step 4: Finalize

Reference Docs — Read During Step 1

Existing Baselines Are Naive

Core XPU Kernel Patterns

Tensor Descriptors (Preferred on XPU)

GRF Mode '256'

Tile Swizzling

bf16 Inputs with fp32 Accumulation

Critical XPU Constraints

Performance Results

KernelBench Level 2 — Fused Kernels (bf16)

Flash Attention Forward (fp16)

Common Issues

Project Structure

See Also

Xe-Forge Tools

XPU Optimization References

Other References

External Resources

XPU Triton Kernels for Intel GPUs

Quick Start

Optimize a Kernel (Xe-Forge Workflow)

Supported Hardware

When This Skill Applies

Xe-Forge Optimization Workflow

Configuration — Read config.yaml first

Rules — Never Violate

Mandatory Tools

Workflow Steps

Step 1: Analyze

Step 2: Initialize

Step 3: Trial Loop (always run all max_trials from config.yaml)

Step 4: Finalize

Reference Docs — Read During Step 1

Existing Baselines Are Naive

Core XPU Kernel Patterns

Tensor Descriptors (Preferred on XPU)

GRF Mode '256'

Tile Swizzling

bf16 Inputs with fp32 Accumulation

Critical XPU Constraints

Performance Results

KernelBench Level 2 — Fused Kernels (bf16)

Flash Attention Forward (fp16)

Common Issues

Project Structure

See Also

Xe-Forge Tools

XPU Optimization References

Other References

External Resources

Configuration — Read `config.yaml` first

Step 3: Trial Loop (always run all `max_trials` from config.yaml)

Configuration — Read `config.yaml` first

Step 3: Trial Loop (always run all `max_trials` from config.yaml)