intel

Rules for automatically determining which disable_* labels to apply to a PR based on the file paths changed. Used by the auto-label workflow.

2026-07-07

extract-asm-onednn

Extract GPU ISA from oneDNN ngen-JIT kernels. This is the ONLY codegen path that bypasses the standard SYCL/SPIR-V/IGC stack. oneDNN uses its own native code generator (ngen) that directly emits GPU ISA bytes — no SPIR-V, no IGC, no zebin ELF, no .debug_line. Use when extracting ASM from oneDNN kernels (gemm_kernel, gen_conv_kernel), matmul, linear, conv, or SDPA-graph ops dispatched via mkldnn.

extract-asm-syclkernel-aot

Extract GPU ISA from AOT-compiled SYCL binaries using NEO driver runtime dump. Use when the kernel is a SYCL symbol (_ZTS...) from an AOT binary such as libtorch_xpu.so, SYCL-TLA FMHA, or any DPC++ binary compiled with -fsycl-targets=spir64_gen -Xs "-device <gpu>".

extract-asm-syclkernel-jit

Extract GPU ISA from JIT-compiled SYCL kernels. The binary only contains SPIR-V; IGC compiles it to native ISA at first launch. The zebin only exists in memory at runtime, so IGC_ShaderDumpEnable=1 is required to capture it. Use when extracting ASM from SYCL JIT kernels, torch-xpu-ops built with TORCH_XPU_ARCH_LIST=none, or standalone DPC++ without AOT flags.

extract-asm-triton

Extract GPU ISA from Triton kernels on XPU. Triton compiles through the same IGC backend as SYCL (Triton IR → SPIR-V → IGC → zebin). Extraction is identical to sycl-jit: IGC_ShaderDumpEnable=1 captures the zebin at runtime. Use when extracting ASM from torch.compile fusions (triton_per_fused, triton_poi, triton_red), standalone @triton.jit kernels, or Inductor-generated XPU kernels.

extract-xpu-kernel-asm

Extract Intel GPU ISA (assembly) from any XPU kernel. Classifies the codegen path (SYCL AOT, SYCL JIT, Triton, or oneDNN ngen) and delegates to the matching extraction skill. Use when asked to extract ASM, disassemble XPU kernels, get GPU ISA for an aten op, dump shader for PyTorch XPU, or disassemble a standalone DPC++/Triton binary.

asm-source-mapping

7 skills1.5k156updated 2026-06-17

Map Intel GPU ISA instruction addresses to precise SYCL/DPC++ source file:line numbers. Primary method reads the DWARF .debug_line section from the GPU zebin ELF. Fallback uses structural pattern recognition by opcode mix. Use when mapping ASM to source code, finding which source line a GPU instruction comes from, or doing DWARF line table analysis on GPU binaries.

2026-07-02

Showing top 8 of 26 collected skills in this repository.

#002

auto-round

16% of creator

skill

occupation

description

updated

adapt-new-diffusion-model

Adapt AutoRound to support a new diffusion model architecture (DiT, UNet, hybrid AR+DiT). Use when a new diffusion model fails quantization, needs custom output configs, requires a custom pipeline function, or is a hybrid architecture with both autoregressive and diffusion components.

2026-06-17

add-vlm-model

software-quality-assurance-analysts-and-testers

Add support for a new Vision-Language Model (VLM) to AutoRound, including multimodal block handler, calibration dataset template, and special model handling. Use when integrating a new VLM like LLaVA, Qwen2-VL, GLM-Image, Phi-Vision, or similar multi-modal models for quantization.

2026-06-17

review-pr

Review or prepare a pull request for the AutoRound repository — checks registration points for new data types/backends/VLMs, validates Chinese translation parity for modified markdown files, verifies quantization numerical stability (scale overflow, STE gradient flow, group_size padding), confirms test placement and fixture usage, and enforces Apache 2.0 headers and DCO sign-off. Use when performing a code review, running a PR checklist, preparing a merge request, or auditing a contribution before submit.

2026-06-10

adapt-new-llm

Adapt AutoRound to support a new LLM architecture that doesn't work out-of-the-box. Use when quantization fails for a new model type, block detection doesn't find layers, MoE models need unfusing, custom forward passes are needed, or non-standard linear layer types need handling.

2026-05-14

add-inference-backend

Add a new hardware inference backend to AutoRound for deploying quantized models (e.g., CUDA/Marlin, Triton, CPU, HPU, ARK). Use when implementing QuantLinear kernels, registering backend capabilities, or enabling quantized model inference on a new hardware platform.

2026-05-11

add-export-format

add-quantization-datatype

Add a new model export format to AutoRound (e.g., auto_round, auto_gptq, auto_awq, gguf, llm_compressor). Use when implementing a new quantized model serialization format, adding a new packing method, or extending export compatibility for deployment frameworks like vLLM, SGLang, or llama.cpp.

2026-04-17

confidential-computing-zoo

Add a new quantization data type to AutoRound (e.g., INT, FP8, MXFP, NVFP, GGUF variants). Use when implementing a new weight/activation quantization scheme, registering a new quant function, or extending the data_type registry.

2026-04-17

#003

3 skills35971updated 2026-02-11

6.8% of creator

skill

occupation

description

updated

check-td-runtime-environment

3 skills23030updated 2026-06-05

Check TD Runtime Environment

Get TDVM event log

Get TDVM Quote Information

2026-02-11

#004

intel-performance-skills

6.8% of creator

skill

occupation

description

updated

performance-patterns

network-and-computer-systems-administrators

Detect and fix x86/C/C++ performance patterns from source code or profiling output (perf, VTune, flamegraphs). Invoke when the user asks to optimize, review for performance, or write new SIMD/vectorized code — even without profiling data. Trigger on: serial accumulator loops, narrow SIMD (xmm/ymm that could be ymm/zmm), _mm* intrinsics, HITM/cmpxchg clusters, false sharing, missing restrict or vzeroupper, futex_wake/notify_all thundering herd, hot symbol inside a system library (.so) with a version gap, or any request to write a fast reduction, dot product, or CPU-dispatched function. Patterns: serial accumulator, TTAS spinlock, SIMD upconversion (zipper), false sharing, per-CPU stats, missing vzeroupper, missing restrict, cv-thundering-herd, mutex-to-rwlock, CPU dispatch, library version upgrade, fast CRC32C, known algorithms (Cosine Similarity, Hamming Distance, Jaccard Distance), SIMD sort (x86-simd-sort).

2026-06-05

linux-perf

Profile and fix Linux performance problems using `perf`. Workflows: (A) hardware counters -- IPC, cache-miss, branch mispredictions; (B) hotspot profiling -- which functions and source lines consume CPU, with SIMD and accumulator detection; (C) cache-line contention -- false sharing, HITM, `perf c2c`; (D) core-count scaling -- dual-profile comparison, bottleneck categorization; (E) structured hotspot report with annotated source and pattern observations. Resolution strategies: TTAS spinlock, SIMD upconversion, parallel accumulator, structured false-sharing fix, per-CPU stats. Trigger on: perf, profiling, profile, hotspot, hotspots, cache miss, IPC, false sharing, HITM, scaling, core count, thread scaling, bottleneck, slow code, CPU bound, why is this slow, where does time go, does not scale. When in doubt, invoke this skill -- better to use it unnecessarily than to miss a performance opportunity.

2026-05-22

phoronix-test-suite

1 skills45058updated 2026-04-10

Install, run, parse, and optimize benchmarks from the Phoronix Test Suite (PTS). Use this skill whenever the user mentions "phoronix", "pts/", or "phoronix-test-suite", or asks to run, measure, improve, or optimize a PTS test — e.g., "run pts/mt-dgemm", "optimize pts/compress-zstd", "what score does pts/x265 get". Trigger immediately on any `pts/<testname>` reference, even if the user doesn't explicitly say "phoronix". Also trigger when the user asks to find or edit the source code of a PTS test.

2026-05-19

#005

PerfSpect

software-quality-assurance-analysts-and-testers

skill

occupation

description

updated

functional-test

Use this skill when running functional tests to validate PerfSpect code changes, when the user says "run functional tests", "test my changes", "check for regressions", or when verifying a code change did not break existing functionality.

2026-04-10

#006

systemc-compiler

1 skills31245updated 2026-05-15

skill

occupation

description

updated

systemc-tools

intel-xpu-backend-for-triton

Use when writing, editing, reviewing, or debugging synthesizable SystemC code for Intel SystemC Compiler (ICSC) and SingleSource library. Covers rules for module hierarchy, channels and ports, process declarations, reset behavior, sensitivity lists, SystemC and C++ data types and collections. Applies communication channels, memory modules, data types and utility functions from Single Source library.

2026-05-15

#007

1 skills261106updated 2026-06-04

skill

occupation

description

updated

blockptr-to-tdesc

1 skills10431updated 2026-07-10

Translate a Triton kernel from the deprecated block-pointer API (tl.make_block_ptr / tl.advance / tl.load(boundary_check=...)) into an equivalent kernel using the modern device-side tensor-descriptor API (tl.make_tensor_descriptor / desc.load / desc.store) for the Intel XPU backend. Use this skill whenever the user wants to migrate, convert, translate, port, modernize, or "update" a kernel from block pointers to tensor descriptors; whenever they mention tl.make_block_ptr or tl.advance and ask for a modern/non-deprecated equivalent; whenever they ask how to use tensor descriptors in a kernel that currently uses block pointers; or when they paste a kernel using block pointers and ask how to speed it up or make it use DPAS / 2D block I/O on Intel GPU (PVC/BMG). Produce the descriptor form the XPU backend can lower efficiently, not just any form that compiles.

2026-06-04

#008

optimization-zone

skill

occupation

description

updated

vllm-xeon-cpu

1 skills166updated 2026-05-01

Deploy, tune, validate, and benchmark vLLM on Intel Xeon CPUs (CPU-only inference, no GPU). USE FOR: serving and performance optimizing LLMs on Intel Xeon, vLLM CPU install, CPU inference tuning, AMX bfloat16 setup, NUMA pinning, VLLM_CPU_KVCACHE_SPACE, VLLM_CPU_OMP_THREADS_BIND, --dtype=bfloat16, vllm/vllm-openai-cpu Docker image, hardware validation for AMX (amx_tile, amx_bf16, amx_int8), KV cache sizing per NUMA node, --max-num-batched-tokens / --max-num-seqs tuning, vllm bench serve on CPU, TTFT/TPOT measurement. DO NOT USE FOR: GPU vLLM (use upstream vLLM docs), training, quantization tuning beyond INT8/AWQ pointers, model architecture selection (use Intel Xeon AI Performance Advisor), non-Xeon CPUs, vLLM source build deep-dives.

2026-07-10

#009

linux-kernel-oops