intel

Rules for automatically determining which disable_* labels to apply to a PR based on the file paths changed. Used by the auto-label workflow.

2026-07-07

extract-asm-onednn

Extract GPU ISA from oneDNN ngen-JIT kernels. This is the ONLY codegen path that bypasses the standard SYCL/SPIR-V/IGC stack. oneDNN uses its own native code generator (ngen) that directly emits GPU ISA bytes — no SPIR-V, no IGC, no zebin ELF, no .debug_line. Use when extracting ASM from oneDNN kernels (gemm_kernel, gen_conv_kernel), matmul, linear, conv, or SDPA-graph ops dispatched via mkldnn.

extract-asm-syclkernel-aot

Extract GPU ISA from AOT-compiled SYCL binaries using NEO driver runtime dump. Use when the kernel is a SYCL symbol (_ZTS...) from an AOT binary such as libtorch_xpu.so, SYCL-TLA FMHA, or any DPC++ binary compiled with -fsycl-targets=spir64_gen -Xs "-device <gpu>".

extract-asm-syclkernel-jit

Extract GPU ISA from JIT-compiled SYCL kernels. The binary only contains SPIR-V; IGC compiles it to native ISA at first launch. The zebin only exists in memory at runtime, so IGC_ShaderDumpEnable=1 is required to capture it. Use when extracting ASM from SYCL JIT kernels, torch-xpu-ops built with TORCH_XPU_ARCH_LIST=none, or standalone DPC++ without AOT flags.

extract-asm-triton

Extract GPU ISA from Triton kernels on XPU. Triton compiles through the same IGC backend as SYCL (Triton IR → SPIR-V → IGC → zebin). Extraction is identical to sycl-jit: IGC_ShaderDumpEnable=1 captures the zebin at runtime. Use when extracting ASM from torch.compile fusions (triton_per_fused, triton_poi, triton_red), standalone @triton.jit kernels, or Inductor-generated XPU kernels.

extract-xpu-kernel-asm

Extract Intel GPU ISA (assembly) from any XPU kernel. Classifies the codegen path (SYCL AOT, SYCL JIT, Triton, or oneDNN ngen) and delegates to the matching extraction skill. Use when asked to extract ASM, disassemble XPU kernels, get GPU ISA for an aten op, dump shader for PyTorch XPU, or disassemble a standalone DPC++/Triton binary.

asm-source-mapping

7 个 skills1.5k156更新于 2026-06-17

Map Intel GPU ISA instruction addresses to precise SYCL/DPC++ source file:line numbers. Primary method reads the DWARF .debug_line section from the GPU zebin ELF. Fallback uses structural pattern recognition by opcode mix. Use when mapping ASM to source code, finding which source line a GPU instruction comes from, or doing DWARF line table analysis on GPU binaries.

2026-07-02

当前展示该仓库 Top 8 / 26 个已收集 skills。

#002

auto-round

占该创作者 16%

skill

职业分类

描述

更新

adapt-new-diffusion-model

Adapt AutoRound to support a new diffusion model architecture (DiT, UNet, hybrid AR+DiT). Use when a new diffusion model fails quantization, needs custom output configs, requires a custom pipeline function, or is a hybrid architecture with both autoregressive and diffusion components.

2026-06-17

add-vlm-model

Add support for a new Vision-Language Model (VLM) to AutoRound, including multimodal block handler, calibration dataset template, and special model handling. Use when integrating a new VLM like LLaVA, Qwen2-VL, GLM-Image, Phi-Vision, or similar multi-modal models for quantization.

2026-06-17

review-pr

软件质量保证分析师与测试员

Review or prepare a pull request for the AutoRound repository — checks registration points for new data types/backends/VLMs, validates Chinese translation parity for modified markdown files, verifies quantization numerical stability (scale overflow, STE gradient flow, group_size padding), confirms test placement and fixture usage, and enforces Apache 2.0 headers and DCO sign-off. Use when performing a code review, running a PR checklist, preparing a merge request, or auditing a contribution before submit.

2026-06-10

adapt-new-llm

Adapt AutoRound to support a new LLM architecture that doesn't work out-of-the-box. Use when quantization fails for a new model type, block detection doesn't find layers, MoE models need unfusing, custom forward passes are needed, or non-standard linear layer types need handling.

2026-05-14

add-inference-backend

Add a new hardware inference backend to AutoRound for deploying quantized models (e.g., CUDA/Marlin, Triton, CPU, HPU, ARK). Use when implementing QuantLinear kernels, registering backend capabilities, or enabling quantized model inference on a new hardware platform.

2026-05-11

add-export-format

add-quantization-datatype

Add a new model export format to AutoRound (e.g., auto_round, auto_gptq, auto_awq, gguf, llm_compressor). Use when implementing a new quantized model serialization format, adding a new packing method, or extending export compatibility for deployment frameworks like vLLM, SGLang, or llama.cpp.

2026-04-17

confidential-computing-zoo

Add a new quantization data type to AutoRound (e.g., INT, FP8, MXFP, NVFP, GGUF variants). Use when implementing a new weight/activation quantization scheme, registering a new quant function, or extending the data_type registry.

2026-04-17

#003

3 个 skills35971更新于 2026-02-11

占该创作者 6.8%

skill

职业分类

描述

更新

check-td-runtime-environment

3 个 skills23030更新于 2026-06-05

Check TD Runtime Environment

Get TDVM event log

Get TDVM Quote Information

2026-02-11

#004

intel-performance-skills

占该创作者 6.8%

skill

职业分类

描述

更新

performance-patterns

Detect and fix x86/C/C++ performance patterns from source code or profiling output (perf, VTune, flamegraphs). Invoke when the user asks to optimize, review for performance, or write new SIMD/vectorized code — even without profiling data. Trigger on: serial accumulator loops, narrow SIMD (xmm/ymm that could be ymm/zmm), _mm* intrinsics, HITM/cmpxchg clusters, false sharing, missing restrict or vzeroupper, futex_wake/notify_all thundering herd, hot symbol inside a system library (.so) with a version gap, or any request to write a fast reduction, dot product, or CPU-dispatched function. Patterns: serial accumulator, TTAS spinlock, SIMD upconversion (zipper), false sharing, per-CPU stats, missing vzeroupper, missing restrict, cv-thundering-herd, mutex-to-rwlock, CPU dispatch, library version upgrade, fast CRC32C, known algorithms (Cosine Similarity, Hamming Distance, Jaccard Distance), SIMD sort (x86-simd-sort).

2026-06-05

linux-perf

网络与计算机系统管理员

Profile and fix Linux performance problems using `perf`. Workflows: (A) hardware counters -- IPC, cache-miss, branch mispredictions; (B) hotspot profiling -- which functions and source lines consume CPU, with SIMD and accumulator detection; (C) cache-line contention -- false sharing, HITM, `perf c2c`; (D) core-count scaling -- dual-profile comparison, bottleneck categorization; (E) structured hotspot report with annotated source and pattern observations. Resolution strategies: TTAS spinlock, SIMD upconversion, parallel accumulator, structured false-sharing fix, per-CPU stats. Trigger on: perf, profiling, profile, hotspot, hotspots, cache miss, IPC, false sharing, HITM, scaling, core count, thread scaling, bottleneck, slow code, CPU bound, why is this slow, where does time go, does not scale. When in doubt, invoke this skill -- better to use it unnecessarily than to miss a performance opportunity.

2026-05-22

phoronix-test-suite

1 个 skills45058更新于 2026-04-10

Install, run, parse, and optimize benchmarks from the Phoronix Test Suite (PTS). Use this skill whenever the user mentions "phoronix", "pts/", or "phoronix-test-suite", or asks to run, measure, improve, or optimize a PTS test — e.g., "run pts/mt-dgemm", "optimize pts/compress-zstd", "what score does pts/x265 get". Trigger immediately on any `pts/<testname>` reference, even if the user doesn't explicitly say "phoronix". Also trigger when the user asks to find or edit the source code of a PTS test.

2026-05-19

#005

PerfSpect

1 个 skills31245更新于 2026-05-15

skill

职业分类

描述

更新

functional-test

软件质量保证分析师与测试员

Use this skill when running functional tests to validate PerfSpect code changes, when the user says "run functional tests", "test my changes", "check for regressions", or when verifying a code change did not break existing functionality.

2026-04-10

#006

systemc-compiler

skill

职业分类

描述

更新

systemc-tools

intel-xpu-backend-for-triton

Use when writing, editing, reviewing, or debugging synthesizable SystemC code for Intel SystemC Compiler (ICSC) and SingleSource library. Covers rules for module hierarchy, channels and ports, process declarations, reset behavior, sensitivity lists, SystemC and C++ data types and collections. Applies communication channels, memory modules, data types and utility functions from Single Source library.

2026-05-15

#007

1 个 skills261106更新于 2026-06-04

skill

职业分类

描述

更新

blockptr-to-tdesc

1 个 skills10431更新于 2026-07-10

Translate a Triton kernel from the deprecated block-pointer API (tl.make_block_ptr / tl.advance / tl.load(boundary_check=...)) into an equivalent kernel using the modern device-side tensor-descriptor API (tl.make_tensor_descriptor / desc.load / desc.store) for the Intel XPU backend. Use this skill whenever the user wants to migrate, convert, translate, port, modernize, or "update" a kernel from block pointers to tensor descriptors; whenever they mention tl.make_block_ptr or tl.advance and ask for a modern/non-deprecated equivalent; whenever they ask how to use tensor descriptors in a kernel that currently uses block pointers; or when they paste a kernel using block pointers and ask how to speed it up or make it use DPAS / 2D block I/O on Intel GPU (PVC/BMG). Produce the descriptor form the XPU backend can lower efficiently, not just any form that compiles.

2026-06-04

#008

optimization-zone

skill

职业分类

描述

更新

vllm-xeon-cpu

1 个 skills166更新于 2026-05-01

Deploy, tune, validate, and benchmark vLLM on Intel Xeon CPUs (CPU-only inference, no GPU). USE FOR: serving and performance optimizing LLMs on Intel Xeon, vLLM CPU install, CPU inference tuning, AMX bfloat16 setup, NUMA pinning, VLLM_CPU_KVCACHE_SPACE, VLLM_CPU_OMP_THREADS_BIND, --dtype=bfloat16, vllm/vllm-openai-cpu Docker image, hardware validation for AMX (amx_tile, amx_bf16, amx_int8), KV cache sizing per NUMA node, --max-num-batched-tokens / --max-num-seqs tuning, vllm bench serve on CPU, TTFT/TPOT measurement. DO NOT USE FOR: GPU vLLM (use upstream vLLM docs), training, quantization tuning beyond INT8/AWQ pointers, model architecture selection (use Intel Xeon AI Performance Advisor), non-Xeon CPUs, vLLM source build deep-dives.

2026-07-10

#009

linux-kernel-oops