name	KernelWiki
description	Use when the user asks about optimizing NVIDIA Blackwell (SM100, B200) or Hopper (SM90, H100) GPU kernels — tcgen05/TMEM/CLC/NVFP4/2-SM cooperative, warp specialization, FlashAttention-4, DeepGEMM, FlashMLA, MoE, grouped GEMM, CuTe-DSL/PTX/Triton on Blackwell, or wants concrete PR references from CUTLASS/SGLang/vLLM/FlashInfer/PyTorch. Do NOT use for generic CUDA Q&A that is not Blackwell/Hopper-specific, host-side framework integration, or distributed systems (DeepEP/EPLB/DualPipe).
argument-hint	[natural-language-question] \| [--tag foo --type kernel] \| [page-id]
allowed-tools	Bash Read Grep Glob

KernelWiki — Blackwell & Hopper Kernel Optimization Wiki

Knowledge cutoff: 2026-04-27. All upstream PR data, blog summaries, and version-claim entries reflect upstream state on or before this date (per data/refresh-cutoff.yaml). Re-run the refresh tooling to advance the cutoff.

Query a structured, cross-referenced knowledge base of GPU kernel optimization for NVIDIA Blackwell (SM100) and Hopper (SM90) — 2179 merged PRs, 48 wiki synthesis pages, 7 competitions, 20 blogs, 11 doc summaries.

When To Use This Skill

Trigger this skill when the user asks about:

Blackwell/SM100 kernel programming — tcgen05.mma, TMEM, CLC, 2-SM cooperative, NVFP4, FP8/FP4 block scaling, PDL/GDC
Kernel implementations — FlashAttention-4, DeepGEMM, FlashMLA, NSA, GatedDeltaNet, NVFP4 GEMM/GEMV, fused MoE, gated dual GEMM
Performance patterns — low SM utilization, memory-bound, register pressure, compute-bound, tail effects, pipeline stalls
DSLs for Blackwell — CuTe DSL, CUDA C++ with PTX inline, Triton on Blackwell
Hopper → Blackwell migration — wgmma → tcgen05, register → TMEM accumulators
PR references — "how did vLLM/SGLang/FlashInfer/CUTLASS/PyTorch implement X for SM100?"
Competition solutions — GPU Mode NVFP4 hackathon, FlashInfer MLSys 2026 submissions

Do NOT use this skill for:

Generic CUDA questions unrelated to Blackwell/Hopper tensor cores
Host-side framework integration (model loading, request routing, scheduling policy)
Distributed systems topics — DeepEP, EPLB, DualPipe are out of scope

How To Query

All commands below run from the skill directory (the clone root — the directory this SKILL.md lives in). The scripts auto-resolve the wiki root; no environment variable required.

Path 1: Unified search (preferred for natural language)

python3 scripts/query.py "how to fuse gate-up dual GEMM on Blackwell"
python3 scripts/query.py --tag nvfp4 --type kernel
python3 scripts/query.py --repo cutlass --limit 20
python3 scripts/query.py --symptom tail-effect --compact

Filters: --type, --tag, --repo, --language, --architecture, --symptom, --confidence, --limit, --compact, --paths-only. --tag and --architecture accept aliases — --tag UMMA matches tcgen05, --architecture B200 matches sm100, etc.

Path 2: Fetch a specific page by id or path

python3 scripts/get_page.py kernel-flash-attention-4
python3 scripts/get_page.py pr-cutlass-2472
python3 scripts/get_page.py kernel-flash-attention-4 --follow-sources
python3 scripts/get_page.py kernel-flash-attention-4 --body-only

Path 3: Regex text search across wiki bodies and PR pages

python3 scripts/grep_wiki.py "tcgen05\\.fence"
python3 scripts/grep_wiki.py "2-CTA backward" --only wiki
python3 scripts/grep_wiki.py "nvfp4" "block_scale" --any

Path 4: Pre-built cross-reference indices

Auto-generated under queries/:

queries/by-problem.md — symptom → pattern page → candidate techniques
queries/by-technique.md — 15 techniques with architectures, confidence, reproducibility, source count
queries/by-hardware-feature.md — tcgen05/tmem/clc/tma/nvfp4/etc. → related wiki + PR pages
queries/by-kernel-type.md — gemm/attention/moe/mla/gated-delta-net → pages
queries/by-language.md — cute-dsl/cuda-cpp/ptx/triton → guide page + related kernels/sources
queries/by-repo.md — all 2179 PRs across cutlass/sglang/vllm/flashinfer/pytorch/DeepGEMM

Path 5: Primer, schema, examples

Companion docs under references/:

references/primer.md — topic map: hardware features, techniques, symptoms, canonical page IDs. Read this first when the question is broad.
references/schema.md — condensed frontmatter schema, confidence rules, reproducibility ladder, controlled vocabulary, canonical aliases.
references/examples.md — 10 worked query patterns mapping user questions → command sequences → synthesis.

Output Pattern

When answering from this KB:

Cite specific pages with paths (e.g., wiki/kernels/flash-attention-4.md) and IDs (kernel-flash-attention-4).
Follow sources: fields to trace claims back to PRs/blogs/docs.
Respect confidence levels — verified > source-reported > inferred > experimental. Call out when a claim is experimental or inferred.
Include code snippets from wiki pages when they exist — technique/kernel/language pages are guaranteed snippet-reproducibility (validator-enforced).
Report performance claims with all six fields — gpu, dtype, shape, metric, value, source_id.

Knowledge Base Contents (knowledge cutoff: 2026-04-27)

2265 total markdown pages — 2179 PR references + 48 wiki synthesis + 20 blogs + 11 docs + 7 contests
6 candidate ledgers in candidates/ — 4,222 merged PRs classified (include/defer/exclude) Jan 2025 – Apr 2026
89 verbatim/extracted/derived asset bundles in artifacts/ (PR diffs, kernel files, blog code) — pinned to upstream SHAs via PROVENANCE.yaml
6 auto-generated query indices in queries/
Controlled vocabulary (80+ tags) in data/tags.yaml, alias map in data/aliases.yaml
Hybrid version-claim registry — per-page version_sensitive: <id> pointers + data/version-claims.yaml central registry, validated for bidirectional consistency
Validator scripts/validate.py — 2265 files / 89 bundles / 6 ledgers / 0 errors
Blackwell-first — SM90 pages only appear when they carry explicit blackwell_relevance

The knowledge cutoff date is the last day on which upstream PRs / blog snapshots were refreshed. To advance it: run scripts/refresh_candidate_ledger.py, regenerate PR pages, then bump data/refresh-cutoff.yaml::cutoff_date.

Quality Guarantees

Every verified page has official-doc + upstream-code evidence
Every technique/kernel/language page has a compilable snippet
Every PR page has inclusion_reason and status: merged
All Hopper-inclusive pages have explicit blackwell_relevance

name	KernelWiki
description	Use when the user asks about optimizing NVIDIA Blackwell (SM100, B200) or Hopper (SM90, H100) GPU kernels — tcgen05/TMEM/CLC/NVFP4/2-SM cooperative, warp specialization, FlashAttention-4, DeepGEMM, FlashMLA, MoE, grouped GEMM, CuTe-DSL/PTX/Triton on Blackwell, or wants concrete PR references from CUTLASS/SGLang/vLLM/FlashInfer/PyTorch. Do NOT use for generic CUDA Q&A that is not Blackwell/Hopper-specific, host-side framework integration, or distributed systems (DeepEP/EPLB/DualPipe).
argument-hint	[natural-language-question] \| [--tag foo --type kernel] \| [page-id]
allowed-tools	Bash Read Grep Glob

KernelWiki — Blackwell & Hopper Kernel Optimization Wiki

Knowledge cutoff: 2026-04-27. All upstream PR data, blog summaries, and version-claim entries reflect upstream state on or before this date (per data/refresh-cutoff.yaml). Re-run the refresh tooling to advance the cutoff.

When To Use This Skill

Trigger this skill when the user asks about:

Blackwell/SM100 kernel programming — tcgen05.mma, TMEM, CLC, 2-SM cooperative, NVFP4, FP8/FP4 block scaling, PDL/GDC
Kernel implementations — FlashAttention-4, DeepGEMM, FlashMLA, NSA, GatedDeltaNet, NVFP4 GEMM/GEMV, fused MoE, gated dual GEMM
Performance patterns — low SM utilization, memory-bound, register pressure, compute-bound, tail effects, pipeline stalls
DSLs for Blackwell — CuTe DSL, CUDA C++ with PTX inline, Triton on Blackwell
Hopper → Blackwell migration — wgmma → tcgen05, register → TMEM accumulators
PR references — "how did vLLM/SGLang/FlashInfer/CUTLASS/PyTorch implement X for SM100?"
Competition solutions — GPU Mode NVFP4 hackathon, FlashInfer MLSys 2026 submissions

Do NOT use this skill for:

Generic CUDA questions unrelated to Blackwell/Hopper tensor cores
Host-side framework integration (model loading, request routing, scheduling policy)
Distributed systems topics — DeepEP, EPLB, DualPipe are out of scope

How To Query

All commands below run from the skill directory (the clone root — the directory this SKILL.md lives in). The scripts auto-resolve the wiki root; no environment variable required.

Path 1: Unified search (preferred for natural language)

python3 scripts/query.py "how to fuse gate-up dual GEMM on Blackwell"
python3 scripts/query.py --tag nvfp4 --type kernel
python3 scripts/query.py --repo cutlass --limit 20
python3 scripts/query.py --symptom tail-effect --compact

Path 2: Fetch a specific page by id or path

python3 scripts/get_page.py kernel-flash-attention-4
python3 scripts/get_page.py pr-cutlass-2472
python3 scripts/get_page.py kernel-flash-attention-4 --follow-sources
python3 scripts/get_page.py kernel-flash-attention-4 --body-only

Path 3: Regex text search across wiki bodies and PR pages

python3 scripts/grep_wiki.py "tcgen05\\.fence"
python3 scripts/grep_wiki.py "2-CTA backward" --only wiki
python3 scripts/grep_wiki.py "nvfp4" "block_scale" --any

Path 4: Pre-built cross-reference indices

Auto-generated under queries/:

queries/by-problem.md — symptom → pattern page → candidate techniques
queries/by-technique.md — 15 techniques with architectures, confidence, reproducibility, source count
queries/by-hardware-feature.md — tcgen05/tmem/clc/tma/nvfp4/etc. → related wiki + PR pages
queries/by-kernel-type.md — gemm/attention/moe/mla/gated-delta-net → pages
queries/by-language.md — cute-dsl/cuda-cpp/ptx/triton → guide page + related kernels/sources
queries/by-repo.md — all 2179 PRs across cutlass/sglang/vllm/flashinfer/pytorch/DeepGEMM

Path 5: Primer, schema, examples

Companion docs under references/:

references/primer.md — topic map: hardware features, techniques, symptoms, canonical page IDs. Read this first when the question is broad.
references/schema.md — condensed frontmatter schema, confidence rules, reproducibility ladder, controlled vocabulary, canonical aliases.
references/examples.md — 10 worked query patterns mapping user questions → command sequences → synthesis.

Output Pattern

When answering from this KB:

Cite specific pages with paths (e.g., wiki/kernels/flash-attention-4.md) and IDs (kernel-flash-attention-4).
Follow sources: fields to trace claims back to PRs/blogs/docs.
Respect confidence levels — verified > source-reported > inferred > experimental. Call out when a claim is experimental or inferred.
Include code snippets from wiki pages when they exist — technique/kernel/language pages are guaranteed snippet-reproducibility (validator-enforced).
Report performance claims with all six fields — gpu, dtype, shape, metric, value, source_id.

Knowledge Base Contents (knowledge cutoff: 2026-04-27)

2265 total markdown pages — 2179 PR references + 48 wiki synthesis + 20 blogs + 11 docs + 7 contests
6 candidate ledgers in candidates/ — 4,222 merged PRs classified (include/defer/exclude) Jan 2025 – Apr 2026
89 verbatim/extracted/derived asset bundles in artifacts/ (PR diffs, kernel files, blog code) — pinned to upstream SHAs via PROVENANCE.yaml
6 auto-generated query indices in queries/
Controlled vocabulary (80+ tags) in data/tags.yaml, alias map in data/aliases.yaml
Hybrid version-claim registry — per-page version_sensitive: <id> pointers + data/version-claims.yaml central registry, validated for bidirectional consistency
Validator scripts/validate.py — 2265 files / 89 bundles / 6 ledgers / 0 errors
Blackwell-first — SM90 pages only appear when they carry explicit blackwell_relevance

Quality Guarantees

Every verified page has official-doc + upstream-code evidence
Every technique/kernel/language page has a compilable snippet
Every PR page has inclusion_reason and status: merged
All Hopper-inclusive pages have explicit blackwell_relevance

kernelwiki

KernelWiki — Blackwell & Hopper Kernel Optimization Wiki

When To Use This Skill

How To Query

Path 1: Unified search (preferred for natural language)

Path 2: Fetch a specific page by id or path

Path 3: Regex text search across wiki bodies and PR pages

Path 4: Pre-built cross-reference indices

Path 5: Primer, schema, examples

Output Pattern

Knowledge Base Contents (knowledge cutoff: 2026-04-27)

Quality Guarantees

KernelWiki — Blackwell & Hopper Kernel Optimization Wiki

When To Use This Skill

How To Query

Path 1: Unified search (preferred for natural language)

Path 2: Fetch a specific page by id or path

Path 3: Regex text search across wiki bodies and PR pages

Path 4: Pre-built cross-reference indices

Path 5: Primer, schema, examples

Output Pattern

Knowledge Base Contents (knowledge cutoff: 2026-04-27)

Quality Guarantees