facebookexperimental Agent Skills

skill

职业分类

描述

更新

Run the sched2tlx perf/correctness harness over the modulo-scheduling example corpus (case1-8: GEMM, persistent GEMM, FA fwd/bwd, addmm+bias, LayerNorm, wgrad+bias, multiphase GEMM). Use when the user asks to benchmark generated-vs-handwritten kernels, check corpus correctness, compare emitter revisions, or regenerate schedule_graph.json fixtures. Never run perf unless explicitly asked.

2026-07-16

barrier-visualization

软件开发工程师

Produce a structured barrier report for AutoWS (automatic warp specialization) IR. Use when the user wants to visualize, audit, or debug barrier usage across warp-specialized partitions, or when debugging a GPU kernel hang (deadlock). For hangs, first dump IR using the ir-debugging skill, then run this barrier analysis to find the barrier that actually deadlocks -- reasoning with the mbarrier phase model (NOT raw arrive/wait counts, which give false positives), plus missing backward barriers and other synchronization issues. Covers mbarriers, named barriers, tcgen05 commit, TMA-implicit arrives, Aref-based synchronization, and producer/consumer barrier patterns.

2026-07-01

compute-sanitizer

软件开发工程师

Run NVIDIA compute-sanitizer (memcheck, racecheck, initcheck, synccheck) against a Triton/TLX kernel to find runtime memory and synchronization bugs. Use when a kernel produces wrong results, crashes with an illegal/misaligned access, or is suspected of a shared-memory data race or invalid barrier usage — especially warp-specialized (WS) kernels using mbarriers, named barriers, TMA copies, or MMA accumulators. This is a runtime check: it runs the real kernel via its reproduce command, so it needs a working GPU and is 10-100x slower than a normal run.

2026-06-26

kernel-perf-testing

软件开发工程师

Run TLX kernel performance benchmarks on Hopper, Blackwell, and AMD (gfx950/CDNA4, gfx1250) GPUs. Use when user asks to benchmark, profile, or measure performance of any TLX kernel (GEMM, Flash Attention, addmm+GLU, IKBO variants). Handles GPU selection, denoise wrapping (NVIDIA only), and version flags. Never run unless explicitly asked.

2026-06-24

tlx-amd-testing

软件开发工程师

Test and run TLX-AMD tutorial kernels (gfx950/CDNA4 and gfx1250) and understand their CI. Use when working on AMD TLX tutorial kernels — GEMM (warp-pipeline, LDS-pipelined, TDM, MXFP), Flash Attention (simple, prefetch, persistent), addmm+GLU, or IKBO (FA, LCE) — running their correctness or perf, checking arch gating (gfx950 vs gfx1250), or the MI350 CI workflow. Covers the standardized layout (one correctness file, one perf file per op×arch).

2026-06-24

autows-authoring

软件开发工程师

Author Triton kernels with automatic warp specialization (AutoWS). Use when writing new AutoWS kernels, adding warp_specialize=True to tl.range loops, choosing tl.range kwargs and JIT options, debugging why WS was not applied, or structuring a kernel to work with both Meta WS and upstream OAI Triton. Covers GEMM and Flash Attention patterns on Hopper and Blackwell.

2026-06-23

ir-override-ablation

软件开发工程师

Design and run Triton TTGIR debugging ablations using ir_override. Use when reducing a provided or dumped TTGIR, trying user-provided or agent-generated ablation/oblation ideas, updating a test harness around ir_override, or preserving a compile/runtime failure while simplifying IR to expose a fundamental compiler or lowering gap.

2026-06-18

running-with-buck

软件开发工程师

How to build and run GPU targets under Buck in fbcode. Use when invoking buck2 run / buck2 build for any GPU benchmark, test, or kernel — selecting the GPU architecture and CUDA version, using @mode/opt and the beta Triton modifier, passing environment variables through, and running from the right directory. Covers the general requirements plus the B200/GB200 (b200a, CUDA >= 12.8) and GB300 (b300a, CUDA >= 13.0) hardware requirements.

2026-06-16

当前展示该仓库 Top 8 / 15 个已收集 skills。

facebookexperimental

Skills 分布在哪些仓库

仓库与代表性 skills