Skip to main content
Ejecuta cualquier Skill en Manus
con un clic
Repositorio de GitHub

triton

triton contiene 12 skills recopiladas de facebookexperimental, con cobertura ocupacional por repositorio y páginas de detalle dentro del sitio.

skills recopiladas
12
Stars
173
actualizado
2026-06-23
Forks
55
Cobertura ocupacional
2 categorías ocupacionales · 100% clasificado
explorador de repositorios

Skills en este repositorio

autows-authoring
Desarrolladores de software

Author Triton kernels with automatic warp specialization (AutoWS). Use when writing new AutoWS kernels, adding warp_specialize=True to tl.range loops, choosing tl.range kwargs and JIT options, debugging why WS was not applied, or structuring a kernel to work with both Meta WS and upstream OAI Triton. Covers GEMM and Flash Attention patterns on Hopper and Blackwell.

2026-06-23
barrier-visualization
Desarrolladores de software

Produce a structured barrier report for AutoWS (automatic warp specialization) IR. Use when the user wants to visualize, audit, or debug barrier usage across warp-specialized partitions, or when debugging a GPU kernel hang (deadlock). For hangs, first dump IR using the ir-debugging skill, then run this barrier analysis to find the barrier that actually deadlocks -- reasoning with the mbarrier phase model (NOT raw arrive/wait counts, which give false positives), plus missing backward barriers and other synchronization issues. Covers mbarriers, named barriers, tcgen05 commit, TMA-implicit arrives, Aref-based synchronization, and producer/consumer barrier patterns.

2026-06-22
ir-override-ablation
Desarrolladores de software

Design and run Triton TTGIR debugging ablations using ir_override. Use when reducing a provided or dumped TTGIR, trying user-provided or agent-generated ablation/oblation ideas, updating a test harness around ir_override, or preserving a compile/runtime failure while simplifying IR to expose a fundamental compiler or lowering gap.

2026-06-18
running-with-buck
Desarrolladores de software

How to build and run GPU targets under Buck in fbcode. Use when invoking buck2 run / buck2 build for any GPU benchmark, test, or kernel — selecting the GPU architecture and CUDA version, using @mode/opt and the beta Triton modifier, passing environment variables through, and running from the right directory. Covers the general requirements plus the B200/GB200 (b200a, CUDA >= 12.8) and GB300 (b300a, CUDA >= 13.0) hardware requirements.

2026-06-16
debug-failing-gpu
Desarrolladores de software

Recover from GPU-busy / GPU-unavailable failures. Use when a command (pytest, python, a TLX/Triton kernel run, a benchmark) fails with errors indicating the GPU is busy, out of memory, or unavailable — e.g. "CUDA error: out of memory", "all CUDA-capable devices are busy or unavailable", "CUDA-capable device(s) is/are busy or unavailable", "RuntimeError: No CUDA GPUs are available", "device-side assert", or a hang on the first CUDA call. Runs find_working_gpu.sh to locate a healthy GPU and re-runs the failed command pinned to it via CUDA_VISIBLE_DEVICES.

2026-06-15
tlx-api-reference
Desarrolladores de software

TLX DSL API reference for low-level GPU primitives. Use when writing or modifying TLX kernel code that uses barriers (mbarrier, named barriers), memory allocation (local_alloc, SMEM, TMEM), TMA operations, warp specialization (async_tasks, async_task), CLC (cluster launch control), or wgmma instructions. Covers Hopper and Blackwell hardware differences.

2026-06-08
proxy-fence-insertion
Desarrolladores de software

Use when working on fence-related compiler passes, TMA store lowering, proxy fence insertion, investigating missing or spurious fences, or debugging correctness issues in TLX kernels that use tlx.async_descriptor_store or MMA operations.

2026-05-22
autows-testing
Analistas de garantía de calidad de software y probadores

Run autoWS (automatic warp specialization) correctness tests. Use when working on autoWS compiler code — files under WarpSpecialization/, partition scheduling, warp_specialize ops, WSCodePartition, WSDataPartition, WSTaskPartition, WSMemoryPlanner, or related passes. Do NOT use TLX correctness tests (third_party/tlx/tutorials/testing/test_correctness.py) for autoWS work — those test manual warp specialization via TLX, not the automatic compiler pipeline.

2026-05-21
autows-docs
Desarrolladores de software

Consult and maintain AutoWS documentation. Use BEFORE exploring AutoWS source code — when investigating, planning, or modifying files under WarpSpecialization/, partition scheduling, warp_specialize ops, WSCodePartition, WSDataPartition, WSTaskPartition, WSMemoryPlanner, or related passes. Also use AFTER making non-trivial changes to AutoWS code to keep docs in sync.

2026-04-25
tma-illegal-instruction
Desarrolladores de software

Diagnose CUDA "illegal instruction" / kernel crashes on Triton kernels that reference to TMA loads or stores (`make_tensor_descriptor`, `TensorDescriptor`, `descriptor.load`, `descriptor.store`, `tl.async_descriptor_load`, async TMA copies) as the source code line. Use when the user reports CUDA error 716, "an illegal instruction was encountered", segfault inside a TMA op, kernel hang followed by an illegal instruction trap, or a crash that only fires on the first or last tile of a launch. Covers the pattern where a TMA store/load is issued at an offset entirely past a tensor's shape — TMA does NOT silently mask out-of-bounds tile accesses; it traps. The root cause is almost never "missing in-kernel mask" — it is commonly a structural launcher / tile-mapping bug.

2026-04-23
ir-debugging
Desarrolladores de software

Debug Triton compilation by dumping IR at each stage (TTIR, TTGIR, LLVM, PTX). Use when investigating compilation failures, kernel performance, register spills, or when user asks to inspect IR output. Covers TRITON_KERNEL_DUMP, MLIR_ENABLE_DUMP, LLVM_IR_ENABLE_DUMP, TRITON_DUMP_PTXAS_LOG, and related env vars.

2026-02-12
kernel-perf-testing
Desarrolladores de software

Run TLX kernel performance benchmarks on Hopper and Blackwell GPUs. Use when user asks to benchmark, profile, or measure performance of any TLX kernel (GEMM, Flash Attention variants). Handles GPU selection, denoise wrapping, and version flags. Never run unless explicitly asked.

2026-02-12