Skip to main content
Exécutez n'importe quel Skill dans Manus
en un clic
Dépôt GitHub

triton

triton contient 12 skills collectées depuis facebookexperimental, avec une couverture métier par dépôt et des pages de détail sur le site.

skills collectés
12
Stars
173
mis à jour
2026-06-23
Forks
55
Couverture métier
4 catégories métier · 100% classifié
explorateur de dépôts

Skills dans ce dépôt

autows-authoring
Développeurs de logiciels

Author Triton kernels with automatic warp specialization (AutoWS). Use when writing new AutoWS kernels, adding warp_specialize=True to tl.range loops, choosing tl.range kwargs and JIT options, debugging why WS was not applied, or structuring a kernel to work with both Meta WS and upstream OAI Triton. Covers GEMM and Flash Attention patterns on Hopper and Blackwell.

2026-06-23
barrier-visualization
Développeurs de logiciels

Produce a structured barrier report for AutoWS (automatic warp specialization) IR. Use when the user wants to visualize, audit, or debug barrier usage across warp-specialized partitions, or when debugging a GPU kernel hang (deadlock). For hangs, first dump IR using the ir-debugging skill, then run this barrier analysis to find the barrier that actually deadlocks -- reasoning with the mbarrier phase model (NOT raw arrive/wait counts, which give false positives), plus missing backward barriers and other synchronization issues. Covers mbarriers, named barriers, tcgen05 commit, TMA-implicit arrives, Aref-based synchronization, and producer/consumer barrier patterns.

2026-06-22
ir-override-ablation
Développeurs de logiciels

Design and run Triton TTGIR debugging ablations using ir_override. Use when reducing a provided or dumped TTGIR, trying user-provided or agent-generated ablation/oblation ideas, updating a test harness around ir_override, or preserving a compile/runtime failure while simplifying IR to expose a fundamental compiler or lowering gap.

2026-06-18
running-with-buck
Développeurs de logiciels

How to build and run GPU targets under Buck in fbcode. Use when invoking buck2 run / buck2 build for any GPU benchmark, test, or kernel — selecting the GPU architecture and CUDA version, using @mode/opt and the beta Triton modifier, passing environment variables through, and running from the right directory. Covers the general requirements plus the B200/GB200 (b200a, CUDA >= 12.8) and GB300 (b300a, CUDA >= 13.0) hardware requirements.

2026-06-16
debug-failing-gpu
Administrateurs de réseaux et de systèmes informatiques

Recover from GPU-busy / GPU-unavailable failures. Use when a command (pytest, python, a TLX/Triton kernel run, a benchmark) fails with errors indicating the GPU is busy, out of memory, or unavailable — e.g. "CUDA error: out of memory", "all CUDA-capable devices are busy or unavailable", "CUDA-capable device(s) is/are busy or unavailable", "RuntimeError: No CUDA GPUs are available", "device-side assert", or a hang on the first CUDA call. Runs find_working_gpu.sh to locate a healthy GPU and re-runs the failed command pinned to it via CUDA_VISIBLE_DEVICES.

2026-06-15
tlx-api-reference
Développeurs de logiciels

TLX DSL API reference for low-level GPU primitives. Use when writing or modifying TLX kernel code that uses barriers (mbarrier, named barriers), memory allocation (local_alloc, SMEM, TMEM), TMA operations, warp specialization (async_tasks, async_task), CLC (cluster launch control), or wgmma instructions. Covers Hopper and Blackwell hardware differences.

2026-06-08
proxy-fence-insertion
Développeurs de logiciels

Use when working on fence-related compiler passes, TMA store lowering, proxy fence insertion, investigating missing or spurious fences, or debugging correctness issues in TLX kernels that use tlx.async_descriptor_store or MMA operations.

2026-05-22
autows-testing
Analystes en assurance qualité des logiciels et testeurs

Run autoWS (automatic warp specialization) correctness tests. Use when working on autoWS compiler code — files under WarpSpecialization/, partition scheduling, warp_specialize ops, WSCodePartition, WSDataPartition, WSTaskPartition, WSMemoryPlanner, or related passes. Do NOT use TLX correctness tests (third_party/tlx/tutorials/testing/test_correctness.py) for autoWS work — those test manual warp specialization via TLX, not the automatic compiler pipeline.

2026-05-21
autows-docs
Développeurs de logiciels

Consult and maintain AutoWS documentation. Use BEFORE exploring AutoWS source code — when investigating, planning, or modifying files under WarpSpecialization/, partition scheduling, warp_specialize ops, WSCodePartition, WSDataPartition, WSTaskPartition, WSMemoryPlanner, or related passes. Also use AFTER making non-trivial changes to AutoWS code to keep docs in sync.

2026-04-25
tma-illegal-instruction
Développeurs de logiciels

Diagnose CUDA "illegal instruction" / kernel crashes on Triton kernels that reference to TMA loads or stores (`make_tensor_descriptor`, `TensorDescriptor`, `descriptor.load`, `descriptor.store`, `tl.async_descriptor_load`, async TMA copies) as the source code line. Use when the user reports CUDA error 716, "an illegal instruction was encountered", segfault inside a TMA op, kernel hang followed by an illegal instruction trap, or a crash that only fires on the first or last tile of a launch. Covers the pattern where a TMA store/load is issued at an offset entirely past a tensor's shape — TMA does NOT silently mask out-of-bounds tile accesses; it traps. The root cause is almost never "missing in-kernel mask" — it is commonly a structural launcher / tile-mapping bug.

2026-04-23
ir-debugging
Programmeurs informatiques

Debug Triton compilation by dumping IR at each stage (TTIR, TTGIR, LLVM, PTX). Use when investigating compilation failures, kernel performance, register spills, or when user asks to inspect IR output. Covers TRITON_KERNEL_DUMP, MLIR_ENABLE_DUMP, LLVM_IR_ENABLE_DUMP, TRITON_DUMP_PTXAS_LOG, and related env vars.

2026-02-12
kernel-perf-testing
Programmeurs informatiques

Run TLX kernel performance benchmarks on Hopper and Blackwell GPUs. Use when user asks to benchmark, profile, or measure performance of any TLX kernel (GEMM, Flash Attention variants). Handles GPU selection, denoise wrapping, and version flags. Never run unless explicitly asked.

2026-02-12