一键在 Manus 中运行任何 Skill

debug-failing-gpu

星标173

分支55

更新时间2026年6月15日 15:05

Recover from GPU-busy / GPU-unavailable failures. Use when a command (pytest, python, a TLX/Triton kernel run, a benchmark) fails with errors indicating the GPU is busy, out of memory, or unavailable — e.g. "CUDA error: out of memory", "all CUDA-capable devices are busy or unavailable", "CUDA-capable device(s) is/are busy or unavailable", "RuntimeError: No CUDA GPUs are available", "device-side assert", or a hang on the first CUDA call. Runs find_working_gpu.sh to locate a healthy GPU and re-runs the failed command pinned to it via CUDA_VISIBLE_DEVICES.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

facebookexperimental

facebookexperimental/triton

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

Debug Failing GPU

A command failed because the GPU it landed on is busy, out of memory, or in a bad state. Find a GPU that actually works and re-run the command pinned to it.

When to trigger

Any failure whose root cause is the device, not the code. Common signatures:

CUDA error: out of memory / torch.cuda.OutOfMemoryError
all CUDA-capable devices are busy or unavailable
CUDA-capable device(s) is/are busy or unavailable
RuntimeError: No CUDA GPUs are available
CUDA error: device-side assert triggered
A kernel/test that hangs on the first CUDA call

Do not use this for kernel logic bugs, compilation errors, or numerical mismatches — those are not device-health problems.

Procedure

Run the scanner:

bash third_party/tlx/find_working_gpu.sh

Read the final line, WORKING_GPUS=... (e.g. WORKING_GPUS=0,2,3). These are physical GPU indices.
Pick the first working index. If several are free, mention them so the user can parallelize across GPUs.
Re-issue the original failing command with CUDA_VISIBLE_DEVICES=<idx>:
- If the command had no CUDA_VISIBLE_DEVICES, prepend one.
- If it already set CUDA_VISIBLE_DEVICES, replace that value — do not stack two assignments.

If no GPU works

WORKING_GPUS= is empty. The GPUs may be held by your own stuck processes:

Clear them: third_party/tlx/killgpu.sh
Re-run bash third_party/tlx/find_working_gpu.sh.
If still empty, all GPUs are occupied by other users — report that and stop; there is nothing to switch to.

Command-rewrite examples

# No device set -> prepend
pytest third_party/tlx/tutorials/testing/test_correctness.py
# becomes
CUDA_VISIBLE_DEVICES=2 pytest third_party/tlx/tutorials/testing/test_correctness.py

# Device already set -> replace, don't stack
CUDA_VISIBLE_DEVICES=4 third_party/tlx/denoise.sh python bench.py
# becomes
CUDA_VISIBLE_DEVICES=2 third_party/tlx/denoise.sh python bench.py

Note: denoise.sh defaults to device 4 when CUDA_VISIBLE_DEVICES is unset (third_party/tlx/denoise.sh:6), so always set it explicitly when wrapping a benchmark with denoise.sh after a failure.

同仓库更多 Skills

同仓库

autows-authoring

facebookexperimental/triton

Author Triton kernels with automatic warp specialization (AutoWS). Use when writing new AutoWS kernels, adding warp_specialize=True to tl.range loops, choosing tl.range kwargs and JIT options, debugging why WS was not applied, or structuring a kernel to work with both Meta WS and upstream OAI Triton. Covers GEMM and Flash Attention patterns on Hopper and Blackwell.

2026-06-23173

barrier-visualization

facebookexperimental/triton

Produce a structured barrier report for AutoWS (automatic warp specialization) IR. Use when the user wants to visualize, audit, or debug barrier usage across warp-specialized partitions, or when debugging a GPU kernel hang (deadlock). For hangs, first dump IR using the ir-debugging skill, then run this barrier analysis to find the barrier that actually deadlocks -- reasoning with the mbarrier phase model (NOT raw arrive/wait counts, which give false positives), plus missing backward barriers and other synchronization issues. Covers mbarriers, named barriers, tcgen05 commit, TMA-implicit arrives, Aref-based synchronization, and producer/consumer barrier patterns.

2026-06-22173

ir-override-ablation

facebookexperimental/triton

Design and run Triton TTGIR debugging ablations using ir_override. Use when reducing a provided or dumped TTGIR, trying user-provided or agent-generated ablation/oblation ideas, updating a test harness around ir_override, or preserving a compile/runtime failure while simplifying IR to expose a fundamental compiler or lowering gap.

2026-06-18173

running-with-buck

facebookexperimental/triton

How to build and run GPU targets under Buck in fbcode. Use when invoking buck2 run / buck2 build for any GPU benchmark, test, or kernel — selecting the GPU architecture and CUDA version, using @mode/opt and the beta Triton modifier, passing environment variables through, and running from the right directory. Covers the general requirements plus the B200/GB200 (b200a, CUDA >= 12.8) and GB300 (b300a, CUDA >= 13.0) hardware requirements.

2026-06-16173

tlx-api-reference

facebookexperimental/triton

TLX DSL API reference for low-level GPU primitives. Use when writing or modifying TLX kernel code that uses barriers (mbarrier, named barriers), memory allocation (local_alloc, SMEM, TMEM), TMA operations, warp specialization (async_tasks, async_task), CLC (cluster launch control), or wgmma instructions. Covers Hopper and Blackwell hardware differences.

2026-06-08173

proxy-fence-insertion

facebookexperimental/triton

Use when working on fence-related compiler passes, TMA store lowering, proxy fence insertion, investigating missing or spurious fences, or debugging correctness issues in TLX kernels that use tlx.async_descriptor_store or MMA operations.

2026-05-22173

name

debug-failing-gpu

description

Debug Failing GPU

A command failed because the GPU it landed on is busy, out of memory, or in a bad state. Find a GPU that actually works and re-run the command pinned to it.

When to trigger

Any failure whose root cause is the device, not the code. Common signatures:

CUDA error: out of memory / torch.cuda.OutOfMemoryError
all CUDA-capable devices are busy or unavailable
CUDA-capable device(s) is/are busy or unavailable
RuntimeError: No CUDA GPUs are available
CUDA error: device-side assert triggered
A kernel/test that hangs on the first CUDA call

Do not use this for kernel logic bugs, compilation errors, or numerical mismatches — those are not device-health problems.

Procedure

Run the scanner:

bash third_party/tlx/find_working_gpu.sh

Read the final line, WORKING_GPUS=... (e.g. WORKING_GPUS=0,2,3). These are physical GPU indices.
Pick the first working index. If several are free, mention them so the user can parallelize across GPUs.
Re-issue the original failing command with CUDA_VISIBLE_DEVICES=<idx>:
- If the command had no CUDA_VISIBLE_DEVICES, prepend one.
- If it already set CUDA_VISIBLE_DEVICES, replace that value — do not stack two assignments.

If no GPU works

WORKING_GPUS= is empty. The GPUs may be held by your own stuck processes:

Clear them: third_party/tlx/killgpu.sh
Re-run bash third_party/tlx/find_working_gpu.sh.
If still empty, all GPUs are occupied by other users — report that and stop; there is nothing to switch to.

Command-rewrite examples

# No device set -> prepend
pytest third_party/tlx/tutorials/testing/test_correctness.py
# becomes
CUDA_VISIBLE_DEVICES=2 pytest third_party/tlx/tutorials/testing/test_correctness.py

# Device already set -> replace, don't stack
CUDA_VISIBLE_DEVICES=4 third_party/tlx/denoise.sh python bench.py
# becomes
CUDA_VISIBLE_DEVICES=2 third_party/tlx/denoise.sh python bench.py

Note: denoise.sh defaults to device 4 when CUDA_VISIBLE_DEVICES is unset (third_party/tlx/denoise.sh:6), so always set it explicitly when wrapping a benchmark with denoise.sh after a failure.