Run any Skill in Manus with one click

hpc-gpu-stack

Build, review, debug, and launch CUDA- and GPU-accelerated HPC workflows. Use when working with `nvcc`, host-compiler compatibility, CUDA-aware MPI, rank-to-GPU mapping, `CUDA_VISIBLE_DEVICES`, Slurm GPU scheduling, GPU memory or stream behavior, or CUDA build and runtime failures.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/SciMate-AI/HPC-Skills --skill hpc-gpu-stack

Copy and paste this command into Claude Code to install the skill

Source

SciMate-AI/HPC-Skills

Stars45

Forks6

UpdatedApril 1, 2026 at 17:42

File Explorer

14 files

SKILL.md

readonly

name	hpc-gpu-stack
description	Build, review, debug, and launch CUDA- and GPU-accelerated HPC workflows. Use when working with `nvcc`, host-compiler compatibility, CUDA-aware MPI, rank-to-GPU mapping, `CUDA_VISIBLE_DEVICES`, Slurm GPU scheduling, GPU memory or stream behavior, or CUDA build and runtime failures.

HPC GPU Stack

Treat GPU execution as one coherent stack: CUDA toolchain, host compiler, launcher, scheduler mapping, and device visibility must agree before kernel tuning matters.

Start

Read references/cuda-and-host-compiler-matrix.md before choosing nvcc, host compiler, or a CUDA build baseline.
Read references/gpu-aware-mpi-and-rank-mapping.md when the workflow spans MPI ranks, one-rank-per-GPU layouts, or CUDA-aware MPI.
Read references/device-visibility-and-scheduler-integration.md when Slurm, CUDA_VISIBLE_DEVICES, MIG, or scheduler-provided GPU allocation is involved.
Read references/memory-streams-and-overlap-playbook.md when debugging device memory pressure, pinned-memory transfers, streams, or overlap assumptions.
Read references/build-and-launch-workflow.md when turning a CUDA code path into a reproducible compile-and-run workflow.
Read references/runtime-debugging-and-profiling.md when kernels fail at runtime, ranks see the wrong device, or performance is unexpectedly poor.
Read references/error-recovery.md when configure, compile, launch, or runtime CUDA behavior fails.
Read references/error-pattern-dictionary.md when a GPU failure needs a fast pattern match.

Work sequence

Confirm the execution model first:
- single GPU
- one MPI rank per GPU
- hybrid MPI plus threads with explicit rank-to-GPU placement
Keep CUDA toolkit, host compiler, and MPI stack mutually compatible.
Let the scheduler expose the intended GPU allocation before forcing manual device selection.
Get a minimal kernel and launch baseline working before tuning streams, overlap, or transport variables.
Reproduce failures on one node and the smallest GPU count that still shows the issue before scaling out.

Guardrails

Do not assume nvcc accepts any host compiler visible in PATH.
Do not mix rank-to-GPU mapping logic from Open MPI, MPICH-family, and Slurm without checking which environment variables are actually set.
Do not tune streams or overlap to compensate for a broken device-mapping or memory-capacity issue.
Do not debug multi-node GPU failures before a single-node baseline is trustworthy.

Additional References

Load these on demand:

references/cuda-and-host-compiler-matrix.md for compiler-compatibility and build-baseline decisions
references/gpu-aware-mpi-and-rank-mapping.md for CUDA-aware MPI and rank placement rules
references/device-visibility-and-scheduler-integration.md for scheduler-exposed GPU visibility and Slurm integration
references/memory-streams-and-overlap-playbook.md for memory hierarchy, streams, and transfer overlap
references/build-and-launch-workflow.md for reproducible build and launch sequencing
references/runtime-debugging-and-profiling.md for runtime inspection and performance triage
references/error-pattern-dictionary.md for common GPU failure signatures

Reusable Templates

Use assets/templates/ when a concrete starting point is faster than rebuilding the GPU workflow from scratch, especially:

cuda_vector_add_minimal.cu
nvcc_build_example.sh
cuda_single_gpu_slurm.sh
cuda_mpi_gpu_slurm.sh

Outputs

Summarize:

CUDA toolkit and host-compiler path chosen
rank-to-GPU mapping or single-GPU launch path
scheduler or visibility assumptions
memory and stream model if relevant
the exact build or runtime failure class if the workflow is being repaired

More from this repository

same repository

hpc-ls-dyna

SciMate-AI/HPC-Skills

Create, review, debug, and recover LS-DYNA keyword decks and simulation workflows. Use when working with LS-DYNA `*.k` or `*.key` input files, include trees, explicit crash or impact models, implicit static or quasi-static solves, contact definitions, section and material selection, timestep or mass-scaling control, hourglass stabilization, thermal or coupled workflows, ALE/SPH/ICFD/EM/DEM/CESE/IGA cases, or LS-DYNA runtime instability and model-quality issues.

2026-04-0645

hpc-gmsh

SciMate-AI/HPC-Skills

Build, review, debug, and automate Gmsh geometry and meshing workflows. Use when working with `.geo` scripts, the Gmsh Python API, GEO versus OpenCASCADE modeling, physical groups, mesh-size fields, transfinite or recombine options, boundary layers, structured/unstructured algorithm selection, mesh partitioning, periodic meshes, high-order elements, mesh export and solver handoff (FEniCS, OpenFOAM, Elmer, SU2, etc.), CAD/STL import, or Gmsh CLI and meshing failures.

2026-04-0445

hpc-openfoam

SciMate-AI/HPC-Skills

Generate, review, debug, and recover OpenFOAM case files for CFD workflows. Use when working with OpenFOAM dictionaries, case structure, turbulence fields, boundary conditions, decomposition, numerics, or OpenFOAM runtime errors. Also covers RANS/LES turbulence setup, wall functions, y+ targeting, conjugate heat transfer, compressible flows, VOF multiphase, mesh quality, and scheme tuning.

2026-04-0445

hpc-vasp

SciMate-AI/HPC-Skills

Build, review, debug, and automate VASP first-principles workflows. Use when working with VASP input sets such as INCAR, POSCAR, KPOINTS, and POTCAR; when choosing SCF, relaxation, static, DOS, or band-structure stages; or when fixing convergence, symmetry, cutoff, and k-point issues. Also covers magnetic calculations, spin-orbit coupling, hybrid functionals, surface slabs, DFT+U, van der Waals, phonons, molecular dynamics, NEB, and defect calculations.

2026-04-0445

hpc-foundations

SciMate-AI/HPC-Skills

Navigate foundational HPC knowledge across concepts, architectures, schedulers, Linux usage, storage and RDMA, containers, cloud basics, and cluster administration patterns distilled from hpclib.com. Use when the task is about understanding or explaining HPC basics, Slurm or PBS or LSF concepts, Linux-on-cluster workflows, software environment setup, storage or network fundamentals, or when turning general HPC knowledge into future skills.

2026-04-0145

hpc-hypre

SciMate-AI/HPC-Skills

Build, review, debug, and tune hypre-based sparse solver workflows. Use when working with hypre `IJ`, `Struct`, or `SStruct` interfaces, `BoomerAMG`, Krylov solvers, MPI-distributed sparse systems, PETSc integration, or hypre build and runtime failures.

2026-04-0145

Source

SciMate-AI

SciMate-AI/HPC-Skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name	hpc-gpu-stack
description	Build, review, debug, and launch CUDA- and GPU-accelerated HPC workflows. Use when working with `nvcc`, host-compiler compatibility, CUDA-aware MPI, rank-to-GPU mapping, `CUDA_VISIBLE_DEVICES`, Slurm GPU scheduling, GPU memory or stream behavior, or CUDA build and runtime failures.

HPC GPU Stack

Treat GPU execution as one coherent stack: CUDA toolchain, host compiler, launcher, scheduler mapping, and device visibility must agree before kernel tuning matters.

Start

Read references/cuda-and-host-compiler-matrix.md before choosing nvcc, host compiler, or a CUDA build baseline.
Read references/gpu-aware-mpi-and-rank-mapping.md when the workflow spans MPI ranks, one-rank-per-GPU layouts, or CUDA-aware MPI.
Read references/device-visibility-and-scheduler-integration.md when Slurm, CUDA_VISIBLE_DEVICES, MIG, or scheduler-provided GPU allocation is involved.
Read references/memory-streams-and-overlap-playbook.md when debugging device memory pressure, pinned-memory transfers, streams, or overlap assumptions.
Read references/build-and-launch-workflow.md when turning a CUDA code path into a reproducible compile-and-run workflow.
Read references/runtime-debugging-and-profiling.md when kernels fail at runtime, ranks see the wrong device, or performance is unexpectedly poor.
Read references/error-recovery.md when configure, compile, launch, or runtime CUDA behavior fails.
Read references/error-pattern-dictionary.md when a GPU failure needs a fast pattern match.

Work sequence

Confirm the execution model first:
- single GPU
- one MPI rank per GPU
- hybrid MPI plus threads with explicit rank-to-GPU placement
Keep CUDA toolkit, host compiler, and MPI stack mutually compatible.
Let the scheduler expose the intended GPU allocation before forcing manual device selection.
Get a minimal kernel and launch baseline working before tuning streams, overlap, or transport variables.
Reproduce failures on one node and the smallest GPU count that still shows the issue before scaling out.

Guardrails

Do not assume nvcc accepts any host compiler visible in PATH.
Do not mix rank-to-GPU mapping logic from Open MPI, MPICH-family, and Slurm without checking which environment variables are actually set.
Do not tune streams or overlap to compensate for a broken device-mapping or memory-capacity issue.
Do not debug multi-node GPU failures before a single-node baseline is trustworthy.

Additional References

Load these on demand:

references/cuda-and-host-compiler-matrix.md for compiler-compatibility and build-baseline decisions
references/gpu-aware-mpi-and-rank-mapping.md for CUDA-aware MPI and rank placement rules
references/device-visibility-and-scheduler-integration.md for scheduler-exposed GPU visibility and Slurm integration
references/memory-streams-and-overlap-playbook.md for memory hierarchy, streams, and transfer overlap
references/build-and-launch-workflow.md for reproducible build and launch sequencing
references/runtime-debugging-and-profiling.md for runtime inspection and performance triage
references/error-pattern-dictionary.md for common GPU failure signatures

Reusable Templates

Use assets/templates/ when a concrete starting point is faster than rebuilding the GPU workflow from scratch, especially:

cuda_vector_add_minimal.cu
nvcc_build_example.sh
cuda_single_gpu_slurm.sh
cuda_mpi_gpu_slurm.sh

Outputs

Summarize:

CUDA toolkit and host-compiler path chosen
rank-to-GPU mapping or single-GPU launch path
scheduler or visibility assumptions
memory and stream model if relevant
the exact build or runtime failure class if the workflow is being repaired