Run any Skill in Manus with one click

$pwd:

cuda-cpp-kernel

Name: Cuda Cpp Kernel
Author: vipshop

// Use when writing, debugging, porting, reviewing, or optimizing CUDA C++ or PTX kernels; investigating CUDA Runtime or Driver API behavior; profiling kernels with Nsight Systems or Nsight Compute; or reasoning about Tensor Core instructions, shared memory, bank conflicts, occupancy, async copy, TMA, WGMMA, and architecture-specific behavior on Ampere, Hopper, or Blackwell.

Run Skill in Manus

$ git log --oneline --stat

stars:1,182

forks:71

updated:April 10, 2026 at 05:19

File Explorer

100 files

SKILL.md

readonly

related-skills.json

same repository

triton-kernel.md

from "vipshop/cache-dit"

Write optimized Triton GPU kernels for deep learning operations. Covers the full spectrum from basic vector ops to Flash Attention, persistent matmul, fused normalization, quantized GEMM, and memory-efficient patterns.

2026-05-191.2k

cute-dsl-kernel.md

from "vipshop/cache-dit"

Use when writing, modifying, porting, or optimizing CuTe DSL GPU kernels in Python; reading CuTe DSL API reference material; integrating a CuTe DSL kernel into a project; or rewriting an existing CUDA or C++ operator into CuTe DSL while preserving correctness and performance expectations.

2026-04-101.2k

cutlass-cpp-kernel.md

from "vipshop/cache-dit"

Use when writing, debugging, porting, reviewing, or optimizing CUTLASS or CuTe C++ kernels and templates; navigating CUTLASS examples, collectives, epilogues, pipelines, GEMM schedules, or CuTe headers; or analyzing template configuration, tiling, memory movement, and kernel structure for Hopper or Blackwell GPUs.

2026-04-101.2k

operator-migration.md

from "vipshop/cache-dit"

Use when doing operator migration or kernel migration for CUDA, Triton, or custom ops in cache-dit; porting kernels from nunchaku, deepcompressor, or other repos; designing operator registration and public wrappers; wiring build and packaging for optional extensions; or reviewing an operator migration plan. Guides survey, minimal-closure migration, API design, extension loading, packaging, and layered validation. Do not use for blind copy-paste ports.

2026-04-091.2k

ptq-workflow-integration.md

from "vipshop/cache-dit"

Use when integrating a new PTQ workflow into cache-dit; designing quantize/load API shape, backend-specific config validation, save/load manifests, benchmark and regression tests, or reviewing a PTQ integration plan. Uses the SVDQ PTQ integration only as a style and coverage reference. Do not copy the SVDQ implementation mechanically.

2026-04-081.2k

package.json

"author": "vipshop"

"repository": "vipshop/cache-dit"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	cuda-cpp-kernel
description	Use when writing, debugging, porting, reviewing, or optimizing CUDA C++ or PTX kernels; investigating CUDA Runtime or Driver API behavior; profiling kernels with Nsight Systems or Nsight Compute; or reasoning about Tensor Core instructions, shared memory, bank conflicts, occupancy, async copy, TMA, WGMMA, and architecture-specific behavior on Ampere, Hopper, or Blackwell.
argument-hint	Describe the kernel or CUDA problem, target GPU architecture, dtype, shapes, current bottleneck, and whether the task is implementation, debugging, optimization, or migration.
user-invocable	true

CUDA C++ and PTX Kernel Development

Goal

Use the bundled CUDA, PTX, and profiling references in this skill to implement, debug, and optimize CUDA kernels without relying on agent-specific install paths or ad hoc web searches.

When to Use

Use this skill when you need to:

write or review CUDA C++ kernels or supporting host code
reason about PTX instructions, inline PTX, Tensor Core instructions, or memory model details
debug CUDA Runtime API or Driver API failures
profile or optimize a kernel with Nsight Systems, Nsight Compute, compute-sanitizer, or cuda-gdb
investigate shared memory bank conflicts, memory coalescing, occupancy, register pressure, async copy, TMA, or cluster behavior
compare a custom operator or kernel against a PyTorch baseline for correctness or performance

Do not use this skill for:

CUTLASS or CuTe template design as the primary task; use cutlass-cpp-kernel
CuTe DSL Python kernel authoring as the primary task; use cute-dsl-kernel
cache-dit operator registration, packaging, or public API migration work by itself; pair with operator-migration

Reference Style Rule

Use skill-local relative paths for bundled references, for example:

references/ptx-docs/
references/cuda-runtime-docs/
references/ncu-docs/ProfilingGuide.md

Do not write agent-specific install paths into follow-up notes or generated docs.

Bundled Reference Map

The following directories are bundled under references/ inside this skill:

references/ptx-docs/ — full PTX ISA reference
references/ptx-simple/ — condensed PTX quick reference
references/cuda-runtime-docs/ — CUDA Runtime API reference
references/cuda-driver-docs/ — CUDA Driver API reference
references/cuda-guide/ — CUDA Programming Guide
references/best-practices-guide/ — CUDA C++ Best Practices Guide
references/ncu-docs/ — Nsight Compute docs
references/nsys-docs/ — Nsight Systems docs
references/debugging-tools.md — debugging workflow notes
references/performance-traps.md — common optimization traps

The following architecture and kernel reference files are also bundled at the top level of this skill:

sm89-optimization-guide.md — Ada-specific optimization and profiling guidance
sm90-optimization-guide.md — Hopper-specific optimization and profiling guidance
sm100-optimization-guide.md — Blackwell datacenter optimization and profiling guidance
sm103-optimization-guide.md — Blackwell Ultra optimization and profiling guidance
sm120-optimization-guide.md — Blackwell desktop or workstation optimization and profiling guidance
kernel-templates.md — low-level CUDA kernel templates and implementation patterns
troubleshooting.md — debugging, compute-sanitizer, and profiling troubleshooting notes

How to Search the Bundle

Prefer narrow text search over loading large reference files into context.

Suggested workflow:

Identify the exact instruction, API, metric, or concept.
Search the narrowest relevant subdirectory first.
Read only the matching file or a short relevant range.
Translate the documentation into the specific kernel or operator constraint you are implementing.

Typical search targets:

PTX instruction syntax: references/ptx-docs/9-instruction-set/
Quick PTX lookup: references/ptx-simple/
CUDA Runtime APIs: references/cuda-runtime-docs/modules/
CUDA Driver APIs: references/cuda-driver-docs/modules/
architecture and programming-model behavior: references/cuda-guide/
profiling metrics and sections: references/ncu-docs/ProfilingGuide.md
timeline and launch behavior: references/nsys-docs/UserGuide.md

For architecture-specific tuning and profiling, read the matching optimization guide first:

sm89-optimization-guide.md
sm90-optimization-guide.md
sm100-optimization-guide.md
sm103-optimization-guide.md
sm120-optimization-guide.md

Architecture-Specific Profiling Workflow

When using Nsight Systems or Nsight Compute, interpret the results in the context of the target architecture instead of treating all GPUs the same.

Use the bundled architecture guides as the first reference for cross-architecture analysis:

For sm89, focus on memory throughput, L2 hit rate, kernel fusion opportunity, and the lack of TMA or cluster features.
For sm90, focus on TMA overlap, warpgroup behavior, shared-memory staging, and whether the timeline shows good load or compute overlap.
For sm100 and sm103, focus on tcgen05 usage, TMEM behavior, TMA v2 overlap, cluster behavior, and whether the kernel actually benefits from Blackwell datacenter features.
For sm120, treat it closer to Ada than to datacenter Blackwell for profiling purposes: watch memory throughput, L2 hit rate, shared-memory limits, and the lack of TMEM or cluster features while deciding explicitly whether TMA or cp.async is the better staging path.

Recommended order:

Read the matching smXX-optimization-guide.md file.
Use nsys to identify launch gaps, overlap, copy or compute concurrency, and end-to-end bottlenecks.
Use ncu to inspect architecture-specific limits such as occupancy, memory throughput, L2 hit rate, tensor core utilization, stall reasons, register pressure, or shared-memory pressure.
Compare the observations against the architecture guide before changing tile shapes, pipelines, or memory movement.

Implementation Checklist

Before changing code, answer these questions:

What is the exact shape, dtype, and layout contract?
What architectural assumptions exist, such as SM target, shared memory budget, alignment, or Tensor Core mode?
Is the bottleneck compute, memory, launch overhead, or synchronization?
Which CUDA Runtime, Driver, or PTX rules must be preserved?
What verification will prove the kernel is correct and faster enough to justify the change?

If the task is a migration into cache-dit, keep the kernel work separate from registration and packaging decisions and use operator-migration for the repository-integration layer.

Debugging Workflow

Reproduce the issue with the smallest input that still fails.
Confirm the failure mode: wrong value, launch error, illegal memory access, race, hang, or performance regression.
When the kernel uses shared memory, cp.async, or any other asynchronous staging path, treat data-synchronization bugs as a first-line hypothesis. If only some shapes or stage-count cases fail, suspect missing barriers, premature shared-memory slot reuse, or incomplete predicate protection before assuming the math is wrong.
Use compute-sanitizer or cuda-gdb for correctness problems.
Use Nsight Systems first for end-to-end bottlenecks, then Nsight Compute for per-kernel root cause.
After each change, rerun the focused correctness test before doing broader benchmarks.

Performance Workflow

Never optimize by intuition alone.

Establish a baseline wall-clock measurement.
Use Nsight Systems to see where time is spent.
Use Nsight Compute to explain why the kernel is slow.
Change one dimension at a time: tile shape, memory movement, synchronization, vectorization, or epilogue structure.
Re-measure and compare against the previous version, not just the current absolute time.

If the result differs across GPU generations, consult the matching smXX-optimization-guide.md file before generalizing the bottleneck diagnosis.

Validation Requirements

Every operator or kernel task completed under this skill must include validation.

Minimum requirements:

Add or update unit tests for correctness.
Compare numerical accuracy against a PyTorch baseline or another trusted high-level reference when applicable.
Compare performance against that baseline when the task claims a performance benefit or replaces a baseline path.
Record the exact benchmark setup: shapes, dtypes, device, warmup, iterations, and timing method.

Additional requirement for rewrites or migrations:

If you rewrite an existing operator, such as moving from one CUDA implementation to another, or from a handwritten kernel to a new implementation style, compare the new implementation against the pre-rewrite operator on both accuracy and performance.
Treat a PyTorch baseline and the pre-rewrite operator as separate comparisons when both exist.

Output Expectations

When you finish a task using this skill, report:

the implementation scope
the main architectural or API constraints
the tests added or run
the PyTorch-baseline accuracy result
the performance result
if applicable, the old-versus-new operator comparison

cuda-cpp-kernel

More from this repository

More from this repository

CUDA C++ and PTX Kernel Development

Goal

When to Use

Reference Style Rule

Bundled Reference Map

How to Search the Bundle

Architecture-Specific Profiling Workflow

Implementation Checklist

Debugging Workflow

Performance Workflow

Validation Requirements

Output Expectations

CUDA C++ and PTX Kernel Development

Goal

When to Use

Reference Style Rule

Bundled Reference Map

How to Search the Bundle

Architecture-Specific Profiling Workflow

Implementation Checklist

Debugging Workflow

Performance Workflow

Validation Requirements

Output Expectations