Run any Skill in Manus with one click

Get Started

add-sgl-kernel

Stars30,621

Forks7,323

UpdatedApril 28, 2026 at 09:05

Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

sgl-project

sgl-project/sglang

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

SKILL.md

readonly

Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)

This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement scale(x, factor) = x * factor to demonstrate the complete workflow.

Goal

Add a new operation that scales each element of a tensor by a scalar factor:

Input: tensor x (CUDA) and scalar factor (float)
Output: x * factor (element-wise, in-place or into pre-allocated out)
Supported dtypes: FP16 (torch.float16), BF16 (torch.bfloat16), FP32 (torch.float32)
- Dispatched via DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 macro (defined in sgl-kernel/include/utils.h)

Two rules of thumb (must follow)

Prefer python/sglang/jit_kernel first when the kernel does not depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.
Prefer sgl-kernel when the kernel does depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.
Exception: if the dependency is flashinfer, or CUTLASS that is already provided through flashinfer, the kernel can still be implemented as jit_kernel.

In addition, every new kernel must ship with:

Tests (pytest)
A benchmark script (triton.testing)

Repository integration map

You will typically touch these files/areas:

Implementation: sgl-kernel/csrc/elementwise/scale.cu (pick the right subdirectory)
Public declarations: sgl-kernel/include/sgl_kernel_ops.h
Torch extension registration: sgl-kernel/csrc/common_extension.cc
Build: sgl-kernel/CMakeLists.txt (set(SOURCES ...))
Python API: sgl-kernel/python/sgl_kernel/ and sgl-kernel/python/sgl_kernel/__init__.py
Tests: sgl-kernel/tests/test_scale.py
Benchmarks: sgl-kernel/benchmark/bench_scale.py

Step 1: Implement the kernel in `csrc/`

Pick the right subdirectory:

csrc/elementwise/ — for element-wise ops (our example)
csrc/gemm/, csrc/attention/, csrc/moe/ — for other categories

Create sgl-kernel/csrc/elementwise/scale.cu:

#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>

#include "utils.h"  // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16

// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
                              const T* __restrict__ input,
                              float factor,
                              int64_t n) {
  int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
  if (idx < n) {
    out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
  }
}

void scale(at::Tensor& out, const at::Tensor& input, double factor) {
  TORCH_CHECK(input.is_cuda(),       "input must be a CUDA tensor");
  TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
  TORCH_CHECK(out.is_cuda(),         "out must be a CUDA tensor");
  TORCH_CHECK(out.is_contiguous(),   "out must be contiguous");
  TORCH_CHECK(out.sizes() == input.sizes(),  "out and input must have the same shape");
  TORCH_CHECK(out.scalar_type() == input.scalar_type(),
              "out and input must have the same dtype");

  const int64_t n = input.numel();
  const int threads = 256;
  const int blocks  = (n + threads - 1) / threads;

  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));

  // Dispatches over float, float16, bfloat16
  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
    scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
        static_cast<c_type*>(out.data_ptr()),
        static_cast<const c_type*>(input.data_ptr()),
        static_cast<float>(factor),
        n);
    cudaError_t status = cudaGetLastError();
    TORCH_CHECK(status == cudaSuccess,
                "scale_kernel launch failed: ", cudaGetErrorString(status));
    return true;
  });
}

Key points:

Use at::Tensor (PyTorch tensors), TORCH_CHECK for validation, at::cuda::getCurrentCUDAStream() for stream
Keep Python wrappers thin; do shape/dtype/device validation in C++ right around the launch path
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 covers float, half (FP16), __nv_bfloat16 (BF16)
Add device error checking after every kernel launch
If a kernel only works on certain architectures, enforce that with TORCH_CHECK and skip logic in tests

Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`

Edit sgl-kernel/include/sgl_kernel_ops.h, add to the elementwise section:

void scale(at::Tensor& out, const at::Tensor& input, double factor);

Step 3: Register the op in `csrc/common_extension.cc`

Edit sgl-kernel/csrc/common_extension.cc, inside TORCH_LIBRARY_FRAGMENT(sgl_kernel, m):

// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);

Key points:

Tensor! means in-place / mutable output argument
The schema is important for torch.compile and for consistent call signatures
Keep the torch schema in PyTorch scalar types (float here), but note that the C++ launcher signature still needs double for scalar arguments accepted by torch::Library

Step 4: Add the new source file to `CMakeLists.txt`

Edit sgl-kernel/CMakeLists.txt, add to set(SOURCES ...):

csrc/elementwise/scale.cu

Key points:

Keep the list alphabetically sorted (the file explicitly requires this)
If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic

Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`

Prefer following the existing module organization first. For elementwise kernels, the usual pattern is:

implement the Python wrapper in sgl-kernel/python/sgl_kernel/elementwise.py
then re-export it from sgl-kernel/python/sgl_kernel/__init__.py

For example, in sgl-kernel/python/sgl_kernel/elementwise.py, add:

import torch

def scale(
    input: torch.Tensor,
    factor: float,
    out: torch.Tensor | None = None,
) -> torch.Tensor:
    """
    Element-wise scale: out = input * factor.

    Supported dtypes: torch.float16, torch.bfloat16, torch.float32.

    Parameters
    ----------
    input  : CUDA input tensor
    factor : scale factor (float)
    out    : optional pre-allocated CUDA output tensor (same shape/dtype as input)
    """
    if out is None:
        out = torch.empty_like(input)
    torch.ops.sgl_kernel.scale.default(out, input, factor)
    return out

Then re-export it from sgl-kernel/python/sgl_kernel/__init__.py following the existing import style used by other kernels.

Step 6: Write tests (required)

Create sgl-kernel/tests/test_scale.py:

import pytest

import torch
import sgl_kernel

@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
    input = torch.randn(size, dtype=dtype, device="cuda")
    out   = torch.empty_like(input)

    result = sgl_kernel.scale(input, factor, out=out)
    assert result is out

    expected = input * factor
    rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
    torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)


def test_scale_shape_mismatch():
    input = torch.randn(128, dtype=torch.float16, device="cuda")
    out   = torch.empty(256, dtype=torch.float16, device="cuda")
    with pytest.raises(RuntimeError, match="same shape"):
        sgl_kernel.scale(input, 2.0, out=out)


def test_scale_cpu_input():
    input = torch.randn(128, dtype=torch.float16)  # CPU
    out   = torch.empty_like(input)
    with pytest.raises(RuntimeError, match="CUDA"):
        sgl_kernel.scale(input, 2.0, out=out)


if __name__ == "__main__":
    import sys
    sys.exit(pytest.main([__file__, "-q"]))

Step 7: Add a benchmark (required)

Create sgl-kernel/benchmark/bench_scale.py:

import itertools

import torch
import triton
import triton.testing

import sgl_kernel
from sglang.utils import is_in_ci

IS_CI = is_in_ci()

dtypes  = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes   = [4096] if IS_CI else [2**n for n in range(10, 20)]  # 1K … 512K
factors = [2.0]

configs = list(itertools.product(dtypes, sizes))


def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
    return input * factor


@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=["dtype", "size"],
        x_vals=configs,
        line_arg="provider",
        line_vals=["sglang", "torch"],
        line_names=["SGL Kernel", "PyTorch"],
        styles=[("green", "-"), ("red", "--")],
        ylabel="µs (median)",
        plot_name="scale-performance",
        args={},
    )
)
def benchmark(dtype, size, provider):
    input  = torch.randn(size, dtype=dtype, device="cuda")
    out    = torch.empty_like(input)
    factor = 2.0

    if provider == "sglang":
        fn = lambda: sgl_kernel.scale(input, factor, out=out)
    else:
        fn = lambda: torch_scale(input, factor)

    ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
        fn, quantiles=[0.5, 0.2, 0.8]
    )
    return 1000 * ms, 1000 * max_ms, 1000 * min_ms


if __name__ == "__main__":
    benchmark.run(print_data=True)

Step 8: Build

Build:

cd sgl-kernel
make build -j16

If you need to limit host resource usage:

cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"

Step 9: Validate

After building successfully, run the test and benchmark:

pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.py

PR CI also runs pr-test-sgl-kernel.yml, including the B200 job sgl-kernel-b200-test when kernel changes are detected. Use that job as the Blackwell coverage signal for AOT sgl-kernel changes.

Troubleshooting

Async CUDA errors: CUDA_LAUNCH_BLOCKING=1
Memory errors: compute-sanitizer --tool memcheck python ...
Build is too slow / OOM: reduce MAX_JOBS and SGL_KERNEL_COMPILE_THREADS
Binary bloat: use sgl-kernel/analyze_whl_kernel_sizes.py
CMake sources list: if your .cu file is missing from SOURCES, the symbol will be undefined at link time

References

sgl-kernel/README.md
sgl-kernel/include/sgl_kernel_ops.h
sgl-kernel/csrc/common_extension.cc
sgl-kernel/CMakeLists.txt
sgl-kernel/include/utils.h — DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 macro and friends
sgl-kernel/csrc/elementwise/activation.cu — reference for the FP16/BF16/FP32 dispatch pattern

Summary of Files Created/Modified

sgl-kernel/csrc/elementwise/scale.cu          # NEW: CUDA kernel + launcher
sgl-kernel/include/sgl_kernel_ops.h           # MODIFIED: C++ declaration
sgl-kernel/csrc/common_extension.cc           # MODIFIED: schema + dispatch registration
sgl-kernel/CMakeLists.txt                     # MODIFIED: add source file (alphabetical)
sgl-kernel/python/sgl_kernel/elementwise.py   # MODIFIED: Python wrapper
sgl-kernel/python/sgl_kernel/__init__.py      # MODIFIED: re-export Python API
sgl-kernel/tests/test_scale.py                # NEW: tests
sgl-kernel/benchmark/bench_scale.py           # NEW: benchmark

name	add-sgl-kernel
description	Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)

add-sgl-kernel

More from this repository

Tutorial: Adding a New Kernel to sgl-kernel (AOT / Heavyweight)

Goal

Two rules of thumb (must follow)

Repository integration map

Step 1: Implement the kernel in csrc/

Step 2: Add a C++ declaration in include/sgl_kernel_ops.h

Step 3: Register the op in csrc/common_extension.cc

Step 4: Add the new source file to CMakeLists.txt

Step 5: Expose a Python API under sgl-kernel/python/sgl_kernel/

Step 6: Write tests (required)

Step 7: Add a benchmark (required)

Step 8: Build

Step 9: Validate

Troubleshooting

References

Summary of Files Created/Modified

Tutorial: Adding a New Kernel to sgl-kernel (AOT / Heavyweight)

Goal

Two rules of thumb (must follow)

Repository integration map

Step 1: Implement the kernel in csrc/

Step 2: Add a C++ declaration in include/sgl_kernel_ops.h

Step 3: Register the op in csrc/common_extension.cc

Step 4: Add the new source file to CMakeLists.txt

Step 5: Expose a Python API under sgl-kernel/python/sgl_kernel/

Step 6: Write tests (required)

Step 7: Add a benchmark (required)

Step 8: Build

Step 9: Validate

Troubleshooting

References

Summary of Files Created/Modified

More from this repository

Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)

Step 1: Implement the kernel in `csrc/`

Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`

Step 3: Register the op in `csrc/common_extension.cc`

Step 4: Add the new source file to `CMakeLists.txt`

Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`

Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)

Step 1: Implement the kernel in `csrc/`

Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`

Step 3: Register the op in `csrc/common_extension.cc`

Step 4: Add the new source file to `CMakeLists.txt`

Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`