with one click
add-sgl-kernel
Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)
Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)
Conventions for SGLang environment variables โ where to define, how to access, how to name, and how to deprecate. Use when adding, renaming, or reviewing any `SGLANG_*` environment variable (or migrating a legacy `SGL_*` alias), or when touching `python/sglang/srt/environ.py`.
Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module
Guide to SGLang CI workflow orchestration โ stage ordering, fast-fail, gating, partitioning, execution modes, and debugging CI failures. Use when modifying CI workflows, adding stages, debugging CI pipeline issues, or understanding how tests are dispatched and gated across stages.
`__init__` style for SGLang `Scheduler`, `TokenizerManager`, and `ModelRunner`. Use when modifying the `__init__` of any of these three classes, or reviewing changes that add new construction logic to them.
Naming conventions for SGLang speculative decoding identifiers. Use when adding, renaming, or reviewing identifiers in speculative decoding code โ anything under `python/sglang/srt/speculative/`, related attention backends, scheduler accumulators, IPC fields, observability metrics, or CLI flags.
Clean up noisy startup warnings and spurious prints in SGLang server logs. Use when users ask to clean up unwanted warnings, deprecation messages, or third-party noise in the server startup output.
| name | add-sgl-kernel |
| description | Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks) |
sgl-kernel (AOT / Heavyweight)This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement scale(x, factor) = x * factor to demonstrate the complete workflow.
Add a new operation that scales each element of a tensor by a scalar factor:
x (CUDA) and scalar factor (float)x * factor (element-wise, in-place or into pre-allocated out)torch.float16), BF16 (torch.bfloat16), FP32 (torch.float32)
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 macro (defined in sgl-kernel/include/utils.h)python/sglang/jit_kernel first when the kernel does not depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.sgl-kernel when the kernel does depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.flashinfer, or CUTLASS that is already provided through flashinfer, the kernel can still be implemented as jit_kernel.In addition, every new kernel must ship with:
You will typically touch these files/areas:
sgl-kernel/csrc/elementwise/scale.cu (pick the right subdirectory)sgl-kernel/include/sgl_kernel_ops.hsgl-kernel/csrc/common_extension.ccsgl-kernel/CMakeLists.txt (set(SOURCES ...))sgl-kernel/python/sgl_kernel/ and sgl-kernel/python/sgl_kernel/__init__.pysgl-kernel/tests/test_scale.pysgl-kernel/benchmark/bench_scale.pycsrc/Pick the right subdirectory:
csrc/elementwise/ โ for element-wise ops (our example)csrc/gemm/, csrc/attention/, csrc/moe/ โ for other categoriesCreate sgl-kernel/csrc/elementwise/scale.cu:
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>
#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
const T* __restrict__ input,
float factor,
int64_t n) {
int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
if (idx < n) {
out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
}
}
void scale(at::Tensor& out, const at::Tensor& input, double factor) {
TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor");
TORCH_CHECK(out.is_contiguous(), "out must be contiguous");
TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape");
TORCH_CHECK(out.scalar_type() == input.scalar_type(),
"out and input must have the same dtype");
const int64_t n = input.numel();
const int threads = 256;
const int blocks = (n + threads - 1) / threads;
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
// Dispatches over float, float16, bfloat16
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
static_cast<c_type*>(out.data_ptr()),
static_cast<const c_type*>(input.data_ptr()),
static_cast<float>(factor),
n);
cudaError_t status = cudaGetLastError();
TORCH_CHECK(status == cudaSuccess,
"scale_kernel launch failed: ", cudaGetErrorString(status));
return true;
});
}
Key points:
at::Tensor (PyTorch tensors), TORCH_CHECK for validation, at::cuda::getCurrentCUDAStream() for streamDISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 covers float, half (FP16), __nv_bfloat16 (BF16)TORCH_CHECK and skip logic in testsinclude/sgl_kernel_ops.hEdit sgl-kernel/include/sgl_kernel_ops.h, add to the elementwise section:
void scale(at::Tensor& out, const at::Tensor& input, double factor);
csrc/common_extension.ccEdit sgl-kernel/csrc/common_extension.cc, inside TORCH_LIBRARY_FRAGMENT(sgl_kernel, m):
// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);
Key points:
Tensor! means in-place / mutable output argumenttorch.compile and for consistent call signaturesfloat here), but note that the C++ launcher signature still needs double for scalar arguments accepted by torch::LibraryCMakeLists.txtEdit sgl-kernel/CMakeLists.txt, add to set(SOURCES ...):
csrc/elementwise/scale.cu
Key points:
sgl-kernel/python/sgl_kernel/Prefer following the existing module organization first. For elementwise kernels, the usual pattern is:
sgl-kernel/python/sgl_kernel/elementwise.pysgl-kernel/python/sgl_kernel/__init__.pyFor example, in sgl-kernel/python/sgl_kernel/elementwise.py, add:
import torch
def scale(
input: torch.Tensor,
factor: float,
out: torch.Tensor | None = None,
) -> torch.Tensor:
"""
Element-wise scale: out = input * factor.
Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
Parameters
----------
input : CUDA input tensor
factor : scale factor (float)
out : optional pre-allocated CUDA output tensor (same shape/dtype as input)
"""
if out is None:
out = torch.empty_like(input)
torch.ops.sgl_kernel.scale.default(out, input, factor)
return out
Then re-export it from sgl-kernel/python/sgl_kernel/__init__.py following the existing import style used by other kernels.
Create sgl-kernel/tests/test_scale.py:
import pytest
import torch
import sgl_kernel
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
result = sgl_kernel.scale(input, factor, out=out)
assert result is out
expected = input * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
def test_scale_shape_mismatch():
input = torch.randn(128, dtype=torch.float16, device="cuda")
out = torch.empty(256, dtype=torch.float16, device="cuda")
with pytest.raises(RuntimeError, match="same shape"):
sgl_kernel.scale(input, 2.0, out=out)
def test_scale_cpu_input():
input = torch.randn(128, dtype=torch.float16) # CPU
out = torch.empty_like(input)
with pytest.raises(RuntimeError, match="CUDA"):
sgl_kernel.scale(input, 2.0, out=out)
if __name__ == "__main__":
import sys
sys.exit(pytest.main([__file__, "-q"]))
Create sgl-kernel/benchmark/bench_scale.py:
import itertools
import torch
import triton
import triton.testing
import sgl_kernel
from sglang.utils import is_in_ci
IS_CI = is_in_ci()
dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K โฆ 512K
factors = [2.0]
configs = list(itertools.product(dtypes, sizes))
def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
return input * factor
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["dtype", "size"],
x_vals=configs,
line_arg="provider",
line_vals=["sglang", "torch"],
line_names=["SGL Kernel", "PyTorch"],
styles=[("green", "-"), ("red", "--")],
ylabel="ยตs (median)",
plot_name="scale-performance",
args={},
)
)
def benchmark(dtype, size, provider):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
factor = 2.0
if provider == "sglang":
fn = lambda: sgl_kernel.scale(input, factor, out=out)
else:
fn = lambda: torch_scale(input, factor)
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
fn, quantiles=[0.5, 0.2, 0.8]
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms
if __name__ == "__main__":
benchmark.run(print_data=True)
Build:
cd sgl-kernel
make build -j16
If you need to limit host resource usage:
cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
After building successfully, run the test and benchmark:
pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.py
PR CI also runs pr-test-sgl-kernel.yml, including the B200 job
sgl-kernel-b200-test when kernel changes are detected. Use that job as the
Blackwell coverage signal for AOT sgl-kernel changes.
CUDA_LAUNCH_BLOCKING=1compute-sanitizer --tool memcheck python ...MAX_JOBS and SGL_KERNEL_COMPILE_THREADSsgl-kernel/analyze_whl_kernel_sizes.py.cu file is missing from SOURCES, the symbol will be undefined at link timesgl-kernel/README.mdsgl-kernel/include/sgl_kernel_ops.hsgl-kernel/csrc/common_extension.ccsgl-kernel/CMakeLists.txtsgl-kernel/include/utils.h โ DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 macro and friendssgl-kernel/csrc/elementwise/activation.cu โ reference for the FP16/BF16/FP32 dispatch patternsgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher
sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration
sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration
sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical)
sgl-kernel/python/sgl_kernel/elementwise.py # MODIFIED: Python wrapper
sgl-kernel/python/sgl_kernel/__init__.py # MODIFIED: re-export Python API
sgl-kernel/tests/test_scale.py # NEW: tests
sgl-kernel/benchmark/bench_scale.py # NEW: benchmark