con un clic
sglang-diffusion-benchmark-profile
// Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.
// Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.
Trigger the bot-cherry-pick workflow for a batch of merged PRs onto a release branch and monitor each run to completion. Use when an SGLang release manager asks to cherry-pick a list of PRs to a release branch.
Guide for writing SGLang CI/UT tests. Covers CustomTestCase, CI registration, server fixtures, model selection, mock testing, and test placement. Always read test/README.md for the full CI layout, how to run tests, and extra tips. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features.
Use when choosing the fastest SGLang Diffusion flags for a model, GPU, and VRAM budget.
Guide to SGLang CI workflow orchestration — stage ordering, fast-fail, gating, partitioning, execution modes, and debugging CI failures. Use when modifying CI workflows, adding stages, debugging CI pipeline issues, or understanding how tests are dispatched and gated across stages.
Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module
Use when quantizing a diffusion DiT with NVIDIA ModelOpt and making the resulting FP8 or NVFP4 checkpoint loadable, verifiable, and benchmarkable in SGLang Diffusion.
| name | sglang-diffusion-benchmark-profile |
| description | Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang. |
Use this skill when measuring denoise performance, finding the slow op, checking whether an existing fast path can solve it, or verifying that a hotspot is real before any kernel work in sglang.multimodal_gen.
This skill is diagnosis-first. It owns:
torch.profiler trace capture and quick hotspot rankingThis skill does not own low-level kernel authoring or standalone Nsight workflows.
Before running any benchmark, profiler, or kernel-validation command:
scripts/diffusion_skill_env.py to derive the repo root from sglang.__file__HF_TOKEN before using gated Hugging Face models such as black-forest-labs/FLUX.*FLASHINFER_DISABLE_VERSION_CHECK=1All diffusion benchmark and profiling results owned by this skill must come from the native SGLang diffusion backend.
Treat any of the following as a hard stop condition:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipelineIf any benchmark, perf-dump, or torch.profiler command prints one of those signals:
torch.profiler workflow; uses checked-in nightly-aligned presets plus skill-only stress recipes such as LTX-2.3 one-stage/two-stage, HunyuanVideo, MOVA, Helios, JoyAI/FireRed image edit, and Hunyuan3D shapeQK norm + RoPE, distributed overlap patterns, and open optimization PRs before proposing new codesglang.__file__, write-access probe, benchmark/profile output directories, idle GPU selectionsglang generate; supports --no-torch-compile, validates nightly preset drift with --validate-nightly-alignment, and saves perf dumps by label for compare_perf.pyBefore calling a diffusion hotspot "new", first classify it with existing-fast-paths.md.
Always rule out these existing families first:
QK norm + RoPEtorch.compile compute / communication reorderIf the user explicitly requires torch.compile to stay off, do not use the
default benchmark preset invocation unchanged. Either pass the checked-in
benchmark helper its no-compile switch or run the equivalent manual command
without --enable-torch-compile.
For FLUX-family manual profiling runs with a quantized transformer override:
sglang generate directly--transformer-path <dir>--prompt-path <file> when also fixing --output-file-name--model-path plus HF_HUB_OFFLINE=1--profile changes latency substantially; use the non-profile perf dump for the real before/after benchmark claim