Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

Comenzar

$pwd:

sglang-diffusion-benchmark-profile

Name: Sglang Diffusion Benchmark Profile
Author: sgl-project

// Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.

Ejecutar en Manus

$ git log --oneline --stat

stars:28.163

forks:6084

updated:20 de mayo de 2026, 04:18

Explorador de archivos

5 archivos

SKILL.md

readonly

name	sglang-diffusion-benchmark-profile
description	Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.

SGLang Diffusion Benchmark and Profile

Use this skill when measuring denoise performance, finding the slow op, checking whether an existing fast path can solve it, or verifying that a hotspot is real before any kernel work in sglang.multimodal_gen.

This skill is diagnosis-first. It owns:

checked-in denoise benchmark presets
perf dump collection and before/after comparison
torch.profiler trace capture and quick hotspot ranking
mapping hot kernels back to known fast paths and fusion families
handing confirmed kernel work to a specialized optimization skill such as ../sglang-diffusion-ako4all-kernel/SKILL.md

This skill does not own low-level kernel authoring or standalone Nsight workflows.

Preflight

Before running any benchmark, profiler, or kernel-validation command:

use scripts/diffusion_skill_env.py to derive the repo root from sglang.__file__
verify the repo is writable
export HF_TOKEN before using gated Hugging Face models such as black-forest-labs/FLUX.*
export FLASHINFER_DISABLE_VERSION_CHECK=1
choose idle GPU(s) before starting perf work

Native Backend Gate

All diffusion benchmark and profiling results owned by this skill must come from the native SGLang diffusion backend.

Treat any of the following as a hard stop condition:

Falling back to diffusers backend
Using diffusers backend
Loaded diffusers pipeline

If any benchmark, perf-dump, or torch.profiler command prints one of those signals:

stop the workflow immediately
do not keep the generated numbers or traces as SGLang benchmark evidence
do not continue to hotspot classification or kernel work
first fix model resolution, pipeline selection, overlay/materialization, or other backend-selection issues so the model runs on the native SGLang diffusion path

Main Reference

benchmark-and-profile.md — canonical denoise benchmark, perf dump, and torch.profiler workflow; uses checked-in nightly-aligned presets plus skill-only stress recipes such as LTX-2.3 one-stage/two-stage, HunyuanVideo, MOVA, Helios, JoyAI/FireRed image edit, and Hunyuan3D shape
existing-fast-paths.md — map bottlenecks to existing fused kernels, packed QKV paths, fused QK norm + RoPE, distributed overlap patterns, and open optimization PRs before proposing new code
scripts/diffusion_skill_env.py — preflight helper: repo root discovery via sglang.__file__, write-access probe, benchmark/profile output directories, idle GPU selection
scripts/bench_diffusion_denoise.py — end-to-end denoise benchmark preset runner via sglang generate; supports --no-torch-compile, validates nightly preset drift with --validate-nightly-alignment, and saves perf dumps by label for compare_perf.py

Opportunity Discovery Rule

Before calling a diffusion hotspot "new", first classify it with existing-fast-paths.md.

Always rule out these existing families first:

HunyuanVideo VAE GroupNorm+SiLU
Z-Image residual-form modulation
fused diffusion QK norm + RoPE
NVFP4 / Nunchaku packed QKV
Nunchaku fused GELU MLP
Ulysses / USP attention overlap
turbo-layer async all-to-all overlap
torch.compile compute / communication reorder
dual-stream diffusion execution

If the user explicitly requires torch.compile to stay off, do not use the default benchmark preset invocation unchanged. Either pass the checked-in benchmark helper its no-compile switch or run the equivalent manual command without --enable-torch-compile.

For FLUX-family manual profiling runs with a quantized transformer override:

use sglang generate directly
pass the override as --transformer-path <dir>
prefer --prompt-path <file> when also fixing --output-file-name
if the base model is already cached locally and the machine has unreliable HF access, use the local cached --model-path plus HF_HUB_OFFLINE=1
remember that --profile changes latency substantially; use the non-profile perf dump for the real before/after benchmark claim

related-skills.json

mismo repositorio

sglang-cherrypick.md

from "sgl-project/sglang"

Trigger the bot-cherry-pick workflow for a batch of merged PRs onto a release branch and monitor each run to completion. Use when an SGLang release manager asks to cherry-pick a list of PRs to a release branch.

2026-05-2228.2k

write-sglang-test.md

from "sgl-project/sglang"

Guide for writing SGLang CI/UT tests. Covers CustomTestCase, CI registration, server fixtures, model selection, mock testing, and test placement. Always read test/README.md for the full CI layout, how to run tests, and extra tips. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features.

2026-05-2128.2k

sglang-diffusion-performance.md

from "sgl-project/sglang"

Use when choosing the fastest SGLang Diffusion flags for a model, GPU, and VRAM budget.

2026-05-2028.2k

ci-workflow-guide.md

from "sgl-project/sglang"

Guide to SGLang CI workflow orchestration — stage ordering, fast-fail, gating, partitioning, execution modes, and debugging CI failures. Use when modifying CI workflows, adding stages, debugging CI pipeline issues, or understanding how tests are dispatched and gated across stages.

2026-05-1928.2k

add-jit-kernel.md

from "sgl-project/sglang"

Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module

2026-05-1628.2k

sglang-diffusion-modelopt-quant.md

from "sgl-project/sglang"

Use when quantizing a diffusion DiT with NVIDIA ModelOpt and making the resulting FP8 or NVFP4 checkpoint loadable, verifiable, and benchmarkable in SGLang Diffusion.

2026-05-0528.2k

package.json

"author": "sgl-project"

"repository": "sgl-project/sglang"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Desarrolladores de softwareOcupaciones informáticas y matemáticas15-1252L4

name	sglang-diffusion-benchmark-profile
description	Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.

SGLang Diffusion Benchmark and Profile

This skill is diagnosis-first. It owns:

checked-in denoise benchmark presets
perf dump collection and before/after comparison
torch.profiler trace capture and quick hotspot ranking
mapping hot kernels back to known fast paths and fusion families
handing confirmed kernel work to a specialized optimization skill such as ../sglang-diffusion-ako4all-kernel/SKILL.md

This skill does not own low-level kernel authoring or standalone Nsight workflows.

Preflight

Before running any benchmark, profiler, or kernel-validation command:

use scripts/diffusion_skill_env.py to derive the repo root from sglang.__file__
verify the repo is writable
export HF_TOKEN before using gated Hugging Face models such as black-forest-labs/FLUX.*
export FLASHINFER_DISABLE_VERSION_CHECK=1
choose idle GPU(s) before starting perf work

Native Backend Gate

All diffusion benchmark and profiling results owned by this skill must come from the native SGLang diffusion backend.

Treat any of the following as a hard stop condition:

Falling back to diffusers backend
Using diffusers backend
Loaded diffusers pipeline

If any benchmark, perf-dump, or torch.profiler command prints one of those signals:

stop the workflow immediately
do not keep the generated numbers or traces as SGLang benchmark evidence
do not continue to hotspot classification or kernel work
first fix model resolution, pipeline selection, overlay/materialization, or other backend-selection issues so the model runs on the native SGLang diffusion path

Main Reference

benchmark-and-profile.md — canonical denoise benchmark, perf dump, and torch.profiler workflow; uses checked-in nightly-aligned presets plus skill-only stress recipes such as LTX-2.3 one-stage/two-stage, HunyuanVideo, MOVA, Helios, JoyAI/FireRed image edit, and Hunyuan3D shape
existing-fast-paths.md — map bottlenecks to existing fused kernels, packed QKV paths, fused QK norm + RoPE, distributed overlap patterns, and open optimization PRs before proposing new code
scripts/diffusion_skill_env.py — preflight helper: repo root discovery via sglang.__file__, write-access probe, benchmark/profile output directories, idle GPU selection
scripts/bench_diffusion_denoise.py — end-to-end denoise benchmark preset runner via sglang generate; supports --no-torch-compile, validates nightly preset drift with --validate-nightly-alignment, and saves perf dumps by label for compare_perf.py

Opportunity Discovery Rule

Before calling a diffusion hotspot "new", first classify it with existing-fast-paths.md.

Always rule out these existing families first:

HunyuanVideo VAE GroupNorm+SiLU
Z-Image residual-form modulation
fused diffusion QK norm + RoPE
NVFP4 / Nunchaku packed QKV
Nunchaku fused GELU MLP
Ulysses / USP attention overlap
turbo-layer async all-to-all overlap
torch.compile compute / communication reorder
dual-stream diffusion execution

For FLUX-family manual profiling runs with a quantized transformer override:

use sglang generate directly
pass the override as --transformer-path <dir>
prefer --prompt-path <file> when also fixing --output-file-name
if the base model is already cached locally and the machine has unreliable HF access, use the local cached --model-path plus HF_HUB_OFFLINE=1
remember that --profile changes latency substantially; use the non-profile perf dump for the real before/after benchmark claim

sglang-diffusion-benchmark-profile

SGLang Diffusion Benchmark and Profile

Preflight

Native Backend Gate

Main Reference

Opportunity Discovery Rule

Más de este repositorio

Más de este repositorio

SGLang Diffusion Benchmark and Profile

Preflight

Native Backend Gate

Main Reference

Opportunity Discovery Rule