원클릭으로 Manus에서 모든 스킬 실행

$pwd:

sglang-diffusion-ako4all-kernel

Name: Sglang Diffusion Ako4all Kernel
Author: sgl-project

// Use when optimizing an existing SGLang diffusion kernel with AKO4ALL, including AKO4ALL repo hygiene, custom microbench setup, ncu-guided iteration, and end-to-end denoise validation. Also use when a sibling AKO4ALL repo must be cloned or refreshed before starting kernel tuning work.

Manus에서 실행

$ git log --oneline --stat

stars:28,163

forks:6,084

updated:2026년 5월 2일 14:04

파일 탐색기

3 개 파일

SKILL.md

readonly

name	sglang-diffusion-ako4all-kernel
description	Use when optimizing an existing SGLang diffusion kernel with AKO4ALL, including AKO4ALL repo hygiene, custom microbench setup, ncu-guided iteration, and end-to-end denoise validation. Also use when a sibling AKO4ALL repo must be cloned or refreshed before starting kernel tuning work.

SGLang Diffusion AKO4ALL Kernel

Use this skill to run the full AKO4ALL-based optimization loop for an existing SGLang diffusion kernel. It is the default implementation path once the benchmark/profile skill has already shown that a hotspot is real and not covered by an existing fast path. This workflow bootstraps a custom AKO harness, benchmarks and profiles the kernel, iterates with ncu, ports the best version back to sglang, then validates with targeted tests and model-level denoise runs.

This skill assumes a sibling repo layout like:

<base-dir>/
├── sglang/
└── AKO4ALL/

If AKO4ALL/ is missing under the current base directory, clone it first.

Use This Skill When

tuning an existing diffusion Triton, CUDA JIT, CuTeDSL, or runtime-integrated kernel in sglang
sglang-diffusion-benchmark-profile has already ruled out an existing in-repo fast path or overlap family
creating a custom AKO4ALL harness for a real diffusion kernel instead of using the default benchmark tasks
validating that a kernel-level win transfers to Qwen, FLUX, Wan, Hunyuan, MOVA, or other diffusion denoise latency
preparing PR artifacts such as microbench tables, ncu before/after data, and proof image outputs

Do not start here when the bottleneck has not been proven yet. First use ../sglang-diffusion-benchmark-profile/SKILL.md to:

measure the real denoise regression
collect the perf dump baseline
capture one representative torch.profiler trace
rule out existing mainline fast paths
prove the run stayed on the native SGLang diffusion backend, not a diffusers fallback

Before opening AKO, also read ../sglang-diffusion-benchmark-profile/existing-fast-paths.md. It records current mainline fusions plus the open PR watchlist for diffusion kernel, VAE, attention, cache, and scheduling work. If an open PR already covers the same shape family, use it as prior art or decide whether to rebase/extend it instead of starting a duplicate kernel.

If a future specialized optimization skill matches the kernel family better than AKO4ALL, hand off there instead. The diagnosis contract stays the same.

Mandatory AKO4ALL Preflight

Before any AKO work:

Run scripts/ensure_ako4all_clean.sh [base-dir].
If <base-dir>/AKO4ALL does not exist, the script clones it.
Do not continue unless AKO4ALL is:
- on the upstream default branch, usually main
- fully clean with no tracked or untracked local changes
- exactly synced to upstream/<default-branch>
If the script reports local commits, divergence, or a dirty worktree, stop and clean or re-clone the repo before continuing.

The script creates an upstream remote automatically when missing. By default it uses the existing origin URL, or AKO4ALL_URL if you need to override the clone source.

Workflow

1. Scope the Kernel

Identify the exact kernel entry point and runtime call sites in sglang.
Record the target shapes, dtypes, model families, and whether the kernel is on a hot path.
Reuse existing unit tests and benchmark entry points when they already exist.
Record whether the hotspot overlaps an open PR from existing-fast-paths.md; if it does, note the PR number in the AKO context and final PR artifacts.

2. Bootstrap the AKO Harness

Inside the clean AKO4ALL repo:

read TASK.md and HINTS.md
create a custom harness instead of relying on the stock benchmark tasks
mirror the real SGLang kernel into:
- input/reference.py
- input/<kernel>.py
- solution/<kernel>.py
- bench/bench_<kernel>.py
keep a short context note in context/ when the kernel has model-specific shape assumptions or perf conclusions

The custom benchmark should:

cover representative diffusion shapes
check correctness against the reference kernel
report aggregate runtime plus per-shape results when useful

3. Establish the Baseline

run the AKO custom microbench before changing the kernel
capture one representative ncu baseline on the hottest meaningful shape
note whether the bottleneck looks like registers, occupancy, instruction count, launch config, or memory latency

4. Iterate in AKO4ALL

change one idea at a time
rerun the microbench after every change
update ITERATIONS.md with hypothesis, result, and next step
prefer simple, explainable wins over clever rewrites that do not transfer

After 3 consecutive no-improvement or regression iterations:

rerun ncu
re-read ITERATIONS.md
change direction instead of continuing blind sweeps

5. Port the Best Version Back to SGLang

apply the best candidate to the real sglang kernel file
run import or syntax checks and targeted tests first
keep the AKO solution/ version aligned with the main-tree version you actually want to keep

6. Validate on Real Models

use the benchmark/profile skill for denoise perf dumps and before/after comparison
prefer exact local snapshot validation when testing local edits on a GPU box
run targeted kernel tests first
run model-level denoise benchmarks with perf dumps
compare baseline vs optimized runs with compare_perf.py
if the PR needs proof that generation still works, save one real model output image

7. Prepare PR Artifacts

At minimum, keep:

one microbench table
one denoise-stage table
one end-to-end table
one ncu before/after pair on the most representative kernel shape
one generated image when the kernel affects production inference

See references/ako-loop.md for the checklist and common stop rules.

Operating Rules

Treat AKO4ALL repo hygiene as a gate, not a suggestion.
Prefer exact local snapshot validation over hand-wavy “remote tree is close enough”.
Do not start or justify kernel work from traces collected after Falling back to diffusers backend, Using diffusers backend, or Loaded diffusers pipeline; fix backend selection and rerun the benchmark/profile workflow first.
Keep model-level validation honest: if microbench improves but denoise does not, do not keep the AKO-only variant in the main code path.
When writing conclusions, explain the win in terms of measurable causes such as lower registers per thread, higher occupancy, fewer executed instructions, or better scheduler eligibility.

related-skills.json

같은 저장소

sglang-cherrypick.md

from "sgl-project/sglang"

Trigger the bot-cherry-pick workflow for a batch of merged PRs onto a release branch and monitor each run to completion. Use when an SGLang release manager asks to cherry-pick a list of PRs to a release branch.

2026-05-2228.2k

write-sglang-test.md

from "sgl-project/sglang"

Guide for writing SGLang CI/UT tests. Covers CustomTestCase, CI registration, server fixtures, model selection, mock testing, and test placement. Always read test/README.md for the full CI layout, how to run tests, and extra tips. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features.

2026-05-2128.2k

sglang-diffusion-benchmark-profile.md

from "sgl-project/sglang"

Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.

2026-05-2028.2k

sglang-diffusion-performance.md

from "sgl-project/sglang"

Use when choosing the fastest SGLang Diffusion flags for a model, GPU, and VRAM budget.

2026-05-2028.2k

ci-workflow-guide.md

from "sgl-project/sglang"

Guide to SGLang CI workflow orchestration — stage ordering, fast-fail, gating, partitioning, execution modes, and debugging CI failures. Use when modifying CI workflows, adding stages, debugging CI pipeline issues, or understanding how tests are dispatched and gated across stages.

2026-05-1928.2k

add-jit-kernel.md

from "sgl-project/sglang"

Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module

2026-05-1628.2k

package.json

"author": "sgl-project"

"repository": "sgl-project/sglang"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

소프트웨어 개발자컴퓨터 및 수학직15-1252L4

name	sglang-diffusion-ako4all-kernel
description	Use when optimizing an existing SGLang diffusion kernel with AKO4ALL, including AKO4ALL repo hygiene, custom microbench setup, ncu-guided iteration, and end-to-end denoise validation. Also use when a sibling AKO4ALL repo must be cloned or refreshed before starting kernel tuning work.

SGLang Diffusion AKO4ALL Kernel

This skill assumes a sibling repo layout like:

<base-dir>/
├── sglang/
└── AKO4ALL/

If AKO4ALL/ is missing under the current base directory, clone it first.

Use This Skill When

tuning an existing diffusion Triton, CUDA JIT, CuTeDSL, or runtime-integrated kernel in sglang
sglang-diffusion-benchmark-profile has already ruled out an existing in-repo fast path or overlap family
creating a custom AKO4ALL harness for a real diffusion kernel instead of using the default benchmark tasks
validating that a kernel-level win transfers to Qwen, FLUX, Wan, Hunyuan, MOVA, or other diffusion denoise latency
preparing PR artifacts such as microbench tables, ncu before/after data, and proof image outputs

Do not start here when the bottleneck has not been proven yet. First use ../sglang-diffusion-benchmark-profile/SKILL.md to:

measure the real denoise regression
collect the perf dump baseline
capture one representative torch.profiler trace
rule out existing mainline fast paths
prove the run stayed on the native SGLang diffusion backend, not a diffusers fallback

If a future specialized optimization skill matches the kernel family better than AKO4ALL, hand off there instead. The diagnosis contract stays the same.

Mandatory AKO4ALL Preflight

Before any AKO work:

Run scripts/ensure_ako4all_clean.sh [base-dir].
If <base-dir>/AKO4ALL does not exist, the script clones it.
Do not continue unless AKO4ALL is:
- on the upstream default branch, usually main
- fully clean with no tracked or untracked local changes
- exactly synced to upstream/<default-branch>
If the script reports local commits, divergence, or a dirty worktree, stop and clean or re-clone the repo before continuing.

The script creates an upstream remote automatically when missing. By default it uses the existing origin URL, or AKO4ALL_URL if you need to override the clone source.

Workflow

1. Scope the Kernel

Identify the exact kernel entry point and runtime call sites in sglang.
Record the target shapes, dtypes, model families, and whether the kernel is on a hot path.
Reuse existing unit tests and benchmark entry points when they already exist.
Record whether the hotspot overlaps an open PR from existing-fast-paths.md; if it does, note the PR number in the AKO context and final PR artifacts.

2. Bootstrap the AKO Harness

Inside the clean AKO4ALL repo:

read TASK.md and HINTS.md
create a custom harness instead of relying on the stock benchmark tasks
mirror the real SGLang kernel into:
- input/reference.py
- input/<kernel>.py
- solution/<kernel>.py
- bench/bench_<kernel>.py
keep a short context note in context/ when the kernel has model-specific shape assumptions or perf conclusions

The custom benchmark should:

cover representative diffusion shapes
check correctness against the reference kernel
report aggregate runtime plus per-shape results when useful

3. Establish the Baseline

run the AKO custom microbench before changing the kernel
capture one representative ncu baseline on the hottest meaningful shape
note whether the bottleneck looks like registers, occupancy, instruction count, launch config, or memory latency

4. Iterate in AKO4ALL

change one idea at a time
rerun the microbench after every change
update ITERATIONS.md with hypothesis, result, and next step
prefer simple, explainable wins over clever rewrites that do not transfer

After 3 consecutive no-improvement or regression iterations:

rerun ncu
re-read ITERATIONS.md
change direction instead of continuing blind sweeps

5. Port the Best Version Back to SGLang

apply the best candidate to the real sglang kernel file
run import or syntax checks and targeted tests first
keep the AKO solution/ version aligned with the main-tree version you actually want to keep

6. Validate on Real Models

use the benchmark/profile skill for denoise perf dumps and before/after comparison
prefer exact local snapshot validation when testing local edits on a GPU box
run targeted kernel tests first
run model-level denoise benchmarks with perf dumps
compare baseline vs optimized runs with compare_perf.py
if the PR needs proof that generation still works, save one real model output image

7. Prepare PR Artifacts

At minimum, keep:

one microbench table
one denoise-stage table
one end-to-end table
one ncu before/after pair on the most representative kernel shape
one generated image when the kernel affects production inference

See references/ako-loop.md for the checklist and common stop rules.

Operating Rules

Treat AKO4ALL repo hygiene as a gate, not a suggestion.
Prefer exact local snapshot validation over hand-wavy “remote tree is close enough”.
Do not start or justify kernel work from traces collected after Falling back to diffusers backend, Using diffusers backend, or Loaded diffusers pipeline; fix backend selection and rerun the benchmark/profile workflow first.
Keep model-level validation honest: if microbench improves but denoise does not, do not keep the AKO-only variant in the main code path.
When writing conclusions, explain the win in terms of measurable causes such as lower registers per thread, higher occupancy, fewer executed instructions, or better scheduler eligibility.

sglang-diffusion-ako4all-kernel

SGLang Diffusion AKO4ALL Kernel

Use This Skill When

Mandatory AKO4ALL Preflight

Workflow

1. Scope the Kernel

2. Bootstrap the AKO Harness

3. Establish the Baseline

4. Iterate in AKO4ALL

5. Port the Best Version Back to SGLang

6. Validate on Real Models

7. Prepare PR Artifacts

Operating Rules

이 저장소의 다른 Skills

이 저장소의 다른 Skills

SGLang Diffusion AKO4ALL Kernel

Use This Skill When

Mandatory AKO4ALL Preflight

Workflow

1. Scope the Kernel

2. Bootstrap the AKO Harness

3. Establish the Baseline

4. Iterate in AKO4ALL

5. Port the Best Version Back to SGLang

6. Validate on Real Models

7. Prepare PR Artifacts

Operating Rules