with one click
llm-pipeline-analysis
// Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
// Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
Perform SGLang code review in the style of human maintainers by consulting the 2024-2025 non-agent PR review corpus, including inline code snippets, original multilingual comments, and discussion threads. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.
Build an operator-level compute template for an LLM and estimate FLOPs/MFU for a serving shape. Use when you need tensor shapes, per-op FLOPs, kernel-to-op MFU mapping, or parallelism what-if analysis.
| name | llm-pipeline-analysis |
| description | Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges. |
Use this when a whole-trace profiler summary is too coarse. The scripts read a Chrome-trace JSON file, find layer-boundary anchor kernels, group kernels into forward passes and layers, and print timing tables you can use for Perfetto navigation or detailed timing analysis.
compress_ratios like DeepSeek-V4 NSA)Before running scripts, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|---|---|---|
| Model name | Determines which config.json to use; affects layer classification | Ask user | — (required) |
| Model profile | Determines anchor kernel, blocks-per-layer, and kernel classification rules | Ask user or auto-infer from config | Auto-inferred from config |
config.json path | Provides compress_ratios, num_hidden_layers, num_hash_layers etc. | Ask user or search filesystem | — (required) |
| GPU type | Optional context for reports and hardware notes | Ask user | — |
| TP / EP | Parallelism config affects kernel naming and AllReduce count | Ask user or infer from trace filename (e.g. TP-0) | TP=8, EP=8 |
| Serving mode | Decode vs prefill changes kernel mix and FLOPs profile | Ask user | decode B=1 |
If the user cannot provide config.json, search common locations such as
/root/workspace/*/config.json and the HuggingFace cache. If it is still not
available, require an explicit --profile.
Scripts use ModelProfile to determine layer boundary detection and kernel
classification. Profiles are auto-inferred from config.json or selected
via --profile:
| Profile | Anchor kernel | Blocks/layer | Layer structure | Auto-infer condition |
|---|---|---|---|---|
dsv4_csa_hca | mhc_post_tilelang | 2 | attn + ffn halves | compress_ratios non-empty |
dsv3_mla | flash_fwd_mla_combine | 1 | full layer | kv_lora_rank > 0 |
generic | auto-detect or --anchor-kernel | 1 | full layer | fallback |
Use --profile generic --anchor-kernel YOUR_KERNEL for models not covered
by built-in profiles.
torch.profiler trace in Chrome-trace JSON format (.json or .json.gz)config.json (for profile inference, compress_ratios, etc.)--anchor-kernel)The scripts use an anchor kernel as a layer-boundary marker. The anchor and layer structure are determined by the active ModelProfile.
For example, with the dsv4_csa_hca profile, each transformer layer produces
2 consecutive mhc_post_tilelang calls:
mhc_post_tilelang ← end of attn half (attention + O-proj + AllReduce)
... ffn computation ...
mhc_post_tilelang ← end of ffn half (MoE experts + AllReduce)
... next layer attn ...
mhc_post_tilelang ← next layer's attn boundary
So for N layers with the dsv4_csa_hca profile, one forward pass has 2N
anchor blocks. With dsv3_mla or generic, each layer has 1 block.
Forward pass P starts at block index P * (N * blocks_per_layer).
layer_timeline_analyzer.py — Per-layer timeline and cluster stats# Show all forward passes summary (cold-start vs steady-state)
python3 scripts/layer_timeline_analyzer.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--show-all-passes
# Detailed per-layer breakdown for a specific forward pass
python3 scripts/layer_timeline_analyzer.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5
# Auto-select first steady-state pass
python3 scripts/layer_timeline_analyzer.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json
The script prints:
layer_kernel_breakdown.py — Per-layer kernel detail and compute flow# Single layer kernel dump
python3 scripts/layer_kernel_breakdown.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layer 3
# Compute flow format (with model architecture summary and category column)
python3 scripts/layer_kernel_breakdown.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layer 3 --format compute-flow
# JSON export
python3 scripts/layer_kernel_breakdown.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layer 3 --format json
# Compare two layers side-by-side
python3 scripts/layer_kernel_breakdown.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layer 2 --compare-layer 3
Output formats:
--format text (default): grouped summary + top hot kernels ranked by duration, with simplified names and percentages--format compute-flow: model architecture summary + per-kernel hotness table with Category, %, and ts_rel(ms) columns--format json: machine-readable per-kernel detail ranked by durationperfetto_time_mapper.py — Perfetto UI time navigation# Show all forward pass time ranges in Perfetto
python3 scripts/perfetto_time_mapper.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json
# Layer-level time ranges for a specific forward pass
python3 scripts/perfetto_time_mapper.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layers 2,3,38,42
The script prints:
python3 scripts/layer_timeline_analyzer.py \
--trace $TRACE --config $CONFIG --show-all-passes
Read the "all-passes" table. The first pass is cold-start (few tokens). Find the first pass where layer-0 wall-clock stabilizes (typically pass 3-5).
python3 scripts/layer_timeline_analyzer.py \
--trace $TRACE --config $CONFIG --fwd-pass 5
Identify:
Select 1-2 representative layers (one per bottleneck type), then:
# Human-readable compute flow table
python3 scripts/layer_kernel_breakdown.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layer 3 --format compute-flow
# JSON export
python3 scripts/layer_kernel_breakdown.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json
The --format compute-flow output includes:
# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dimsdur(us) descending by default; use ts_rel(ms) to jump back to the kernel's trace location.python3 scripts/layer_kernel_breakdown.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layer 2 --compare-layer 3
This shows the exact kernel difference between the two layer types.
python3 scripts/perfetto_time_mapper.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layers 2,3,38,42
Use the printed time ranges to navigate directly in Perfetto.
The scripts classify layers based on config.json fields:
| Config field | Value | Layer Type | Description |
|---|---|---|---|
compress_ratios[i] | 0 | FULL_ATTN | No NSA compression (layers 0-1) |
compress_ratios[i] | 4 | C4_LIGHT | C128 sparse attention, fastest |
compress_ratios[i] | 128 | C128_HEAVY | C4 attention + Hadamard + Indexer, bottleneck |
i >= N - num_hash_layers | — | HASH | Hash-table routing with paged MQA |
i == 0 | — | FIRST | First layer (empty KV cache) |
i == N - 1 | — | FINAL | Final layer (lm_head output) |
Kernels are classified by the active ModelProfile's rules. Categories marked
with (DSv4) are specific to the dsv4_csa_hca profile; all profiles include
the universal categories.
| Category | Match Pattern | Profile | Typical Share (DSv4) |
|---|---|---|---|
| ★ MLA Attention | flash_fwd_splitkv_mla | DSv4, DSv3 | 21-33% |
| ★ MoE Fused | fused_moe_kernel | DSv4, DSv3 | 11-17% |
| ● NCCL AllReduce | AllReduce | universal | 5-8% |
| GEMM fp8 | deep_gemm | universal | 12-25% |
| GEMM bf16 | nvjet | universal | 11-13% |
| Hadamard Xform | hadamard | DSv4 | 0-2.4% |
| Indexer Cache | indexer | DSv4 | 0-0.1% |
| Paged MQA | paged_mqa_logits | DSv4 | 0-1.8% |
| MHC | mhc_pre_gemm_sqrsum, mhc_pre_big_fuse, mhc_post_tilelang | DSv4 | 10-15% |
| C4/C128 Prefill | c4_prefill, c128_prefill | DSv4 | 0-0.3% |
| RMSNorm | RMSNorm, rms_normalize | universal | 1-2% |
| FP8 Quant | quant, Quant | universal | 1-2% |
| TopK | topk | universal | 0-0.7% |
| RoPE | deepseek_rope, fused_norm_rope | DSv4, DSv3 | 1-2% |
| Activation | silu_mul_clamp, act_and_mul | universal | 0-0.5% |
| Other | — | universal | 2-5% |
Include:
config.json):
layer_timeline_analyzer.py --show-all-passes):
layer_timeline_analyzer.py --fwd-pass N):
layer_kernel_breakdown.py --format compute-flow# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dimsdur(us) descending) by default--format json)