with one click
model-compute-simulation
// Build an operator-level compute template for an LLM and estimate FLOPs/MFU for a serving shape. Use when you need tensor shapes, per-op FLOPs, kernel-to-op MFU mapping, or parallelism what-if analysis.
// Build an operator-level compute template for an LLM and estimate FLOPs/MFU for a serving shape. Use when you need tensor shapes, per-op FLOPs, kernel-to-op MFU mapping, or parallelism what-if analysis.
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
Perform SGLang code review in the style of human maintainers by consulting the 2024-2025 non-agent PR review corpus, including inline code snippets, original multilingual comments, and discussion threads. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.
| name | model-compute-simulation |
| description | Build an operator-level compute template for an LLM and estimate FLOPs/MFU for a serving shape. Use when you need tensor shapes, per-op FLOPs, kernel-to-op MFU mapping, or parallelism what-if analysis. |
Use this when the question is about operator order, tensor dimensions, FLOPs, MFU, or parallelism checks. The simulator loads a model config, builds the representative operator sequence, prints tensor shapes and FLOPs, and can estimate MFU from measured latency.
Before running a simulation, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|---|---|---|
| Model name | Resolves to config in model-config-index.json; determines entire architecture | Ask user or infer from trace context | — (required) |
| Config accuracy | Indexed values may differ from actual serving config (e.g. routed_expert_intermediate_size, compress_ratios) | Ask user to provide config.json or verify key params against HuggingFace | Use indexed values with a caveat |
| GPU type | Determines peak FLOPS for MFU denominator | Ask user | — (required for MFU) |
| dtype (bf16 / fp8) | Affects peak FLOPS selection; fp8 doubles peak | Ask user | bf16 |
| Batch size & seq len | Directly affects FLOPs and tensor shapes | Ask user | B=1, S=1 (decode) |
| TP / DP / EP | TP splits GEMM FLOPs across GPUs; EP splits expert FLOPs | Ask user | TP=8, DP=1, EP=8 |
| Measured latency (ms) | Required for MFU numerator; must be per-GPU forward-pass wall-clock | Ask user or extract from a profiler trace | — (optional, no MFU without it) |
If the model is not in model-config-index.json, ask the user for a
config.json path or add an indexed config before running estimates.
Resolve the model name and load its configuration parameters:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models
The script resolves the model name against references/model-config-index.json, which stores public HuggingFace config parameters (hidden_size, num_experts, MLA ranks, etc.).
If the model is not indexed, tell the user to provide a config.json path or request an index update.
Run the simulator with batch size, sequence length, and parallelism configuration:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16
The simulator prints:
For decode: use --seq-len 1.
For prefill: use --seq-len <prompt_length>.
Provide the measured forward-pass latency to compute MFU:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--measured-ms 15.0
MFU = theoretical_min_time / measured_time × 100%
The simulator prints:
GPU peak FLOPS are loaded from references/gpu-specs.json. The bundled
hardware table includes H20, H100 SXM 80GB, H200 SXM 141GB, and B200 SXM
180GB. Use aliases such as --gpu h100, --gpu h200, or --gpu b200 when
running on those local boxes.
When you have per-kernel measured latency, compute per-operator MFU by mapping kernel durations to the compute flow.
--kernel-flow (kernel-level MFU, recommended)Provide per-kernel detail as JSON, then feed it to the simulator for kernel-level MFU analysis. This preserves every kernel row from the compute flow and adds FLOPs/MFU columns.
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-flow @/tmp/layer3_detail.json
The --kernel-flow parameter accepts a JSON string or @file path. It produces
a kernel-level MFU table that preserves all kernel rows from the compute
flow and adds:
Mapped Op: which operator this kernel maps toFLOPs: operator's total FLOPsTheo(us): theoretical minimum timeMFU%: measured FLOPs utilizationshape_in→shape_out: operator tensor dimensionsWhen --kernel-flow is provided, the static per-operator template is omitted
because the kernel-level MFU table already carries per-kernel shape and FLOPs
information. The output keeps the model summary, serving configuration, total
FLOPs, and kernel-level MFU table.
Mapping rules:
FP8 kernel MFU correction: Kernels in categories moe (fused_moe_kernel)
and gemm_fp8 use fp8 math internally even when --dtype bf16 is specified.
For these kernels, the MFU denominator uses the GPU's fp8 peak FLOPS
(2x bf16 peak) instead of bf16 peak. The resulting MFU is marked with a
superscript ⁸ (for example, 63.7%⁸) to show that the fp8 denominator was
used. gemm_bf16 kernels still use the bf16 peak FLOPS denominator.
--kernel-detail (operator-level MFU, legacy)Same input as --kernel-flow but outputs an operator-level summary table
(aggregated by operator, not per-kernel). Use when you want a compact view.
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-ms '{
"mla": 4.922, "moe": 1.644, "allreduce": 0.769,
"hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
"gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
"rope": 0.209, "topk": 0.122, "activation": 0.071,
"other": 0.437
}'
The --kernel-ms parameter accepts a JSON object mapping kernel category names
to their measured durations in milliseconds. It uses FLOPs-proportional
distribution across entire categories, which is less precise than --kernel-detail
because generic GEMM categories (gemm_fp8, gemm_bf16) span multiple operator categories.
Output includes:
List known model IDs:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models
List known GPU types:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus
Emit JSON for automation:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json
Include:
--kernel-flow provided):
# | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out--kernel-detail or --measured-ms provided):
Use scripts/extract_compute_flow_from_trace.py to extract the real operator sequence and tensor dimensions from a torch profiler trace, then compare against the static template as ground truth validation.
# Extract compute flow from a trace
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
--input /path/to/trace.json.gz --format text
# Compare trace against static template
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
--input /path/to/trace.json.gz \
--compare qwen3-235b-a22b \
--batch-size 1 --seq-len 1 --tp 8 --ep 8
When the static template or trace extraction cannot fully confirm the compute process (e.g. ambiguous scope, missing shapes, new model architecture), follow this escalation hierarchy:
Static template (model_compute_simulator.py + model-config-index.json) — fast, covers known models
Trace extraction (extract_compute_flow_from_trace.py) — validates template against real execution
Inference framework source code — when trace is insufficient (missing Input Dims, CUDA Graph replay, compiled kernels without scope), read the model's forward flow directly from the serving framework source:
python/sglang/srt/models/<model_name>.py — contains the forward() method with the exact operator sequence, tensor shapes, and parallelism split logicvllm/model_executor/models/<model_name>.pycpp/tensorrt_llm/pyexecutor/py_executor.cpp + model config filesWhen consulting framework source, focus on:
forward() method: operator call order and residual connectionsq_lora_rank, o_lora_rank)Action: If the framework source reveals discrepancies with the static template, update model-config-index.json and/or build_layer_ops() accordingly.
| Limitation | Detail | Workaround |
|---|---|---|
record_shapes=True required | Trace must be captured with shape recording enabled; without it, Input Dims fields are absent and FLOPs cannot be computed | SGLang live capture and vLLM torch_profiler_with_stack=true already enable this; TensorRT-LLM requires a py_executor.py override adding record_shapes=True |
| CUDA Graph mode | During graph replay, cpu_op events may only appear once (at capture time); shape information for replayed iterations is not re-recorded | The script detects graph capture phases and annotates affected ops; use eager-mode traces for full coverage |
| TP-sliced dimensions | Trace shows post-TP-split dimensions (e.g. H/TP), not the full-model view | Use --tp in --compare mode to scale trace FLOPs back to full-model equivalents |
| Scope attribution quality | Python scope depends on with_stack=True; some frameworks or compiled paths may produce shallow or missing scope chains | Graceful degradation: ops with unresolved scope are categorized as "other" |
| Not a replacement for static templates | Trace extraction is a validation and discovery tool; static templates remain the primary fast-analysis path | Use trace extraction to verify templates for new models, then update model-config-index.json if discrepancies are found |
references/model-config-index.json: model configuration parameters (hidden_size, expert counts, MLA ranks, etc.).references/gpu-specs.json: GPU peak FLOPS specifications for MFU calculation.scripts/extract_compute_flow_from_trace.py: trace-based compute flow extraction and template validation tool.