Run any Skill in Manus with one click

$pwd:

ad-layer-visualizer

Name: Ad Layer Visualizer
Author: NVIDIA

// Visualize a specific transformer decoder layer from an AutoDeploy FX graph text dump as a hierarchical DOT/PNG diagram. Optionally annotate nodes with actual GPU kernel names and durations from an nsys trace. Use when the user wants to visualize, inspect, or debug a layer in an AutoDeploy model graph dump. Triggers on: "visualize layer", "show layer", "graph of layer", "layer visualization", "dump graph layer". Assumes graph dumps already exist in a directory (produced by AD_DUMP_GRAPHS_DIR).

Run Skill in Manus

$ git log --oneline --stat

stars:13,702

forks:2,406

updated:May 20, 2026 at 07:35

File Explorer

3 files

SKILL.md

readonly

related-skills.json

same repository

trtllm-model-onboard-multimodal.md

from "NVIDIA/TensorRT-LLM"

Onboard a HuggingFace multimodal model (vision/audio/video + text) to the TensorRT-LLM PyTorch backend. Use when writing a new `tensorrt_llm/_torch/models/modeling_<vlm>.py` plus its input processor and weight mapper, or extending an existing VLM. Not for AutoDeploy — use `ad-model-onboard` for that path.

2026-05-2113.7k

ad-accuracy-debug.md

from "NVIDIA/TensorRT-LLM"

Debug AutoDeploy accuracy regressions vs a reference score (PyTorch backend or published baseline). Use when an AutoDeploy model's eval score is significantly below the reference and the root cause is unknown.

2026-05-2013.7k

ad-add-fusion-transformation.md

from "NVIDIA/TensorRT-LLM"

Claude Code skill (trtllm-agent-toolkit): implement or extend TensorRT-LLM AutoDeploy fusion transforms under transform/library/ in a TensorRT-LLM checkout. Prefer existing kernels and custom ops; use Triton only when no viable existing-kernel path exists. Use ad-graph-dump for AD_DUMP_GRAPHS_DIR workflows. Covers TRT-LLM paths, registry, default.yaml registration, graph validation, tests, and a review checklist — without prescribing profiling tools or throughput targets.

2026-05-2013.7k

ad-graph-dump.md

from "NVIDIA/TensorRT-LLM"

Enable and interpret TensorRT-LLM AutoDeploy FX graph text dumps via AD_DUMP_GRAPHS_DIR. Use when you need before/after graphs per transform, to locate subgraphs, or to confirm a rewrite ran. Paths and behavior are grounded in tensorrt_llm/_torch/auto_deploy (GraphWriter, BaseTransform). Complements ad-add-fusion-transformation.

2026-05-2013.7k

kernel-cute-writing.md

from "NVIDIA/TensorRT-LLM"

Write and implement GPU kernels using NVIDIA CuTe DSL (CUTLASS 4.x Python API) — NOT for Triton, CUDA C++, or conceptual explanations. Trigger only when the user wants to write or implement a kernel, not when asking questions about CuTe DSL concepts or layouts. CuTe DSL uses cute.jit/cute.kernel decorators and cutlass.cute imports. Covers element-wise kernels, GEMM patterns, reductions, memory hierarchy (global/shared/register/TMA), MMA tensor core operations, software pipelining, and framework integration.

2026-05-2013.7k

kernel-tileir-optimization.md

from "NVIDIA/TensorRT-LLM"

Optimize existing Triton kernels for NVIDIA TileIR backend on Blackwell GPUs (sm_100+). Adds TileIR-specific autotune configs: occupancy, num_ctas, TMA descriptors. Covers kernel classification (dot-related, norm-like, elementwise, reduction), type-specific transformations, and PTX-vs-TileIR benchmarking. Triggered by: "optimize for TileIR", "add TileIR configs", "Blackwell optimization", "TMA descriptors", "2CTA mode", "occupancy tuning". Kernels use standard `import triton`; TileIR activates via ENABLE_TILE=1 when nvtriton is installed.

2026-05-2013.7k

package.json

"author": "NVIDIA"

"repository": "NVIDIA/TensorRT-LLM"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	ad-layer-visualizer
description	Visualize a specific transformer decoder layer from an AutoDeploy FX graph text dump as a hierarchical DOT/PNG diagram. Optionally annotate nodes with actual GPU kernel names and durations from an nsys trace. Use when the user wants to visualize, inspect, or debug a layer in an AutoDeploy model graph dump. Triggers on: "visualize layer", "show layer", "graph of layer", "layer visualization", "dump graph layer". Assumes graph dumps already exist in a directory (produced by AD_DUMP_GRAPHS_DIR).
license	Apache-2.0
metadata	{"author":"NVIDIA Corporation"}

AutoDeploy Layer Visualizer

Visualize a single transformer decoder layer from an AutoDeploy SSA graph dump. Optionally overlay actual GPU kernel names and durations from an nsys trace.

Prerequisite knowledge: This skill assumes familiarity with the ad-graph-dump skill, which covers how to enable dumps via AD_DUMP_GRAPHS_DIR, file naming conventions, SSA format basics, and GraphModule section structure. Refer to ad-graph-dump for that context.

Inputs

Dump directory — path to a directory containing graph dump .txt files
Layer number — which decoder layer to visualize (e.g. 5)
(Optional) Dump file — specific .txt file. If not given, pick the file with the highest numeric prefix (final transform stage).
(Optional) Trace file — path to .nsys-rep or .sqlite trace file. When provided, GPU kernel names and durations are extracted and annotated onto each node in the visualization.

Workflow

Phase 0: Ask the user

Before starting, ask the user two questions (skip any already answered in their request):

Do you have an nsys trace file? (.nsys-rep or .sqlite) — if yes, kernel names and durations from the trace will be annotated directly on each node in the diagram.
If yes — prefill or decode? The CUDA graph structure differs between prefill and decode phases. Knowing which phase the trace captures helps correctly segment layers and map kernels.

Proceed once both are answered (or the user says no trace).

Phase 1: Select the dump file

If the user didn't specify a file, pick the file with the highest numeric prefix (the final transform stage). See ad-graph-dump for the file naming convention and how lexicographic sort matches pipeline order.

Phase 2: Read the graph dump

Read the selected .txt file. If it contains multiple GraphModule sections (delimited by ======== headers), pick the one labeled monolithic.model or the first/largest one with real operation nodes.

Phase 3: Extract the layer subgraph

This is the core step — you do this yourself by understanding the graph structure. Do NOT delegate to a script.

The dump is an SSA-form graph where each line is one of:

Placeholder: %name : shape : dtype (model inputs like input_ids, kv_cache, etc.)
Operation: %name = namespace.op_name(%input1, %input2, ..., const_args...) : shape : dtype
Output: output(%name1, %name2, ...)

How to identify which nodes belong to layer N:

A transformer decoder layer typically contains these blocks in order:

Input LayerNorm — consumes model_layers_N_input_layernorm_weight
Self-Attention (MLA/MHA/GQA) — consumes model_layers_N_self_attn_* weights (q_proj, k_proj, v_proj, o_proj, kv_a_proj, kv_b_proj, q_a_proj, q_b_proj, etc.)
Post-Attention LayerNorm + Residual — consumes model_layers_N_post_attention_layernorm_weight
FFN / MoE — consumes model_layers_N_mlp_* weights (gate_weight, shared_experts, etc.) and fused_*_N_* fused weight references

Extraction rules:

Seed nodes: Any operation whose arguments include a weight reference matching layers_N_ or layers.N. or a fused weight pattern like fused_*_N_* (where these are non-% references, i.e. weight parameters, not activation outputs from other ops).
Forward follow: From seeds, follow the dataflow forward — if a node consumes the output of a seed/layer node, it belongs to this layer too. Stop when you hit a node that consumes weights from a different layer (e.g. layers_M_ where M ≠ N).
Backward follow: From seeds, follow the dataflow backward to pick up intermediate ops (getitem, sub, floordiv, eq, mul, view, reshape, etc.) that feed into seed nodes. Stop when you hit:
- A placeholder node (model input)
- A node that belongs to a different layer (i.e., it was already identified as part of layer M ≠ N by the seed rule, or its own backward chain reaches a different layer's weights)
- The output of a residual-add from the previous layer (this is the boundary between layers)
Critical: Generic arithmetic ops like sub_1, floordiv_1, mul_1, eq_1 that sit between two layers' MoE routing logic can be tricky. Trace their inputs backward — if they ultimately derive from layer M's weights/operations (not layer N's), they belong to layer M, not layer N. The suffix number on these generic ops does NOT indicate which layer they belong to; you must trace the dataflow.
External inputs: Nodes from outside the layer that feed into layer nodes are "external inputs" (show them as input nodes in the diagram). These include:
- Previous layer's residual output
- Global model buffers (RoPE caches like _ad_rotary_cos_sin_N, batch metadata)
- KV cache placeholders

Phase 4: Produce the JSON

After extraction, output a JSON file at <dump_dir>/<dump_stem>_layer<N>.json with this structure:

{
  "layer": 5,
  "source_file": "085_compile_compile_model.txt",
  "nodes": [
    {
      "id": "noaux_tc_op_default_2",
      "op": "trtllm.noaux_tc_op.default",
      "shape": "(8x8, 8x8)",
      "dtype": "(torch.bfloat16, torch.int32)",
      "group": "moe",
      "sub_group": "moe_router",
      "inputs": ["dsv3_router_gemm_op_default_2"],
      "weight_inputs": [
        {"name": "model_layers_5_mlp_gate_e_score_correction_bias", "shape": "256", "dtype": "torch.bfloat16"}
      ]
    }
  ],
  "edges": [
    {"from": "dsv3_router_gemm_op_default_2", "to": "noaux_tc_op_default_2"}
  ],
  "external_inputs": [
    {
      "id": "trtllm_fused_allreduce_residual_rmsnorm_default_4",
      "label": "Layer 4 residual output",
      "shape": "(2x4x7168, 2x4x7168)"
    }
  ]
}

Node fields:

id: the SSA name (without % prefix)
op: the full operation target string
shape, dtype: output shape and dtype
group: one of norm, attention/mla, moe, mlp, mamba, gdn, other
sub_group (optional): finer classification like q_branch, kv_branch, rope, moe_router, moe_experts, shared_experts
inputs: list of node IDs that this node consumes (only nodes within the layer or external inputs)
weight_inputs: list of weight parameters consumed (name, shape, dtype)

Edge fields:

from, to: node IDs (both must be in nodes or external_inputs)

Group assignment heuristic:

norm: ops consuming input_layernorm or post_attention_layernorm weights
mla/attention: ops consuming self_attn_* weights or named *mla*, *rope*, *sdpa*
moe: ops consuming MoE weights (*moe*, *experts*), router ops (noaux_tc_op, topk), and the arithmetic ops that process router outputs (sub, floordiv, eq, mul between router and MoE fused op)
mlp: ops consuming mlp_* weights that aren't MoE (e.g., shared_experts, gate_proj, up_proj, down_proj)
other: everything else (getitem, view, reshape, etc.) — assign to the same group as neighbors

Phase 5: (Optional) Extract trace kernels

If the user provided a trace file, extract per-layer kernel sequences:

python <skill_dir>/scripts/extract_trace_kernels.py <trace_file> --layer <N> --output <dump_dir>/<dump_stem>_layer<N>_kernels.json

This script uses graphNodeId to extract a single CUDA graph replay, groups kernels by stream, and identifies which streams belong to which layer. It outputs a JSON with per-layer kernel sequences including short names, full names, durations, and stream IDs.

Phase 6: (Optional) Annotate each node with its trace kernels

When trace kernel data is available, you must map GPU kernels to individual FX graph nodes. This is the key step — the render script will display kernel names and durations directly on each node in the diagram.

How to map kernels to nodes:

Read the trace kernel JSON and the layer JSON side by side. For each FX graph node, identify which GPU kernel(s) it corresponds to based on the op type and the kernel execution order. Common mappings:

FX graph op	Trace kernel(s)
`flashinfer_mla_with_cache`	`fmhaSm100...`
`finegrained_fp8_linear`	`fp8_blockscale` + `pack_fp32_to_ue8m0` + `deep_gemm_fp8`
`torch_linear_simple`	`nvjet_gemm` + `splitKreduce`
`flashinfer_rms_norm`	`rms_norm_reduce_fusion`
`flashinfer_fused_add_rms_norm`	`fused_kernel_a5fe...`
`mlir_fused` (input_layernorm)	`fused_kernel_1984...`
`mlir_fused` (post_attn)	`fused_kernel_2e06...`
`trtllm_moe_fused`	`bmm_E4m3...` + `moe_activation_deepseek` + `bmm_Bfloat16...` + `moe_finalize`
`noaux_tc_op` (top-k routing)	`deepseek_v3_topk`
`fused_swiglu_mlp` / `fused_finegrained_fp8_swiglu_mlp`	`deep_gemm_fp8` (×2) + `act_and_mul`
`trtllm_dist_all_reduce`	`nccl_allreduce_symk` or `symm_mem_allreduce`
`symm_mem_all_gather`	`symm_mem_allgather` or `nccl_allgather`
`triton_rope_on_interleaved_qk_inputs`	`mla_rope_assign_qkv`

Add a "trace_kernels" list to each node that has corresponding GPU kernels:

{
  "id": "flashinfer_mla_with_cache_default_5",
  "op": "auto_deploy.flashinfer_mla_with_cache.default",
  "group": "mla",
  "trace_kernels": [
    {"kernel": "fmhaSm100...", "duration_us": 50.1}
  ],
  ...
}

Also add top-level trace_summary for the whole layer:

{
  "layer": 5,
  "trace_summary": {
    "total_duration_us": 650.9,
    "kernel_count": 66,
    "num_streams": 3
  },
  "nodes": [...]
}

The render script will display each node's kernels as ⚡ kernel_name (Xµs) lines below the node label. Nodes without trace_kernels are left unchanged.

Phase 7: Render

Run the bundled visualization script:

python <skill_dir>/scripts/render_layer.py <dump_dir>/<dump_stem>_layer<N>.json --output <dump_dir>/<dump_stem>_layer<N>

This reads the JSON and produces .dot and .png files. If nodes have trace_kernels fields, the script renders kernel names and durations directly on each node's label (prefixed with ⚡).

Present the output paths to the user.

Notes

The script requires graphviz (dot command) to be installed for PNG rendering
If the dump has multiple GraphModules, prefer the monolithic.model module or auto-select the largest
When in doubt about a node's layer membership, trace its weight dependencies all the way back — weight parameter names are the ground truth for layer assignment
The trace extraction script auto-converts .nsys-rep to .sqlite if needed (requires nsys on PATH)
CUDA graph kernels run across multiple streams per layer; the script handles multi-stream grouping automatically

ad-layer-visualizer

More from this repository

AutoDeploy Layer Visualizer

Inputs

Workflow

Phase 0: Ask the user

Phase 1: Select the dump file

Phase 2: Read the graph dump

Phase 3: Extract the layer subgraph

Phase 4: Produce the JSON

Phase 5: (Optional) Extract trace kernels

Phase 6: (Optional) Annotate each node with its trace kernels

Phase 7: Render

Notes

AutoDeploy Layer Visualizer

Inputs

Workflow

Phase 0: Ask the user

Phase 1: Select the dump file

Phase 2: Read the graph dump

Phase 3: Extract the layer subgraph

Phase 4: Produce the JSON

Phase 5: (Optional) Extract trace kernels

Phase 6: (Optional) Annotate each node with its trace kernels

Phase 7: Render

Notes

More from this repository