تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

ad-add-fusion-transformation

Name: Ad Add Fusion Transformation
Author: NVIDIA

// Claude Code skill (trtllm-agent-toolkit): implement or extend TensorRT-LLM AutoDeploy fusion transforms under transform/library/ in a TensorRT-LLM checkout. Prefer existing kernels and custom ops; use Triton only when no viable existing-kernel path exists. Use ad-graph-dump for AD_DUMP_GRAPHS_DIR workflows. Covers TRT-LLM paths, registry, default.yaml registration, graph validation, tests, and a review checklist — without prescribing profiling tools or throughput targets.

تشغيل في Manus

$ git log --oneline --stat

stars:١٣٬٧٠٢

forks:٢٬٤٠٦

updated:٢٠ مايو ٢٠٢٦ في ٠٧:٣٥

SKILL.md

readonly

related-skills.json

نفس المستودع

trtllm-model-onboard-multimodal.md

from "NVIDIA/TensorRT-LLM"

Onboard a HuggingFace multimodal model (vision/audio/video + text) to the TensorRT-LLM PyTorch backend. Use when writing a new `tensorrt_llm/_torch/models/modeling_<vlm>.py` plus its input processor and weight mapper, or extending an existing VLM. Not for AutoDeploy — use `ad-model-onboard` for that path.

2026-05-2113.7k

ad-accuracy-debug.md

from "NVIDIA/TensorRT-LLM"

Debug AutoDeploy accuracy regressions vs a reference score (PyTorch backend or published baseline). Use when an AutoDeploy model's eval score is significantly below the reference and the root cause is unknown.

2026-05-2013.7k

ad-graph-dump.md

from "NVIDIA/TensorRT-LLM"

Enable and interpret TensorRT-LLM AutoDeploy FX graph text dumps via AD_DUMP_GRAPHS_DIR. Use when you need before/after graphs per transform, to locate subgraphs, or to confirm a rewrite ran. Paths and behavior are grounded in tensorrt_llm/_torch/auto_deploy (GraphWriter, BaseTransform). Complements ad-add-fusion-transformation.

2026-05-2013.7k

ad-layer-visualizer.md

from "NVIDIA/TensorRT-LLM"

Visualize a specific transformer decoder layer from an AutoDeploy FX graph text dump as a hierarchical DOT/PNG diagram. Optionally annotate nodes with actual GPU kernel names and durations from an nsys trace. Use when the user wants to visualize, inspect, or debug a layer in an AutoDeploy model graph dump. Triggers on: "visualize layer", "show layer", "graph of layer", "layer visualization", "dump graph layer". Assumes graph dumps already exist in a directory (produced by AD_DUMP_GRAPHS_DIR).

2026-05-2013.7k

kernel-cute-writing.md

from "NVIDIA/TensorRT-LLM"

Write and implement GPU kernels using NVIDIA CuTe DSL (CUTLASS 4.x Python API) — NOT for Triton, CUDA C++, or conceptual explanations. Trigger only when the user wants to write or implement a kernel, not when asking questions about CuTe DSL concepts or layouts. CuTe DSL uses cute.jit/cute.kernel decorators and cutlass.cute imports. Covers element-wise kernels, GEMM patterns, reductions, memory hierarchy (global/shared/register/TMA), MMA tensor core operations, software pipelining, and framework integration.

2026-05-2013.7k

kernel-tileir-optimization.md

from "NVIDIA/TensorRT-LLM"

Optimize existing Triton kernels for NVIDIA TileIR backend on Blackwell GPUs (sm_100+). Adds TileIR-specific autotune configs: occupancy, num_ctas, TMA descriptors. Covers kernel classification (dot-related, norm-like, elementwise, reduction), type-specific transformations, and PTX-vs-TileIR benchmarking. Triggered by: "optimize for TileIR", "add TileIR configs", "Blackwell optimization", "TMA descriptors", "2CTA mode", "occupancy tuning". Kernels use standard `import triton`; TileIR activates via ENABLE_TILE=1 when nvtriton is installed.

2026-05-2013.7k

package.json

"author": "NVIDIA"

"repository": "NVIDIA/TensorRT-LLM"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مطوّرو البرمجياتمهن الحاسوب والرياضيات15-1252L4

name	ad-add-fusion-transformation
description	Claude Code skill (trtllm-agent-toolkit): implement or extend TensorRT-LLM AutoDeploy fusion transforms under transform/library/ in a TensorRT-LLM checkout. Prefer existing kernels and custom ops; use Triton only when no viable existing-kernel path exists. Use ad-graph-dump for AD_DUMP_GRAPHS_DIR workflows. Covers TRT-LLM paths, registry, default.yaml registration, graph validation, tests, and a review checklist — without prescribing profiling tools or throughput targets.
license	Apache-2.0
tags	["tensorrt-llm","autodeploy","fusion","graph-transform","optimization"]
metadata	{"author":"NVIDIA Corporation"}

Autodeploy: Add Fusion Transformation Pass

Where this skill applies

This file lives in the trtllm-agent-toolkit plugin. Paths such as tensorrt_llm/..., examples/auto_deploy/..., and tests/... are relative to a TensorRT-LLM source checkout on the user’s machine — not the plugin tree.

After installing the plugin (see the toolkit README.md), skills use the trtllm-agent-toolkit: prefix (for example trtllm-agent-toolkit:ad-add-fusion-transformation).

Related skills in this plugin

Skill	Use it for
ad-graph-dump	Enabling `AD_DUMP_GRAPHS_DIR`, dump file layout, and how to read SSA graph output.
trtllm-codebase-exploration	Mapping existing transforms, custom ops, and search patterns before writing a pass.
trtllm-code-contribution	TensorRT-LLM pre-commit, tests, DCO sign-off, and PR expectations.
triton-kernel-writing	Implementing a Triton op only after existing-kernel lookup fails.
triton-tileir-optimization	Tuning existing Triton kernels for the TileIR backend when that path applies.

Use this skill when you already know which subgraph or pattern you are targeting (from graph dumps, logs, or code reading). For dump capture and file semantics, follow ad-graph-dump first.

When to use this skill

Adding, extending, or reviewing a fusion under AutoDeploy transforms in a TensorRT-LLM tree.

Workflow (concise)

Confirm the pattern in current graph dumps (see ad-graph-dump).
Search for an existing kernel or custom-op path before new Triton or CUDA.
Implement the smallest change that proves correctness and matching; add tests.
Re-run dumps and tests; if outputs drift, separate matching issues from metadata loss from numeric differences.

Finding fusion candidates (lightweight)

Do this before writing a new pass so you work on real graph structure.

Inputs

Graph dump directory from a run with AD_DUMP_GRAPHS_DIR set (see ad-graph-dump).
Model id and active AutoDeploy config (registry YAML, default.yaml overlays).
TensorRT-LLM source tree for kernel and transform lookup.

Outputs

Ordered list of candidates with: graph evidence, existing-kernel lookup (found / not_found), recommendation (use_existing_kernel, needs_triton_fallback, defer), and trade-offs (complexity, correctness risk).

Discovery workflow

Parse dumps for repeated unfused patterns (element-wise chains, norm chains, epilogues, attention-adjacent ops).
Search the tree for equivalent transforms or custom ops; record file/symbol evidence.
If nothing fits, mark Triton or other kernel work as a deliberate fallback.
Prefer candidates with clear recurrence, existing support, and lower numerical risk.

Per-candidate template

Candidate: <short-name>
Affected graph pattern: <pattern>
Existing kernel lookup: <found|not_found>
Evidence: <path/symbol>
Recommendation: <use_existing_kernel|needs_triton_fallback|defer>
Strengths / weaknesses / risks:
- ...

Guardrails

Do not skip existing-kernel lookup.
Do not default to Triton when a viable existing op already exists.
If uncertain, defer and narrow the question with one more dump or test.

Inputs (implementation)

Chosen candidate or concrete subgraph.
Active model and config files.
Fresh graph dumps when available.
Current baseline: match counts from logs, unit test status, any accuracy notes you already maintain.

Outputs (implementation)

Pass design or patch: registered transform, default.yaml entry, optional model-registry YAML.
Path decision: existing_kernel_path vs triton_fallback_path (or other kernel stack).
Validation notes: graph evidence, [SUMMARY] matches=... before/after from AutoDeploy logs, test results.

Implementation workflow

Align the pass with observed graph structure from dumps — not assumed op names from docs alone.
Search transform/library/, custom_ops/, torch.ops.auto_deploy.*, and related tests for reuse.
Integrate an existing op when possible; otherwise delegate kernel work to the appropriate skill (triton-kernel-writing, cuda-kernel-writing, etc.).
Keep one logical change per patch; extend tests in the same change.
Re-read dumps after the change; if match counts collapse, suspect pattern availability or metadata propagation.

Where fusion passes live

Transforms: tensorrt_llm/_torch/auto_deploy/transform/library/
Registry / base behavior: tensorrt_llm/_torch/auto_deploy/transform/interface.py
Default transform list: tensorrt_llm/_torch/auto_deploy/config/default.yaml
Dump helper: tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py
Graph utilities: tensorrt_llm/_torch/auto_deploy/utils/node_utils.py, tensorrt_llm/_torch/auto_deploy/utils/_graph.py
Custom ops: tensorrt_llm/_torch/auto_deploy/custom_ops/

Tests (typical):

tests/unittest/auto_deploy/singlegpu/transformations/library/
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (when behavior or numerics may change)

How to add a transform

Implement the pass

Create or update a module under transform/library/ and register the class:

@TransformRegistry.register("my_transform_key")
class MyTransform(BaseTransform):
    @classmethod
    def get_config_class(cls):
        return MyTransformConfig

Use a dedicated config class only when the pass needs parameters beyond the base transform config.

Register in `default.yaml`

Add a key under transforms: in tensorrt_llm/_torch/auto_deploy/config/default.yaml. Copy the field set from the closest existing transform in the same section of the file (required keys depend on the transform config class and on how peers are declared). New experimental passes should stay enabled: false until covered by tests and dumps.

Enable for a specific model

For targeted rollout, adjust registry YAMLs under examples/auto_deploy/model_registry/configs/ rather than turning on unproven passes globally.

Implementation rules

Prefer existing AutoDeploy / TRT-LLM ops and torch.ops.auto_deploy entries.
Prefer stable, backend-neutral graph contracts; avoid hiding real dataflow in node.meta when an edge should carry it.
Use metadata for observable tensor facts (shape, dtype) and preserve it across rewrites when replacements should remain traceable.
One hypothesis per patch — do not mix unrelated fusions.

Existing kernel first, Triton second

Before Triton:

Search transform/library/ and custom_ops/.
Search torch.ops.auto_deploy.* and TRT-LLM custom op definitions.
Read tests for similar integrations.

Use triton-kernel-writing only when no suitable op exists and you accept owning kernel + integration work.

Validation order

Graph dumps — pattern present, rewrite visible (see ad-graph-dump).
Unit tests for the transform.
Integration or accuracy checks when numerics or end-to-end behavior may change.

Match counts

AutoDeploy logs [SUMMARY] matches=<n> (or skipped / disabled) per transform. Compare before and after your change; a large drop usually indicates pattern or metadata issues, not “slow runs.”

Testing expectations

Follow trtllm-code-contribution for repo conventions. Cover:

Happy-path micrograph or exported-graph rewrites.
Failure modes that must not fuse (multiple consumers, mixed consumers).
Metadata preservation when an upstream pass feeds your pattern.

Primary unittest location for library transforms:

tests/unittest/auto_deploy/singlegpu/transformations/library/

Review checklist

Target structure appears in current dumps.
Transform registered and listed in default.yaml consistently with peer entries.
Model-registry toggles are intentional.
Non-zero matches where expected, or skipped is explained.
Before/after dump snippets or diffs saved for the review thread.
Tests cover both success and intentional non-match cases.
If outputs change, classify match loss vs metadata loss vs acceptable numeric drift.

Guardrails

Do not bundle unrelated passes in one change.
If dumps contradict expectations, document what you observed before chasing unrelated hypotheses.

Iteration note (template)

Candidate: <name>
Path: <existing_kernel_path|triton_fallback_path|other>
Rationale:
- ...
Graph validation: <pass|fail — what files / ops>
Summary logs: <matches before / after>
Tests: <what ran>
Open risks:
- ...

ad-add-fusion-transformation

المزيد من هذا المستودع

المزيد من هذا المستودع

Autodeploy: Add Fusion Transformation Pass

Where this skill applies

Related skills in this plugin

When to use this skill

Workflow (concise)

Finding fusion candidates (lightweight)

Inputs

Outputs

Discovery workflow

Per-candidate template

Guardrails

Inputs (implementation)

Outputs (implementation)

Implementation workflow

Where fusion passes live

How to add a transform

Implement the pass

Register in default.yaml

Enable for a specific model

Implementation rules

Existing kernel first, Triton second

Validation order

Match counts

Testing expectations

Review checklist

Guardrails

Iteration note (template)

Autodeploy: Add Fusion Transformation Pass

Where this skill applies

Related skills in this plugin

When to use this skill

Workflow (concise)

Finding fusion candidates (lightweight)

Inputs

Outputs

Discovery workflow

Per-candidate template

Guardrails

Inputs (implementation)

Outputs (implementation)

Implementation workflow

Where fusion passes live

How to add a transform

Implement the pass

Register in default.yaml

Enable for a specific model

Implementation rules

Existing kernel first, Triton second

Validation order

Match counts

Testing expectations

Review checklist

Guardrails

Iteration note (template)

Register in `default.yaml`

Register in `default.yaml`