with one click
tt-lang-profile-optimize
// Profile and optimize TT-Lang kernels for performance. Covers auto-profiling, perf summary, signposts, and optimization workflow.
// Profile and optimize TT-Lang kernels for performance. Covers auto-profiling, perf summary, signposts, and optimization workflow.
Write a ForgeModel-compatible loader for a HuggingFace model, validate it on CPU, and push the result to a branch on tenstorrent/tt-forge-models.
Install tt-forge, run the model loader from the cpu bringup branch on Tenstorrent hardware, iterate on failures, and open a PR to tenstorrent/tt-forge-models on success.
File a bug report with a reproducer against Tenstorrent repos (tt-lang, tt-metal, tt-xla)
Set up and verify remote connection to Tenstorrent hardware. Provides tools for running kernels, copying files, and reading logs on remote devices.
TTNN trace capture and replay for eliminating dispatch overhead. Essential for real-time inference and multi-chip performance.
Comprehensive TT-Lang DSL reference including programming model, APIs, hardware constraints, and guides for translating CUDA, Triton, PyTorch, or TTNN kernels
| name | tt-lang-profile-optimize |
| description | Profile and optimize TT-Lang kernels for performance. Covers auto-profiling, perf summary, signposts, and optimization workflow. |
| argument-hint | <kernel-file> |
Before optimizing, ask the user:
Keep real-world constraints in mind throughout. For example, do not move everything to L1 just because the test data fits -- if the real data is larger, you need streaming. Optimizations must hold for the production workload, not just the test case.
Three goals, in priority order:
The kernel MUST use all available cores. If the kernel runs on grid=(1, 1), it is leaving performance on the table. Partition work across cores using the multicore patterns from the tt-lang skill. Check PERF SUMMARY for grid size.
Minimize unnecessary DRAM reads and writes. Key strategies:
Note: if tensors are small enough, moving them to L1 memory space (ttnn.L1_MEMORY_CONFIG) avoids DRAM reads entirely, but this only helps when the data actually fits.
Check PERF SUMMARY for DRAM read/write volumes and effective bandwidth.
Larger DFB shapes (block sizes) mean fewer DMA transfers and better throughput. Keep increasing shape=(R, C) on dataflow buffers until you run out of L1 (~1.5MB per core). This is often a big win.
Ask the user how to verify correctness after each optimization. Examples: numerical comparison against a reference output, assertion in the test script, visual inspection, or a tolerance threshold. Use this criteria throughout the optimization loop to ensure changes don't break the kernel.
Run with perf profiling on hardware:
# Via run-test.sh:
run-test.sh --perf --hw /path/to/kernel.py
# Or directly:
TT_METAL_DEVICE_PROFILER=1 TT_METAL_PROFILER_MID_RUN_DUMP=1 TT_METAL_DEVICE_PROFILER_NOC_EVENTS=1 TTLANG_PERF_DUMP=1 python /path/to/kernel.py
Record:
PERF SUMMARY) -- this is your ground truthFrom the baseline, determine the primary bottleneck:
Present your optimization plan to the user before making changes. Include:
Wait for user approval.
Make ONE change at a time. After each change:
Wall time is the metric that matters. Other metrics (cycles, BW) are diagnostic tools to understand why wall time changed.
Iterate as many times as needed. There is no limit on profiling runs. Keep going until you've exhausted the optimization targets or hit diminishing returns.
Summarize:
The auto-profiler can only profile one kernel invocation at a time. Before profiling, read the file and check if there are multiple @ttl.kernel calls or if a kernel is called in a loop. If so, comment out extra invocations so only the target kernel runs once. If it's ambiguous which kernel to profile, ask the user.
The auto profiler can produce line-by-line cycle counts. The user may request this or you may find it helpful for optimizing. To auto profile, use the below flow.
Read the input file. Ensure only one kernel invocation will execute (see constraint above).
Run with auto-profiling on hardware:
# Via run-test.sh:
run-test.sh --auto-profile --hw /path/to/kernel.py
# Or directly:
TT_METAL_DEVICE_PROFILER=1 TT_METAL_PROFILER_MID_RUN_DUMP=1 TTLANG_AUTO_PROFILE=1 python /path/to/kernel.py
Then read the output log.
The log contains several sections. Grep for these markers:
THREAD SUMMARY -- Per-thread cycle counts, op counts, and a compute-vs-memory bound analysis with a visual barPERF SUMMARY -- Grid size, duration, DRAM read/write volumes, effective bandwidth, transfer sizes, barrier countsDFB FLOW GRAPH -- JSON describing dataflow buffer producer/consumer relationships and DMA opsPIPE GRAPH -- Inter-core pipe communication graph (may be empty if no pipes)Summarize:
Be specific with numbers.
Give the user a command they can run directly on the server for the full auto-profiler result:
TT_METAL_DEVICE_PROFILER=1 TT_METAL_PROFILER_MID_RUN_DUMP=1 TTLANG_AUTO_PROFILE=1 python <path_to_kernel.py>
Do NOT include TTLANG_PERF_DUMP=1 in this command.