with one click
tt-enable-tracing
// TTNN trace capture and replay for eliminating dispatch overhead. Essential for real-time inference and multi-chip performance.
// TTNN trace capture and replay for eliminating dispatch overhead. Essential for real-time inference and multi-chip performance.
Write a ForgeModel-compatible loader for a HuggingFace model, validate it on CPU, and push the result to a branch on tenstorrent/tt-forge-models.
Install tt-forge, run the model loader from the cpu bringup branch on Tenstorrent hardware, iterate on failures, and open a PR to tenstorrent/tt-forge-models on success.
File a bug report with a reproducer against Tenstorrent repos (tt-lang, tt-metal, tt-xla)
Set up and verify remote connection to Tenstorrent hardware. Provides tools for running kernels, copying files, and reading logs on remote devices.
Profile and optimize TT-Lang kernels for performance. Covers auto-profiling, perf summary, signposts, and optimization workflow.
Comprehensive TT-Lang DSL reference including programming model, APIs, hardware constraints, and guides for translating CUDA, Triton, PyTorch, or TTNN kernels
| name | tt-enable-tracing |
| description | TTNN trace capture and replay for eliminating dispatch overhead. Essential for real-time inference and multi-chip performance. |
Trace capture records a sequence of TTNN operations once, then replays them without host dispatch overhead.
When opening the device, reserve space for the trace with trace_region_size:
# Single device
device = ttnn.open_device(device_id=0, trace_region_size=100000000)
# Multi-device mesh
ttnn.set_fabric_config(ttnn.FabricConfig.FABRIC_1D)
mesh_device = ttnn.open_mesh_device(ttnn.MeshShape(1, N_CHIPS),
trace_region_size=100000000)
The trace replays the exact recorded command sequence. Everything inside the trace MUST be pure device work:
ttnn.from_torch, ttnn.to_torch, ttnn.copy_host_to_device_tensor calls must happen outside the trace.output_tensor arguments. This avoids dynamic allocation inside the trace.# 1. Pre-allocate all tensors that will be used in the trace
trace_input = ttnn.from_torch(dummy_input, dtype=ttnn.bfloat16,
layout=ttnn.TILE_LAYOUT, device=device,
memory_config=ttnn.DRAM_MEMORY_CONFIG)
# 2. Capture the trace (runs the ops once to record them)
trace_id = ttnn.begin_trace_capture(device, cq_id=0)
result = ttnn.matmul(trace_input, weights)
result = ttnn.relu(result)
ttnn.end_trace_capture(device, trace_id, cq_id=0)
ttnn.synchronize_device(device)
# 3. Replay with new inputs (no dispatch overhead)
for batch in batches:
ttnn.copy_host_to_device_tensor(batch_host_tensor, trace_input)
ttnn.execute_trace(device, trace_id, cq_id=0, blocking=False)
ttnn.synchronize_device(device)
synchronize_device is only needed if you use non-blocking execution. If you pass blocking=True to execute_trace, you don't need it (but you lose the ability to overlap host work).
Traces work with mesh devices and collective operations:
trace_id = ttnn.begin_trace_capture(mesh_device, cq_id=0)
partial = ttnn.matmul(x_sharded, w_sharded)
reduced = ttnn.all_reduce(partial)
ttnn.end_trace_capture(mesh_device, trace_id, cq_id=0)
ttnn.execute_trace(mesh_device, trace_id, cq_id=0, blocking=True)