with one click
ttnn
// TTNN operations library reference for Tenstorrent hardware. Covers tensor APIs, ops catalog, model conversion from PyTorch, and memory/layout configuration.
// TTNN operations library reference for Tenstorrent hardware. Covers tensor APIs, ops catalog, model conversion from PyTorch, and memory/layout configuration.
Write a ForgeModel-compatible loader for a HuggingFace model, validate it on CPU, and push the result to a branch on tenstorrent/tt-forge-models.
Install tt-forge, run the model loader from the cpu bringup branch on Tenstorrent hardware, iterate on failures, and open a PR to tenstorrent/tt-forge-models on success.
File a bug report with a reproducer against Tenstorrent repos (tt-lang, tt-metal, tt-xla)
Set up and verify remote connection to Tenstorrent hardware. Provides tools for running kernels, copying files, and reading logs on remote devices.
TTNN trace capture and replay for eliminating dispatch overhead. Essential for real-time inference and multi-chip performance.
Profile and optimize TT-Lang kernels for performance. Covers auto-profiling, perf summary, signposts, and optimization workflow.
| name | ttnn |
| description | TTNN operations library reference for Tenstorrent hardware. Covers tensor APIs, ops catalog, model conversion from PyTorch, and memory/layout configuration. |
TTNN natively supports multi-chip execution via the MeshDevice abstraction. See multi_device.md for full details.
# Single device
device = ttnn.open_device(device_id=0, trace_region_size=100000000)
# Multi-device mesh (e.g., 4 chips in a row)
ttnn.set_fabric_config(ttnn.FabricConfig.FABRIC_1D)
mesh_device = ttnn.open_mesh_device(ttnn.MeshShape(1, N_CHIPS),
trace_region_size=100000000)
# Replicate a tensor to all devices
x = ttnn.from_torch(t, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT,
device=mesh_device,
mesh_mapper=ttnn.ReplicateTensorToMesh(mesh_device))
# Shard a tensor across devices along a dimension (tensor parallelism)
w = ttnn.from_torch(t, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT,
device=mesh_device,
mesh_mapper=ttnn.ShardTensorToMesh(mesh_device, dim=1))
# Read back sharded results by concatenating
result = ttnn.to_torch(t, mesh_composer=ttnn.ConcatMeshToTensor(mesh_device, dim=1))
# Tensor parallel matmul pattern: column parallel + row parallel + all_reduce
col_out = ttnn.matmul(x_replicated, w_col_sharded) # shard W along dim=1
row_out = ttnn.matmul(col_out, w_row_sharded) # shard W along dim=0
reduced = ttnn.all_reduce(row_out) # sync across chips
Tensor sharding distributes data across cores for locality and reduced communication. See tensor_sharding.md for height, width, and block sharding strategies.
Large fused kernels can exceed the default kernel config buffer limit (~69KB). The fix is to reduce worker_l1_size, which trades user L1 (for CBs/buffers) for more kernel config space.
# Get the default worker L1 size
default_size = ttnn.device.get_max_worker_l1_unreserved_size()
# Subtract enough for your kernel's config buffer needs
# e.g., fused kernel is ~85KB, so give 88KB (90112 bytes) more config space
device = ttnn.open_device(device_id=0, worker_l1_size=default_size - 90112)
The tradeoff: slightly less L1 available for tile buffers. Start with a small reduction (e.g., 8192) and increase if you still hit the config buffer limit.
TTNN supports captured traces for eliminating host overhead in hot loops. See the tt-enable-tracing skill for setup and usage.
Find the op name in api.rst, then fetch its full documentation:
curl https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api/ttnn.<OP>.html
For example: api/ttnn.conv2d.html, api/ttnn.matmul.html, api/ttnn.softmax.html.
Most TTNN ops accept an output_tensor or optional_output_tensor parameter that lets you write the result into a pre-allocated tensor instead of allocating a new one. This is useful for:
# Pre-allocate a scratch tensor
scratch = ttnn.zeros_like(x, device=device, memory_config=ttnn.DRAM_MEMORY_CONFIG)
# Reuse it across ops
ttnn.relu(x, output_tensor=scratch)
ttnn.add(scratch, bias, output_tensor=scratch)
Look up individual ops in the API reference to check whether they support output_tensor.
TTNN is the high-level operations library for Tenstorrent hardware. It provides a PyTorch-like API for tensor creation, manipulation, and computation on TT devices. TTNN ops run individually (one kernel launch per op call). For fusing multiple ops into a single kernel, use TT-Lang.
ttnn.to_device(tensor, device)ttnn.ROW_MAJOR_LAYOUT or ttnn.TILE_LAYOUT (32x32 tiles, required for most compute ops)ttnn.DRAM_MEMORY_CONFIG (default, large) or ttnn.L1_MEMORY_CONFIG (fast, limited ~1.5MB/core)ttnn.bfloat16 (standard), ttnn.float32, ttnn.bfloat8_b, ttnn.uint32import torch
import ttnn
device = ttnn.open_device(device_id=0)
# Torch -> TTNN
x_torch = torch.randn(1, 1, 64, 64, dtype=torch.bfloat16)
x = ttnn.from_torch(x_torch, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT,
device=device, memory_config=ttnn.DRAM_MEMORY_CONFIG)
# Compute
y = ttnn.relu(x)
y = ttnn.matmul(a, b)
y = ttnn.softmax(x, dim=-1)
# TTNN -> Torch
result = ttnn.to_torch(y)
ttnn.close_device(device)