Run any Skill in Manus with one click

$pwd:

ttnn

Name: Ttnn
Author: tenstorrent

// TTNN operations library reference for Tenstorrent hardware. Covers tensor APIs, ops catalog, model conversion from PyTorch, and memory/layout configuration.

Run Skill in Manus

$ git log --oneline --stat

stars:244

forks:37

updated:March 31, 2026 at 16:27

File Explorer

6 files

SKILL.md

readonly

name	ttnn
description	TTNN operations library reference for Tenstorrent hardware. Covers tensor APIs, ops catalog, model conversion from PyTorch, and memory/layout configuration.

External Resources

TTNN Documentation
TT-Metal Repository
TTNN API Reference -- full operation catalog (~400+ ops organized by category)
Tensor Reference -- shapes, layouts, memory configs, data types
Converting PyTorch Models to TTNN -- step-by-step conversion guide
Multi-Device Programming -- MeshDevice, tensor parallelism, data parallelism, CCL ops
Tensor Sharding -- height, width, and block sharding strategies

Multi-Device

TTNN natively supports multi-chip execution via the MeshDevice abstraction. See multi_device.md for full details.

# Single device
device = ttnn.open_device(device_id=0, trace_region_size=100000000)

# Multi-device mesh (e.g., 4 chips in a row)
ttnn.set_fabric_config(ttnn.FabricConfig.FABRIC_1D)
mesh_device = ttnn.open_mesh_device(ttnn.MeshShape(1, N_CHIPS),
                                     trace_region_size=100000000)

# Replicate a tensor to all devices
x = ttnn.from_torch(t, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT,
                     device=mesh_device,
                     mesh_mapper=ttnn.ReplicateTensorToMesh(mesh_device))

# Shard a tensor across devices along a dimension (tensor parallelism)
w = ttnn.from_torch(t, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT,
                     device=mesh_device,
                     mesh_mapper=ttnn.ShardTensorToMesh(mesh_device, dim=1))

# Read back sharded results by concatenating
result = ttnn.to_torch(t, mesh_composer=ttnn.ConcatMeshToTensor(mesh_device, dim=1))

# Tensor parallel matmul pattern: column parallel + row parallel + all_reduce
col_out = ttnn.matmul(x_replicated, w_col_sharded)  # shard W along dim=1
row_out = ttnn.matmul(col_out, w_row_sharded)        # shard W along dim=0
reduced = ttnn.all_reduce(row_out)                    # sync across chips

Sharding

Tensor sharding distributes data across cores for locality and reduced communication. See tensor_sharding.md for height, width, and block sharding strategies.

Custom Program Sizes

Large fused kernels can exceed the default kernel config buffer limit (~69KB). The fix is to reduce worker_l1_size, which trades user L1 (for CBs/buffers) for more kernel config space.

# Get the default worker L1 size
default_size = ttnn.device.get_max_worker_l1_unreserved_size()

# Subtract enough for your kernel's config buffer needs
# e.g., fused kernel is ~85KB, so give 88KB (90112 bytes) more config space
device = ttnn.open_device(device_id=0, worker_l1_size=default_size - 90112)

The tradeoff: slightly less L1 available for tile buffers. Start with a small reduction (e.g., 8192) and increase if you still hit the config buffer limit.

Tracing

TTNN supports captured traces for eliminating host overhead in hot loops. See the tt-enable-tracing skill for setup and usage.

Looking Up Op Documentation

Find the op name in api.rst, then fetch its full documentation:

curl https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api/ttnn.<OP>.html

For example: api/ttnn.conv2d.html, api/ttnn.matmul.html, api/ttnn.softmax.html.

Output Tensors and Scratch Memory

Most TTNN ops accept an output_tensor or optional_output_tensor parameter that lets you write the result into a pre-allocated tensor instead of allocating a new one. This is useful for:

Performance: avoids repeated allocation/deallocation overhead
Tracing: required for pre-allocating all tensors before trace capture
Scratch buffers: reuse the same tensor across ops or loop iterations

# Pre-allocate a scratch tensor
scratch = ttnn.zeros_like(x, device=device, memory_config=ttnn.DRAM_MEMORY_CONFIG)

# Reuse it across ops
ttnn.relu(x, output_tensor=scratch)
ttnn.add(scratch, bias, output_tensor=scratch)

Look up individual ops in the API reference to check whether they support output_tensor.

Overview

TTNN is the high-level operations library for Tenstorrent hardware. It provides a PyTorch-like API for tensor creation, manipulation, and computation on TT devices. TTNN ops run individually (one kernel launch per op call). For fusing multiple ops into a single kernel, use TT-Lang.

Key Concepts

Tensors must be moved to device before computation: ttnn.to_device(tensor, device)
Layouts: ttnn.ROW_MAJOR_LAYOUT or ttnn.TILE_LAYOUT (32x32 tiles, required for most compute ops)
Memory configs: ttnn.DRAM_MEMORY_CONFIG (default, large) or ttnn.L1_MEMORY_CONFIG (fast, limited ~1.5MB/core)
Data types: ttnn.bfloat16 (standard), ttnn.float32, ttnn.bfloat8_b, ttnn.uint32

Common Patterns

import torch
import ttnn

device = ttnn.open_device(device_id=0)

# Torch -> TTNN
x_torch = torch.randn(1, 1, 64, 64, dtype=torch.bfloat16)
x = ttnn.from_torch(x_torch, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT,
                     device=device, memory_config=ttnn.DRAM_MEMORY_CONFIG)

# Compute
y = ttnn.relu(x)
y = ttnn.matmul(a, b)
y = ttnn.softmax(x, dim=-1)

# TTNN -> Torch
result = ttnn.to_torch(y)

ttnn.close_device(device)

related-skills.json

same repository

model-bringup-cpu.md

from "tenstorrent/tt-forge"

Write a ForgeModel-compatible loader for a HuggingFace model, validate it on CPU, and push the result to a branch on tenstorrent/tt-forge-models.

2026-05-20244

model-bringup-tt-hardware.md

from "tenstorrent/tt-forge"

Install tt-forge, run the model loader from the cpu bringup branch on Tenstorrent hardware, iterate on failures, and open a PR to tenstorrent/tt-forge-models on success.

2026-05-20244

tt-bug-report.md

from "tenstorrent/tt-forge"

File a bug report with a reproducer against Tenstorrent repos (tt-lang, tt-metal, tt-xla)

2026-03-31244

tt-connect-remote-device.md

from "tenstorrent/tt-forge"

Set up and verify remote connection to Tenstorrent hardware. Provides tools for running kernels, copying files, and reading logs on remote devices.

2026-03-31244

tt-enable-tracing.md

from "tenstorrent/tt-forge"

TTNN trace capture and replay for eliminating dispatch overhead. Essential for real-time inference and multi-chip performance.

2026-03-31244

tt-lang-profile-optimize.md

from "tenstorrent/tt-forge"

Profile and optimize TT-Lang kernels for performance. Covers auto-profiling, perf summary, signposts, and optimization workflow.

2026-03-31244

package.json

"author": "tenstorrent"

"repository": "tenstorrent/tt-forge"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

# Single device device = ttnn.open_device(device_id=0, trace_region_size=100000000) # Multi-device mesh (e.g., 4 chips in a row) ttnn.set_fabric_config(ttnn.FabricConfig.FABRIC_1D) mesh_device = ttnn.open_mesh_device(ttnn.MeshShape(1, N_CHIPS), trace_region_size=100000000) # Replicate a tensor to all devices x = ttnn.from_torch(t, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=mesh_device, mesh_mapper=ttnn.ReplicateTensorToMesh(mesh_device)) # Shard a tensor across devices along a dimension (tensor parallelism) w = ttnn.from_torch(t, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=mesh_device, mesh_mapper=ttnn.ShardTensorToMesh(mesh_device, dim=1)) # Read back sharded results by concatenating result = ttnn.to_torch(t, mesh_composer=ttnn.ConcatMeshToTensor(mesh_device, dim=1)) # Tensor parallel matmul pattern: column parallel + row parallel + all_reduce col_out = ttnn.matmul(x_replicated, w_col_sharded) # shard W along dim=1 row_out = ttnn.matmul(col_out, w_row_sharded) # shard W along dim=0 reduced = ttnn.all_reduce(row_out) # sync across chips

# Get the default worker L1 size default_size = ttnn.device.get_max_worker_l1_unreserved_size() # Subtract enough for your kernel's config buffer needs # e.g., fused kernel is ~85KB, so give 88KB (90112 bytes) more config space device = ttnn.open_device(device_id=0, worker_l1_size=default_size - 90112)

# Pre-allocate a scratch tensor scratch = ttnn.zeros_like(x, device=device, memory_config=ttnn.DRAM_MEMORY_CONFIG) # Reuse it across ops ttnn.relu(x, output_tensor=scratch) ttnn.add(scratch, bias, output_tensor=scratch)

import torch import ttnn device = ttnn.open_device(device_id=0) # Torch -> TTNN x_torch = torch.randn(1, 1, 64, 64, dtype=torch.bfloat16) x = ttnn.from_torch(x_torch, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device, memory_config=ttnn.DRAM_MEMORY_CONFIG) # Compute y = ttnn.relu(x) y = ttnn.matmul(a, b) y = ttnn.softmax(x, dim=-1) # TTNN -> Torch result = ttnn.to_torch(y) ttnn.close_device(device)

ttnn

External Resources

Multi-Device

Sharding

Custom Program Sizes

Tracing

Looking Up Op Documentation

Output Tensors and Scratch Memory

Overview

Key Concepts

Common Patterns

More from this repository

More from this repository

External Resources

Multi-Device

Sharding

Custom Program Sizes

Tracing

Looking Up Op Documentation

Output Tensors and Scratch Memory

Overview

Key Concepts

Common Patterns