| name | attention-mechanisms |
| description | How to implement and understand attention mechanisms in neural networks and LLMs. Use this skill whenever the user needs to build self-attention layers, causal attention, multi-head attention, or understand how attention weights are calculated. Trigger this skill for any task involving attention scores, Q/K/V matrices, attention masking, or transformer architecture components. |
Attention Mechanisms Skill
This skill helps you implement and understand attention mechanisms used in neural networks and large language models (LLMs).
What This Skill Covers
- Self-attention: Computing attention weights between tokens in a sequence
- Scaled dot-product attention: Using Q/K/V matrices with proper scaling
- Causal attention: Masking future tokens for autoregressive generation
- Multi-head attention: Running multiple attention heads in parallel
- Manual calculations: Step-by-step attention weight computation
When to Use This Skill
Use this skill when you need to:
- Implement attention layers from scratch in PyTorch or similar frameworks
- Debug or visualize attention patterns in a model
- Understand how attention weights are calculated
- Build transformer components (encoder/decoder layers)
- Explain attention mechanisms to others
- Convert between manual calculations and code implementations
Core Concepts
Attention Mechanism Overview
Attention allows a model to focus on specific parts of the input when generating each output. It assigns different weights to different inputs based on their relevance.
Key components:
- Query (Q): What we're looking for
- Key (K): What each position contains
- Value (V): What each position contributes
- Attention weights: How much to attend to each position
Step-by-Step Attention Calculation
Step 1: Compute Attention Scores
Calculate the dot product between the query and each key:
attention_score[i] = query ยท key[i]
For embeddings, this is the sum of element-wise products.
Step 2: Scale the Scores
Divide by the square root of the key dimension to prevent large values:
scaled_score = attention_score / sqrt(d_k)
Step 3: Apply Softmax
Normalize scores to get weights that sum to 1:
attention_weight[i] = exp(scaled_score[i]) / sum(exp(scaled_scores))
Step 4: Compute Context Vector
Weighted sum of values using attention weights:
context_vector = sum(attention_weight[i] * value[i])
Implementation Patterns
Basic Self-Attention (PyTorch)
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
def forward(self, x):
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(-2, -1)
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5,
dim=-1
)
context_vec = attn_weights @ values
return context_vec
Causal Attention (Masked)
For LLMs, prevent attending to future tokens:
class CausalAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout=0.0, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
self.register_buffer(
'mask',
torch.triu(torch.ones(context_length, context_length), diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(-2, -1)
attn_scores.masked_fill_(
self.mask.bool()[:num_tokens, :num_tokens],
-torch.inf
)
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5,
dim=-1
)
attn_weights = self.dropout(attn_weights)
context_vec = attn_weights @ values
return context_vec
Multi-Head Attention
Run multiple attention heads in parallel:
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out)
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length), diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
attn_scores = queries @ keys.transpose(-2, -1)
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
context_vec = (attn_weights @ values).transpose(1, 2)
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec)
return context_vec
Manual Calculation Example
For the sentence "Hello shiny sun!" with 3D embeddings:
| Word | Embedding |
|---|
| Hello | [0.34, 0.22, 0.54] |
| shiny | [0.53, 0.34, 0.98] |
| sun | [0.29, 0.54, 0.93] |
Compute attention for "shiny":
-
Attention scores (dot products with "shiny" as query):
- Hello: 0.34ร0.53 + 0.22ร0.34 + 0.54ร0.98 = 0.775
- shiny: 0.53ร0.53 + 0.34ร0.34 + 0.98ร0.98 = 1.317
- sun: 0.29ร0.53 + 0.54ร0.34 + 0.93ร0.98 = 1.225
-
Apply softmax to get weights:
- exp(0.775) = 2.170
- exp(1.317) = 3.732
- exp(1.225) = 3.405
- Sum = 9.307
- Weights: [0.233, 0.401, 0.366]
-
Context vector (weighted sum):
- = 0.233ร[0.34, 0.22, 0.54] + 0.401ร[0.53, 0.34, 0.98] + 0.366ร[0.29, 0.54, 0.93]
- = [0.399, 0.386, 0.861]
Common Issues and Solutions
Issue: Attention weights are all similar
Solution: Check that you're scaling by sqrt(d_k). Without scaling, softmax saturates.
Issue: Model can't learn
Solution: Ensure Q/K/V matrices are trainable parameters (use nn.Linear or nn.Parameter).
Issue: Future tokens leaking in
Solution: Verify causal mask is applied BEFORE softmax, not after.
Issue: Shape mismatches
Solution: Remember the transpose pattern:
- After Q @ K.T: (batch, seq_len, seq_len)
- After softmax: (batch, seq_len, seq_len)
- After weights @ V: (batch, seq_len, d_out)
Testing Your Implementation
Use the scripts/verify_attention.py script to:
- Verify attention weights sum to 1
- Check causal masking works correctly
- Validate multi-head attention shapes
References