--- title: "Attention Is All You Need" arxiv_id: "1706.03762" date_read: 2024-03-25 reading_time: 45min difficulty: ⭐⭐⭐☆☆ --- # Attention Is All You Need > [!abstract] 摘要 > The dominant sequence transduction models are based on complex recurrent or convolutional neural networks... ## 1. Introduction Recurrent neural networks, long short-term memory and gated recurrent neural networks... ### 1.1 Background The Transformer uses multi-headed self-attention... ## 2. Model Architecture ![Model Architecture](figures/model-architecture.png) ### 2.1 Encoder and Decoder Stacks **Encoder**: The encoder is composed of a stack of N = 6 identical layers... **Decoder**: The decoder is also composed of a stack of N = 6 identical layers... ## 3. Attention $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ ## 4. Experiments | Model | BLEU | Training Time | |-------|------|---------------| | Transformer (big) | 28.4 | 3.5 days | | Transformer (base) | 27.3 | 12 hours | ## Key Insights 1. Self-attention allows modeling of dependencies regardless of distance 2. Multi-head attention enables attending to information from different positions 3. Positional encoding is necessary since the model contains no recurrence ## Questions - [ ] Why divide by sqrt(d_k)? - [ ] How does multi-head attention work in detail? - [ ] What are the computational complexity trade-offs? ## References 1. Vaswani et al. (2017) - This paper 2. Bahdanau et al. (2015) - Attention mechanism 3. Gehring et al. (2017) - Convolutional sequence models

{ "methods": [ { "name": "Scaled Dot-Product Attention", "equation": "Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V", "purpose": "Compute attention weights efficiently" }, { "name": "Multi-Head Attention", "description": "Run multiple attention operations in parallel", "benefit": "Attend to information from different positions" } ] }

# Self-Attention ## Definition Self-attention is a mechanism that relates different positions of a single sequence... ## How It Works 1. Compute Query, Key, Value matrices 2. Calculate attention weights 3. Apply weights to values 4. Output weighted sum ## Why It Matters - Captures long-range dependencies - Highly parallelizable - Interpretable attention patterns ## Applications - Machine translation - Text summarization - Image generation ## Related Concepts - [[Attention Mechanism]] - [[Transformer]] - [[Multi-Head Attention]]

# Multi-Head Attention ## Overview Runs multiple self-attention operations in parallel... ## Implementation ```python class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.num_heads = num_heads self.d_k = d_model // num_heads # ...

## Reading Workflow ### Level 1: Quick Scan (5-10 min) - Title and abstract - Introduction - Conclusion - Key figures **Output**: High-level understanding ### Level 2: Standard Read (30-60 min) - All sections - Important equations - Key experiments - Method details **Output**: Detailed notes + questions ### Level 3: Deep Dive (2-4 hours) - Every section in detail - Derive equations - Reproduce experiments - Related work **Output**: Comprehensive understanding + implementation ## Advanced Features ### Batch Processing Process multiple PDFs: ```bash python pdf-reader.py --dir D:\Papers\ --batch

Key

Action

n

Next section

p

Previous section

q

Ask question

s

Save note

h

Highlight text

f

Find in paper

Esc

Exit

Key

Action

n

Next section

p

Previous section

q

Ask question

s

Save note

h

Highlight text

f

Find in paper

Esc

Exit

name	PDF Reader
description	"深度阅读和分析 PDF 论文。支持 PDF 转 Markdown、智能摘要、关键信息提取、问答式学习、笔记生成。"

pdf-reader

Más de este repositorio

Más de este repositorio

When to Use

Quick Start

Read a PDF

Read from arXiv

Ask Questions

Manual Run

Features

1. PDF to Markdown Conversion

2. Intelligent Summarization

3. Information Extraction

4. Q&A Mode

5. Note Generation

Advantages

Trade-offs

Comparison Mode

Export Options

Integration with Zotero

Read from Zotero

Save Back to Zotero

Customization

Custom Prompts

Reading Templates

Troubleshooting

PDF Extraction Failed

Poor Quality Conversion

LLM Rate Limit

Best Practices

Keyboard Shortcuts (when using interactive mode)

Related Skills

Future Enhancements