LN: Vaswani et al. (2017) — Attention Is All You Need
Bibliographic Reference
Citation: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1706.03762
Pass 1 — Bird’s Eye View (5 Cs)
| C | Assessment |
|---|---|
| Category | Foundational architecture paper |
| Context | Google Brain / Google Research, NeurIPS 2017. Proposes a neural network architecture based solely on attention mechanisms, eliminating recurrence and convolution entirely |
| Correctness | Empirically validated on WMT 2014 machine translation benchmarks; BLEU scores exceed all prior models. Now the foundation of every modern LLM |
| Contributions | (1) Scaled dot-product attention; (2) Multi-head attention; (3) Positional encoding; (4) Encoder-decoder Transformer; (5) Demonstration that attention alone suffices for sequence modelling |
| Clarity | Excellent — precise mathematical formulation with strong empirical grounding |
Relevance: ⭐⭐⭐⭐⭐
This paper is the architectural foundation of every model PUMA uses: Llama 3.2, Mistral 7B, GPT-4o, Claude. Understanding the Transformer is essential for explaining how PUMA’s LLMs process Jira issue text, represent tokens, and generate structured outputs.
Pass 2 — Content
The Core Mechanism: Scaled Dot-Product Attention
Where Q (query), K (key), V (value) are linear projections of the input. The scaling factor prevents the dot product from growing too large in high-dimensional spaces, which would push softmax into vanishing-gradient regions.
Multi-Head Attention
Rather than a single attention function, the Transformer runs attention times in parallel with different learned projections:
Each head attends to different aspects of the representation — one head may track syntactic dependencies, another semantic relationships.
Positional Encoding
Since attention has no inherent notion of sequence order, position is injected via sinusoidal encodings:
This allows the model to generalise to sequence lengths not seen during training.
Why Attention Replaces Recurrence
| Property | RNN/LSTM | Transformer |
|---|---|---|
| Parallelism | Sequential — cannot parallelise across time | Full parallel across all positions |
| Long-range dependencies | Gradient vanishing over long sequences | Direct O(1) path between any two positions |
| Training efficiency | Slow (sequential computation) | Fast (GPU-parallelisable) |
| Interpretability | Hidden state is opaque | Attention weights are inspectable |
This parallelism is what enabled scaling to billions of parameters and training on internet-scale corpora.
PUMA Integration
Every model PUMA evaluates (Llama 3.2 8B, Mistral 7B, Phi-3.5 Mini, GPT-4o) is built on this architecture. The mechanism by which PUMA’s prompts are processed — tokenisation, attention over the issue description, context window management — is directly described here. PUMA’s context window constraints (Llama: 128K, Mistral: 32K) are a downstream consequence of the positional encoding and attention scaling choices introduced in this paper.
Related Notes
- PN-LLM-Models-PUMA — model catalog with Transformer-based architecture notes
- LN-Fedus-2022-SwitchTransformers — MoE extension of the Transformer
- PN-FineTuning-LoRA-Quantization — LoRA modifies Transformer weight matrices
- LN-Wei-2022-ChainOfThought — CoT leverages Transformer reasoning capacity