LN: Vaswani et al. (2017) — Attention Is All You Need

Bibliographic Reference

Citation: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1706.03762


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategoryFoundational architecture paper
ContextGoogle Brain / Google Research, NeurIPS 2017. Proposes a neural network architecture based solely on attention mechanisms, eliminating recurrence and convolution entirely
CorrectnessEmpirically validated on WMT 2014 machine translation benchmarks; BLEU scores exceed all prior models. Now the foundation of every modern LLM
Contributions(1) Scaled dot-product attention; (2) Multi-head attention; (3) Positional encoding; (4) Encoder-decoder Transformer; (5) Demonstration that attention alone suffices for sequence modelling
ClarityExcellent — precise mathematical formulation with strong empirical grounding

Relevance: ⭐⭐⭐⭐⭐

This paper is the architectural foundation of every model PUMA uses: Llama 3.2, Mistral 7B, GPT-4o, Claude. Understanding the Transformer is essential for explaining how PUMA’s LLMs process Jira issue text, represent tokens, and generate structured outputs.


Pass 2 — Content

The Core Mechanism: Scaled Dot-Product Attention

Where Q (query), K (key), V (value) are linear projections of the input. The scaling factor prevents the dot product from growing too large in high-dimensional spaces, which would push softmax into vanishing-gradient regions.

Multi-Head Attention

Rather than a single attention function, the Transformer runs attention times in parallel with different learned projections:

Each head attends to different aspects of the representation — one head may track syntactic dependencies, another semantic relationships.

Positional Encoding

Since attention has no inherent notion of sequence order, position is injected via sinusoidal encodings:

This allows the model to generalise to sequence lengths not seen during training.

Why Attention Replaces Recurrence

PropertyRNN/LSTMTransformer
ParallelismSequential — cannot parallelise across timeFull parallel across all positions
Long-range dependenciesGradient vanishing over long sequencesDirect O(1) path between any two positions
Training efficiencySlow (sequential computation)Fast (GPU-parallelisable)
InterpretabilityHidden state is opaqueAttention weights are inspectable

This parallelism is what enabled scaling to billions of parameters and training on internet-scale corpora.


PUMA Integration

Every model PUMA evaluates (Llama 3.2 8B, Mistral 7B, Phi-3.5 Mini, GPT-4o) is built on this architecture. The mechanism by which PUMA’s prompts are processed — tokenisation, attention over the issue description, context window management — is directly described here. PUMA’s context window constraints (Llama: 128K, Mistral: 32K) are a downstream consequence of the positional encoding and attention scaling choices introduced in this paper.

MOCs