LN: Vaswani et al. (2017) — Attention Is All You Need

Bibliographic Reference

Citation: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1706.03762

Pass 1 — Bird’s Eye View (5 Cs)

C	Assessment
Category	Foundational architecture paper
Context	Google Brain / Google Research, NeurIPS 2017. Proposes a neural network architecture based solely on attention mechanisms, eliminating recurrence and convolution entirely
Correctness	Empirically validated on WMT 2014 machine translation benchmarks; BLEU scores exceed all prior models. Now the foundation of every modern LLM
Contributions	(1) Scaled dot-product attention; (2) Multi-head attention; (3) Positional encoding; (4) Encoder-decoder Transformer; (5) Demonstration that attention alone suffices for sequence modelling
Clarity	Excellent — precise mathematical formulation with strong empirical grounding

Relevance: ⭐⭐⭐⭐⭐

This paper is the architectural foundation of every model PUMA uses: Llama 3.2, Mistral 7B, GPT-4o, Claude. Understanding the Transformer is essential for explaining how PUMA’s LLMs process Jira issue text, represent tokens, and generate structured outputs.

Pass 2 — Content

The Core Mechanism: Scaled Dot-Product Attention

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

Where Q (query), K (key), V (value) are linear projections of the input. The scaling factor $\frac{1}{d _{k}}$ prevents the dot product from growing too large in high-dimensional spaces, which would push softmax into vanishing-gradient regions.

Multi-Head Attention

Rather than a single attention function, the Transformer runs attention $h$ times in parallel with different learned projections:

$MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W^{O}$

Each head attends to different aspects of the representation — one head may track syntactic dependencies, another semantic relationships.

Positional Encoding

Since attention has no inherent notion of sequence order, position is injected via sinusoidal encodings:

$P E_{(p os, 2 i)} = sin (p os /1000 0^{2 i / d_{m o d e l}})$

This allows the model to generalise to sequence lengths not seen during training.

Why Attention Replaces Recurrence

Property	RNN/LSTM	Transformer
Parallelism	Sequential — cannot parallelise across time	Full parallel across all positions
Long-range dependencies	Gradient vanishing over long sequences	Direct O(1) path between any two positions
Training efficiency	Slow (sequential computation)	Fast (GPU-parallelisable)
Interpretability	Hidden state is opaque	Attention weights are inspectable

This parallelism is what enabled scaling to billions of parameters and training on internet-scale corpora.

PUMA Integration

Every model PUMA evaluates (Llama 3.2 8B, Mistral 7B, Phi-3.5 Mini, GPT-4o) is built on this architecture. The mechanism by which PUMA’s prompts are processed — tokenisation, attention over the issue description, context window management — is directly described here. PUMA’s context window constraints (Llama: 128K, Mistral: 32K) are a downstream consequence of the positional encoding and attention scaling choices introduced in this paper.

PN-LLM-Models-PUMA — model catalog with Transformer-based architecture notes
LN-Fedus-2022-SwitchTransformers — MoE extension of the Transformer
PN-FineTuning-LoRA-Quantization — LoRA modifies Transformer weight matrices
LN-Wei-2022-ChainOfThought — CoT leverages Transformer reasoning capacity

PUMA Vault

Explorador

Attention Is All You Need

LN: Vaswani et al. (2017) — Attention Is All You Need

Pass 1 — Bird’s Eye View (5 Cs)

Pass 2 — Content

The Core Mechanism: Scaled Dot-Product Attention

Multi-Head Attention

Positional Encoding

Why Attention Replaces Recurrence

PUMA Integration

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Attention Is All You Need

LN: Vaswani et al. (2017) — Attention Is All You Need

Pass 1 — Bird’s Eye View (5 Cs)

Pass 2 — Content

The Core Mechanism: Scaled Dot-Product Attention

Multi-Head Attention

Positional Encoding

Why Attention Replaces Recurrence

PUMA Integration

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces