PN: Transformer Architecture and Mixture of Experts — Technical Reference for PUMA Models

Core Idea

The Transformer (Vaswani et al., 2017) is the universal architecture underlying all LLMs used in PUMA. It replaces recurrence with self-attention, enabling parallel sequence processing and scaling to billions of parameters. Mixture of Experts (MoE) (Fedus et al., 2022 — Switch Transformers) extends the Transformer FFN layer with sparse routing: each token activates only a subset of parameter “experts”, achieving higher total capacity at constant inference cost. DeepSeek-V3, Mixtral, and other PUMA models use MoE.


The Standard Transformer Block

Every LLM is a stack of N identical Transformer blocks. Each block has two sub-layers:

Input Embedding + Positional Encoding
         │
┌────────▼────────────────────────────┐
│  Block 1 (repeated N times)         │
│                                     │
│   ┌─────────────────────────────┐   │
│   │   Multi-Head Self-Attention │   │
│   └──────────────┬──────────────┘   │
│           Add & LayerNorm           │
│   ┌──────────────▼──────────────┐   │
│   │   Feed-Forward Network (FFN)│   │
│   └──────────────┬──────────────┘   │
│           Add & LayerNorm           │
└────────────────────┬────────────────┘
                     │
              Output Logits → Softmax → Next Token

Self-Attention

For sequence :

Where:

  • , , (linear projections)
  • = head dimension (scales the dot product to prevent vanishing gradients)
  • The softmax output is an attention weight matrix: each token attends to every other token

Multi-Head: run parallel attention heads with independent weights, then concatenate:

Why Self-Attention Matters

PropertyRNN/LSTMTransformer Self-Attention
ParallelismSequential (token by token)Fully parallel
Long-range dependencyDecays with distanceConstant (full context)
Context window~100–200 tokens practical32k–1M+ tokens
Training speedSlow (sequential backprop)Fast (parallel)

Feed-Forward Network (FFN)

After attention, each position passes through a 2-layer MLP:

Typical dimensions:

  • Hidden dimension : 4,096 (7B models), 8,192 (70B models)
  • FFN inner dimension: in dense models

The FFN is where most parameter count lives (2/3 of total weights in a dense Transformer).


Positional Encoding

Self-attention is permutation-invariant by default. Positional encodings add position information:

MethodDescriptionUsed By
Sinusoidal (original)Fixed sin/cos functions of positionOriginal Transformer, BERT
Learned absoluteTrained position embedding tableGPT-2, early GPT-3
RoPE (Rotary)Relative position via rotation of Q, K vectorsLlama, Mistral, DeepSeek
ALiBiAttention bias that decays with distanceMPT, BLOOM

RoPE is the modern standard for PUMA models. It enables context window extension (e.g., Llama-3.1’s 128k window) via RoPE scaling without retraining.


Architectural Variants in PUMA Models

ModelArchitectureAttentionFFNContext
Llama-3.1-8BDecoder-onlyGQA + RoPESwiGLU128k
Mistral-7BDecoder-onlyGQA + RoPESwiGLU32k
Phi-3-miniDecoder-onlyMHA + RoPESwiGLU128k
DeepSeek-V3Decoder-only MoEMLA + RoPEMoE-FFN128k
Claude SonnetDecoder-onlyUndisclosedLikely MoE200k
GPT-4oUndisclosedLikely MoEUndisclosed128k

GQA = Grouped Query Attention (fewer KV heads → smaller KV cache → efficient inference)
MLA = Multi-head Latent Attention (DeepSeek’s compressed KV cache)
SwiGLU = Swish-gated linear unit (replaces ReLU FFN; better performance)


Mixture of Experts (MoE)

The Core Idea

Dense Transformer: every token activates every FFN parameter.
MoE Transformer: each token is routed to one (Switch) or a few (top-k) of expert FFNs:

Dense FFN:   token → [FFN with all parameters]

MoE FFN:     token → Router → selects Expert_i from {E_1, ..., E_N}
                    → [Expert_i FFN] (only 1/N parameters active)

This decouples total parameters (capacity) from activated parameters (compute cost).

Switch Transformer Routing (Fedus et al., 2022)

Router = linear layer + softmax:

Token routes to expert with highest gate value .

Expert Capacity: tokens per expert per batch (cf = capacity factor, typically 1.25). Overflow tokens skip the expert (residual passthrough).

Load Balancing Loss: auxiliary loss encourages uniform expert utilisation:

Where = fraction of tokens routed to expert , = mean routing probability for expert .

MoE Models in PUMA

ModelMoE StructureActivated ParamsTotal ParamsPUMA Role
DeepSeek-V3Top-2 routing, 256 experts37B671BCloud baseline
DeepSeek-R1Similar to V3~37B~671BReasoning baseline
Mixtral 8×7BTop-2, 8 experts13B47BAlternative local model
Phi-3.5-MoETop-2, 16 experts6.6B41.9BEfficient local option

Why MoE matters for PUMA: DeepSeek-V3 has GPT-4 class reasoning capacity (671B total parameters) but GPT-3.5 class inference cost (37B active parameters). This justifies using it as the PUMA cloud baseline — high quality at manageable API cost.


Scaling Laws

The relationship between model size, training compute, and performance follows power laws (Hoffmann et al. — Chinchilla, 2022):

Where = parameters, = training tokens. Chinchilla optimal: (e.g., 7B model → 140B tokens).

For MoE: scaling laws apply to activated parameters, not total. DeepSeek-V3 behaves like a ~37B dense model for compute purposes.


Inference Efficiency for PUMA

KV Cache

During inference, Key and Value matrices for all past tokens are cached:

Example (Llama-3.1-8B, 4k context):

  • layers, KV heads, , , fp16
  • KV cache: bytes = 536 MB

GQA reduces KV heads (H), directly cutting cache size. MLA (DeepSeek) compresses KV into a latent vector — reduces KV cache by ~10×.

Quantisation Impact on Architecture

QuantisationPrecisionVRAM ReductionQuality Loss
fp3232-bit1× baselineNone
fp16 / bf1616-bit0.5×Negligible
GGUF Q8_0~8-bit~0.25×Minimal
GGUF Q4_K_M~4-bit~0.125×Small
GGUF Q2_K~2-bit~0.065×Noticeable

For PUMA local deployment: Q4_K_M is the recommended compromise (acceptable quality + fits in 8–16 GB VRAM).


Ch.3 Methods Reference

For PUMA thesis Chapter 3 (Methodology), reference this note when:

  • Explaining the model selection rationale (MoE efficiency → DeepSeek-V3 as API baseline)
  • Justifying context window choices (RoPE-extended 128k → sufficient for issue + few-shot examples)
  • Describing inference setup (quantisation level, KV cache overhead, batch size)

MOCs