LN: Fedus, Zoph & Shazeer (2022) — Switch Transformers

Bibliographic Reference

Citation: Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39. https://arxiv.org/abs/2101.03961

Pass 1 — Bird’s Eye View (5 Cs)

C	Assessment
Category	Architecture paper — sparse scaling
Context	Google Brain. Demonstrates that Mixture-of-Experts (MoE) can be simplified to a single-expert router (Switch) and scaled to trillion parameters without proportional compute cost
Correctness	Validated on C4 language modelling and SuperGLUE; 7× pre-training speedup vs. T5 at equivalent FLOPs
Contributions	(1) Switch routing: each token routes to exactly one expert; (2) Expert capacity with overflow tokens; (3) Auxiliary load-balancing loss; (4) Selective precision: bf16 for router, fp32 for model weights
Clarity	Very good — builds directly on Transformer foundations

Relevance: ⭐⭐⭐⭐

DeepSeek-V3 (671B total / 37B active) and Mixtral 8×7B both use MoE architectures derived from this work. PUMA’s rationale for using DeepSeek-V3 as a cloud baseline is directly grounded in the efficiency claims of Switch Transformers.

Pass 2 — Key Concepts

Mixture of Experts (MoE) Principle

A dense Transformer activates all parameters for every token. A MoE Transformer routes each token to a subset of specialised “expert” feed-forward networks. Total parameters grow, but activated parameters per token stay constant.

Dense FFN:     token → single FFN (all parameters activated)
Switch MoE:    token → Router → Expert_i (1 of N experts activated)

Switch simplification: Previous MoE used top-k routing (k≥2). Switch uses k=1, reducing communication overhead and load imbalance while retaining most quality gains.

Expert Capacity

Each expert has a fixed capacity $C = ⌈ \frac{T}{N} \cdot capacity_factor ⌉$ tokens per batch. Overflow tokens skip the expert and pass through a residual connection. This prevents bottlenecks at popular experts.

Why This Matters for PUMA

Model	MoE Structure	PUMA Role
DeepSeek-V3	671B total, 37B active (MoE)	Cloud baseline
DeepSeek-R1	Similar MoE structure	Reasoning baseline
Mixtral 8×7B	8 experts, top-2 routing	Alternative local model

MoE models deliver high parameter counts (reasoning capacity) at manageable inference cost — critical for PUMA’s local-execution constraint.

PUMA Integration

PUMA’s model selection rationale in Ch.3 (Methods) should reference Switch Transformers when explaining why DeepSeek-V3’s MoE architecture is both powerful and computationally feasible as an API baseline.

PN-LLM-Models-PUMA — model catalog: DeepSeek-V3 MoE entry
LN-Vaswani-2017-AttentionIsAllYouNeed — the Transformer base that MoE extends
PN-FineTuning-LoRA-Quantization — GGUF quantization of MoE models

PUMA Vault

Explorador

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

LN: Fedus, Zoph & Shazeer (2022) — Switch Transformers

Pass 1 — Bird’s Eye View (5 Cs)

Pass 2 — Key Concepts

Mixture of Experts (MoE) Principle

Expert Capacity

Why This Matters for PUMA

PUMA Integration

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

LN: Fedus, Zoph & Shazeer (2022) — Switch Transformers

Pass 1 — Bird’s Eye View (5 Cs)

Pass 2 — Key Concepts

Mixture of Experts (MoE) Principle

Expert Capacity

Why This Matters for PUMA

PUMA Integration

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces