LN: Fedus, Zoph & Shazeer (2022) — Switch Transformers

Bibliographic Reference

Citation: Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39. https://arxiv.org/abs/2101.03961


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategoryArchitecture paper — sparse scaling
ContextGoogle Brain. Demonstrates that Mixture-of-Experts (MoE) can be simplified to a single-expert router (Switch) and scaled to trillion parameters without proportional compute cost
CorrectnessValidated on C4 language modelling and SuperGLUE; 7× pre-training speedup vs. T5 at equivalent FLOPs
Contributions(1) Switch routing: each token routes to exactly one expert; (2) Expert capacity with overflow tokens; (3) Auxiliary load-balancing loss; (4) Selective precision: bf16 for router, fp32 for model weights
ClarityVery good — builds directly on Transformer foundations

Relevance: ⭐⭐⭐⭐

DeepSeek-V3 (671B total / 37B active) and Mixtral 8×7B both use MoE architectures derived from this work. PUMA’s rationale for using DeepSeek-V3 as a cloud baseline is directly grounded in the efficiency claims of Switch Transformers.


Pass 2 — Key Concepts

Mixture of Experts (MoE) Principle

A dense Transformer activates all parameters for every token. A MoE Transformer routes each token to a subset of specialised “expert” feed-forward networks. Total parameters grow, but activated parameters per token stay constant.

Dense FFN:     token → single FFN (all parameters activated)
Switch MoE:    token → Router → Expert_i (1 of N experts activated)

Switch simplification: Previous MoE used top-k routing (k≥2). Switch uses k=1, reducing communication overhead and load imbalance while retaining most quality gains.

Expert Capacity

Each expert has a fixed capacity tokens per batch. Overflow tokens skip the expert and pass through a residual connection. This prevents bottlenecks at popular experts.

Why This Matters for PUMA

ModelMoE StructurePUMA Role
DeepSeek-V3671B total, 37B active (MoE)Cloud baseline
DeepSeek-R1Similar MoE structureReasoning baseline
Mixtral 8×7B8 experts, top-2 routingAlternative local model

MoE models deliver high parameter counts (reasoning capacity) at manageable inference cost — critical for PUMA’s local-execution constraint.


PUMA Integration

PUMA’s model selection rationale in Ch.3 (Methods) should reference Switch Transformers when explaining why DeepSeek-V3’s MoE architecture is both powerful and computationally feasible as an API baseline.

MOCs