PN: RLHF and Constitutional AI — LLM Alignment Training Paradigms

Core Idea

RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI (CAI) are the two dominant paradigms for aligning pre-trained LLMs to human values and instructions. RLHF trains a reward model on human preference comparisons and uses PPO to optimise the policy. CAI replaces human raters with an AI judge operating under a written constitution, enabling scalable alignment without proportional human annotation cost. Both produce the instruction-following models used in PUMA.


Pre-Training vs. Alignment

A base language model (e.g., Llama-3 base, GPT-4 base) is trained to predict next tokens — it is not inherently helpful or safe. Alignment training adapts it to:

  1. Follow instructions (do what the user asks)
  2. Be helpful (produce useful, accurate responses)
  3. Be harmless (avoid dangerous, toxic, or deceptive outputs)
  4. Be honest (express uncertainty, avoid hallucination)

RLHF and CAI are the primary methods to achieve this.


RLHF — Reinforcement Learning from Human Feedback

Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)
  Pre-trained base model + demonstration dataset (human-written responses)
  → SFT model (follows instructions, but not yet preference-aligned)

Stage 2: Reward Model (RM) Training
  SFT model generates response pairs for same prompt
  Human raters rank: response A > response B
  RM trained to predict human preference score
  → Reward Model

Stage 3: PPO Fine-Tuning
  SFT policy optimised using PPO to maximise RM score
  KL penalty prevents the policy from diverging too far from SFT model
  → RLHF-aligned model (ChatGPT / Claude / Llama-Chat)

Mathematical Framework

The RLHF objective is:

Where:

  • = policy being trained (the LLM)
  • = reward model
  • = KL penalty coefficient (prevents reward hacking)
  • = SFT model as reference point

Failure Modes

ProblemDescriptionEffect
Reward hackingPolicy finds non-intended ways to get high rewardSycophancy; verbose-but-wrong answers
Mode collapsePolicy converges to narrow response styleLow diversity; robustness loss
Human bias amplificationRM learns rater biasesDemographic skew; cultural bias
ScalabilityHuman annotation is expensiveFew datasets; narrow coverage

Constitutional AI (Anthropic, 2022)

Motivation

Human annotation for RLHF is expensive, slow, and inconsistent. CAI replaces human preference labelling with an AI judge that applies a written constitution — a set of principles describing what helpful, harmless, and honest behaviour looks like.

Two-Stage Process

Stage 1: Supervised Learning from AI Feedback (SL-CAI)
  1. Prompt model to generate a potentially harmful response (red-teaming)
  2. Prompt model to critique its own response using constitution principles
  3. Prompt model to revise the response based on the critique
  4. Repeat critique-revise cycle (typically 2 rounds)
  5. Fine-tune on the final (revised) responses → SL-CAI model

Stage 2: Reinforcement Learning from AI Feedback (RLAIF)
  1. For each prompt, generate response pairs (SL-CAI model)
  2. AI judge (frontier model + constitution) ranks: A > B
  3. Train preference model on AI rankings
  4. PPO fine-tuning using AI preference model
  → Constitutional AI model (Claude 1, 2, 3, Sonnet, Haiku, Opus)

The Constitution

A constitution is a list of principles. Example principles from Anthropic’s published constitution:

“Choose the response that is least likely to contain harmful or unethical content.” “Choose the response that is most supportive of human autonomy and individual rights.” “Choose the response that is most honest and avoids claiming to be human when it isn’t.”

The AI judge evaluates response pairs against each principle and produces a binary preference label.

Advantages over RLHF

DimensionRLHFConstitutional AI
Annotation costHigh (human raters)Low (AI judge)
ScalabilityLimited by annotation budgetScales with compute
ConsistencyVariable (rater disagreement)Consistent (same constitution)
CustomisabilityHard (retrain RM for new values)Easy (edit constitution)
TransparencyRM is a black boxPrinciples are human-readable

RLAIF — Reinforcement Learning from AI Feedback

RLAIF is the general term for using any AI model (not necessarily a constitution-based judge) to provide preference labels. CAI is one specific implementation of RLAIF.

Other RLAIF variants:

  • Self-Rewarding Language Models (Yuan et al., 2024): the model is its own judge
  • SPIN (Self-Play fine-tuning): model generates both preferred and rejected responses
  • Direct Preference Optimisation (DPO): eliminates the RL step; optimises directly on preference data

DPO — Direct Preference Optimisation

DPO (Rafailov et al., 2023) reformulates RLHF without an explicit reward model or PPO:

Where = preferred response, = dispreferred response.

DPO is preferred for instruction fine-tuning because:

  • No RL training instability (PPO is notoriously finicky)
  • No separate reward model to train and maintain
  • Direct optimisation on preference pairs

Used by: Llama-3-Instruct, Mistral-Instruct, Phi-3, most open-source instruction models.


Impact on PUMA Models

All models in PUMA’s catalog have undergone some form of alignment training:

ModelAlignment MethodNotes
GPT-4oRLHF + RLAIF (OpenAI)Undisclosed full details
Claude Sonnet 3.7Constitutional AI (CAI)Anthropic; most transparent
DeepSeek-R1GRPO (Group Relative Policy Optimisation)Specialised for reasoning
Llama-3.1-8B-InstructSFT + DPOMeta; open-source
Mistral-7B-InstructSFT + DPOMistral AI
Phi-3-miniSFT + DPOMicrosoft; dataset-focused alignment

Implications for PUMA

  1. Frozen models: PUMA evaluates all models post-alignment; no fine-tuning in baseline experiments
  2. Alignment tax: aligned models may be more cautious on ambiguous triage tasks → may produce “safe” but incorrect classifications
  3. Constitution-aware prompting: Claude’s CAI training makes it especially responsive to explicit normative constraints in system prompts (bounded autonomy framing)
  4. DPO models are PUMA-compatible: open-source DPO models run on Ollama with no special considerations

RLHF vs. Constitutional AI: PUMA Design Implications

QuestionRLHFCAI
Can we inspect alignment criteria?No (RM is opaque)Yes (constitution is readable)
Can we audit for PM-domain bias?DifficultFeasible — check constitution principles
Consistent with EU AI Act transparency?PartialStrong
Available for local deployment?Open RM checkpoints (rare)N/A (CAI produces model; model is local)

For the PUMA ethics chapter (Section 5.3): Constitutional AI provides the strongest transparency story for regulatory compliance framing.


MOCs