PN: RLHF and Constitutional AI — LLM Alignment Training Paradigms

Core Idea

RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI (CAI) are the two dominant paradigms for aligning pre-trained LLMs to human values and instructions. RLHF trains a reward model on human preference comparisons and uses PPO to optimise the policy. CAI replaces human raters with an AI judge operating under a written constitution, enabling scalable alignment without proportional human annotation cost. Both produce the instruction-following models used in PUMA.

Pre-Training vs. Alignment

A base language model (e.g., Llama-3 base, GPT-4 base) is trained to predict next tokens — it is not inherently helpful or safe. Alignment training adapts it to:

Follow instructions (do what the user asks)
Be helpful (produce useful, accurate responses)
Be harmless (avoid dangerous, toxic, or deceptive outputs)
Be honest (express uncertainty, avoid hallucination)

RLHF and CAI are the primary methods to achieve this.

RLHF — Reinforcement Learning from Human Feedback

Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)
  Pre-trained base model + demonstration dataset (human-written responses)
  → SFT model (follows instructions, but not yet preference-aligned)

Stage 2: Reward Model (RM) Training
  SFT model generates response pairs for same prompt
  Human raters rank: response A > response B
  RM trained to predict human preference score
  → Reward Model

Stage 3: PPO Fine-Tuning
  SFT policy optimised using PPO to maximise RM score
  KL penalty prevents the policy from diverging too far from SFT model
  → RLHF-aligned model (ChatGPT / Claude / Llama-Chat)

Mathematical Framework

The RLHF objective is:

$max_{π_{θ}} E_{x \sim D, y \sim π_{θ} (y ∣ x)} [r_{ϕ} (x, y)] - β \cdot D_{K L} [π_{θ} (y ∣ x) ∥ π_{SFT} (y ∣ x)]$

Where:

$π_{θ}$ = policy being trained (the LLM)
$r_{ϕ}$ = reward model
$β$ = KL penalty coefficient (prevents reward hacking)
$π_{SFT}$ = SFT model as reference point

Failure Modes

Problem	Description	Effect
Reward hacking	Policy finds non-intended ways to get high reward	Sycophancy; verbose-but-wrong answers
Mode collapse	Policy converges to narrow response style	Low diversity; robustness loss
Human bias amplification	RM learns rater biases	Demographic skew; cultural bias
Scalability	Human annotation is expensive	Few datasets; narrow coverage

Constitutional AI (Anthropic, 2022)

Motivation

Human annotation for RLHF is expensive, slow, and inconsistent. CAI replaces human preference labelling with an AI judge that applies a written constitution — a set of principles describing what helpful, harmless, and honest behaviour looks like.

Two-Stage Process

Stage 1: Supervised Learning from AI Feedback (SL-CAI)
  1. Prompt model to generate a potentially harmful response (red-teaming)
  2. Prompt model to critique its own response using constitution principles
  3. Prompt model to revise the response based on the critique
  4. Repeat critique-revise cycle (typically 2 rounds)
  5. Fine-tune on the final (revised) responses → SL-CAI model

Stage 2: Reinforcement Learning from AI Feedback (RLAIF)
  1. For each prompt, generate response pairs (SL-CAI model)
  2. AI judge (frontier model + constitution) ranks: A > B
  3. Train preference model on AI rankings
  4. PPO fine-tuning using AI preference model
  → Constitutional AI model (Claude 1, 2, 3, Sonnet, Haiku, Opus)

The Constitution

A constitution is a list of principles. Example principles from Anthropic’s published constitution:

“Choose the response that is least likely to contain harmful or unethical content.” “Choose the response that is most supportive of human autonomy and individual rights.” “Choose the response that is most honest and avoids claiming to be human when it isn’t.”

The AI judge evaluates response pairs against each principle and produces a binary preference label.

Advantages over RLHF

Dimension	RLHF	Constitutional AI
Annotation cost	High (human raters)	Low (AI judge)
Scalability	Limited by annotation budget	Scales with compute
Consistency	Variable (rater disagreement)	Consistent (same constitution)
Customisability	Hard (retrain RM for new values)	Easy (edit constitution)
Transparency	RM is a black box	Principles are human-readable

RLAIF — Reinforcement Learning from AI Feedback

RLAIF is the general term for using any AI model (not necessarily a constitution-based judge) to provide preference labels. CAI is one specific implementation of RLAIF.

Other RLAIF variants:

Self-Rewarding Language Models (Yuan et al., 2024): the model is its own judge
SPIN (Self-Play fine-tuning): model generates both preferred and rejected responses
Direct Preference Optimisation (DPO): eliminates the RL step; optimises directly on preference data

DPO — Direct Preference Optimisation

DPO (Rafailov et al., 2023) reformulates RLHF without an explicit reward model or PPO:

$L_{DPO} (π_{θ}) = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})]$

Where $y_{w}$ = preferred response, $y_{l}$ = dispreferred response.

DPO is preferred for instruction fine-tuning because:

No RL training instability (PPO is notoriously finicky)
No separate reward model to train and maintain
Direct optimisation on preference pairs

Used by: Llama-3-Instruct, Mistral-Instruct, Phi-3, most open-source instruction models.

Impact on PUMA Models

All models in PUMA’s catalog have undergone some form of alignment training:

Model	Alignment Method	Notes
GPT-4o	RLHF + RLAIF (OpenAI)	Undisclosed full details
Claude Sonnet 3.7	Constitutional AI (CAI)	Anthropic; most transparent
DeepSeek-R1	GRPO (Group Relative Policy Optimisation)	Specialised for reasoning
Llama-3.1-8B-Instruct	SFT + DPO	Meta; open-source
Mistral-7B-Instruct	SFT + DPO	Mistral AI
Phi-3-mini	SFT + DPO	Microsoft; dataset-focused alignment

Implications for PUMA

Frozen models: PUMA evaluates all models post-alignment; no fine-tuning in baseline experiments
Alignment tax: aligned models may be more cautious on ambiguous triage tasks → may produce “safe” but incorrect classifications
Constitution-aware prompting: Claude’s CAI training makes it especially responsive to explicit normative constraints in system prompts (bounded autonomy framing)
DPO models are PUMA-compatible: open-source DPO models run on Ollama with no special considerations

RLHF vs. Constitutional AI: PUMA Design Implications

Question	RLHF	CAI
Can we inspect alignment criteria?	No (RM is opaque)	Yes (constitution is readable)
Can we audit for PM-domain bias?	Difficult	Feasible — check constitution principles
Consistent with EU AI Act transparency?	Partial	Strong
Available for local deployment?	Open RM checkpoints (rare)	N/A (CAI produces model; model is local)

For the PUMA ethics chapter (Section 5.3): Constitutional AI provides the strongest transparency story for regulatory compliance framing.

PN-FineTuning-LoRA-Quantization — SFT and LoRA fine-tuning (Stage 1 of alignment pipeline)
PN-LLM-Models-PUMA — model catalog: alignment method per model
PN-HITL-BoundedAutonomy — HITL as alignment-complement at inference time
PN-AlgorithmicBias — bias introduced or perpetuated by RLHF rater demographics
PN-Reflexion-SelfCritique — comparison table: Reflexion vs. RLHF vs. CAI

PUMA Vault

Explorador

RLHF and Constitutional AI — LLM Alignment Training Paradigms

PN: RLHF and Constitutional AI — LLM Alignment Training Paradigms

Pre-Training vs. Alignment

RLHF — Reinforcement Learning from Human Feedback

Three-Stage Pipeline

Mathematical Framework

Failure Modes

Constitutional AI (Anthropic, 2022)

Motivation

Two-Stage Process

The Constitution

Advantages over RLHF

RLAIF — Reinforcement Learning from AI Feedback

DPO — Direct Preference Optimisation

Impact on PUMA Models

Implications for PUMA

RLHF vs. Constitutional AI: PUMA Design Implications

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

RLHF and Constitutional AI — LLM Alignment Training Paradigms

PN: RLHF and Constitutional AI — LLM Alignment Training Paradigms

Pre-Training vs. Alignment

RLHF — Reinforcement Learning from Human Feedback

Three-Stage Pipeline

Mathematical Framework

Failure Modes

Constitutional AI (Anthropic, 2022)

Motivation

Two-Stage Process

The Constitution

Advantages over RLHF

RLAIF — Reinforcement Learning from AI Feedback

DPO — Direct Preference Optimisation

Impact on PUMA Models

Implications for PUMA

RLHF vs. Constitutional AI: PUMA Design Implications

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces