PN: LLM Models Used in PUMA — Technical Reference

Core Idea

PUMA experiments compare local open-source models (run via Ollama) against proprietary cloud APIs. This note catalogs each model’s architecture, key specifications, Ollama identifiers, and relevance to PUMA’s evaluation methodology.


Model Architecture Background

Transformer Architecture (Vaswani et al., 2017)

All modern LLMs are based on the Transformer:

  • Self-attention:
  • Multi-head attention: Parallel attention heads capture different relationship types
  • Feed-forward layers: Position-wise transformation after attention
  • Positional encoding: Adds sequence position information (sinusoidal or RoPE)

Key Design Axes

AxisOptionsPUMA Implication
Attention typeFull, Flash Attention 2, GQAAffects inference speed/VRAM
Context length4k–128k tokensLimits few-shot example count
TokenizerBPE, SentencePieceAffects token efficiency for code/PM text
QuantizationFP16, Q4_K_M, Q5_K_S, GGUFLocal VRAM vs. quality tradeoff

Mixture of Experts (MoE)

Instead of a dense feed-forward layer, MoE uses a router that activates only a subset of expert sub-networks per token:

  • Sparse activation: Only 2–8 experts fire per token despite total model having many more parameters
  • Benefit: Larger effective capacity at lower inference cost than equivalent dense model
  • Examples: Mixtral 8×7B, DeepSeek-V3 (256 experts, 37B active/671B total)
  • PUMA: DeepSeek-V3 is the frontier MoE model tested in cloud evaluation

Open-Source Local Models

Llama 3.2 8B (Meta, 2024)

PropertyValue
Parameters8B (active)
ArchitectureTransformer (dense), GQA
Context128K tokens
Training tokens15T
Ollama tagllama3.2:8b
GGUF quantizationQ4_K_M (~5GB VRAM)
LicenseLlama 3.2 Community (commercial allowed ≤700M MAU)

PUMA relevance: Primary local baseline; strong instruction following; 128K context allows multi-shot prompting. GQA reduces KV-cache memory for long contexts.


Mistral 7B Instruct v0.3 (Mistral AI, 2023)

PropertyValue
Parameters7.3B
ArchitectureTransformer (dense), Sliding Window Attention, GQA
Context32K tokens
Ollama tagmistral:7b-instruct-v0.3
LicenseApache 2.0 (fully open)

PUMA relevance: Strong at structured JSON output; Sliding Window Attention enables long contexts efficiently. Apache 2.0 = most permissive license for commercial/academic deployment.


Phi-3.5 Mini Instruct (Microsoft, 2024)

PropertyValue
Parameters3.8B
ArchitectureTransformer (dense)
Context128K tokens
Ollama tagphi3.5:mini
LicenseMIT

PUMA relevance: Smallest model; tests feasibility of extreme resource constraints (edge deployment, no GPU). Strong reasoning-per-parameter ratio due to “textbook quality” training data curation.


Gemma 2 9B Instruct (Google DeepMind, 2024)

PropertyValue
Parameters9B
ArchitectureTransformer, alternating local/global attention, logit soft-capping
Context8K tokens
Ollama taggemma2:9b
LicenseGemma Terms of Use (permissive non-commercial research)

PUMA relevance: Strong instruction following; alternating attention is a novel architectural choice. Shorter context (8K) limits few-shot example count — relevant constraint for H2 estimation.


DeepSeek-R1 (DeepSeek, 2025)

PropertyValue
Parameters7B (distilled from 671B) variants available
ArchitectureDense (distilled); parent uses MoE
Context128K tokens
TrainingGRPO reinforcement learning from reasoning traces; no SFT cold start
Ollama tagdeepseek-r1:7b, deepseek-r1:14b, deepseek-r1:32b
LicenseMIT

PUMA relevance: First fully open reasoning model trained with RL from scratch (not RLHF). Generates explicit chain-of-thought between <think> tags — aligns with PUMA’s CoT prompting strategy. Competitive with OpenAI o1 on AIME/MATH at fraction of cost.


Qwen 3 (Alibaba, 2025)

PropertyValue
Parameters0.6B–235B (dense); 30B-A3B, 235B-A22B (MoE)
Context128K tokens (32K recommended)
FeatureThinking mode (extended CoT) switchable
LicenseApache 2.0

PUMA relevance: State-of-the-art open-source model family at time of PUMA experiments; 8B and 14B variants competitive with GPT-4o on coding and instruction following benchmarks.


Proprietary Cloud Models

GPT-4o (OpenAI, 2024)

PropertyValue
ParametersUnknown (estimated ~200B MoE)
Context128K tokens
APIgpt-4o via OpenAI API
Input cost$2.50 / 1M tokens
Output cost$10.00 / 1M tokens

PUMA relevance: Frontier cloud baseline; upper-bound reference for H1/H2 experiments. GPT-4o’s JSON mode (response_format={"type": "json_object"}) directly supports PUMA’s structured output requirements.


Claude 3.5 Sonnet / Claude 4 (Anthropic, 2024–2025)

PropertyValue
ArchitectureConstitutional AI + RLHF; transformer-based
Context200K tokens
APIclaude-3-5-sonnet-20241022 / claude-sonnet-4
StrengthLong context, structured output, careful instruction following

PUMA relevance: Tested as alternative cloud model; 200K context allows embedding full TAWOS dataset subsets. Strong for few-shot examples with long issue descriptions.


DeepSeek-V3 (DeepSeek, 2024)

PropertyValue
Parameters671B total, 37B active (MoE)
Experts256 experts, top-2 routing per token
Context128K tokens
Training cost$6M (fraction of GPT-4 cost)
LicenseMIT (weights open)

PUMA relevance: Most capable open-weights model; competitive with GPT-4o on coding. Demonstrates that frontier-quality performance is achievable with open-source models — validates PUMA’s open-model track.


Quantization Reference (Local Deployment)

FormatBitsQuality LossVRAM (7B)VRAM (13B)
FP1616None (baseline)~14GB~26GB
Q8_08Minimal~8GB~14GB
Q5_K_S5Very small~5.5GB~9GB
Q4_K_M4Small~4.8GB~8GB
Q3_K_M3Moderate~4GB~6.5GB
Q2_K2Large~3GB~5GB

PUMA: Use Q4_K_M as default for local experiments (best quality/VRAM tradeoff). Run FP16 for final reported results when hardware allows.


Model Selection Rationale for PUMA

ModelTrackRationale
Llama 3.2 8BLocal (primary)Best balance of capability and resource requirements
Mistral 7BLocalApache 2.0 license; strong JSON compliance
Phi-3.5 MiniLocal (efficiency)Tests minimum viable deployment
Gemma 2 9BLocalGoogle architecture variant; novel attention
GPT-4oCloud (primary)Performance upper bound
DeepSeek-R1 7BLocal (reasoning)Open reasoning model; CoT alignment with PUMA
DeepSeek-V3Cloud (open-weight)Validates open frontier model feasibility

MOCs