Local LLM inference trades capability ceiling for reproducibility, privacy, and zero marginal cost

Local LLMs (running on consumer hardware via Ollama) cannot match GPT-4’s raw capability, but for benchmark research they offer three decisive advantages that cloud models cannot replicate.

The Trade-off

Dimension	Local (Ollama + Llama 3.2 8B)	Cloud (GPT-4, Claude Opus)
Capability ceiling	Lower (8B parameters)	Higher (>100B parameters)
Reproducibility	Perfect (seed=42, temp=0, fixed version)	Imperfect (model updates, API non-determinism)
Cost per experiment	$0 (after hardware)	~ $0.10-$ 1.00 per 1k tokens
Privacy	Data stays local	Data sent to third-party servers
Carbon footprint	Measurable, lower	Difficult to measure, higher
Availability	Always (no internet required)	API-dependent

Why This Matters for PUMA

Angermeir et al. (2025) found that only 5 of 18 LLM-SE papers with published artefacts were actually executable. A primary cause: cloud API changes break experimental pipelines. Local inference with pinned model versions (Ollama uses content-addressable model registry) guarantees bit-identical reproduction.

Calikli & Alhamed (2025) showed that prompt format has non-monotonic effects on estimation quality — effects that vary across model versions. Only reproducible local inference allows controlled comparison of prompting strategies.

Falsifiable Claim for PUMA

H1 and H2 evaluate local models. If local models fail to surpass baselines, this is a valid scientific result (failure to reject H₀) — not a study weakness. It establishes the current capability boundary for local LLMs in PM tasks.

This is distinct from a study using GPT-4: if GPT-4 succeeds, the result cannot be reproduced without payment and may not hold with the next API update.

Practical Notes for PUMA

Llama 3.2 8B (Q4_K_M, ~5GB RAM): primary model. Good reasoning, instruction following.
Mistral 7B (Q4_K_M, ~4.5GB RAM): comparison model. Faster inference, different tokenisation.
Phi-3.5 Mini 3.8B (~2GB RAM): fallback if latency > 60s on target hardware.
Ollama commands: ollama pull llama3.2:8b-instruct-q4_K_M, ollama run mistral:7b-instruct-q4_K_M

PN-RAG-Embeddings-VectorDB — Stage 4 RAG on local models
PN-KeyConcepts-Agents-Reproducibility-RedTeam — Reproducibility rationale
PN-IssueTriage-StoryPoints — PUMA evaluation targets
SP-PUMA-Constitution — Art. 2 Local-Only Inference
PR-PUMA-Ch3-Methods — §3.3 Models
Carbon-Tracking-Log

PUMA Vault

Explorador

Local LLM inference trades capability ceiling for reproducibility, privacy, and zero marginal cost

Local LLM inference trades capability ceiling for reproducibility, privacy, and zero marginal cost

The Trade-off

Why This Matters for PUMA

Falsifiable Claim for PUMA

Practical Notes for PUMA

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Local LLM inference trades capability ceiling for reproducibility, privacy, and zero marginal cost

Local LLM inference trades capability ceiling for reproducibility, privacy, and zero marginal cost

The Trade-off

Why This Matters for PUMA

Falsifiable Claim for PUMA

Practical Notes for PUMA

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces