Local LLM inference trades capability ceiling for reproducibility, privacy, and zero marginal cost

Local LLMs (running on consumer hardware via Ollama) cannot match GPT-4’s raw capability, but for benchmark research they offer three decisive advantages that cloud models cannot replicate.


The Trade-off

DimensionLocal (Ollama + Llama 3.2 8B)Cloud (GPT-4, Claude Opus)
Capability ceilingLower (8B parameters)Higher (>100B parameters)
ReproducibilityPerfect (seed=42, temp=0, fixed version)Imperfect (model updates, API non-determinism)
Cost per experiment$0 (after hardware)~1.00 per 1k tokens
PrivacyData stays localData sent to third-party servers
Carbon footprintMeasurable, lowerDifficult to measure, higher
AvailabilityAlways (no internet required)API-dependent

Why This Matters for PUMA

Angermeir et al. (2025) found that only 5 of 18 LLM-SE papers with published artefacts were actually executable. A primary cause: cloud API changes break experimental pipelines. Local inference with pinned model versions (Ollama uses content-addressable model registry) guarantees bit-identical reproduction.

Calikli & Alhamed (2025) showed that prompt format has non-monotonic effects on estimation quality — effects that vary across model versions. Only reproducible local inference allows controlled comparison of prompting strategies.


Falsifiable Claim for PUMA

H1 and H2 evaluate local models. If local models fail to surpass baselines, this is a valid scientific result (failure to reject H₀) — not a study weakness. It establishes the current capability boundary for local LLMs in PM tasks.

This is distinct from a study using GPT-4: if GPT-4 succeeds, the result cannot be reproduced without payment and may not hold with the next API update.


Practical Notes for PUMA

  • Llama 3.2 8B (Q4_K_M, ~5GB RAM): primary model. Good reasoning, instruction following.
  • Mistral 7B (Q4_K_M, ~4.5GB RAM): comparison model. Faster inference, different tokenisation.
  • Phi-3.5 Mini 3.8B (~2GB RAM): fallback if latency > 60s on target hardware.
  • Ollama commands: ollama pull llama3.2:8b-instruct-q4_K_M, ollama run mistral:7b-instruct-q4_K_M

MOCs