ST: Reproducibility in LLM-SE Research — Structure Note

Theme: Why is reproducibility a crisis in LLM-SE research, and how does PUMA address it?


Core Claims

Reproducibility Crisis

Only 5/18 LLM-SE papers with published artefacts are actually executable (Angermeir, 2025). PUMA is designed to be fully reproducible by construction.

  1. Only 5/18 LLM-SE papers with artefacts are executable (Angermeir 2025) → PN-KeyConcepts-Agents-Reproducibility-RedTeam
  2. Local inference + pinned models = bit-identical reproductionPN-LLM-Local-vs-Cloud
  3. seed=42 + temperature=0 ensure determinismSP-PUMA-Constitution
  4. RAG complicates reproducibility when retrieval index changesPN-RAG-Embeddings-VectorDB

Failure Taxonomy (Angermeir 2025)

Failure typePUMA mitigation
Missing dependenciesrequirements.txt with pinned versions
Undocumented preprocessingDocumented scripts in repo
Non-deterministic LLM callstemperature=0, seed=42, local Ollama
Absent random seedsseed=42 everywhere

LN-Angermeir-2025-ReproducibilitySP-PUMA-Constitution (Art. 1) → PR-PUMA-Ch3-Methods (§3.8) → MOC-Methods-Frameworks