ST: Reproducibility in LLM-SE Research — Structure Note
Theme: Why is reproducibility a crisis in LLM-SE research, and how does PUMA address it?
Core Claims
Reproducibility Crisis
Only 5/18 LLM-SE papers with published artefacts are actually executable (Angermeir, 2025). PUMA is designed to be fully reproducible by construction.
- Only 5/18 LLM-SE papers with artefacts are executable (Angermeir 2025) → PN-KeyConcepts-Agents-Reproducibility-RedTeam
- Local inference + pinned models = bit-identical reproduction → PN-LLM-Local-vs-Cloud
- seed=42 + temperature=0 ensure determinism → SP-PUMA-Constitution
- RAG complicates reproducibility when retrieval index changes → PN-RAG-Embeddings-VectorDB
Failure Taxonomy (Angermeir 2025)
| Failure type | PUMA mitigation |
|---|---|
| Missing dependencies | requirements.txt with pinned versions |
| Undocumented preprocessing | Documented scripts in repo |
| Non-deterministic LLM calls | temperature=0, seed=42, local Ollama |
| Absent random seeds | seed=42 everywhere |
→ LN-Angermeir-2025-Reproducibility → SP-PUMA-Constitution (Art. 1) → PR-PUMA-Ch3-Methods (§3.8) → MOC-Methods-Frameworks