LN: Angermeir (2025) — Reproducibility of LLM Studies in SE

Bibliographic Reference

Citation: Angermeir, F., Kalinowski, M., & Méndez, D. (2025). Reproducibility of LLM studies in software engineering [Preprint]. arXiv:2510.25506.

Pass 1 Summary (5 Cs)

C	Assessment
Category	Empirical analysis / systematic mapping study
Context	Builds on EBSE tradition; related to Wohlin (2012) on SE experimentation
Correctness	Strong: reviews 85 papers, 18 with artefacts. Methodology clearly described.
Contributions	(1) Only 5/18 papers with published artefacts are executable. (2) Zero papers are fully reproducible end-to-end. (3) Taxonomy of reproducibility failures.
Clarity	Excellent. Well-structured, clear definitions.

Relevance: ⭐⭐⭐⭐⭐ (5/5)

Directly justifies PUMA’s reproducibility design

Pass 2 Key Points

Core finding: Out of 85 LLM-SE papers (2020–2024), 18 published artefacts. Of those, only 5 could be executed. Zero were fully reproducible (same results from scratch).

Failure taxonomy:

Missing dependencies or broken environments
Undocumented data preprocessing steps
Non-deterministic LLM calls (API model updates, temperature > 0)
Absent or incorrect random seeds

Implication for PUMA: PUMA’s constitution (Articles 1, 2, 4) directly addresses all four failure categories: fixed dependencies, reproducible scripts, local inference with pinned models, seed=42 + temperature=0.

Key quote context: The study confirms that reproducibility in LLM-SE research is “a critical, largely unresolved problem” — the specific gap PUMA claims to address.

Pass 3 — Virtual Reconstruction

Methodology: Manual screening of arXiv + ACM DL + IEEE Xplore. Inclusion: LLM-based SE studies 2020–2024. Execution test: fresh environment, follow published instructions.

What I would change: The study doesn’t measure why researchers don’t share reproducible artefacts. A survey component would strengthen the taxonomy.

Challenge for PUMA: Angermeir et al. don’t define “fully reproducible” formally. PUMA should provide its own operational definition: “clean environment, seed=42, same results within statistical tolerance.”

PUMA Integration

Used in: Section 1.1 (justification), Section 1.3 (contribution), Section 2.1 (SLR protocol)

Supports: H1 and H2 design (why we need reproducible conditions)

Permanent note generated: PN-KeyConcepts-Agents-Reproducibility-RedTeam

References in this paper (follow-up)

Wohlin et al. (2012) — Experimentation in SE ✅ already in vault
Kitchenham & Charters (2007) — SLR guidelines ✅ already in vault

🔗 Connected Notes

Permanent note: PN-KeyConcepts-Agents-Reproducibility-RedTeam (Reproducibility section) Also in: LN-KeyPapers-CoGEE-Angermeir-Flyvbjerg Cluster: ST-Reproducibility-Cluster Constitution response: SP-PUMA-Constitution (Art. 1, 2) Architecture response: SP-Architecture Chapter citation: PR-PUMA-Ch1-Introduction (Gap 1) · PR-PUMA-Ch3-Methods (§3.8) MOC: MOC-Literature-Review · MOC-PUMA-Master

PUMA Vault

Explorador

Reproducibility of LLM Studies in Software Engineering

LN: Angermeir (2025) — Reproducibility of LLM Studies in SE

Pass 1 Summary (5 Cs)

Pass 2 Key Points

Pass 3 — Virtual Reconstruction

PUMA Integration

References in this paper (follow-up)

🔗 Connected Notes

Vista Gráfica

Tabla de Contenidos

Retroenlaces