LN: Calikli & Alhamed (2025) — Request Formats and Effort Estimation

Bibliographic Reference

Citation: Calikli, G., & Alhamed, A. (2025). Request formats and effort estimation with LLMs. ACM Transactions on Software Engineering and Methodology. https://doi.org/10.1145/3715771


Pass 1 Summary (5 Cs)

CAssessment
CategoryEmpirical evaluation (controlled experiment)
ContextBuilds on CoGEE (Tawosi 2024), Few-Shot LLM literature
CorrectnessStrong empirical design. Multiple datasets.
ContributionsPrompt format has non-monotonic, counter-intuitive effects on estimation quality. More examples ≠ better.
ClarityExcellent. Clear experimental design.

Relevance: ⭐⭐⭐⭐⭐ (5/5)

Directly justifies PUMA’s systematic prompting comparison (H2)


Pass 2 Key Points

Core finding: The relationship between number of examples (k in few-shot) and estimation quality is non-monotonic. Neither zero-shot nor maximum-shot is consistently best. The optimal k varies by dataset, model, and task type.

Why this is critical for PUMA: This is the empirical justification for testing multiple prompting strategies (zero-shot, few-shot-3, few-shot-6, CoT) rather than assuming more examples = better. PUMA’s independent variable (prompting strategy) is justified by this finding.

Counter-intuitive result: In some conditions, zero-shot outperforms few-shot, because providing examples can inadvertently anchor the model to non-representative examples from the training set.

PUMA connection: H₀₂ (null: no configuration with few-shot outperforms mean historical baseline) must be tested because assuming few-shot wins without evidence repeats the error Calikli identifies.


Pass 3 — Reconstruction

The paper likely tests:

  • k ∈ {0, 1, 3, 5, 10} examples per prompt
  • Multiple models (including proprietary)
  • TAWOS or similar dataset

Key weakness: Does not test CoT explicitly — PUMA adds this as Strategy 4.


Permanent Note Generated

PN-CoT-FewShot-Prompting


PUMA Integration

Used in: Section 1.1 (Limitation 2 — gap: no systematic prompting comparison), Section 2.3 (justification for 4-strategy design), H₀₂ framing

Key claim for PUMA Project: “Calikli & Alhamed (2025) demonstrate that prompt format has non-monotonic effects on LLM estimation quality — an effect no existing PM benchmark controls systematically.”


🔗 Connected Notes

Permanent note: PN-CoT-FewShot-Prompting (Few-Shot section) Dataset: LN-Datasets-JiraSR-TAWOS (TAWOS) Hypothesis: EX-Hypotheses-H1-H2 (H2 — why we test multiple strategies) PM concepts: PN-IssueTriage-StoryPoints (story point estimation) Methods: PR-PUMA-Ch3-Methods (§3.4 prompting strategies) Chapter 1: PR-PUMA-Ch1-Introduction (§1.1 Limitation 2, §1.3 gap justification) Chapter 2: PR-PUMA-Ch2-Ch3-Ch4-Ch5 (§2.3 effort estimation related work) Related paper: LN-KeyPapers-CoGEE-Angermeir-Flyvbjerg (CoGEE section) MOC: MOC-Literature-Review · MOC-Methods-Frameworks · MOC-PUMA-Master