CoGEE: Story Point Estimation with Generative AI
One-sentence summary: CoGEE uses search-based optimisation (genetic algorithms) to find the best k-shot example set from a project’s history, achieving MAE ~1.9 SP with GPT-4 on the TAWOS dataset — the current state of the art.
📋 Bibliographic Reference
Authors: Tawosi, Vali · Alamir, Serena · Liu, Xiaomo Year: 2024 | Venue: EASE 2024 | arXiv: 2403.08430
🎯 Core Contribution
Problem: Which examples to include in few-shot prompts for story point estimation is not obvious — random selection performs poorly.
Main claim: Search-based optimisation of the example set (selecting the k stories most similar to the target) significantly improves GPT-4 estimation quality compared to random few-shot selection.
Key insight: The selection of few-shot examples matters as much as the number of examples. Semantically similar historical stories provide better anchoring than random ones.
🔬 Methods & Results
| Aspect | Detail |
|---|---|
| Model | GPT-4, GPT-3.5-turbo |
| Dataset | TAWOS (23,000+ user stories) |
| Approach | Genetic algorithm selects optimal k examples |
| Best MAE | ~1.9 SP (GPT-4 + optimised few-shot) |
| Baseline MAE | ~3.5 SP (historical mean) |
| Reproducibility | Code not publicly available |
🧠 Critical Analysis (DRCA + Red Team)
Strengths:
- Uses TAWOS — the standard dataset, enabling comparison
- Clearly demonstrates few-shot selection matters
- Large-scale empirical evaluation
Limitations (authors):
- Relies on proprietary GPT-4 API — cost and reproducibility concerns
- Optimisation is expensive at inference time
Limitations (mine — Red Team):
- Genetic algorithm overhead makes this impractical for real-time triage
- No comparison with local models → unknown if technique transfers
- No carbon/energy measurement
- Code not published → reproducibility gap (the gap PUMA directly addresses)
Relation to PUMA gap:
- CoGEE uses GPT-4 (API, closed) → PUMA uses Llama 3.2 + Mistral (local, open)
- CoGEE doesn’t compare prompting strategies → PUMA explicitly compares 4 strategies
- CoGEE doesn’t measure CO₂ → PUMA does (CodeCarbon)
🔗 Connections
Dataset: LN-Datasets-JiraSR-TAWOS Task: PN-IssueTriage-StoryPoints (Story Points section) Hypothesis: EX-Hypotheses-H1-H2 (H2 baseline — PUMA must approach CoGEE’s MAE) Methods: PR-PUMA-Ch3-Methods (§3.5 baselines)
id: LN-Angermeir2025-Reproducibility title: “Reproducibility of LLM Studies in Software Engineering (ICSE 2026)” type: literature-note subtype: paper tags: [literature, paper, reproducibility, llm, se, benchmark, icse] authors: [“Angermeir, Florian”, “et al.”] year: 2025 venue: “ICSE 2026” arxiv: “2510.25506” url: “https://arxiv.org/abs/2510.25506” zotero_key: “Angermeir2025” relevance: high puma_relevance: “Primary empirical evidence for PUMA’s Gap 1 (reproducibility). The finding that only 5/18 papers with published artefacts are actually executable is the core empirical motivation for PUMA’s reproducibility-first design.” read_status: processed prisma_decision: include created: 2026-03-01
Reproducibility of LLM Studies in Software Engineering
One-sentence summary: A large-scale meta-study of 85 LLM papers in SE finds that only 5 of 18 papers with published artefacts can actually be executed, and none are fully reproducible — establishing the primary research gap that PUMA addresses.
🎯 Core Contribution
Problem: LLM research in SE is rapidly growing but poorly reproducible, making scientific accumulation of knowledge difficult.
Main claim: Reproducibility is systemically absent: most papers either don’t publish code/data, or publish incomplete artefacts that cannot be executed.
Key figures:
- 85 papers analysed
- 18 with published artefacts
- 5 of 18 artefacts executable
- 0 of 18 fully reproducible (same results, clean environment)
🧠 Critical Analysis
This is the strongest single justification for PUMA’s design. PUMA is explicitly designed to be the counter-example: 100% reproducible, seed=42, temperature=0, Docker option, MIT licence, ≤10 install commands.
Red Team: Could reproducibility have improved since the paper’s data collection (papers likely from 2023–2024)? Possibly, but the structural incentives haven’t changed. PUMA’s approach is still needed.
🔗 Connections
Defines: PN-KeyConcepts-Agents-Reproducibility-RedTeam (Reproducibility section) Motivates: SP-Architecture · SP-PUMA-Constitution (Art. 1) Cited in: PR-PUMA-Ch1-Introduction (Gap 1) Cluster: ST-Reproducibility-Cluster
id: LN-Flyvbjerg2023-BigThings title: “How Big Things Get Done — Flyvbjerg & Gardner (2023)” type: literature-note subtype: book tags: [literature, book, project-management, planning-fallacy, uniqueness-trap] authors: [“Flyvbjerg, Bent”, “Gardner, Dan”] year: 2023 venue: “Crown Publishers” isbn: “978-0593239513” zotero_key: “Flyvbjerg2023” relevance: high puma_relevance: “Introduces the ‘Uniqueness Trap’ (treating each project as unique, preventing statistical learning) — the third dimension of the PM problem that PUMA addresses via historical pattern learning.” read_status: processed created: 2026-03-01
How Big Things Get Done
One-sentence summary: Flyvbjerg & Gardner demonstrate empirically that large projects systematically fail due to cognitive biases — especially the “Uniqueness Trap” — and argue for statistical learning from reference classes of similar projects.
🎯 Core Contribution for PUMA
The Uniqueness Trap (Chapter relevant to PUMA): Project managers systematically resist using historical data from past similar projects because they perceive their current project as unique. This prevents statistical learning and generates systematic forecast errors.
PUMA connection: LLMs can act as a “reference class database” — providing probabilistic estimates grounded in historical patterns (few-shot examples from past similar issues/stories) rather than treating each issue as unique. This is exactly what the few-shot prompting strategy implements in PUMA.
Cornell Notes
| Questions | Notes |
|---|---|
| What is the Uniqueness Trap? | PM tendency to reject historical analogies as “not applicable to my unique project” |
| How does PUMA address this? | Few-shot prompts provide historical examples, forcing anchoring to base rates |
| What is “Reference Class Forecasting”? | Using statistics from similar past projects to anchor estimates |
Summary: The book provides the non-technical PM justification for why AI assistance in project management has value: humans are systematically biased in ways that statistical models are not. PUMA benchmarks whether LLMs can overcome these biases in triage and estimation tasks.
🔗 Connections
Defines concept: PN-KeyConcepts-Agents-Reproducibility-RedTeam (Uniqueness Trap section) Key thinker: PER-Flyvbjerg-Bent Cited in: PR-PUMA-Ch1-Introduction (§1.1 problem 3) Supports: EX-Hypotheses-H1-H2 (H2 rationale — few-shot as reference class) Applied via: PN-CoT-FewShot-Prompting (few-shot examples as reference class)