Chapter 5 — Discussion & Conclusions

Chapter Status

⏳ Pending F4 results Requires: Completed results tables from PR-PUMA-Ch4-Results

Chapter Structure

5.1 Interpretation of Results

H1 (Triage) interpretation framework:

If H₀₁ REJECTED (p < 0.05, r ≥ 0.1):
  → "Local LLMs provide measurable improvement over keyword heuristics..."
  → Discuss which model/strategy performed best and why
  → Discuss whether improvement is practically significant

If H₀₁ NOT REJECTED:
  → "No configuration achieved statistically significant improvement..."
  → This is a valid negative result — explain value
  → Possible explanations: 8B models too small for complex classification?
    CoT less effective at small model sizes? Dataset characteristics?

H2 (Estimation) interpretation framework:

If H₀₂ REJECTED:
  → Compare to CoGEE (MAE ~1.9 SP) — how close did local models get?
  → Discuss the quality-cost tradeoff (local vs GPT-4 API)

If H₀₂ NOT REJECTED:
  → Discuss why historical mean is hard to beat
  → Reference Flyvbjerg's Uniqueness Trap — teams use local context
    (not base rates) → model without fine-tuning faces same challenge

5.2 Implications for Practice

For ICT project managers:

Under what conditions should they adopt LLM-assisted triage?
What is the minimum acceptable F1 for practical deployment?
Carbon cost: is local inference sustainable at scale?

For researchers:

Which prompting strategies warrant further investigation?
Is fine-tuning necessary for PM tasks?
What would Stage 4 (RAG) add?

5.3 Limitations (Threats to Validity)

Internal validity threats:

Confound: model quantization level may affect results
Confound: hardware performance variation despite seed=42
Selection: Jira SR is Apache Software Foundation projects only → community OSS bias

External validity threats:

Results may not generalise to: closed-source Jira data, non-English issues, post-2014 issues
Local model results may not generalise to larger models (13B+)
8B parameter models may behave differently from 70B models

Construct validity threats:

F1-macro treats all classes equally — Critical errors may matter more in practice
Story points are team-relative — TAWOS cross-team ground truth has noise

Conclusion validity threats:

N=200 for triage, N=350 for estimation — may have limited power for small effects
Wilcoxon test assumptions: paired, continuous — per-issue scores may not be fully independent

5.4 Future Work

LIST
FROM "90 - GTD/93 Someday-Maybe"
WHERE contains(tags, "future-work")

Priority extensions:

Stage 3: Backlog prioritisation (Spearman correlation)
Stage 4: RAG-enhanced triage (ChromaDB + semantic retrieval)
Fine-tuning Llama 3.2 on Jira SR for fair comparison
Fairness analysis: does triage accuracy differ by project type?
Publication: MSR 2027 or EASE 2027

5.5 Conclusions

Answering the main RQ:
"Do statistically significant differences exist in automatic issue triage
quality and effort estimation when using different LLMs and prompting
strategies, evaluated on real ICT project datasets?"

Answer: [To be written based on results]

Contribution summary:
1. First fully reproducible PM+LLM benchmark (100%, local, MIT)
2. First systematic prompting strategy comparison for PM tasks
3. First carbon measurement dataset for PM+LLM research
4. [Result-dependent: evidence for/against local LLM viability in PM]

Red Team Preparation (Pre-Submission)

Before submitting Ch. 5, run these red team prompts:

PT-Advanced-Prompts-IIPR-Anchoring-AgentOS → rival hypothesis for main finding
PT-BMAD-Agent-Prompts → section review via Red Teamer agent
Manual: “If F1=0.58, is this practically meaningful or just statistically significant?”

Results source: PR-PUMA-Ch4-Results · EX-Hypotheses-H1-H2

Theoretical context: PN-LLM-Local-vs-Cloud · PN-KeyConcepts-Agents-Reproducibility-RedTeam (Red Teaming) PN-KeyConcepts-Agents-Reproducibility-RedTeam (Uniqueness Trap — §5.1 H2 discussion) PER-Flyvbjerg-Bent — Uniqueness Trap in H2 interpretation

Future work: Smart-PMO-Vision · PN-RAG-Embeddings-VectorDB (Stage 4) PN-MultiAgent-ArchitecturePatterns (Stage 5)

Navigation: PR-PUMA-Ch1-Introduction · MOC-PUMA-Master

PUMA Vault

Explorador

Chapter 5 — Discussion & Conclusions

Chapter 5 — Discussion & Conclusions

Chapter Structure

5.1 Interpretation of Results

5.2 Implications for Practice

5.3 Limitations (Threats to Validity)

5.4 Future Work

5.5 Conclusions

Red Team Preparation (Pre-Submission)

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Chapter 5 — Discussion & Conclusions

Chapter 5 — Discussion & Conclusions

Chapter Structure

5.1 Interpretation of Results

5.2 Implications for Practice

5.3 Limitations (Threats to Validity)

5.4 Future Work

5.5 Conclusions

Red Team Preparation (Pre-Submission)

🔗 Related Notes

Vista Gráfica

Tabla de Contenidos

Retroenlaces