Chapter 4 — Results

Chapter Status

Overview

⏳ Pending F4 (experiment completion + analysis) All tables generated programmatically from results/ directory.


Chapter Outline

4.1 Stage 1 — Issue Triage Results

Subsections:

  • 4.1.1 Baseline performance (B1 heuristic, B2 SVM)
  • 4.1.2 LLM condition results table (F1-macro × model × strategy)
  • 4.1.3 Statistical analysis (Wilcoxon, p-values, effect sizes)
  • 4.1.4 Per-class analysis (Critical, High, Medium, Low breakdown)
  • 4.1.5 Error analysis (top failure patterns + examples)
  • 4.1.6 H1 decision and interpretation

4.2 Stage 2 — Effort Estimation Results

Subsections:

  • 4.2.1 Baseline performance (historical mean, Deep-SE, CoGEE reported)
  • 4.2.2 LLM condition results table (MAE × model × strategy)
  • 4.2.3 Statistical analysis (Wilcoxon vs historical mean)
  • 4.2.4 Per-SP-class analysis
  • 4.2.5 H2 decision and interpretation

4.3 System Performance Analysis

  • 4.3.1 Latency per condition (seconds per query, aggregated)
  • 4.3.2 Carbon footprint per condition (gCO₂eq, CodeCarbon)
  • 4.3.3 Quality vs carbon tradeoff frontier

4.4 Cross-Stage Findings

  • Best model overall (Stage 1 + Stage 2 combined)
  • Best strategy overall
  • Quality-cost frontier recommendations

Draft Tables (Placeholder — fill during F4)

Table 4.1 — Stage 1: Triage Results

ConditionModelStrategyF1-macroPrec. (macro)Rec. (macro)Wprsig.?
B1HeuristicTBD
B2TF-IDF+SVMTBD
C1llama3.2:8bzero-shotTBD
C2llama3.2:8bfew-shot-3TBD
C3llama3.2:8bfew-shot-6TBD
C4llama3.2:8bcotTBD
C5mistral:7bzero-shotTBD
C6mistral:7bfew-shot-3TBD
C7mistral:7bfew-shot-6TBD
C8mistral:7bcotTBD

Note: Wilcoxon (W, p, r) computed vs B1 (heuristic baseline). α=0.05. Bonferroni correction applied (α_corrected = 0.05/8 = 0.00625).

Table 4.2 — Stage 2: Estimation Results

ConditionModelStrategyMAERMSEWprsig.?
B1Historical meanTBD
B2Deep-SE~3.2 (lit.)
B3CoGEE (GPT-4)~1.9 (lit.)
C1llama3.2:8bzero-shotTBD
C2llama3.2:8bfew-shot-3TBD
C3llama3.2:8bcotTBD
C4mistral:7bzero-shotTBD
C5mistral:7bfew-shot-3TBD
C6mistral:7bcotTBD

Table 4.3 — System Performance

ConditionModelStrategyTaskgCO₂eqLatency (s/q)
C1llama3.2:8bzero-shottriageTBDTBD
[all conditions]

Visualisations Plan

  • Figure 4.1: F1-macro heatmap (model × strategy)
  • Figure 4.2: Per-class F1 grouped bar chart (best condition vs baseline)
  • Figure 4.3: MAE comparison bar chart with error bars
  • Figure 4.4: Carbon vs F1 scatter (quality-cost frontier)
  • Figure 4.5: Latency distribution box plots per model

All figures: matplotlib + seaborn, reproducible from notebooks/04_results_visualisation.ipynb


Experiments: EX-Hypotheses-H1-H2 · EX-Stages-Overview

Methodology: PR-PUMA-Ch3-Methods · PN-Wilcoxon-FINER-Cornell-PRISMA

PM concepts: PN-IssueTriage-StoryPoints · PN-CoT-FewShot-Prompting

Datasets: LN-Datasets-JiraSR-TAWOS · LN-KeyPapers-CoGEE-Angermeir-Flyvbjerg (baselines)

Carbon: Carbon-Tracking-Log

Navigation: PR-PUMA-Ch5-Discussion · MOC-PUMA-Master · Dashboard-Experiment-Status