Issue Triage in Software Project Management

Atomic Claim

Overview

Manual issue triage is a high-volume, low-individual-value activity that introduces systematic cognitive bias in priority assignment — making it the ideal first target for LLM automation in ICT project management.


💡 The Concept

Issue triage is the process of reviewing incoming work items (bugs, feature requests, tasks) and assigning: priority, severity, component, assignee, and milestone.

Why It’s a Problem

From the Jira Social Repository (Ortu et al., 2015 — 50,000+ real Apache issues):

  • Priority assigned manually varies significantly between projects for technically equivalent issues
  • Triage consumes disproportionate time relative to value added
  • Inconsistency cascades into planning errors and missed deadlines

Priority Classes (4-class schema used in PUMA)

ClassDefinitionProportion in Jira SR
CriticalSystem-down, data loss, security breach~8%
HighMajor feature broken, significant user impact~22%
MediumFeature degraded, workaround exists~45%
LowCosmetic, minor UX, documentation~25%

Class imbalance challenge: The 8%/22%/45%/25% distribution requires stratified sampling for fair evaluation. PUMA uses 50 issues per class (200 total) to create a balanced evaluation set.


📊 Evaluation Metrics for Triage

MetricFormulaWhy it matters
F1-macroMean F1 across all classesTreats each class equally regardless of size
F1-microWeighted by class frequencyDominated by majority class (Medium)
Precision per classTP/(TP+FP)Avoids false alarms
Recall per classTP/(TP+FN)Avoids missing critical issues

PUMA uses F1-macro because missing a Critical issue is as bad as miscategorising Low issues — no class should be downweighted.

Baseline: Heuristic Classifier

The reference baseline assigns priority based on keyword presence:

  • “crash”, “down”, “production”, “all users” → Critical
  • “slow”, “performance”, “timeout” → High
  • Default → Medium

This achieves approximately F1-macro = 0.45–0.52 on Jira SR. PUMA H1 tests whether LLMs exceed this.


🔗 Connected Ideas

Dataset: LN-Datasets-JiraSR-TAWOS (Jira SR) Hypothesis: EX-Hypotheses-H1-H2 (H1) Methods: PR-PUMA-Ch3-Methods (§3.2, §3.6) Prompts: PT-PUMA-Experiment-Prompts Related concept: PN-CoT-FewShot-Prompting · PN-KeyConcepts-Agents-Reproducibility-RedTeam (Uniqueness Trap) Glossary: Glossary-Master (Issue Triage, F1-macro) MOC: MOC-PUMA-Master · MOC-LLM-Benchmarks-PM-AI



id: PN-Story-Points title: “Story Points & Effort Estimation in Agile” type: permanent-note category: concept tags: [permanent, concept, story-points, estimation, agile, tawos, effort] aliases: [“Story Points”, “Effort Estimation”, “Sprint Estimation”, “SP”] created: 2026-03-01 maturity: evergreen sources: [“LN-Datasets-JiraSR-TAWOS”, “LN-KeyPapers-CoGEE-Angermeir-Flyvbjerg”]

Story Points & Effort Estimation in Agile

Atomic Claim

Story point estimation is the PM task with the highest variance in software engineering, exhibiting systematic over-estimation under sprint pressure and under-estimation in new projects — patterns that LLMs with few-shot examples may partially correct by anchoring to historical data.


💡 The Concept

Story points are a relative measure of effort, complexity, and uncertainty for a user story in Agile sprints. They are not hours — they are dimensionless units calibrated per team.

The Estimation Problem

From TAWOS (Tawosi et al., 2022 — 23,000+ real user stories):

  • Teams over-estimate under sprint pressure (conservative padding)
  • Teams under-estimate in new projects (optimism bias)
  • Estimates correlate poorly across teams even for similar stories

Reference Baselines for PUMA (Stage 2)

BaselineMAE (Story Points)Notes
Mean historical~3.5 SPProject historical average
Deep-SE~3.2 SPDeep learning model (Choetkiertikul et al.)
CoGEE (GPT-4)~1.9 SPBest published result (Tawosi et al., 2024)
PUMA target (H2)≤ 3.0 SPMVP threshold
PUMA ideal≤ 1.5 SPDesirable threshold

Fibonacci Scale (standard in PUMA experiments)

Story points follow a Fibonacci sequence: 1, 2, 3, 5, 8, 13, 21

This non-linearity must be accounted for in prompts:

  • Few-shot examples should include stories from across the scale
  • MAE computed in raw story points (not log-transformed)

🔗 Connected Ideas

Dataset: LN-Datasets-JiraSR-TAWOS (TAWOS) Hypothesis: EX-Hypotheses-H1-H2 (H2) Methods: PR-PUMA-Ch3-Methods (§3.2, §3.6) Core paper: LN-KeyPapers-CoGEE-Angermeir-Flyvbjerg Related concept: PN-CoT-FewShot-Prompting (few-shot as reference class) · PN-KeyConcepts-Agents-Reproducibility-RedTeam (Uniqueness Trap) Prompts: PT-PUMA-Experiment-Prompts Glossary: Glossary-Master (Story Points, MAE, Fibonacci scale) MOC: MOC-PUMA-Master · MOC-LLM-Benchmarks-PM-AI