Issue Triage in Software Project Management
Atomic Claim
Overview
Manual issue triage is a high-volume, low-individual-value activity that introduces systematic cognitive bias in priority assignment — making it the ideal first target for LLM automation in ICT project management.
💡 The Concept
Issue triage is the process of reviewing incoming work items (bugs, feature requests, tasks) and assigning: priority, severity, component, assignee, and milestone.
Why It’s a Problem
From the Jira Social Repository (Ortu et al., 2015 — 50,000+ real Apache issues):
- Priority assigned manually varies significantly between projects for technically equivalent issues
- Triage consumes disproportionate time relative to value added
- Inconsistency cascades into planning errors and missed deadlines
Priority Classes (4-class schema used in PUMA)
| Class | Definition | Proportion in Jira SR |
|---|---|---|
| Critical | System-down, data loss, security breach | ~8% |
| High | Major feature broken, significant user impact | ~22% |
| Medium | Feature degraded, workaround exists | ~45% |
| Low | Cosmetic, minor UX, documentation | ~25% |
Class imbalance challenge: The 8%/22%/45%/25% distribution requires stratified sampling for fair evaluation. PUMA uses 50 issues per class (200 total) to create a balanced evaluation set.
📊 Evaluation Metrics for Triage
| Metric | Formula | Why it matters |
|---|---|---|
| F1-macro | Mean F1 across all classes | Treats each class equally regardless of size |
| F1-micro | Weighted by class frequency | Dominated by majority class (Medium) |
| Precision per class | TP/(TP+FP) | Avoids false alarms |
| Recall per class | TP/(TP+FN) | Avoids missing critical issues |
PUMA uses F1-macro because missing a Critical issue is as bad as miscategorising Low issues — no class should be downweighted.
Baseline: Heuristic Classifier
The reference baseline assigns priority based on keyword presence:
- “crash”, “down”, “production”, “all users” → Critical
- “slow”, “performance”, “timeout” → High
- Default → Medium
This achieves approximately F1-macro = 0.45–0.52 on Jira SR. PUMA H1 tests whether LLMs exceed this.
🔗 Connected Ideas
Dataset: LN-Datasets-JiraSR-TAWOS (Jira SR) Hypothesis: EX-Hypotheses-H1-H2 (H1) Methods: PR-PUMA-Ch3-Methods (§3.2, §3.6) Prompts: PT-PUMA-Experiment-Prompts Related concept: PN-CoT-FewShot-Prompting · PN-KeyConcepts-Agents-Reproducibility-RedTeam (Uniqueness Trap) Glossary: Glossary-Master (Issue Triage, F1-macro) MOC: MOC-PUMA-Master · MOC-LLM-Benchmarks-PM-AI
id: PN-Story-Points title: “Story Points & Effort Estimation in Agile” type: permanent-note category: concept tags: [permanent, concept, story-points, estimation, agile, tawos, effort] aliases: [“Story Points”, “Effort Estimation”, “Sprint Estimation”, “SP”] created: 2026-03-01 maturity: evergreen sources: [“LN-Datasets-JiraSR-TAWOS”, “LN-KeyPapers-CoGEE-Angermeir-Flyvbjerg”]
Story Points & Effort Estimation in Agile
Atomic Claim
Story point estimation is the PM task with the highest variance in software engineering, exhibiting systematic over-estimation under sprint pressure and under-estimation in new projects — patterns that LLMs with few-shot examples may partially correct by anchoring to historical data.
💡 The Concept
Story points are a relative measure of effort, complexity, and uncertainty for a user story in Agile sprints. They are not hours — they are dimensionless units calibrated per team.
The Estimation Problem
From TAWOS (Tawosi et al., 2022 — 23,000+ real user stories):
- Teams over-estimate under sprint pressure (conservative padding)
- Teams under-estimate in new projects (optimism bias)
- Estimates correlate poorly across teams even for similar stories
Reference Baselines for PUMA (Stage 2)
| Baseline | MAE (Story Points) | Notes |
|---|---|---|
| Mean historical | ~3.5 SP | Project historical average |
| Deep-SE | ~3.2 SP | Deep learning model (Choetkiertikul et al.) |
| CoGEE (GPT-4) | ~1.9 SP | Best published result (Tawosi et al., 2024) |
| PUMA target (H2) | ≤ 3.0 SP | MVP threshold |
| PUMA ideal | ≤ 1.5 SP | Desirable threshold |
Fibonacci Scale (standard in PUMA experiments)
Story points follow a Fibonacci sequence: 1, 2, 3, 5, 8, 13, 21
This non-linearity must be accounted for in prompts:
- Few-shot examples should include stories from across the scale
- MAE computed in raw story points (not log-transformed)
🔗 Connected Ideas
Dataset: LN-Datasets-JiraSR-TAWOS (TAWOS) Hypothesis: EX-Hypotheses-H1-H2 (H2) Methods: PR-PUMA-Ch3-Methods (§3.2, §3.6) Core paper: LN-KeyPapers-CoGEE-Angermeir-Flyvbjerg Related concept: PN-CoT-FewShot-Prompting (few-shot as reference class) · PN-KeyConcepts-Agents-Reproducibility-RedTeam (Uniqueness Trap) Prompts: PT-PUMA-Experiment-Prompts Glossary: Glossary-Master (Story Points, MAE, Fibonacci scale) MOC: MOC-PUMA-Master · MOC-LLM-Benchmarks-PM-AI